PaperHub
6.3
/10
Poster3 位审稿人
最低6最高7标准差0.5
7
6
6
3.0
置信度
正确性3.0
贡献度3.0
表达2.7
NeurIPS 2024

On Feature Learning in Structured State Space Models

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06

摘要

关键词
State space modelsfeature learninghyperparameter transferscaling theorydeep learning

评审与讨论

审稿意见
7

This paper studies the large width scaling behavior of a recently popular class of models known as structured state space models (SSMs). The authors demonstrate that the previous work on large width neural scaling, which prescribes a parametrization known as the Maximal Update Parametrization (muP) as the optimal scaling for neural networks, does not cover the case of SSMs. Furthermore, muP turns out to be suboptimal for SSMs and does not achieve the desired consequence of stable feature learning and hyperparameter transfer. The authors then propose the proper scaling for SSMs and demonstrate numerically that hyperparameter transfer is achievied.

优点

Understanding proper scaling of hyperparameters with model size is a practically important problem especially in the age of scaling. State space models are an important class of models with competitive performance and desirable attributes such as fast inference speeds. Identifying the proper scaling for SSMs is important for ensuring stable performance as SSM model sizes increases and for unlocking hyperparameter transfer which can greatly reduce the cost of hyperparameter tuning. The authors provide a fairly simple and thorough explanation for the correct scaling for SSMs which turns out to be different from the previous muP scaling. The numerical experiments give further confidence to the correctness of the results.

缺点

The presentation is at times a bit hard to follow. Some more introduction to state space models would be helpful to help readers who are not already familiar. Many terms are introduced with very little explanation to orient the reader. Some diagrams could be very helpful, so that the reader can better visualize the SSM forward pass etc.

On the mathematical side it would be great if the things could be cleaner and better organized. I think the original spectral scaling paper [1] is a great example of this. The notation and results used should be clearly established (e.g. definition of spectral norm, Kolmogorov's strong law of large numbers, etc.). It would be probably the most helpful if the proofs could be "modularized" into basic operations. Right now all of the analysis goes through very specific instantiations for specific models. The original paper [1] focuses on what happens for elementary operations which can be transparently composed. The results for SSMs should then follow as corollaries.

[1] Yang, G., Simon, J. B., & Bernstein, J. (2023). A spectral condition for feature learning. arXiv preprint arXiv:2310.17813.

问题

None.

局限性

Yes.

作者回复

We sincerely appreciate your thoughtful review and positive assessment of our work. Your feedback has been invaluable in helping us improve the paper. We have addressed your concerns regarding improved presentation and organization of the proofs. We hope that our revisions will merit your consideration for an increased score.

Improved presentation. To enhance the visualization of the Mamba forward pass, we have created a new figure (please see Figure 2 in the PDF attached to the global response). We will integrate this figure into the main paper and expand Section 3.1 to provide additional background on SSMs for enhanced readability. We will also introduce all notation and results used such as the definition of the spectral norm, Kolmogorov’s SLLN and Lindenberg-Feller CLT, in the appendix.

Modularization of the proofs. Thank you for your suggestion. We concur that modularizing the proofs will certainly enhance the paper's accessibility. However, analyzing signal propagation in the Mamba architecture is inherently more complex than in MLPs, necessitating some analysis of specific modules. Nevertheless, we came up with a strategy to modularize the analysis by noting that the Mamba layer can be mechanistically viewed in the forward pass as a sequence of 3 components/sub-layers:

  1. Selection
  2. Discretization
  3. Linear recurrence

Our visualization of the Mamba layer is based on this perspective (in the PDF attached to the global response). To our knowledge, all SSM layers comprise some or all of these sub-layers, allowing us to modularize both forward and backward signal propagation analysis through each sub-layer. It's worth noting that different discretization rules (e.g., ZOH or Euler) would require separate analyses. Beyond this, while SSMs primarily differ in their initialization of the transition matrix A, our analysis remains applicable to any initialization chosen according to HiPPO theory.

We believe these changes will significantly enhance the clarity and accessibility of our work, and we look forward to incorporating them in our revised manuscript.

评论

I appreciate the efforts the authors have made in the rebuttal and the proposed changes. I think implementing these changes will greatly help the paper. I will raise my score slightly.

审稿意见
6

This paper investigates the scaling behavior of state-space models (SSMs) and structured variants like Mamba as their width approaches infinity. The authors demonstrate that established scaling rules such as maximal update parameterization (μP) and spectral scaling conditions fail to yield feature learning in SSMs at infinite width. They derive a new scaling rule, μP SSM, enabling feature learning in SSMs as width approaches infinity. Empirical results show that μP SSM leads to improved stability, generalization, and hyperparameter transferability compared to standard parameterization and spectral scaling.

优点

  • Addresses an important theoretical question about SSM scaling behavior
  • Provides rigorous analysis of forward and backward signal propagation in SSMs
  • Identifies limitations of existing scaling approaches and proposes a principled correction
  • Empirically validates results on real SSM architectures like Mamba
  • Has potential implications for training larger, more efficient SSMs

缺点

  • Theoretical analysis limited to N_u then N_x approaching infinity setting
  • Narrow scope of empirical validation (text generation with Mamba on a single dataset)
  • Lacks comparison to other recent SSM variants beyond Mamba
  • Insufficient discussion of potential negative implications of enabling feature learning in larger SSMs
  • No clear roadmap for extending results to more practical settings

问题

  1. Can you provide a more intuitive explanation of why SSMs violate the tensor program ansatz?
  2. How do you expect these results to generalize to other SSM variants and tasks?
  3. What are the computational trade-offs of applying μP SSM scaling in practice?
  4. Can you elaborate on the implications of spectral scaling leading to parts of the SSM being in the lazy regime?
  5. Have you explored whether μP SSM enables training of wider SSMs than previously possible?
  6. Can you outline a path for extending your analysis to the proportional limit case?

局限性

The authors discuss the key limitation of restricting analysis to the N_u then N_x approaching infinity setting. However, they should more explicitly address limitations of their empirical evaluation and potential challenges in applying results to practical scenarios.

作者回复

We would like to express our gratitude for your thorough and insightful review of our manuscript, and useful feedback for improving our work.

Order of limits and practical applicability. In SSM models, NuN_u (input dimension for SSM component) typically increases much faster than NxN_x (latent states dimension for SSM component) during scaling (see Table 10 in the original Mamba paper [1]). This makes the limit where NuN_u approaches infinity before NxN_x more practically relevant. Furthermore, fortunately, these limits commute which allows us to extend our results to the proportional limit setting.

Additional Experiments. We have conducted additional experiments to further validate our theory. We have summarized them in the global response and the results are provided in the PDF attached to the global response. Addressing your concerns, we now also present results on a randomly sampled subset of the Fineweb dataset, suggesting similar conclusions to the wikitext-103 results in Figure 2 (main paper). Separately tuning the learning rates in SSM layers and non-SSM layers shows the benefit of muP-SSM over spectral scaling more clearly: While muP-SSM continues to improve with scale, spectral scaling no longer monotonically improves beyond a certain width threshold. Even without separate learning rates, muP-SSM markedly improves training stability at larger learning rates over spectral scaling. SP consistently performs worse than muP-SSM in terms of stability and generalization. This indicates that muP-SSM can indeed improve performance at larger scale.

Implications for other SSM variants. Mamba is among the most complex SSM architectures, containing all components present in other recent SSM variants. By providing a scaling analysis for Mamba analogous analyses for S4 or LRUs follow as corollaries. We will clarify implications for other SSM variants in the revision (see our answer to Reviewer iUa6).

Implications of enabling feature learning and computational considerations. We see no negative implications in enabling feature learning in every layer. If feature learning is undesired in a specific layer, training can be explicitly disabled, saving computation instead of letting the updates to implicitly vanish with scale. muP-SSM therefore provides the flexibility to achieve feature learning in SSM layers when desired. muP-SSM introduces no additional computational complexity, as it's merely a different parameterization of the weights.

Have you explored whether μP SSM enables training of wider SSMs than previously possible? While we lack the computational resources to experiment at such large scales, there's evidence suggesting muP-SSM could enable training of wider SSMs. Even at relatively small scales, muP-SSM shows markedly improved training stability at larger learning rates compared to spectral scaling. According to [3], instabilities in small-scale models at large learning rates can predict general training instabilities at larger scales. This evidence provides some support for the possibility of training wider SSMs with muP-SSM, though direct large-scale experiments are needed for confirmation. We hope that the SSM community would conduct experiments at larger scales to further investigate this potential.

Intuitive explanation of why SSMs violate the TP ansatz. There are two primary reasons for why SSMs such as Mamba cannot be represented via TP:

  1. Structured transition matrix (A). Typically, the state transition matrix A is highly structured and chosen according to (or loosely based on) Hippo theory [2]. One example is the diagonal matrix with the iith diagonal entry being i+1i+1. The TP framework crucially relies on matrices with i.i.d entries (such as i.i.d Gaussians) and cannot represent such structured matrices.
  2. Selection mechanism. Activations and weights play different roles under TP. The selection mechanism first computes activations (linear transformations of input parameterized via some weights) and uses them as weights in the linear recurrence (see the pdf attached to the global response for a visualization). This underlies a second source of incompatibility of SSMs with the TP framework.

[1] Gu, Albert, and Tri Dao. "Mamba: Linear-time sequence modeling with selective state spaces." arXiv preprint arXiv:2312.00752 (2023).

[2] Gu, A., Dao, T., Ermon, S., Rudra, A., & Ré, C. "Hippo: Recurrent memory with optimal polynomial projections." NeurIPS 2020.

[3] Wortsman, M., Liu, P. J., Xiao, L., Everett, K., Alemi, A., Adlam, B., ... & Kornblith, S. "Small-scale proxies for large-scale transformer training instabilities." arXiv:2309.14322 (2023).

评论

I have read the authors' responses and discussion with other reviewers, and thank you for the detailed responses. I will be maintaining my score.

审稿意见
6

Following the tensor program and maximal update parameterization, this paper studies the parameterization of initialization and learning hyperparameters in structured state space models (SSMs), e.g. S6 Mamba. The authors consider the input dimension and latent dimension of each vector in the sequences to go to infinity and analyze the proper initialization and parameterization for hyperparameters such that the initial output at each layer is stable when passing through multiple layers and the feature updates are compatible with initialization when taking one step gradient. The paper provides a detailed analysis of signal propagation in SSMs, both in the forward and backward passes, as the width of the network increases. And it also reveals that established scaling rules, such as the maximal update parameterization and spectral scaling conditions, fail to maintain feature learning in SSMs.

优点

This paper is the first paper studying hyperparameter scaling for infinite-width SSMs, a topic that has not been extensively explored compared to other neural network architectures like MLPs and CNNs. And this work has practical implications for improving the training and performance of large-scale SSMs and sets the stage for future research in this area. By tackling practical issues such as vanishing or exploding gradients, the paper provides solutions that may enhance the stability and efficiency of training state-of-art state-space models, making it a valuable resource for practitioners.

缺点

  1. There should be more experiments and empirical comparisons among standard parameterization, maximal update parameterization, spectral scaling, and the μ\mu P SSM parameterization, in terms of test loss and parameterization transferability on different types of datasets and learning tasks. The only empirical experiments in Fig 2 in the paper seem to indicate that spectral scaling performs similarly to μ\muP SSM parameterization. It would be more convincing if we could compare these parameterizations in various situations.

  2. There is a lack of explanation of why we need to consider NxN_x and NuN_u go to infinity. In terms of SSMs, we usually consider the length of the sequence to be large and SSMs can preserve the long-range dependency. Besides, in the analysis, the authors only consider the inputs of the layer i.i.d. which is quite different from the practical sequential dataset. It requires more explanations for these assumptions and the definition of feature learning in this paper.

问题

  1. Typo in Line 47: RuN\mathbb{R}_u^N

  2. Can you provide more motivation for defining feature learning in layer-wise sequential models in Definition 2.1? In (1) and (2), why do you only require the existence of one output satisfying stability and feature-update assumptions? Do we need to ensure these bounds for all outputs?

  3. In Line 70, you mentioned for feature learning we need the inputs to the layer to be asymptotically independent and identically distributed. Does this mean we need to assume u1,,uLu_1,\ldots,u_L are asymptotically i.i.d?

  4. Line 76: typo ''Parmeterization''

  5. In (6), WW should be WlW_l

  6. Explain the functions in (8-10) for completeness.

  7. Below Line 172: typo σB2\sigma_B^2 should be σC2\sigma_C^2

局限性

N/A

作者回复

Thank you for your constructive feedback and positive evaluation of our work. We appreciate your insights and we have carefully addressed your major concerns in the following response. We will rectify minor issues like typos in our revision. We hope that our revisions warrant your consideration for an increased score.

Additional Experiments. We have conducted additional experiments to further validate our theory. We have summarized them in the global response and the results are provided in the attached PDF. Addressing your concerns, we conduct experiments on a randomly sampled subset of the Fineweb dataset which suggest similar conclusions as those on wikitext-103 in Figure 2 (main paper). muP-SSM outperforms SP in terms of test loss, training stability, monotonic improvement of generalization with scale, and transferability of optimal learning rate across scales. Observations in comparison to spectral scaling are more nuanced. muP-SSM markedly improves training stability at large learning rates over spectral scaling, and slightly improves generalization performance. First tuning the learning rate in non-SSM layers and subsequently tuning a separate learning rate in SSM layers separates muP-SSM and spectral scaling more clearly: While muP-SSM continues to monotonically improve with scale, spectral scaling no longer improves beyond a certain width, suggesting that the reasonable performance of spectral scaling stems from the non-SSM layers. In SSM layers, the spectral scaling approach lacks rigorous theoretical justification.

Clarification on the asymptotically i.i.d assumption. In Section 2.2 & A.1 of our paper, we discuss that activations and their updates in neural networks representable as Tensor Programs (TPs) become asymptotically independent and identically distributed (i.i.d.) with increasing width. This result underlies our i.i.d assumption. Note that, we consider the practical setting where SSM layers are embedded within a network of non-SSM layers (such as MLPs or Layernorms) representable via TP. Accordingly, it's natural to assume that inputs to the SSM layer (e.g., activations from the previous non-SSM layer) are asymptotically i.i.d at each recurrence step. In fact, this i.i.d. assumption isn't strictly necessary. It suffices to assume that inputs are correctly scaled, i.e., u2Θ(Nu)\vert \vert u\vert \vert_2 \in \Theta(\sqrt{N_u}). This scaling has been demonstrated for correctly scaled network modules representable via TP. Crucially, we do not assume that the sequence of inputs u1,u2,,uLu_1, u_2, \cdots, u_L are i.i.d. Rather, we assume that the coordinates of a single input uiu_i are asymptotically i.i.d. (or at least correctly scaled) in the sense described above.

Why should N_u and N_x go to infinity? We would like to clarify that our analysis does not require both NuN_u (input dimension for SSM component) and NxN_x (latent states dimension for SSM component) to go to infinity. It merely allows both quantities to scale up. To derive the results for when only NuN_u goes to infinity, one can simply set Nxθ(1)N_x \in \theta(1). In this simplified scenario where the dimension of the latent states NxN_x is fixed, heuristic muP spectral scaling yields ηB,ηC=Θ(1Nu)\eta_B,\eta_C=\Theta(\frac{1}{N_u}) while muP-SSM gives ηB,ηC=Θ(1Nu)\eta_B,\eta_C=\Theta(\frac{1}{\sqrt{N_u}}) (see Table 1 in the main paper). Experiments carried out independently by researchers (to be acknowledged in the public version), indeed verify that the correct width-independent scaling aligns with muP-SSM (results as shown in Figure 4 in the attachment). Note, however, that as models are scaled up both NuN_u as well as NxN_x may be increased in practice, albeit at very different scales. For example, see Table 10 in the original Mamba paper [1].

In Definition 2.1, why do we require that there exists one input that is correctly scaled? This follows the same logic as the paper that proposed muP [3, Definition H.9], which defines a parameterization to be feature learning iff there exists a training routine and input that result in the correct update scaling. This is a technicality we have to adopt as there exist degenerate combinations of inputs and learning rates that result in smaller scalings. Even under this definition, there exists only one choice of layerwise initialization and learning rate scalings that achieves feature learning.

[1] Gu, Albert, and Tri Dao. "Mamba: Linear-time sequence modeling with selective state spaces." arXiv:2312.00752 (2023).

[2] Wortsman, M., Liu, P. J., Xiao, L., Everett, K., Alemi, A., Adlam, B., ... & Kornblith, S. . Small-scale proxies for large-scale transformer training instabilities. arXiv:2309.14322 (2023).

[3] Yang, Greg, and Edward J. Hu. "Feature learning in infinite-width neural networks." arXiv:2011.14522 (2020).

评论

I thank the authors for a detailed rebuttal and additional experiments. I appreciate the helpful explanation of my questions and the further experiments. I believe incorporating them, and the responses to other reviewers, into the revision of this paper will significantly improve the writing and clarity.

作者回复

Dear Reviewers,

We sincerely appreciate the time and effort you have invested in evaluating our work. Your insightful comments and constructive feedback have been invaluable in helping us improve the clarity and quality of our research. We have included a pdf attachment to this response which contains additional figures that address some of the questions raised by multiple reviewers. We will address individual points raised by each reviewer in detail in the reviewer-specific responses.

Improved Accessibility. Following the suggestion from Reviewer iUa6, we have included a new illustration (Figure 2 in the attached PDF) that demonstrates the forward pass of the Mamba SSM layer. This addition aims to enhance the accessibility of our work for readers less familiar with State Space Models.

Additional Experiments. To address concerns raised by Reviewers QtDa and JuBx and to strengthen our empirical evidence, we conducted additional experiments. Specifically,

  1. Results on the Fineweb Dataset. We present results on a randomly sampled subset of the Fineweb dataset (Figure 3 in the attachment). While computational and time constraints prevented us from training on the entire dataset or using larger model widths, our observations on Fineweb are consistent with the wikitext-103 results presented in the main paper (Figure 2).

  2. Decoupled Learning Rates. To isolate the effects of SSM scaling, we decoupled learning rates for SSM and non-SSM layers. We first tuned the learning rate of non-SSM layers, then compared test performance across different scales for various SSM learning rates, using the optimal non-SSM learning rate (Figure 1 in attachment).

  3. Verify Correct Scaling. In a simplified scenario where the dimension of the latent states NxN_x is fixed, heuristic muP spectral scaling yields ηB,ηC=Θ(1Nu)\eta_B,\eta_C=\Theta(\frac{1}{N_u}) while muP-SSM gives ηB,ηC=Θ(1Nu)\eta_B,\eta_C=\Theta(\frac{1}{\sqrt{N_u}}) (see Table 1 in the main paper). In experiments carried out independently by researchers (to be acknowledged in the public version), their results as shown in Figure 4 in the attachment verify that the correct width-independent scaling indeed aligns with muP-SSM.

These new experiments further validate our theoretical results and address the specific concerns raised in the reviews. We plan to incorporate these results into the revised manuscript. Below, we summarize our key findings based on both the main paper experiments and the additional experiments included in the pdf.

Summary of empirical findings. Our work evaluates the scaling behaviour of different parameterizations for State Space Models (SSMs) using four criteria: generalization (test loss), training stability, monotonic improvement of generalization with scale, and transferability of optimal learning rate across scales. Our experiments reveal that while Standard Parameterization (SP) consistently performs worse than muP-SSM across all metrics, the comparison between muP-SSM and spectral scaling yields nuanced results at the scales we test:

  1. Generalization. Spectral scaling has only slightly worse test loss compared to muP-SSM at the largest scale we were able to test across all the experiments.
  2. Training stability. However, muP-SSM demonstrates markedly improved training stability at larger learning rates compared to the spectral scaling approach already at the relatively small scales we test. Note that, instabilities in small scale models at large learning rates can be predictive of general training instabilities at larger scales [2].
  3. Monotonic improvement of generalization with scale.
    • Using a single global learning rate for the entire network (SSM + non-SSM layers), similar to muP-SSM, spectral scaling appears to improve performance monotonically with scale up to the largest scale tested (Figure 2 in main paper, Figure 3 in rebuttal attachment).
    • However, we hypothesized this might be an artifact due to the correct scaling of non-SSM modules under spectral scaling. To test this, we decoupled learning rates for SSM and non-SSM layers. That is, we first tune the learning rate of the non-SSM layers and then under the optimal choice of LR for the non-SSM layers, we compare performance of the models across different scales for different SSM learning rates. Figure 1 in the rebuttal attachment shows that beyond a certain width threshold, performance under spectral scaling no longer improves monotonically with scale, contrasting sharply with muP-SSM.
  4. HP transferability. At tested scales, both muP-SSM and spectral scaling demonstrate similar transferability. This is not entirely unexpected when using a global learning rate, as both methods identically parameterize non-SSM layers.

It's important to note that unlike muP-SSM, the spectral scaling approach used in our experiments is heuristically derived and lacks rigorous theoretical justification.

最终决定

This paper studies feature learning in SSMs in the infinite-width regime. There are several contributions: (1) showing muP parameterization leads to unbounded outputs (2) identifying this problem is due to tensor programs not being able to represent SSMs (3) developing a correction that leads to more stable training and bounded outputs (4) proposing a scaling rule for SSMs and empirically tests it on several architectures.

The reviewers appreciate the interesting problem and the theoretical analysis. Even though the empirical validation could be more comprehensive, given the initial promising results and the potential impact on the SSM space, I recommend acceptance.