PaperHub
4.9
/10
Poster4 位审稿人
最低2最高3标准差0.4
3
3
3
2
ICML 2025

Quantifying Memory Utilization with Effective State-Size

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

We propose the Effective State-Size (ESS) metric to quantitatively analyze how sequence models utilize memory and context, offering insights into model in-context recall, initialization strategies, efficiency, and architecture design.

摘要

关键词
model analysisinterpretabilitylinear systemsattentionstate-space modelssequence modelsmemory utilizationcontext utilization

评审与讨论

审稿意见
3

The paper proposes to study the memory stored in a wide range of sequential neural network architecture through the notion of effective state-size, which is motivated by minimal realization theory and applicable out of the box to many architecture. The authors empirically validate it as a sound measure by correlating it with performance on memory intensive tasks, as well as use it to better understand which kind of model can be easier to distill and to derive initialization strategies.

[EDIT 04/02/2025]: updated score from 2 to 3.

给作者的问题

  • I am confused by the fact that the effective state-size is a function of ii. I cannot really make sense of it: from my understanding ii indexes time, and I would expect the effective state-size to be something independent of ii, for example the maximum of all the ESSi\text{ESS}_i values. Could the authors clarify that point? Having a working example, as I mentioned in the previous section, could be a good way to clarify things.
  • I cannot really make intuitive sense of why the ESS decreases before the end of the sequence in Figure 7. Intuitively, I would expect the separator to remove information about previous states but not the tokens appearing before, at least in a causal system. Why isn't it the case? Are there some boundary effects of the method?
  • Can the author compare their method to the following approach: for all mentioned models except softmax attention, look at dsi+j/dsjds_{i + j} / ds_j, compute on average for how many time steps it is greater than some threshold an compute the capacity of the model as this value times the number of states. Intuitively, this would correspond to the "kv" cache size of the model. I would appreciate if the authors could compare with this type of metric (theoretically / empirically through a comparison in a simple setting / ...). Linked to that point is the last sentence of the last paragraph in page 1: I could not find a detailed criticism of approaches like Wu et al. 2024. Discussing them in more detail would help better understanding the contribution of the present work.

论据与证据

The claims are supported through extensive theoretical and empirical verification.

方法与评估标准

The focus on memory intensive tasks is particularly relevant for the proposed approach. The distillation part also makes sense. I have some concerns regarding the setting of Section 5.2 that I detail below.

理论论述

The theoretical claims are mostly rooted on established results from the theory of recurrent realization. I haven't checked the correctness of the claims though, and I am not familiar with this literature.

实验设计与分析

No, but I didn't notice anything raising my awareness.

补充材料

Only had a look at it, but did not review it more thoroughly.

与现有文献的关系

The paper does a great job at linking the ideas it introduces / problems it studies with existing work.

遗漏的重要参考文献

No

其他优缺点

Strengths:

  • the paper studies a very timely problem, that is the one of in-context memory, of sequential models.
  • the approach is theoretically well motivated, looking at the minimal representation is an interesting angle.
  • most of the experiments make sense. The memory experiments are a good test bed and the proposed method brings an interesting angle to the distillation case.
  • the state saturation vs. state collapse argument is an interesting lens on studying failures of models on memory heavy tasks.

Weaknesses:

  • the main weakness to me is the presentation of the paper: I find hard to get an intuitive understanding of the method (see next box for more details).
  • as a result of the previous point, I have some troubles judging whether the proposed method makes sense. This justifies my conservative score for now (weak reject), and I am happy to revise my score once I am more confident in my understanding of the method.
  • The authors claim that model performance on language depends on the model's "ability to dynamically modulate its ESS in response to inputs". While I agree with the underlying intuition, I fail to see what ESS is bringing here and would appreciate clarification from the authors.

其他意见或建议

Regarding the writing. The clarity of the paper could be greatly improved in my opinion. I appreciate the authors' effort in summarizing important results in boxes, but more should be done for the paper to be ready for acceptance. To list some improvement points:

  • I find the introduction overly technical. As a rule of thumb, I would avoid any mathematical formulation and dynamical system jargon in the introduction.
  • Introducing one or two working example while going through the theoretical section could be useful, e.g. a simple linear SSMs (eventually with rank constraint on the A matrix) and/or an SSM with an input-dependent transition matrix. This would help the reader (me included) to get a better feeling of what the proposed measure is capturing.
  • Moving section 3.1 from the main text to the appendix might be a good idea (keeping a high level summary in the main text would be enough). I find it overly technical for the main text and it does not really seem to be needed in the following.
  • [Minor] Using the terms "memory capacity" and "memory utilization" would make the text easier to read than the repetitive use of ESS / TSS, which I find slightly confusing.
  • [Minor] Figure 1 is not really dense in terms of information, and the current version could likely be removed without hindering the understanding of the paper.
  • [Minor] Suggestion to improve Figure 5: I would find more interesting to see how the loss evolves as a function of the student TSS. In particular I would like to see how much bigger it needs to be than the ESS of the teacher to get a better than chance performance. The current visualization makes this difficult to assess.
作者回复

Thank you for your detailed comments! The paper will be revised to better explain each of the following points:

Why is ESS a function of time? Why does ESS decrease before the EOS token?

ESS at each time step ii captures the lower bound for the minimal state-size required at that specific step. Intuitively, this bound reflects at least how much information from the past is relevant for future computation. As the sequence progresses, the model may retain more or less information depending on upcoming needs. While we can summarize ESS over time with an aggregate measure (e.g. max or average, as done in Sections 4 and 5.1), examining ESS at each time step can reveal interesting temporal patterns—particularly w.r.t. a model’s ability to store and discard information dynamically.

Because our formulation of ESS depends on both past and future relevance, it decreases when the model requires less prior context to process upcoming tokens. Near the EOS, fewer future tokens remain to be influenced by previous tokens, so ESS tends to shrink before the actual EOS token.

We have also identified ESS variants that measure a causally determinable minimal state-size (i.e. it depends only on the past). Under this metric, state-size drops sharply at the EOS token, rather than exhibiting a more gradual decrease. In Section D.2.2, we briefly discuss these “causal ESS” metrics. Instead of computing the rank of HH, one can decompose HH into its causal and anti-causal parts (H=OCH = \mathcal{OC}) and then compute the rank of C\mathcal{C} to obtain a causal ESS. However, this metric may fail to capture certain insights, because for models like softmax attention, C\mathcal{C} is effectively an identity matrix that grows with sequence length (substitute Eq. D.2.5 into C\mathcal{C}). As a result, the causal ESS for softmax attention increases linearly with sequence length, failing to reflect how state-size might rise and fall near the EOS token.

Comparing ESS to an Effective Window-Based Metric

Counting the number of state derivatives above a certain threshold effectively measures the window of the model’s memory. Intuitively, a smaller window suggests lower memory utilization. This phenomenon is also reflected in the ESS metric: when the operator decays quickly, most rows/columns in HiH_i are zeroed out, leading to a lower rank and hence a smaller ESS.

However, the effective window alone does not capture the complexity of dependencies within the window. For instance, for linear attention, dsi+jdsj\frac{ds_{i+j}}{ds_j} is always an identity matrix, regardless of ii or jj. Consequently, a window-based metric that simply counts these derivatives would increase indefinitely with sequence length and fail to reflect the model’s true memory capacity or utilization that is capped. Moreover, because the window-based metric does not account for complexity, it ignores the effects of BB and CC (input and output projections), which themselves can be degenerate (e.g. zeroed-out rows).

For these reasons, ESS provides a more accurate and theoretically grounded measure of memory utilization. Nonetheless, we acknowledge that alternative metrics, such as the one proposed by the reviewer, can still offer useful heuristic insights when analyzing a model’s memory.

Criticism of Approaches in Wu et al. (2024) and Related Works

We are not criticizing Wu et al.; rather, we cited their work as they similarly highlight how some spectral analyses of the attention matrix (i.e. the operator T) overlook the causality inherent in sequence decoders. That said, we do point out drawbacks in approaches such as Min et al. (2024) and Bhojanapalli et al. (2020), which derive their metrics by taking the SVD of the operator T without accounting for causal masking. For instance, Min et al. state: “It should be noted that the training of GPT-2 necessitated the masking of attention matrices to prevent the model from accessing future data. However, this mask was not applied during the rank evaluation.” Consequently, these approaches fail to tie key notions like memory capacity and memory utilization.

Further improvements in writing

  1. We will reduce technical jargon or add contextual information where needed in the introduction to improve accessibility.
  2. We will move Section 3.1 to the appendix to shorten the main text, enabling space for a toy example of ESS (illustrating how ESS varies with time/input). Figure 1 will be moved to the appendix if needed.
  3. Regarding your question on what a model's "ability to dynamically modulate its ESS in response to inputs" captures, our results in Section 5.2 indicate that recall performance depends on a model’s ability to modulate its ESS, not just total memory capacity. Although WLA, GLA, and LA share the same memory capacity, their recall differs, implying differences in memory utilization. Nonetheless, we will clarify in the text that ESS only suggests this dependence, rather than conclusively proving it.
审稿意见
3

This work introduces the Effective State-Size (ESS) metric to quantify memory utilization in sequence models while previous approaches focus on memory capacity (such as cache size/total memory available). ESS aims to measure how effectively a model uses its available memory. Using this metric the authors analyze 4 kinds of sequence models on a synthetic task and demonstrate how the metric helps to explain their performance, as well as how it predicts performance after distillation for some models.

给作者的问题

论据与证据

The claims seem to be generally supported:

  1. The appendix details their experiments on trying to improve performance through ESS-informed regularization (although the accuracy only reaches 0.3 - would this keep going up with more regularization or has it saturated at this point? This needs to be shown.)
  2. The appendix details their experiments on trying to improve performance through ESS-informed initialization for GLA vs S6 models.
  3. They show how their metric correlates with the performance of distillation dependent on whether the state utilization of the teacher model was high or low (if it was high then it should be expected that the distillation should not be very successful if I understand correctly).

方法与评估标准

  1. The method to study sequence models' usage of their state space through a newly defined metric makes sense. It would be good to get some intuition though of why rank represents utilization of state for recurrent operators since I am not from a signal processing background.
  2. The task used throughout the paper is synthetic - MQAR (with additional results in appendix for compression and copying - also synthetic).
  3. The paper reports correlations between their metric and quantities of interest but something that is worrying is that the actual correlation values are pretty low (0.5-0.7 range) even if the trends in changes in correlation (with changes in models etc) make sense.

理论论述

I was not able to check the proofs throughly since I am not very familiar with the literature in this area and the paper's description of notation was confusing (for eg. what are C and B in section 3 is not defined anywhere and has to be guessed).

实验设计与分析

I read the experiments section throughly but I'm not very familiar with prior work that has shown results on the same tasks.

补充材料

  1. The notation section.
  2. The section on regularization and initialization
  3. The plots showing how utilization changes across models for English and code sequences when separator tokens are introduced.

与现有文献的关系

Many prior studies focus on measuring memory capacity rather than utilization, ESS refines this by measuring actual usage rather than theoretical capacity. Also, prior work mostly used qualitative measures (e.g., attention visualizations, synthetic task accuracy). The introduction of ESS as a precise quantitative metric allows drawing insights in a more automated way.

遗漏的重要参考文献

其他优缺点

I found the visualizations of ESS around EOS tokens pretty interesting.

其他意见或建议

作者回复

We thank the reviewer for the positive feedback! Below, we respond to the points raised.

Claims and Evidence

  1. Yes, this is a good point, we will extend the regularization experiments to include parameters beyond the range in the plot and update the plot in the paper. We anticipate that at some point, the performance will begin to decline since continuing to increase the regularization strength corresponds to increasingly decaying the model towards linear attention, which, as shown in Section B, performs quite poorly.
  2. Yes, your understanding of how the ESS metric correlates with performance of distillation is correct. We will clarify this in the paper.

Methods and Evaluation Criteria

  1. For theoretical justification as to why the rank represents state utilization, we refer the reviewer to Theorem 3.2. For intuition, we offer the following interpretations of rank as it pertains to state utilization:
    • Distinct directions of influence: Through this lens, rank counts how many linearly independent “directions” connect past inputs to future outputs, i.e., how many unique ways inputs can shape the future.
    • Minimal internal memory: Because each independent direction requires its own coordinate in memory, the rank matches the smallest state dimension that can exactly replicate the operator.
    • Effective rank: Of course, in practice we compute effective rank to capture the “dominant” directions of interests where “dominant” is captured by either some threshold (e.g. tolerance ESS) or by the decay rate of the singular values (e.g. entropy ESS).
  2. Here, we address the concern posed by the reviewer regarding the correlation values being too “low”.
    • Having sufficient ESS/kv results in a non-linear effect: As shown in Figure 2, once ESS/kv exceeds a certain threshold for a given model, the majority of task-model configurations achieve an accuracy of 1. At this point, performance saturates, and further increases in ESS/kv do not translate into higher accuracy. This weakens the correlation, as correlation coefficients do not adequately capture such non-linear relationships. We chose not to omit these data points in our reported correlations, since the non-linear trend is clearly visible in the plot. However, we are happy to also report correlations with these saturated points omitted, should the reviewer find that informative.
    • On the interpretation of “low” correlation values: Our claims about correlation are relative rather than absolute. While it is fair to consider correlations in the range of 0.85–1.0 as “high,” our point is that ESS consistently correlates better with task performance than TSS. In this comparative context, we maintain that ESS is a significantly better proxy. This is relevant since most works consider only the TSS when evaluating model memory. Note that we do not claim that ESS is the best possible performance predictor—there may be stronger alternatives—but evaluating such alternatives is beyond the scope of this work.

Clarifying notation

To answer your question in particular, CC and BB are analogous to the QQ and KK matrices in attention. And in the context of recurrent models, they canonically represent the state-to-output and input-to-state matrices, respectively. However, we agree that this is unclear in the paper and we will clarify this upon revision along with analogous points raised by reviewers KvDY and KxgK.

审稿意见
3

The paper introduces Effective State-Size (ESS) as a novel measure for quantifying memory utilization in causal sequence modeling architectures. ESS provides interpretable and actionable metrics that can enhance initialization strategies, regularizers, and model distillation. The paper develops a unified framework for analyzing systems with input-invariant and input-varying linear operators (LIVs), and demonstrates the correlation between ESS and performance across various tasks and models. Applications of ESS include model-order reduction, predicting model compressibility, and state modulation in large language models. The empirical validation shows ESS's utility in improving performance-efficiency trade-offs and highlights cross-architectural differences in memory utilization.

给作者的问题

NA

论据与证据

  1. The introduction of Effective State-Size (ESS) as a quantitative measure of memory utilization is a novel and valuable contribution. It provides interpretable and actionable measurements that can enhance initialization strategies, regularizers, and model distillation.
  2. The paper explores multiple applications of ESS, such as model-order reduction, predicting model compressibility, and state modulation in large language models. These applications are well-explained and show the practical relevance of ESS.

方法与评估标准

The authors provide thorough empirical validation of ESS across various tasks and models. The correlation between ESS and performance is demonstrated, highlighting the utility of ESS in improving performance-efficiency trade-offs.

理论论述

The paper's theoretical sections, particularly the formal definition and computation of ESS, look solid, but may be challenging for readers who are not well-versed in dynamical systems theory. Simplifying these sections or providing additional explanatory material could improve accessibility.

Also, I understand the implementation details for computing ESS are discussed in the appendix. But I feel that it would be great if pseudocode can be included in the main text.

实验设计与分析

The authors showed the correlation between ESS and accuracy on some synthetic tasks, which I understand are especially created to conduct experiments in a controlled manner. However, most people in the field are more familiar with other real-world benchmarks. Does the same correlation extend to popular real-world benchmarks, e.g., machine translation benchmarks (e.g., WMT), language understanding (e.g., GLUE, MMLU), question answering (e.g., SQuAD, PIQA), image classification which can use Vision Transformers (e.g., ImageNet)? It would be great if the authors can include some of these common benchmarks in their experiments, which will strengthen this paper.

补充材料

skim through it but didn't check the details.

与现有文献的关系

I believe understanding the memory utilization is of interest to many researchers in the field.

遗漏的重要参考文献

looks good to me

其他优缺点

NA

其他意见或建议

NA

作者回复

Thank you for your constructive review and positive outlook on the work!

Presentation and Accessibility

Thank you for pointing out that some theoretical sections may be inaccessible to many due to technical jargon. We will simplify those parts and add working examples of computing ESS, as Reviewer KxgK suggested. We will also simplify the introduction by stripping some of the more technical motivation in order to make it more accessible. In addition, we will distill Section 3.1 to make room to add the pseudocode for the implementation of ESS, as you have suggested.

Real-World Benchmarks

We agree that applying ESS to popular benchmarks could be valuable. However, we note that many real-world NLP benchmarks often test factual recall (i.e., returning accurate information from the pre-training/post-training data) and not working memory utilization. For example, the Mamba paper (Gu et al., 2024) shows strong performance on standard benchmarks (i.e., against Pythia models (Biderman et al., 2023) of the same scale) but performs poorly on tasks that test raw memory utilization, such as the phone book recall task (Jelassi et al., 2024).

For this reason, we focused on tasks that directly test memory recall and chose bigram perplexity (Arora et al., 2024) for evaluating the language models (Section 5.2). Bigram perplexity evaluates the model's perplexity on repeated bigrams in any arbitrary dataset. This allows us to use the diverse and large scale NLP pre-training datasets (that the model wasn’t trained on), as opposed to a narrow hand-crafted evaluation set to specifically evaluate the model’s ability to utilize past context. In our case, we extracted 16k randomly selected sequences from the Fineweb dataset (Penedo et al., 2024). Therefore, we maintain that the controlled experiments we have conducted effectively conveys the utility of ESS as a diagnostic tool for analyzing memory utilization.

  • Gu, A., & Dao, T. (2024). Mamba: Linear-time sequence modeling with selective state spaces. arXiv. https://arxiv.org/abs/2312.00752
  • Jelassi, S., Brandfonbrener, D., Kakade, S. M., & Malach, E. (2024). Repeat after me: Transformers are better than state space models at copying. arXiv. https://arxiv.org/abs/2402.01032
  • Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S. S., Raff, E., Skowron, A., Sutawika, L., & van der Wal, O. (2023). Pythia: A suite for analyzing large language models across training and scaling. arXiv. https://arxiv.org/abs/2304.01373
  • Arora, S., Eyuboglu, S., Timalsina, A., Johnson, I., Poli, M., Zou, J., Rudra, A., & Ré, C. (2023). Zoology: Measuring and improving recall in efficient language models. arXiv. https://arxiv.org/abs/2312.04927
  • Penedo, G., Kydlíček, H., Ben Allal, L., Lozhkov, A., Mitchell, M., Raffel, C., Von Werra, L., & Wolf, T. (2024). The FineWeb datasets: Decanting the web for the finest text data at scale.
审稿意见
2

This paper claims to propose a new metric "effective state-size (ESS)", which can not only evaluate the memory-utilization of different models, but also brings instruction to the selection of initialization/regularization, and distillation strategies. Several empirical results are presents to supports the effect of ESS.

给作者的问题

None

论据与证据

  • Calculating the actual (as opposed to theoretical) information capacity using Singular Value Decomposition (SVD) and the Rank of parameter matrices/hidden states has been a widely adopted metric among researchers and engineers for many years—its origins are difficult to trace.
  • Additionally, the relationship between singular values and the rank of parameter matrices/hidden states appears trivial and intuitive, raising questions about the necessity of the overly complex modeling.

方法与评估标准

  • Some aspects hold significance, yet there are technical details that necessitate further verification and elaboration.

理论论述

  • However, this paper claims to introduce ESS, an SVD-based metric, for evaluating model memory utilization. This approach appears highly similar to the common practice mentioned above. Simply renaming an established method and presenting it as a novel contribution is unlikely to be encouraged.

实验设计与分析

I have reviewed the relevant experimental sections, and the pertinent information can be found under the section labeled "Weakness."

补充材料

yes

与现有文献的关系

None

遗漏的重要参考文献

  • Wang, Shida, and Qianxiao Li. "Stablessm: Alleviating the curse of memory in state-space models through stable reparameterization." arXiv preprint arXiv:2311.14495 (2023).
  • Qi, Biqing, et al. "Smr: State memory replay for long sequence modeling." arXiv preprint arXiv:2405.17534 (2024).

其他优缺点

Strenghs:

  • Within the view of ESS, this paper brings lots of comparisons between models with different architecture, such as attention/recurrent/convolution. These comparison may brings insight on the design of new architectures in the future.

Weacknesses

1、Over-claimed Contribution (which is also their primary claimed contribution)

  • Calculating the actual (as opposed to theoretical) information capacity using Singular Value Decomposition (SVD) and the Rank of parameter matrices/hidden states has been a widely adopted metric among researchers and engineers for many years—its origins are difficult to trace.
  • However, this paper claims to introduce ESS, an SVD-based metric, for evaluating model memory utilization. This approach appears highly similar to the common practice mentioned above. Simply renaming an established method and presenting it as a novel contribution is unlikely to be encouraged.
  • Additionally, the relationship between singular values and the rank of parameter matrices/hidden states appears trivial and intuitive, raising questions about the necessity of the overly complex modeling.

2. Ambiguous Presentation and Inappropriate Typography

  • Certain metrics, such as ESS/kv and TSS/kv, are used before being properly defined.
  • The captions and explanations of figures are too simple, making it difficult for readers to follow the paper’s contributions.
  • The paper claims that ESS is valuable for initialization, regularization, and distillation strategies, highlighting its practical significance. However, no relevant empirical results are provided in the main body to support this claim.

其他意见或建议

See weakness.

作者回复

We appreciate your detailed feedback. Below, we address the main concerns.

Mischaracterization of Our Claims

Your review states that our work involves measuring the rank of parameter matrices or hidden states using SVD—a practice that has existed for years. However, we would like to clarify that this is a misrepresentation of our contribution, and believe that the subsequent criticisms that we are simply renaming an established method while proposing an overly complex approach to analyze memory utilization, do not apply to the actual claims made in our paper.

We do not simply apply SVD to parameter matrices or “hidden states” (i.e., the input/output activations of a neural network layer), of which, the former capture complexity only across the model’s channels and cannot be used for analyzing memory utilization or capacity across the temporal dimension. Instead, the ESS metric is computed by applying SVD to specific submatrices of the materialized, flattened operator T in causal sequence models. These submatrices capture complexity across both channel and time dimensions, and we show that its rank provably lower bounds the minimal state size of an equivalent linear recurrence—offering a theoretically grounded proxy for memory utilization in causal sequence models.

Moreover, prior approaches that apply SVD over the entire attention matrix (i.e., the operator T) typically ignore causal masking, which greatly influences the rank of T and distorts any interpretation related to memory usage. Our metric is explicitly designed to account for this, and we discuss this aspect of our contribution in the introductory section of the paper.

Importantly, we note that the underlying theoretical framework for computing minimal recurrent realizations stems from classical signal processing and control theory, and to the best of our knowledge, has not been adapted to analyze the memory utilization of modern deep learning sequence models such as input-varying SSMs, linear attention and softmax attention.

We summarize the core contributions below:

  1. We demonstrate that most modern deep learning causal sequence models (SSMs, convolutions, attention, etc.) can be cast as input-varying linear models (LIVs), which uniformly realize a materialized operator T. In doing so, we are not introducing an overly complex framework; rather, we show that diverse sequence models can be analyzed under a unified lens.
  2. Drawing from classical control theory, we prove that the rank of specific submatrices of T provides a lower bound on the minimal state-size of an equivalent SSM realization. Based on this result, we propose ESS as a principled and model-agnostic proxy for memory utilization.
  3. We empirically validate ESS using a wide range of modern sequence models and synthetic tasks that are explicitly designed to test memory utilization.
  4. Finally, we show how ESS can be used in practice to inform and analyze downstream tasks such as model distillation, initialization, and regularization, as well as provide insight into memory-related phenomena in language models.

We hope this clarifies our main claims and the novelty of our contributions.

Improvements on Presentation

We thank the reviewer for pointing out areas where the presentation can be improved. We will more clearly define metrics such as ESS/kv and TSS/kv (i.e., ESS normalized by the number of key-value pairs in a task), and clarify related notation earlier in the manuscript.

Regarding the experiments, ESS can be used to support a range of downstream analyzes. First, we would like to clarify that the distillation results are in the main body of the paper (see Section 5.1). However, due to space constraints, the initialization and regularization experiments are presented in the appendix along with all relevant experimental details. We will improve the referencing of these results in the main text to ensure clarity and visibility. Also, we are happy to rearrange the paper to fit some of these results in the main body if the reviewer sees fit.

最终决定

In light of the increasing popularity of new language model architectures, this paper proposes a new quantitative metric of memory utilization in general sequence models that can be represented as input-dependent linear transformations. This effective state size (ESS) metric relates to the effective state capacity of a recurrent model that can capture the given sequence transformation, and provides interesting new quantitative and qualitative analysis of the performance of recurrent sequence models. The proposed contributions increase the set of tools available to researchers studying and designing modern sequence models.