DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products
We extend DeltaNet by using products of householders as state-transition matrices allowing us to trade-off expressivity and computational complexity.
摘要
评审与讨论
This paper addresses the fundamental trade-off between expressiveness and computational efficiency in State Space Models (SSMs). While diagonal SSM blocks are computationally tractable but limited in expressive power, and unstructured transition matrices offer full expressiveness at prohibitive computational costs, the authors propose a novel middle-ground approach. Their solution involves taking several rank-one steps for each token, effectively using low-rank transition matrices with rank equal to the number of steps.
The paper provides theoretical results proving the well-posedness and expressive power of the new SSM on state tracking tasks, supported by extensive experiments comparing against state-of-the-art competing architectures.
优缺点分析
Strengths
-
Thorough Theoretical Foundation: The paper provides comprehensive theoretical backing for the proposed approach, with formal proofs of well-posedness and expressive power.
-
Extensive Experimental Validation: The experimental section offers comprehensive comparisons with state-of-the-art architectures.
-
Clear Problem Motivation: The paper tackles an important and well-defined problem in the SSM literature with a principled approach.
-
Transparency: The authors are transparent about both the limitations and strengths of their approach.
Major Weaknesses
-
Errors in Main Proofs: There are errors in key theoretical results that need correction, though these appear to be fixable without changing the main conclusions.
-
Inconsistent Proof Quality: Some proofs are hastily written and inconsistent with the rest of the paper, making them very difficult to follow and verify.
-
Experimental Fairness Concerns: The use of identical hyperparameters and training setups across different models without justification raises questions about fair comparison.
问题
Notation and Consistency Issues
- Line 97: The definition "yi = x1 · x2 · · · xi" is inconsistent with parts of the appendix where "yi = xi ... x1". For non-abelian products, this distinction is crucial. Please maintain consistency or clearly indicate when different conventions are used.
Direct Parametrization Questions
- Lines 184-188: You have a couple of lines on direct parametrization of the low-rank structure. It is true that in the presented case (line 188) the matrix is restricted to be symmetric, but is there something blocking you from choosing the second ki to be different from the first? In that case there would not be any imposed symmetry. Is there some reason why you did not consider this?
Appendix Issues
Proposition 1
- Line 1034 (CRITICAL): In computing D explicitly, you appear to have omitted the term -4det(A). Fortunately, since (2-b1-b2)² - 4(1-b1)(1-b2) = (b1-b2)², the conclusion still holds. With this correction, you could derive explicit bounds on cos(θ).
Unclear Statements and Missing Information
- Line 1064 (Pedantic): Can this really model ANY product of Householders, or do restrictions on operator norm apply? This may be inconsistent with later comments.
- Line 1069: "The assumption on H0" - which specific assumption are you referring to?
- Line 1073: In the RMS formula, the xi terms need to be squared.
- Lines 1078-1082: Please clarify this section as it's currently difficult to follow.
- Line 1084: "H_t^j = ei" - the indices i and j are not quantified. Are they related? Can they be any possible tuple?
Theorem 3 Issues
-
Lines 1117-1119: Please expand for clarity. This is where the definition of m is crucially used as .
-
Lines 1120-1127 (CRITICAL): This section requires substantial clarification as it's a crucial part of the proof but appears hastily written:
- Line 1120: What do you mean by "with enough information in xi about previous tokens"?
- Lines 1121-1122: Please place this within your established framework (A, B, k, v, etc.). Why is x bold now? What are the matrices "P_{lm+i}" - these haven't been defined. Aren't the "G_{l,m}...G_{l,i+1}" functions on tokens coming after the current one? How is this information encoded in the tuple (Hi,xi)?
-
Line 1128: The 'a' should not be bold here.
Lemma 2 Concerns
- Line 1141: Use clearer set notation: write "2-d ≤ i ≤ 0" instead of "i ∈ {2-d, ..., 0}".
- Line 1142: Where does the bold 'a' come from? Is this the tuple (t mod d, a_{t-d+1}, ..., at)?
- Line 1145: There is no j>1 since nh = 1.
- Line 1146 (Important): While claiming that the system can learn any function on the specified domain is technically correct, it seems somewhat disingenuous. What you actually prove is that your recurrence can memorize input up to a certain time, while learning is performed by a universal approximator in the decoder (e.g., MLP). Do you believe this occurs in practice, or can your system perform meaningful computation independently of decoder universality? This deserves discussion.
Theorem 4 Issues
- Line 1152: Which specific isomorphism are you referring to?
- Line 1153: "that" appears to be a typo.
- Lines 1159-1161: Please cite this result - it's unlikely the intended audience is familiar with it.
- Lines 1162-1164: Why recover y only at the last step? Shouldn't it be recovered at all steps according to your state tracking problem definition?
Lemma 3
- Line 1198: "exist" should be "exists".
- Lines 1215-1229: Same critical issues as found in Theorem 3.
Theorem 6
- Lines 1247-1248: There are two (ii) labels and no (i); line 1256 should be (i).
- Line 1252: Consistency issue with "yi = x1 · x2 · · · xi" as discussed above.
- Line 1265: Is the angle clockwise or counterclockwise? To be pedantic the line passes from the origin to the POINT (1,0), this line in some sense is the vector (1,0) itself.
- Lines 1268-1269: Mismatch between j and i indices in the last two θ expressions.
- Line 1270 (Important): When referencing "standard identities of products of 2D rotations and reflections," clarify that these are matrix representations of the group as given in equation (5). You should consider placing this paragraph (1270-1275) at the beginning for clarity, then introduce the system. In general I believe this theorem lacks a paragraph giving intuition, explaining what d_0 and c_0 are supposed to represent, etc.
- Lines 1271-1272: First line - d0 is not bold. Second line - need to prove "H(π/m)d0 = R(π/m)d0"; avoid ambiguous ± notation.
- Lines 1273-1274: "d0 = H(π/m)c0" appears true but unproven. "R(-2iπ/m)d0 = H(-2iπ/m)d0" is not proven.
- Lines 1274-1275 (Critical Error): There appears to be an error in the last line. θ(sj, 0) - θ(s{m-i}, 0) = (-2m + 2(i+j))π/m since (2+2j)-(2+2(m-i)) = 2j - 2m + 2i = -2m + 2(i+j). The expression in R should be (-2m + 2(i+j))π/m instead of (2 + 2(i+j))π/m.
Additional Issues
- Line 1292 (Pedantic): "When this property is not satisfied, the norm of the state will diverge exponentially fast to infinity." Does this not depend on how fast the norms diverge?
- Lines 1375-1377 and 1407-1411: Using identical hyperparameters for different models without justification raises fairness concerns. How do you ensure fair comparison?
Recommendation
This paper tackles an important problem with substantial theoretical and empirical evidence supporting it. The approach is well-motivated and the experimental work is comprehensive. However, the seeming mistakes in the main proofs (particularly in Proposition 1, where it is not a typo) and unclear arguments in Theorem 3 and Lemma 3 prevent acceptance of the paper in its current form.
Path to Acceptance: Please correct the identified errors in the proofs (or prove me wrong) and clarify the murky arguments in Theorems 3 and Lemma 3. With these corrections, I would raise my recommendation to at least a 5, as the underlying contribution is valuable and the work is otherwise very well-executed.
局限性
yes
最终评判理由
This paper tackles an important problem with substantial theoretical and empirical evidence supporting it. The approach is well-motivated and the experimental work is comprehensive. Mistakes pointed out before rebuttal have been addressed and corrected.
As the contribution is valuable I raise my score and recommend acceptance.
格式问题
None
We greatly thank the reviewer for the thoughtful and thorough response and apologize for the lack of clarity, typos and small mistakes in the theoretical sections. We modified our draft incorporating the reviewer feedback, which greatly improved the clarity of the paper. After careful checking and as the reviewer points out, none of our theoretical results will change. Due to space constraints, we have to be concise in our response, but will be happy to provide any clarification during the discussion period.
Direct Parametrization Questions
Lines 184-188: The issue discussed in that paragraph concerns bounding the norm of the recurrence. If we choose the second key to be different from the first we cannot in general guarantee the recurrence stability even when , as explained in Appendix B.5 for RWKV-7 which also has two different keys. Please also see our first response to Reviewer HNXg, where we expand on the expressivity of RWKV-7 products.
Appendix Issues
Proposition 1
Thank you for identifying this error in Proposition 1. Our corrected proof shows that real eigenvalues are guaranteed if at least one is in by proving the discriminant where and is the subspace spanned by the vectors , .
- If one and the other is in , then , which ensures .
- If both , the AM-GM inequality shows , which implies and thus . Hence complex eigenvalues only arise when both .
Following your suggestion, we computed explicit bounds on for which complex eigenvalues occur:
For the special case of two reflections (), this simplifies to , which holds whenever the keys are linearly independent but not orthogonal.
Unclear Statements and Missing Information
Line 1078-1082: We agree that that section was difficult to follow. The general idea is that for the decoder to support our theoretical constructions, we need the map from all possible states of the RNN and the inputs to the decoder to be injective so that these can be distinguished by the MLP. There are two challenges to make this happen. The first is the dimensionality reduction via the scalar product with the query vector: the vector-valued state that we generally use would be compressed to a scalar after this operation. The second is the RMSNorm operation, which, in the case of a scalar input and (in practice also if too small) gives a binary output, so no matter how many states our RNN has, we only have two different inputs to the MLP input. Since for our construction the number of states that we use is always finite, by setting the query as and (In practice not too small) we can still guarantee injectivity despite these drawbacks. We’ll include a proof of this fact in the revised paper and we’ll be happy to clarify this further during the discussion period if needed. We also noticed a small mistake: should be defined as \**b** = (b, b^2, \\dots, b^d)^\\top with , , and is the finite set of values of each element of the RNN state matrix.
Line 1064: You are correct. The restriction on operator norm applies here. We’ll rephrase this in the updated version.
Line 1069: We will explicitly state which assumption: “Alternatively, the assumption on (task-dependent with rank at most )”.
Line 1084: j is the index of the head, which is not related to the content of the hidden state which is . We will write that to make this clearer.
Line 1073, Line 1128: Thank you, we corrected it.
Theorem 3
We agree with the reviewer that this section was not clear. We’ll expand and clarify it in the revised version. The main idea of the proof is to divide the tokens into blocks of matrices. The recurrence of the last layer computes the decomposition for the tokens in the previous blocks, while the current block is computed via the last layer decoder. We’ll be happy to provide further clarification and possibly the complete proof during the discussion period if needed.
Line 1117-1119: We’ll rewrite as “We divide the input sequence into blocks of elements: we factorize the position into where is the index of the previous block and is the position within the current block (index ).”
Line 1120: We explain what we mean by “with enough information in xi about previous tokens” later in the text (we’ll clarify this): each token of the last layer should contain the position modulo 2m and the last 2 elements of the input sequence: including both the (partial) current and previous blocks. is now bold because it is the input of the last layer, which can be a vector, as opposed to the input to the model. It should be , not , we’ll fix it. The matrices are not functions of the tokens coming after the current ones because they are part of the “previous block” of tokens. The current block of permutations is computed entirely in the decoder via lookup-table.
Notation and Consistency Issues
Line 97: Thanks for pointing this out, we will make the main paper consistent with the appendix by reverting the order, i.e. using .
Lemma 2
Line 1146: You are right to highlight the role of the decoder. Our theoretical constructions (as RWKV-7’s) rely on the universality of the MLP, whose required size grows with the complexity of the task (e.g., the group size). We’ll clarify this. The MLP complexity for complex groups with small values of is quite high since to learn , the lookup table that it has to learn grows exponentially with . Indeed, in our experiments (Figure 5 bottom row) the world problem cannot be learned when even with 10 layers. However it might be possible to learn the construction if is closer in value to .
Line 1141, 1142, 1145, 1153, 1198, 1247-1248, 1252, 1268-1269: Thank you, we agree and we fixed the typos in our manuscript.
Line 1265: The direction of the angle does not matter for the proof. We will change “vector” for “point” as suggested.
Theorem 4
Line 1152: We will change the word isomorphism to the word assumption. From the context, it will be clear that we are talking about the assumption of the group being isomorphic to the orthogonal or rotation groups.
Line 1159-1161: The fact that we only need products of Householder matrices to represent all elements in when is even, is a consequence of the Cartan-Dieudonné theorem, which we already mention since it explains DeltaProduct’s expressivity, and the fact that now we are focusing on orthogonal transformations with determinant +1. The determinant of the product of matrices equals the product of the determinants and each Householder has determinant -1. Hence, an even number of Householders cannot be in . We’ll add this clarification.
Lines 1162-1164: The same reasoning can be applied at any step (not only the last one), we will change the index from t to i to highlight this.
Lemma 3
Line 1215-1229: The proof for this will follow the same structure as our clarified proof for Theorem 3, and we will refer to it to maintain conciseness.
Theorem 6
Line 1270: We have added an introductory part to the proof explaining the geometric intuition and defining the matrix representations of reflections and rotations upfront.
Line 1271-1272: We’ll ensure that is always bold. The unproven equation can be directly derived from the definition and the definitions of rotation and reflection matrices, given that the first column of a rotation matrix is identical to that of a reflection matrix (based on our notation).
Lines 1273-1274: We do not use that equality but just use the definition here. See the above point for the proof of the second equation.
Lines 1274-1275 (Critical Error): You’re right, that was a typo, your expression is correct. The term can be removed due to the periodicity of the rotation matrix, yielding the desired result . We’ll fix this in the revised version.
Additional Issues
Line 1292: Correct. We will revise to remove "exponentially fast".
Lines 1375-1377 (Chomsky Hierarchy): In our experiments, we compared against the results by Grazzi et al. 2025 for DeltaNet, Mamba and mLSTM / sLSTM. Our hyperparameters were chosen based on the ones found to produce the best results on the state-tracking experiments.
Lines 1407-1411 (Language Modeling): Due to compute budget constraints, we decided to use the hyperparameters used by Yang et al. 2024 for similar experiments.
References:
Grazzi, R., Siems, J., Zela, A., Franke, J. K., Hutter, F., & Pontil, M. Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues. ICLR. 2025
Yang, S., Wang, B., Zhang, Y., Shen, Y., & Kim, Y. (2024). Parallelizing linear transformers with the delta rule over sequence length. Advances in neural information processing systems, 37, 115491-115522.
Thank you for the extensive response, you seem to have addressed all the issues raised. Before the final decision, it would be really helpful to have access to the revised proofs. Would you be able to provide them in any manner? I'm thinking chiefly about Theorem 3.
Given the constraints of the rebuttal, we think the only way to provide the proofs is in the comments. We would like to note that this requires re-formatting due to the absence of macros and the different syntax, which might produce typos. In the following comments we provide the proof and statement of Theorem 3, the proof of Proposition 1 case 3 and the first part of the Decoder implementation section. Please let us know if other parts would be helpful.
Theorem 3.
For any there exists a DeltaProduct model with one of the following configurations that can solve the word problem of the symmetric group : (i) one layer with (ii) 3 layers with (iii) 4 layers with .
The construction for (ii) and (iii) requires that the MLP at the second last layer computes a lookup-table of size , function of the last input tokens and the position modulo with .
Proof. One way to solve the group word problem for the symmetric group is to map each element of the group to the corresponding permutation matrix and then for each input sequence with compute each element of the output sequence as
where is a surjective map from vectors in to the elements of , which we consider integers for simplicity, i.e., .
(i). Since a permutation of elements is a series of at most swaps of elements, if , then we can solve the problem with a 1-layer DeltaProduct by setting , , (), . The latter is achieved by setting for the -th element in the product ), either and with being the -th element of the canonical basis of (swap element at index with the one at index ), or (identity).
(ii) and (iii). If , then the state transition matrix is not sufficiently expressive to represent all permutations of elements. However, we can use additional layers to overcome this issue as follows. We divide the input sequence into blocks of elements: we factorize the position into where is the index of the previous block and is the position within the current block (index ). First, consider the case when . Let be the product of the permutations of the previous block. Since is a permutation matrix of elements, we can factor it into where we recall that and each of is a product of generalized householder matrices and thus can be a state-transition matrix of our model. We fix one factorization for each possible permutation matrix and we set , with to handle the case when .
Now let be the input of the last layer. if contains enough information about previous tokens (as we specify later), we can set the recurrence and decoder of the last layer as
where , , , using the construction at point (i) since is a product of at most Householers. Note that contains the product of the input permutations only up to token and a partial computation of previous block of permutations . Hence, the decoder completes the computation by applying two additional components: (1) the remaining transformations through needed to complete , and (2) the actual permutations from the current partial block through .
Proof of Theorem 3 — Part 2
We can check that this ends up computing the correct output by substituting the expression for and unrolling the recurrence as follows.
Note that to compute and , should contain and the last (in general the last ) tokens, corresponding to the current and previous block. Hence, the layers before the last one are dedicated to compute at each time-step a lookup table for the possible values of whose output will be included in the input of the last layer . The first layers (two layers if , one if ) can provide by using Lemma 2 with . Finally, the second to last layer can output any function of the last tokens and the position modulo through Lemma 3 with and , by using from the first layer(s).
End of Proof of Theorem 3
Proposition 1
Proof of Case 3: Complex Spectrum via Non-orthogonal Directions and Real Subcase
Let span a 2D subspace with such that . The product acts as the identity on (preserving eigenvalues at 1) and non-trivially on . The restriction of to has trace and determinant . The discriminant of its characteristic equation is . Complex eigenvalues arise if .
To find the explicit bounds, we expand the inequality :
Rearranging this expression as a quadratic in yields:
This inequality holds if and only if lies strictly between the two roots of the corresponding equation. Solving for the roots gives the explicit bounds for complex eigenvalues:
This inequality is only satisfiable when . For the special case of two standard reflections where , the condition simplifies to , confirming that the product of any two distinct reflections is a rotation.
Conversely, we show that if at least one coefficient , the eigenvalues are real as . We analyze this by cases:
Case 1: One coefficient is in , the other is in . Without loss of generality, let and . This implies and , so their product . The term is therefore non-negative. Since , their sum must be non-negative.
Case 2: Both coefficients are in . By the AM-GM inequality on the non-negative terms and , we have . Since includes an additional non-negative term , it also holds that . Squaring both sides gives , ensuring .
This analysis confirms that complex eigenvalues, which enable rotations, can only arise if and only if both and . When at least one , real eigenvalues restrict the transformations in that subspace to scaling or reflection.
Decoder implementation.
In our implementation, is the same as in Gated DeltaNet:
where are two learned matrices, , corresponds (when ) to a projection onto an axis aligned ellipse where is learnable and determines the lengths of the axis and is a small value set to avoid numerical errors. Meanwhile, is a two layer MLP. This structure can be limiting. For instance, when , then and if , then , which means that after RMSNorm we are left with effectively only 2 possible values, while our constructions might require more: as many as the number of possible states. Indeed, for all our constructions to work in practice, we would need a sufficiently large , so that the output of can retain some magnitude information. Since in our constructions the number of states is finite, with and an appropriate value for we are guaranteed that the map , and consequently (by appropriately setting ) the map from states and inputs , to the input of the MLP, is injective, which we show in the next lemma. Hence, thanks to the MLP, the decoder can approximate any continuous function of even after the bottlenecks caused by the scalar product with and the RMSnorm.
Lemma 1. Let be a finite set of values. Define and . Let and set . If satisfies , then the mapping given by is injective.
Proof. The mapping is a composition , where the component functions are: where . The overall mapping is injective if each component is injective.
1. Injectivity of : Intuitively, encodes the vector as a scalar as a number in base , where each acts as a ``digit'' drawn from the finite set . By choosing sufficiently large relative to the spread of (captured by ), we ensure that different vectors produce distinct scalars. To prove this formally, we show that for any non-zero difference vector , the difference is also non-zero. Let . The magnitude of the highest-order term is lower-bounded by: The magnitude of the sum of lower-order terms is upper-bounded as where we used the triangle inequality and the assumption on , which implies . Comparing the bounds, we see that Since the magnitude of the highest-order term is strictly greater than that of the sum of all other terms, their sum cannot be zero. Thus, is injective.
2. Injectivity of and : The function is a linear scaling by the non-zero constant and is therefore injective. For , its derivative is strictly positive for , meaning is strictly monotonic and also injective.
Since , , and are all injective, their composition is also injective.
Theorem 3.
i)
For how I understand it you are trying to show that a system of type defined in sec 4 lines 156-162 can be constructed such that . The strategy is to write the permutation as a sequence of transpositions (order matters here) each of which can be written in matrix form as for vectors of the type (here I am making all dependencies explicit).
What I'm missing here is how you choose the right and with a map of type . I see a way of doing it by fixing the sequence of transpositions a-priori, sending in a space in spirit equal to in order to encode this ordering of transpositions and then using the to choose the th transposition and encode it, with the identity. Is this similar to what you had in mind? If I stand correctly this construction is assumed for all the parts of the proof.
ii and ii)
I believe that for a first time reader it could be useful to have a comment giving intuition on why there is a "delay" in the construction of . By this I mean that is constructed from by using which is a component of the "previous" block.
Lemma 1.
- Once you have you could simply conclude by Lagrange's bounds on polynomial roots (https://en.wikipedia.org/wiki/Geometrical_properties_of_polynomial_roots), which implies your result.
Thank you again for taking the time to check the details of the updated theorems.
Theorem 3
i) That is correct, if for example is a one-hot vector (where the dimension accounts for all possible swaps of elements and the identity) then it can select the appropriate columns of that will results in the right decomposition for the permutation . We also note that the construction holds not only if is the identity, but also when each component of the image of (which is applied elementwise) includes an interval that contains zero in its interior, since this implies that by setting appropriately , can be any point on the sphere of unit radius. This is the case for the SiLU which is used in our experiments.
ii and iii) Thank you for the suggestion. Right above “We can check that this ends up computing the correct output ” we will add the comment “The delay in the recurrence is necessary, since to compute even the first matrix of the factorization for a block of elements of the input sequence, all the elements in such a block need to be processed.”
Lemma
Thanks for the pointer to Cauchy upper bound for polynomial roots. We will add it to the lemma.
DeltaProduct extends DeltaNet by performing rank-n updates instead of rank-1, yielding state-transition matrices as products of generalized Householder transformations. This creates a principled mechanism to interpolate between diagonal matrices (efficient) and more expressive transformations while maintaining stability through an operator norm less than 1. The authors prove that DeltaProduct can solve symmetric group word problems with improved trade-offs and that Gated DeltaProduct recognizes any regular language with finite layers.
Experimentally, DeltaProduct demonstrates superior performance on state-tracking tasks involving permutation groups and achieves dramatically improved length extrapolation in language modeling compared to DeltaNet, with minimal performance degradation on sequences beyond training context. The effective rank analysis reveals that multiple Householder products enable more efficient state forgetting mechanisms. The work provides both theoretical insights and practical improvements in synthetic reasoning and real-world language modeling tasks.
优缺点分析
This paper demonstrates exceptional scientific rigor with a clear, meaningful contribution that advances linear RNN research. The theoretical foundation is mathematically sound yet accessible through intuitive geometric explanations. The experimental validation is comprehensive, effectively combining synthetic state-tracking tasks with realistic language modeling benchmarks at an appropriate scale. The work exemplifies proper incremental research, extending DeltaNet in a principled direction that yields genuine improvements. The related work section is particularly thorough and well-organized, while the authors' commitment to reproducibility strengthens the work's impact. I consider the presentation quality to be outstanding throughout.
问题
I'm curious about a FLOPs comparison in the scaling analysis in Figure 10. To match parameters, the number of heads was reduced, but how do these variants compare w.r.t. FLOPs?
Did DeltaProduct display any training instabilities?
In Figure 5 bottom 2nd and 4th plots, an increased number of layers seems to hurt performance on the state tracking tasks. You addressed it, but isn't this unexpected? Is there an explanation? Also, the number of layers is not consistent in that row, which reduces the readability of the figure as a whole.
In Figure 9, the Gated DeltaNet effective ranks are extremely jagged. Why is this?
局限性
yes
最终评判理由
The work is convincing, and after seeing the other review,s I'm even more certain about it. This work should be accepted.
格式问题
line 42 should be an em-dash (https://en.wikipedia.org/wiki/Dash)
We thank the reviewer for the positive feedback.
I'm curious about a FLOPs comparison in the scaling analysis in Figure 10. To match parameters, the number of heads was reduced, but how do these variants compare w.r.t. FLOPs?
This is a great question. The FLOPs of the recurrence scale linearly in such that the complexity is (Yang et al., 2024), where is the sequence length, is the chunk size and is the dimension of keys and values. The recurrence computation also requires sequential steps. However, in our setup the total FLOPs of the model are dominated by the large matrix multiplications in the input/output projections and the MLP, rather than the recurrence itself, and so, the contribution of the recurrence to total model FLOPs is minor.
In our parameter-matched scaling experiments (Figure 10 and Table 4), we adjusted the number of heads to equalize parameters. For instance, at the ~800M scale, we compare DeltaNet (16 heads) with DeltaProduct (12 heads). The increase in FLOPs from the additional key/value projections and the doubled recurrence in DeltaProduct is largely offset by the reduction in FLOPs from having smaller query, key, and value projections due to the fewer number of heads. Our results in Figure 13 in Appendix C.4 also support this. There we can see that for non-parameter matched models, the training throughput scales sub-linearly in . We conducted additional throughput experiments with an improved model implementation that fuses the projection layers and causal 1d-convolutions. Here we find that DeltaProduct with 805M parameters achieves 86% (compared to 75% without fusing projections and causal 1d-convolutions) of the training throughput of DeltaNet with 805M parameters at 4096 sequence length and a per-device batch size of 12 (on a NVIDIA H100-94GB). Moreover, we expect this to further improve with an optimized custom triton kernel (we are using DeltaNet’s).
Did DeltaProduct display any training instabilities?
No, we did not encounter any training instabilities with (Gated) DeltaProduct. We attribute this to the inherent stability of the recurrence, as the operator norm of the transition matrix is bounded by 1, which keeps the gradients well-behaved. The training loss curves for in Figure 14 of the supplementary material demonstrate this stability empirically.
In Figure 5 bottom 2nd and 4th plots, an increased number of layers seems to hurt performance on the state tracking tasks. You addressed it, but isn't this unexpected? Is there an explanation? Also, the number of layers is not consistent in that row, which reduces the readability of the figure as a whole.
Indeed, increasing layers from 8 to 10 reduces performance for the and groups. However, we see that these experiments are quite challenging and require multiple seeds to find a run that could fit the training context length of 128 permutations and show extrapolation to 512 permutations: we use 5 seeds per configuration yet sometimes still couldn't fit the context length. This observation reinforces our claim: increasing expressivity horizontally via offers a more parameter efficient and stable optimization path for these state-tracking tasks than increasing it vertically via depth. We thank you for pointing out the inconsistent legend labels in the colormap; we will fix the colormap in the revised figure to improve readability.
In Figure 9, the Gated DeltaNet effective ranks are extremely jagged. Why is this?
Both Gated DeltaNet and Gated DeltaProduct produce very jagged effective ranks. We found that both models learned to set the forget gate close to zero at the beginning of sequence tokens, effectively erasing the memory for the new sequence. Since the forget gate is a scalar for each hidden state per head the reset is very drastic.
line 42 should be an em-dash (https://en.wikipedia.org/wiki/Dash)
Thank you for pointing this out, we have fixed it in our manuscript.
Thank you again for helping us improve the paper.
References
S. Yang, B. Wang, Y. Zhang, Y. Shen, and Y. Kim. Parallelizing Linear Transformers with the Delta Rule over Sequence Length. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Pa- quet, J. Tomczak, and C. Zhang, editors, Proceedings of the 37th International Conference on Advances in Neural Information Processing Systems (NeurIPS’24), 2024
Thank you for your response. I'm looking forward to seeing this paper presented at the conference.
This paper introduces DeltaProduct, a novel linear Recurrent Neural Network (RNN) architecture that improves upon existing models like DeltaNet by enhancing state-tracking and expressivity. While DeltaNet's recurrence is interpreted as a single online gradient descent step per token, DeltaProduct performs multiple (nh) steps, resulting in a more expressive state-transition matrix formed by the product of nh generalized Householder transformations. This creates a diagonal plus rank-nh matrix structure, providing a tunable mechanism to balance expressivity with computational efficiency. The authors provide theoretical analysis showing that increasing nh improves the model's capacity for complex state-tracking, such as solving group word problems. Through extensive experiments, they demonstrate that DeltaProduct outperforms DeltaNet in both state-tracking and language modeling tasks, and exhibits significantly improved length extrapolation capabilities.
优缺点分析
Strengths :-
- The paper constructs complex, non-symmetric state-transition matrices by composing products of simple, symmetric Householder reflections. This allows the model to learn rotational and oscillatory dynamics, crucial for advanced state-tracking, while mathematically guaranteeing stability since the operator norm remains bounded.
- The paper demonstrates nice scientific rigor by not just showing that the model solves a task (S4 permutations), but by verifying how it solves it. Through model introspection, it provides strong evidence that the network learns the underlying geometric structure of the problem (the isomorphism to a cube's rotation group) exactly as predicted by theory.
Weaknesses :-
-
The model's core design using Householder products enforces a strict norm ||A(xi)|| <= 1 on its state-transition matrix. While this guarantees stability, it is also a fundamental expressive bottleneck that prevents the model from performing norm-increasing operations, which are crucial for amplifying signals and are a key reason why competing models like RWKV-7 achieve stronger theoretical guarantees.
-
Increasing the model's expressivity via the hyperparameter nh (the number of Householder products) is inefficient. It leads to a linear growth in both model parameters and, more critically, the sequential computation time of the recurrence, making it nh times slower per token. This architectural choice limits the practical scalability of using a large nh to achieve higher performance.
-
The number of operations, nh, is a fixed, global hyperparameter. The model applies the same amount of costly computation to every token, regardless of its importance. This is wasteful for simple tokens (e.g., punctuation) and may be insufficient for complex ones (e.g., resolving a long-range dependency), representing a missed opportunity for adaptive computation that could allocate resources more efficiently.
问题
None.
局限性
Yes.
最终评判理由
The authors' rebuttal, particularly the new experiments and the generalization of their framework, has addressed my initial concerns. The planned revisions will substantially improve the paper's quality and significance. Therefore, I have updated my recommendation.
格式问题
None.
We thank the reviewer for the feedback and address each point below.
The model's core design using Householder products enforces a strict norm ||A(xi)|| <= 1 on its state-transition matrix. While this guarantees stability, it is also a fundamental expressive bottleneck that prevents the model from performing norm-increasing operations, which are crucial for amplifying signals and are a key reason why competing models like RWKV-7 achieve stronger theoretical guarantees.
We agree this is a crucial trade-off. We intentionally designed DeltaProduct with Householder products to prioritize guaranteed stability (), a criterion for stable training on long sequences. This design choice directly contrasts with the potential instability of models like RWKV-7. As we demonstrate in Appendix B.6, the RWKV-7 recurrence becomes unstable in the very cases that grant it higher expressivity (i.e., when vectors vary per token); we show that the product of just two such matrices can have a spectral radius > 1.2.
However, the core product-of-matrices mechanism we propose is flexible. It can incorporate other matrix families, including the norm-increasing ones from RWKV-7 Indeed, a minor extension of our theory, building on Merrill et al. (2024) (Theorem 5.2) and Peng et al. (2025) (Lemma 3), shows that this generalized framework has stronger theoretical guarantees. For instance, a single-layer model using RWKV-7 matrix products can recognize any regular language accepted by a finite state automaton with states, while if , a three-layer model can recognize any regular language. We will add these results to the revised paper.
To validate this, we ran new experiments with products of RWKV-7 type state-transition matrices. We used as state-transition matrix where is unit norm and by transforming through a sigmoid. We find that a single layer using products of 2 RKWV-7 matrices obtains comparable results to DeltaProduct on . We will add new experiments for the other permutation groups using products of RWKV-7 state-transition matrices to a revised version of our manuscript.
Increasing the model's expressivity via the hyperparameter (the number of Householder products) is inefficient. It leads to a linear growth in both model parameters and, more critically, the sequential computation time of the recurrence, making it nh times slower per token. This architectural choice limits the practical scalability of using a large nh to achieve higher performance.
We see the linear increase in sequential computation with as a tunable trade-off rather than a fixed limitation. Our experiments in Figure 8 show that a small increase in (e.g., to 2 or 3) yields substantial gains in length extrapolation, making it a highly effective trade-off for many applications. Despite the theoretical linear increase in computational complexity, we find that in practice the training time increases sub-linearly with as shown in Figure 13 in Appendix C.4. The scaling-rate is also dependent on training sequence length and batch size and could be further improved by a custom triton kernel. After submission, we benchmarked the training throughput of the parameter-matched 805M parameter DeltaNet and DeltaProduct on an improved implementation of our model with fused projections and causal 1d-convolutions (see also the response to reviewer RGKp). We find that DeltaProduct achieves 86% (compared to our previous 75% without fusing projections and causal 1d-convolutions) of the training throughput of DeltaNet at 4096 token sequence length and a per-device batch size of 12.
The number of operations, nh, is a fixed, global hyperparameter. The model applies the same amount of costly computation to every token, regardless of its importance. This is wasteful for simple tokens (e.g., punctuation) and may be insufficient for complex ones (e.g., resolving a long-range dependency), representing a missed opportunity for adaptive computation that could allocate resources more efficiently.
Thank you for the suggestion. Adaptive computation is a promising direction that could greatly enhance the model's efficiency by allocating computation dynamically. While this falls outside the scope of our current work, which focuses on establishing the fundamental properties of the Householder product structure, we agree it is a valuable avenue for future research. We will add a discussion of adaptive computation, such as using a Mixture-of-Experts (MoE) approach to select per token, to our future work section.
References:
Merrill, W., Petty, J., & Sabharwal, A. (2024, July). The Illusion of State in State-Space Models. In International Conference on Machine Learning (pp. 35492-35506). PMLR.
Peng, B., Zhang, R., Goldstein, D., Alcaide, E., Du, X., Hou, H., ... & Zhou-Zheng, C. (2025). Rwkv-7" goose" with expressive dynamic state evolution. arXiv preprint arXiv:2503.14456.
I thank the authors for their detailed and thoughtful rebuttal.
Regarding my primary concerns:
-
On the Stability vs. Expressivity Trade-off: The authors' response was helpful. They rightly framed the guaranteed stability (
||A(xi)|| <= 1) of the Householder product as an intentional and important design choice, providing a clear counterpoint to models like RWKV-7 which can suffer from instability. The proposed extension to incorporate norm-increasing matrices (like those in RWKV-7) and the new preliminary experiments showing comparable performance are very convincing. This generalization elevates the contribution from a specific architecture to a more flexible and powerful framework. I support the inclusion of these new theoretical and experimental results in the revised manuscript, as they turn a perceived limitation into a significant strength. -
On Computational Inefficiency: The authors' new throughput benchmarks are very helpful. The updated finding that an improved DeltaProduct-2 model can achieve 86% of DeltaNet's throughput demonstrates that the practical cost of increasing
nhis manageable and represents a worthwhile trade-off for the performance gains observed. This alleviates my concerns about the practical scalability of the approach for small, effective values ofnh. -
On Adaptive Computation: I accept the authors' position that implementing adaptive computation is a promising avenue for future work and falls outside the scope of this foundational paper. Their commitment to discussing this limitation and future direction is appropriate.
The authors' rebuttal, particularly the new experiments and the generalization of their framework, has addressed my initial concerns. The planned revisions will substantially improve the paper's quality and significance. Therefore, I have updated my recommendation.
Summary:
Linear recurrent neural networks are an area of active study for language models because they avoid the quadratic complexity of dense softmax attention in transformers. Earlier models such as Mamba or GLA use a diagonal transition matrix, which limits their expressivity. More recent models such as DeltaNet and RWKV-7 improve on certain tasks by using a diagonal plus rank-1 transition matrix. In this paper the authors introduce a new LRNN, DeltaProduct, which extends DeltaNet by taking multiple steps per token. The state-transition matrix thus becomes rank-N rather than rank 1, and is formed as a product of N generalized Householder transformations.
The paper provides both mathematical proofs, and experimental results which demonstrate that the technique works.
Meta-review.
The reviewers were unanimously in favor of accepting this paper, and I see no reason to disagree. The content of the paper lies outside my area of expertise, but it seems to be well-written. The reviews also seem to be careful and thorough.
Reviewer HNXg thought that "the paper demonstrates nice scientific rigor", but was concerned that the use of householder products was a "fundamental expressive bottleneck." However, the authors provided new throughput benchmarks, and a detailed explanation of the "stability vs. expressivity tradeoff" in their rebuttal, which addressed the reviewer's concerns.
Reviewer RGKp had nothing but praise:
- "This paper demonstrates exceptional scientific rigor with a clear, meaningful contribution that advances linear RNN research."
- "The related work section is particularly thorough and well-organized."
Reviewer bkha did an exceptionally detailed review (thank you!), went through the paper and the proofs line-by-line, and found some errors in key theoretical results. However, the authors provided updated proofs during the rebuttal period. In their final justification, reviewer bkha writes: "Mistakes pointed out before rebuttal have been addressed and corrected."
Given the attention to detail with which bkha reviewed this paper, I feel confident that the paper was fairly reviewed, and should be accepted. However, given that none of the reviewer scores were above 5, it probably does not qualify for a spotlight or oral.