4.8

/10

Rejected4 位审稿人

最低3最高8标准差2.0

4.0

置信度

正确性2.8

贡献度2.5

表达2.3

ICLR 2025

You Only Scan Once: Efficient Multi-dimension Sequential Modeling with LightNet

Zhen Qin,Yuxin Mao,Xuyang Shen,Dong Li,Jing Zhang,Yuchao Dai,Yiran Zhong

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

摘要

关键词

LightNetmulti-dimensional sequential modeling

评审与讨论

审稿意见

评分: 8置信度: 22024-11-03

The authors start by exploring the space of linear recurrences, and when they can represent linear combinations. They then define a constraint to keep this recurrence numerical stable, and two different ways to satisfy the constraint. They call these “additive” and “multiplicative”.

They show that the SSMs normally use a “multiplicative” linear recurrence, which stops information from “future” of the sequence affecting the current position, requiring two scans along a sequence, one forward and one backward. Instead, they modify their “additive” linear recurrence to be a “global” summary of information present by modifying the normalization. They show this modified and novel “additive” linear recurrence is significantly computationally cheaper than a 2-scan approach (Figure). This is particularly useful when dealing with multi-dimensional arrays, where there is no natural scan order, and multiple scans in different orders might be required. They also demonstrate that using one-scan is significantly faster in practice (Figure 2), but they do not compare speed for their other experiments.

However, their ‘additive’ linear recurrence loses the relative position of tokens, so they introduce two new forms of positional encoding they call Multi-Dimensional Toeplitz Position Encoding (MD-TPE), and Multi-Dimensional Linearized Relative Positional Encoding (MD-LRPE), where MD-TPE is applied after the input embedding and MD-LRPE is applied after each Q/K computation.

Together, they use their one-scan attention and MD-LRPE blocks to construct a LightNet Attention (LNA) block, which alternates with a Gated Linear Unit (GLU) to construct what they call the LightNet.

They empirically test the performance of LightNet in both multi-dimensional image classification and generation tasks, as well as at word modeling tasks.

In image classification, they show that they perform similar to the system doing full softmax attention or multi-scan attention , while they only require their one scan system, which should be faster. In Image generation, they demonstrate SOTA FID compared to the other methods.

In the bidirectional language modeling, they compare in 24-hour training regime, and LightNet significantly outperforms all other methods. In autoregressive LM, they match MAMBA performance, which makes sense.

优点

Originality: To my knowledge both the “additive” linear recurrence and corresponding improvements to the positional encodings system are novel, and important for non-linear regimes.

Quality: The technical claims are clear and appear to be solid. However, see note about it being hard to compare the computational efficiency of their methods vs. competitors due to not reporting timings in weaknesses section.

Clarity: The paper is clearly written and relatively easy to follow, with experiments described in sufficient detail that they could be reproduced, as well as network. However, note that the appendices also appear to be missing, which includes one of the smaller proofs from the paper.

Significance: They demonstrate SOTA on some of their tasks, so it is definitely the best method to use in at least certain situations. This combined with their claims of gains in computational efficiency could make LightNet the network of choice in various situations. It’s also surprising that their modification works as well as it does, so that itself may lead to more interesting results and modifications.

缺点

It would be more informative if the timing information for the evaluation/training iteration per experiment were reported along with the results, as opposed to only being shown in Figure 2. As it is, it is hard to judge what computational speed difference it makes in practice, which is one of the main attractions of their method. Assuming the results in Figure 2 apply to the other methods, it would make it more impressive. At least why this wasn't included should be addressed in the text.

There are no error bars on any results, which makes it harder to be sure that the differences are meaningful. However, this could be understandable given the computational expense involved. A comment on how much variability was seen if multiple runs were performed would be useful.

Their results table a not that easy to parse - bolding winners on each table would make it quicker and easier to read.

There are some places where an appendix is referenced, but there are no appendices. E.g., the proof of Equation 4 is missing.

问题

There is still a “directionality” in the additive linear recurrence, as the terms before and after a specific position have different weightings (there is a sum to the LHS, but not the RHS of a term). Could you provide more intuition about why this asymmetry doesn't matter in practice?

It’s not clear what the setup is in Figure 2 - you say you substitute the attention block, but in what setting/dataset?

评论- Respose to Reviewer 4dh3

2024-11-22

Q1. More speed results

A1. Thank you for your suggestion. We have added a comparison of the speed between Mamba2 and Flash Attention, as shown in the table below.

Forward Pass
seqlen/method	Mamba2	LightNet	Flash Attn
256	0.40	0.05	0.05
512	0.72	0.08	0.11
1024	1.14	0.12	0.36
2048	2.15	0.21	1.34
4096	4.20	0.36	5.31
8192	9.57	0.69	21.05
16384	18.87	1.34	83.83

Backward Pass
seqlen/method	Mamba2	LightNet	Flash Attn
256	1.13	0.22	0.25
512	2.18	0.34	0.55
1024	3.60	0.57	1.53
2048	7.00	1.06	4.70
4096	13.84	2.05	16.54
8192	28.68	4.03	62.30
16384	57.36	7.99	242.16

Q2. Error bars on experiments

A2. Thank you for your suggestion. Noting that some experiments (e.g., LLMs and Diffusion models) are computationally expensive, it is challenging to run multiple repetitions. To verify the robustness of our method, we conducted five experiments on the ImageNet-1k dataset. The results are shown below:

Model	Acc	Params	Acc	Params	Acc	Params
LightNet	$74.44\pm 0.08$	6	$80.08\pm 0.05$	22.64	$81.87\pm 0.04$	87.74

The results demonstrate that LightNet achieves stable performance on ImageNet-1k.

Q3. Results table not easy to parse

A3. Thank you for your suggestion. In the next version, we will bold the best results for better clarity.

Q4. Missing reference

A4. Thank you for your suggestion. We will address these issues in the revised version.

Q5. "Directionality" in additive decay

A5. Thank you for your suggestion. This describes a causal setting, such as in language modeling, where each time step depends only on previous ones. In this scenario, only the LHS exists without the RHS, which is reasonable.

Q6. Setup in Figure 2

A6. In Figure 2, we compare LightNet Attention with Linear Attention (2-scan) at the operator level. The input is randomly generated, and we vary the sequence length to measure the time consumed by each operator.

2024-11-26

Thanks for the clarifications and the extra results. However, I'm confused by your response.

Q There is still a “directionality” in the additive linear recurrence, as the terms before and after a specific position have different weightings (there is a sum to the LHS, but not the RHS of a term). Could you provide more intuition about why this asymmetry doesn't matter in practice?

A5. Thank you for your suggestion. This describes a causal setting, such as in language modeling, where each time step depends only on previous ones. In this scenario, only the LHS exists without the RHS, which is reasonable.

We're talking about Line 246-253 (lines don't quite line up). You modify the denominator to incorprorate "global" information, by changing the normalization constant. You say this only applies in the causal case - but by definition you don't want to incorporate global information the causal case as that'll leak future information into the current prediction.

In the non-causal case, the question about asymmetry still applies.

评论- Respose to Reviewer 4dh3

2024-11-26

Thank you for your response! Our previous response explained that in the causal setting, we only need the LHS because information leakage must be avoided (hence, there is no asymmetry issue). In contrast, in the non-causal setting, we need to sum both the LHS and RHS, i.e., $\Delta = \sum_{s=1}^n \delta(s)$ .

Let me further clarify the logic here:

In the causal setting, to prevent information leakage, we do not modify the denominator. Thus, $a_t = \frac{\sum_{s=1}^{t-1} \delta(s)}{\sum_{s=1}^t \delta(s)}$ , meaning the denominator only sums over the LHS.
In the non-causal setting, since we require $a_t$ to incorporate global information (to avoid multiple scans), we modify the denominator to $\Delta = \sum_{s=1}^n \delta(s)$ to obtain this global information. In this case, $a_t = \frac{\sum_{s=1}^{t-1} \delta(s)}{\Delta}$ , where the denominator effectively includes both the LHS and RHS.

I hope this resolves your confusion! If you have further questions, please feel free to ask.

2024-12-03

Thanks for the answer.

Consider the non-causal setting - you define:

$a_t = \frac{\sum^{t-1}_{s=1}\delta(s)}{\Delta}$

But you could define:

$a_t = \frac{\sum^{T}_{s=t+1}\delta(s)}{\Delta}$

These are not the same - so there is an asymmetry introduced in the scan direction, unless I'm failing to understand something. Do you empirically observe any effects when you flip the scan direction in your experiments?

审稿意见

评分: 3置信度: 42024-11-04

This paper proposes a formulation for the decay function of the linear attention models in order to enable single-pass processing of multi-dimensional signals (such as image) in transformers. The authors also propose two positional embedding functions for multi-dimensional data, extending the previous formulations in the literature to multi-dimensional settings.

优点

The paper tackles an important problem re: processing of multi-dimensional data using linear transformers and its performance implications.

The paper also proposes two novel positional embedding formulations for multi-dimensional data which are interesting and to my limited knowledge novel.

The range of the empirical study is also quite wide and contains image classification, image generation and language modeling with lots of comparisons across different models.

缺点

1- Overall, this paper is poorly written; the general argument is quite vague, and the derivation is unclear with vague notation. In particular:

A) What does it mean that $a_t$ "cannot provide global information"? What do authors mean by "global information" here? I can think of any monotonic function of $t$ to capture the global information re: the position, or any aggregation function (sum, product, etc) to capture the global information re: the value.

B) In Sec 3. the decay function is presented as a function of $t$ ; however, in Sec 4. all of the sudden the decay is a function of the key variables themselves with no further explanation. Why should the decay be a function of keys and not simply their time indices? There's no explanation whatsoever.

C) Figure 3 is not consistent with the formulation presented in Eqs. 7, 8 & 9. For instance, it's not clear if the $phi$ transformation is applied on $U$ or not and if so, why should that be the case.

D) What is "y" in Eq. 10 & 11? Also there's no intuitive explanation of where these formulations come from.

I'd highly recommend the authors to (I) rewrite Sec 4 and their preceding arguments in a clear manner, (II) provide a section or appendix re: any background relevant to their work because the current manuscript is not self-contained, and (III) make sure all the notations used in the paper are consistent and clearly defined.

2- The main contribution of the paper is not clear. In particular,

A) Theorem 3.1 is simply the textbook solution for the first-order non-homogeneous recurrence equations with variable coefficients (which is what Eq. 3 is, mathematically speaking), which is already a established result.

B) The distinction between the "multiplicative" and "additive" cases is not clear. The additive case is indeed a special case of the multiplicative case where one chooses $\rho$ to be the ratio proposed for the additive case. In other words, the proposed supposedly "additive" decay formulation is in fact multiplicative underneath.

Instead of using confusing terminology, I'd recommend authors to clearly state their contribution re: their choice of decay function and the theoretical intuitions behind it.

3- The empirical results are not particularly conclusive. More precisely,

A) If the main contribution of the paper is achieving better results with a single pass, then it makes sense for the baselines to run on single pass setting as well and then measure the potential improvement the proposed method has achieved using the proposed decay formulation. For the current reported results, it's not exactly clear where the benefit comes from.

B) In table 1, it's really hard to pinpoint the main source of improvement as there are many other factors among these models that are not controlled.

C) In addition to controlling for the number of passes, the authors should also control for model size. For instance, in Table 3, the model sizes need to be reported.

D) Since language is not a multi-dimensional signal, it is not clear what hypothesis exactly the authors are trying to test by the language experiments. In particular, is the comparison against other linear attention models of the same size? Or is it against any model of the same size? Or is it against any model whatsoever (i.e. beating the leaderboard)? In other words, what exact part of the proposed framework are we testing here and how do we control for other factors that are NOT being tested? There are too many uncontrolled factors here: models' architectures, models' sizes, models' kernel embedding functions (for linear attention models), the decay formulation, etc. This means that the presented results are to be interpreted as leaderboard benchmark results, but is that your claim to beat the leaderboard for these tasks?

E) The "parameter sharing" contribution is first mentioned in the ablation study section and unlike what is claimed there, there is no single mention of it in the methodology sections before! Is it another contribution of the paper? If so, then it needs to clearly stated so and fully explained in the methodology section.

问题

Please see my comments above.

评论- Response to Reviewer TjCt

2024-11-22

Q1. What does "global information" mean?

A1. Thank you for your question. Let me clarify the meaning of "global information" as described in lines 249–250. Based on Eq. (3) and the assumption $0\le a_t \le 1$ , $a_t$ can take two forms:

$a_t=\rho(t), 0\le \rho(t)\le 1;$ where at depends only on the current time step t.
$a_t=\frac{\sum_{s=1}^{t-1} \delta(s)}{\sum_{s=1}^t \delta(s)}, \delta(s) \geq 0$ , where at depends on all time steps from 1 to t, i.e., "global information."

We will clarify this distinction further in the updated manuscript.

Q2. Definition of the decay function

A2. Thank you for your suggestion. In Sec. 3, we use a simple linear recursion as a toy example to analyze the decay function. The implementation in real models differs. In Sec. 4, the decay function depends on $k_t$ , indicating it is data-dependent—a standard approach in the Linear Attention community, as seen in [1], [2], and [3].

Q3. Figure 3 not clear

A3. Thank you for your suggestion. In Figure 3, we applied $ψ$ to $U$ . We will update the figure for clarity in the revised version.

Q4. Typo in Eq. 10 & 11

A4. This was indeed a typo. The RHS should be $t$ . These equations describe relative position encoding [4]:

$\mathbf{y}\_{n\_1, \ldots, n\_k}=\sum\_{m\_k \leq n\_k} \ldots \sum\_{m\_1 \leq n\_1} \mathbf{t}\_{n\_1-m\_1, \ldots, n\_k-m\_k} \mathbf{x}\_{m\_1, \ldots, m\_k}.$

Q5. Significance of Theorem 3.1

A5. Thank you for your suggestion. While the result of Theorem 3.1 is textbook material, its application to analyzing decay in Linear Attention is, to our best knowledge, novel. Current approaches to decay in Linear Attention predominantly use multiplicative decay. To highlight our contribution, we will emphasize Proposition 3.2, which represents the core of this work, in future versions.

Q6. Differences between multiplicative and additive decay

A6. Thank you for your suggestion. We believe our naming conventions are appropriate, as decay can be unified through the following framework:

Decay: $\lambda_t=s_{t-1}/s_t$ $λ_{t} = s_{t - 1} / s_{t}$ , where:
- Multiplicative decay: $\log s_t=\log s_{t-1} + \delta(s)$ .
- Additive decay: $s_t=s_{t-1}+\delta(s)$ .

Since addition in logarithmic space is equivalent to multiplication, this justifies the term "multiplicative decay." We will add this explanation in the next version of the paper.

Q7. Interpretation of experimental results

A7. Our main contributions are twofold:

Achieving a global receptive field in a single scan, leading to faster processing.
Matching the performance of multi-scan and Transformer-based models, as demonstrated in Tables 1, 2, and 3.

A single-scan baseline has limited value in tasks requiring a global receptive field, as it inherently performs worse. For example, GPT underperforms BERT in text classification [5], and Table 4 in [6] shows significant advantages of multi-scan models in image classification.

Additionally, we demonstrate that our model is competitive in language modeling—a domain where most Linear Attention models are applied. This is the significance of Tables 4 and 5.

Q8. More details about Table 1

A8. In Table 1, we replaced the token mixer with LightNet Attention. This controlled comparison, a standard approach in the field [3], [7], evaluates model performance end-to-end. The results validate the effectiveness of LightNet.

Q9. Updating model sizes in tables

A9. Thank you for your suggestion. We will include model sizes in the revised manuscript. As a simple comparison, LightNet has 672 million parameters, while Dit has 675 million parameters.

Q10. Parameter sharing

A10. Thank you for your suggestion. This approach was first introduced in [4], so it is not our primary contribution. We will include proper citations in future versions.

Citations:

Albert Gu, & Tri Dao. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces.
Tri Dao, & Albert Gu. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality.
Zhen Qin, Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, Yiran Zhong. (2024). HGRN2: Gated Linear RNNs with State Expansion.
Zhen Qin, Xiaodong Han, Weixuan Sun, Bowen He, Dong Li, Dongxu Li, Yuchao Dai, Lingpeng Kong, & Yiran Zhong. (2023). Toeplitz Neural Network for Sequence Modeling.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, & Kristina Toutanova. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
Benedikt Alkin, Maximilian Beck, Korbinian Pöppel, Sepp Hochreiter, & Johannes Brandstetter. (2024). Vision-LSTM: xLSTM as Generic Vision Backbone.
Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, & Xinggang Wang. (2024). Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model.

评论- Looking forward to your response

2024-11-28

Dear Reviewer, thank you for reviewing our manuscript and providing valuable feedback. We would like to kindly inquire if you have any further comments or suggestions, as we look forward to your response to further improve our work.

2024-12-02

I appreciate the authors taking time to address some of my questions and confusions. Nevertheless, the new draft is still not satisfying the minimum clarity and coherence required for an ICLR submission in my opinion. I'd highly encourage the authors to make a significant the methodology sections and clearly state the motivations and their thought process wrt their proposed approach.

审稿意见

评分: 3置信度: 52024-11-04

The paper addresses the question of how to efficiently model multi-dimensional data with modern architectures. The authors first categorize existing linear recurrence approaches into additive and multiplicative scans, demonstrating that the additive approach is significantly more efficient. By exploring the fundamental formulation of additive scans, they propose LightNet - a linear-attention-based model variant that can be implemented in a single scan, resulting in a highly efficient model. Empirical analysis shows that LightNet achieves competitive results in both language modeling and vision tasks.

优点

The paper addresses an important question: how to efficiently process multimodal data using modern DL models.
The proposed '1 Scan' strategy significantly reduces latency.
The results are promising, demonstrating improvements over the selected baselines.

缺点

W.1. The paper lacks clarity and contains numerous typos in critical sections, making it difficult to understand key aspects of the methods and results. For example:

In the method section, line 280, a softmax operation is mentioned, yet the authors claim the transformer relies on linear attention. This inconsistency is especially problematic given the absence of a reference code.
In the method section, line 283, the variables W_{u1} and W_{u2} are referenced, but they do not appear to be introduced or explained.
In the method section, line 247, there seems to be a double exponent function. Is "exp(-exp(x))" correct, or is this a typo?
Broken references are present (lines 180, 401).
Unclear notation: In Section 4.1, equations (7-9), the distinction between bold and standard letters is not defined.
Additional typos: line 159 (“E.q 3” should be “Eq. 3”), line 413 (“modeling..” should be “modeling.”).

W.2. Missing Baselines (and related work) in linear attention variants: The paper does not discuss several existing transformer-based linear attention models that have demonstrated improved performance in certain contexts. Examples include Mega [2] and Megalodon [3], which are also based on gating and decay mechanisms. The authors should also include a discussion, possibly in a new subsection, on how LightNet (equations 7-9) compares to similar variants that omit the softmax.

W.3. Missing Baselines and related work on N-Dimensional position encoding: As far as I understand, the authors did not discuss or benchmark previously proposed N-D position encoding methods. Is this correct? Numerous approaches address this, see [1] as an example.

W.4. Limited scope of experimental validation: While the paper claims to support n-dimensional data, experiments are conducted only on 1-D and 2-D cases.

W.5. The statistical significance of the results is not well-defined. Were the experiments run over multiple seeds? In some cases, such as the ablations without PE, performance gains are marginal. Could the authors clarify the robustness and significance of these results?

[1] Rethinking and Improving Relative Position Encoding for Vision Transforme. Wu et al. ICCV 2021

[2] Mega: Moving Average Equipped Gated Attention. Ma et al. ICLR 2023

[3] Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length. Ma et al.

问题

Q.1. Can the authors present FLOPs and latency analysis against highly optimized implementations, such as FlashAttention and Mamba?

Q.2. Selective mechanisms, such as Mamba and the selective transformer, emphasize empirical advantages. Can these (or similar variations) be incorporated with the 1-S approach?

See additional questions under Weaknesses (e.g., W.5).

评论- Response to Reviewer YtrF(Part 2)

2024-11-22

Q6. FLOPs and latency analysis against FlashAttention and Mamba

A6. We compared the results of LightNet with TNL's Lightning Attention [9] in Figure 2. As described in [9], Lightning Attention is faster than FlashAttention, and LightNet outperforms FlashAttention in speed. Additionally, Mamba2 employs data-dependent decay, making it slower than Lightning Attention. To verify this, we conducted further tests, and the results are as follows: The time unit is second.

Forward Pass
seqlen/method	Mamba2	LightNet	Flash Attn
256	0.40	0.05	0.05
512	0.72	0.08	0.11
1024	1.14	0.12	0.36
2048	2.15	0.21	1.34
4096	4.20	0.36	5.31
8192	9.57	0.69	21.05
16384	18.87	1.34	83.83

Backward Pass
seqlen/method	Mamba2	LightNet	Flash Attn
256	1.13	0.22	0.25
512	2.18	0.34	0.55
1024	3.60	0.57	1.53
2048	7.00	1.06	4.70
4096	13.84	2.05	16.54
8192	28.68	4.03	62.30
16384	57.36	7.99	242.16

Theoretical FLOPs Analysis

Let d denote the embedding dimension, n the sequence length, h the number of heads, and B the chunk size in Mamba2.

Softmax Attention FLOPs:

qk multiplication: $2 n^2 d$
Softmax: exp, sum, divide $3 n^2 h$ .
Sofmtax(qk)v multiplication: $2n^2d$ .

Total: $4n^2d + 3hn^2$ .

Mamba2:

Intra block: $(n/B)\times (4 B^2 d+B^2 h)=4Bnd+nBh$ .
Inter block: $n/B\times \frac{2 B d^2}{h}=2nd^2/h.$
State update: $(n/B) \times \left( \frac{2 B d^2}{h}+\frac{d^2}{h} \right)=2nd^2/h+nd^2/(Bh).$

Total (2 pass): $8nd^2/h + 8Bnd + 2Bnh + 2nd^2/(Bh).$

LightNet FLOPs:

kv multiplication: $2nd^2/h$ ；
q(kv) multiplication: $2nd^2/h$ ；

Total: $4nd^2/h.$

These results demonstrate that LightNet has theoretical and practical speed advantages.

Q7. Can models similar to Mamba utilize a 1-scan method?

A7. Thank you for your question. The answer is no. As discussed in our Proposition 3.2, Mamba-like models belong to the Multiplicative Decay category. Let me explain the specific reasons. Linear Attention has shown superior speed and competitive performance compared to Softmax Attention in one-dimensional language modeling. However, when dealing with multi-dimensional data, Softmax Attention can obtain results through a single parallel computation. In contrast, existing Linear Attention methods require at least a two-scan process to achieve the same receptive field as Softmax Attention due to the presence of multiplicative decay. LightNet, on the other hand, uses additive decay, which allows it to achieve results with just one scan.

Citations:

Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, Soham De. "Resurrecting Recurrent Neural Networks for Long Sequences." (2023).
Mega: Moving Average Equipped Gated Attention. Ma et al. ICLR 2023
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length. Ma et al.
Rethinking and Improving Relative Position Encoding for Vision Transforme. Wu et al. ICCV 2021
Ma, Xin, Yaohui, Wang, Gengyun, Jia, Xinyuan, Chen, Ziwei, Liu, Yuan-Fang, Li, Cunjian, Chen, Yu, Qiao. "Latte: Latent diffusion transformer for video generation". arXiv preprint arXiv:2401.03048. (2024).
Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, Oskar van der Wal. "Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling." (2023).
Teven Le Scao, et al.. "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model". CoRR abs/2211.05100. (2022).
William Peebles, Saining Xie. "Scalable Diffusion Models with Transformers." (2022).
Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, Yiran Zhong. "Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models." (2024).

2024-11-27

Thank you for your clarifications and the additional details, which have undoubtedly improved the manuscript. While some of my concerns have been fully addressed, others remain unresolved. My two main concerns are as follows:

(A) Writing and Clarity: The paper’s writing and overall clarity need significant improvement, and I suggest a major revision. Many reviewer questions arose because key decisions were not motivated or discussed, negatively impacting the review process. For instance, I now better understand the motivation for designing new PE schemes, and even appreciate this research direction. However, the motivation should be explicitly discussed in the paper, including detailed explanations of why other alternatives are less efficient or unsuitable.

(B) Justification of Design Choices and Missing Ablation Studies: The lack of justification for design choices and the absence of comprehensive ablation studies significantly reduce the insights that can be drawn from the results. For example, how critical is the use of Softmax in the activation function? Is this discussed in the paper? Is it considered standard practice?

评论- Response to Reviewer YtrF

2024-11-28

Thank you for your suggestions. Regarding these two issues, our responses are as follows:

(A) Writing and Clarity: Following the reviewers' suggestions, we have made extensive revisions to the manuscript. Details can be found in the General Response.

(B) Justification of Design Choices and Missing Ablation Studies: We have included additional ablation studies in Section 5.3 of the manuscript, specifically in the Effectiveness of Additive Decay subsection.

评论- Response to Reviewer YtrF(Part 1)

2024-11-22

Q1. The paper lacks clarity.

A1. Thank you for your question. We will address these issues in future versions. Here are the answers to your concerns:

Softmax operation in line 280: This refers to the activation function used in the Key module, and it has no relation to Softmax Attention.
W*{u1} and W{u2} in line 283*: These correspond to the weight matrices of the "low-rank output gate from TNL3" mentioned in lines 268 to 269.
Line 247: This is not a typo. The function $\exp(-\exp(x))$ maps $\mathbb{R}$ to [0, 1], aligning with the definition of decay. Additionally, this is illustrated in Figure 1 of [1].
Broken references and typos: We will fix these in the updated version.

Q2. Missing baselines

A2. Thank you for your suggestion. We will include discussions of [2] and [3] in subsequent revisions. Note that in [2] and [3], only the Mega-chunk variant exhibits linear complexity. Moreover, the Softmax function in equations 7–9 is merely an activation function derived from additive decay. To prevent confusion, we will clarify this point in future versions.

Q3. Missing baselines and related work on N-Dimensional position encoding

A3. Thank you for your suggestion. Most N-Dimensional position encoding methods involve operations on the attention matrix, which are incompatible with Linear Attention as we utilize the right-product trick. For instance, the official code for [4] includes:

attn = (q @ k.transpose(-2, -1))

# image relative position on keys
if self.rpe_k is not None:
  attn += self.rpe_k(q)

This approach clearly cannot be applied to Linear Attention, motivating the design of our own N-Dimensional position encoding.

Q4. Limited scope of experimental validation

A4. Thank you for your suggestion. To evaluate the proposed LightNet on higher-dimensional tasks, we replaced the Transformer in Latte [5] with LightNet. The results are shown below, demonstrating consistent performance improvements on both datasets. The metric used is FVD (lower is better).

Method	UCF101	Taichi
Latte	477.97	159.60
LightNet	257.38	132.82

Q5. Robustness of the results

A5. Thank you for your suggestion. Some experiments, such as those involving LLMs and Diffusion, are computationally expensive and difficult to repeat multiple times (notably, baselines [6], [7], [8] also ran only one trial). To assess robustness, we conducted five runs on the ImageNet-1k dataset. The results are shown below:

Model	Accuracy ( $\pm$ std)	Params	Accuracy (\pm std)	Params	Accuracy (\pm std)	Params
LightNet	$74.44\pm 0.08$	6M	$80.08\pm 0.05$	22.64M	$81.87\pm 0.04$	87.74M

The results show that LightNet delivers stable performance on ImageNet-1k.

审稿意见

评分: 5置信度: 52024-11-04

The paper presents LightNet, a new variant of State Space Models (SSMs) incorporating an additive decay/selectivity parameter. Unlike Mamba or Linear Attention (LA) with decay, LightNet avoids recursion-based parallelization, achieving faster parallelization using only matrix multiplications, similar to LA without decay.
To address permutation invariance in non-causal settings (a characteristic it shares with LA), the authors introduce two multi-dimensional positional encodings MD-TPE and MD-LRPE.
The model is evaluated across diverse tasks, including autoregressive and bidirectional language modeling, image classification and generation.

优点

The authors identify that SSM-based methods like Mamba and LA with decay suffer from slow GPU performance due to recursion-based parallelization. In contrast vanilla LA is fully parallelizable via matrix multiplications but underperforms other SOTA SSMs.
To introduce decay to improve LA's performance while retaining full parallelizability, they derive the mathematical formulation of the linear operation induced by a decayed recurrence and using this they motivate an “Additive” decay mechanism, that allows for a decay with matrix multiplication-based parallelization.
They evaluate their model across a broad range of tasks and demonstrate near competitive performance against SOTA methods.

缺点

1. Saturation of Additive Decay for Long Sequences

The proposed additive decay mechanism, defined as $g(t) = \sum_{s=1}^{t} \delta(s)$ , tends to saturate for long sequences. As the decay accumulates $\delta$ values over the course of a long sequence, $g(t)/g(s)$ reaches a point where it can no longer distinguish between nearby tokens effectively. This may be problematic for language modeling as it has a strong local bias.

This saturation effect will likely be more pronounced with very long sequences or on large-scale models. I am curious how LightNet might perform compared to Mamba in this setting, which has the opposite issue: difficulty paying attention to very distant tokens despite selectivity. I suspect LightNet may struggle with paying attention locally at scale, potentially leading to underperformance on long sequences.

2. Comparing LA and LightNet on Long Sequences

In line with the reasoning above, I feel that LA and LightNet may have similar performance at scale. This is due to LightNet's decay factor saturating to 1. To better understand this, an ablation study comparing LightNet directly with Linear Attention, keeping all other factors constant, would be useful. Specifically, I suggest replacing the additive decay with vanilla LA’s $\phi(K)$ (substituting softmax(K) with $\phi(K)$ ). Additionally, scaling up model sizes could reveal whether the performance gap between LightNet and Linear Attention narrows, providing insight into the effectiveness of additive decay.

3. It is unclear how much of the heavy lifting is being done by the position embeddings.

I believe there is a chance that the position embeddings may be doing a lot of the heavy lifting in getting good performance. For instance Toeplitz matrix in itself is a performant SSM [1] and may artificially boost the apparent efficacy of Additive Decay. Can authors maybe use 2D rope as the position embedding and see if they improve/compete with original LightNet, LightNet with no position embeddings and Linear Attention.

[1] Hydra: Bidirectional State Space Models Through Generalized Matrix Mixers. Sukjun Hwang, Aakash Lahoti, Tri Dao, Albert Gu

问题

In the paper you do claim that you have results on 1B language modeling as well. Could you please share them?
Could you also compare against vanilla LA especially on larger scales. This helps prove/disprove my hypothesis.
Could the authors please explain how is the method implemented through matrix multiplications in the causal setting? I am not able to figure out the vectorized formula for the same that computes in O(Sequence Length) time.
Could the authors please clarify what they mean by "parameter tiling strategy" in Line 266, and if and why this is not implemented for non-causal settings?

评论- Response to Reviewer tk6m(Part 1)

2024-11-22

Q1. Saturation of Additive Decay for Long Sequences

A1. Thank you for your suggestion. From an intuitive perspective, the saturation of additive decay does not prevent distinguishing local information. This is comparable to Softmax Attention, which can still focus on local information in long texts. To validate this point, we compared the Needle-in-a-Haystack experiments of Mamba-2 3b [1] and LightNet 3b [2], with both models trained on the same 300b tokens. As shown in the figure below, LightNet significantly outperforms Mamba in this experiment, without any issues in distinguishing local information.

LightNet

Mamba2

Q2. Comparing LA and LightNet

A2. Thank you for your suggestion. We removed the additive decay from LightNet, replaced it with $\phi(x) = \mathrm{silu}(x)$ , and trained LA models of size 1b and 3b. We then compared them to LightNet 1b (trained on the same 300b tokens). The table below presents the results on Commonsense Reasoning Tasks, showing that LightNet's average accuracy is nearly 2 points higher, demonstrating the effectiveness of additive decay.

Model	Params	Token	PIQA	HS	WG	ARC-E	ARC-C	OBQA	AVG
LA	3b	300	74.27	57.38	57.30	63.22	31.40	35.20	53.13
LightNet	3b	300	75.14	60.00	59.75	65.99	33.87	35.80	55.09
LA	1b	300	70.73	45.55	50.51	55.22	27.30	31.00	46.72
LightNet	1b	300	71.06	47.27	51.30	56.31	27.56	33.00	47.75

Q3. The Role of Positional Encoding

A3. Thank you for your suggestion. Let me explain why RoPE is not suitable in this scenario. Assuming RoPE [3] is used, take a language model as an example:

\begin{aligned} \mathbf o_t^{\top} &= \sum_{s\le t}\mathbf q_t^{\top} \mathbf W_t^{\top} \left((\mathbf W_s {\exp(\mathbf k_s)})/\left( \sum_{j=1}^t \exp(\mathbf k_j) \right) \right)\mathbf v_s^\top \\\\ &= \sum_{s\le t}\mathbf q_t^{\top} \mathbf W_t^{\top} \mathrm{diag}\left\\{ \sum_{j=1}^t \exp(\mathbf k_j) \right\\}^{-1} \left(\mathbf W_s {\exp(\mathbf k_s)}\right)\mathbf v_s^\top \\\\ &\neq \sum_{s\le t}\mathbf q_t^{\top} \mathbf W_{t}^{\top}\mathbf W_s \mathrm{diag}\left\\{ \sum_{j=1}^t \exp(\mathbf k_j) \right\\}^{-1} {\exp(\mathbf k_s)}\mathbf v_s^\top \\\\ &= \sum_{s\le t}\mathbf q_t^{\top} \mathbf W_{t-s}^{\top} \mathrm{diag}\left\\{ \sum_{j=1}^t \exp(\mathbf k_j) \right\\}^{-1} {\exp(\mathbf k_s)}\mathbf v_s^\top. \end{aligned}

The inequality in the second-to-last step arises because block-diagonal matrices (RoPE matrices) and diagonal matrices are non-commutative. However, this issue does not occur in LRPE [4], which is implemented as follows:

f_{\mathrm{lrpe}}(\mathbf x,\Theta)=\mathrm{concat}([\mathbf x \odot\cos \Theta, \mathbf x \odot\sin \Theta], \mathrm{dim}=-1).

Thus, the computation becomes:

\begin{aligned} \mathbf o_t^{\top} &= \sum_{s\le t}[\mathbf q_t \odot\cos(t\Theta ), \mathbf q_t \odot\sin (t\Theta)]^{\top} \left[{\exp(\mathbf k_s)}/\left( \sum_{j=1}^t \exp(\mathbf k_j) \right) \odot\cos(s\Theta ), {\exp(\mathbf k_s)}/\left( \sum_{j=1}^t \exp(\mathbf k_j) \right) \odot\sin (s\Theta) \right]\mathbf v_s^\top \\\\ &= \sum_{s\le t}\mathbf q_t^\top \mathrm{diag}\\{\cos ((t-s)\Theta)\\} {\exp(\mathbf k_s)}/\left( \sum_{j=1}^t \exp(\mathbf k_j) \right) \mathbf v_s^\top . \end{aligned}

This shows that LRPE can capture relative positional information, which is why we chose LRPE.

Q4. 1b Result

A4. Refer to A2.

评论- Response to Reviewer tk6m(Part 2)

2024-11-22

Q5. $O(\text{Sequence Length})$ Time Algorithm in Causal Setting

A5. Thank you for your suggestion. This is not particularly complicated. We discuss two cases separately:

Without LRPE:

\begin{aligned} &\mathbf {s}\_t=\mathbf {s}\_{t-1}+\exp \left(\mathbf {k}\_t\right), \overline{{k}}\_{{t}}=\exp \left({\mathbf k}\_t\right) / \mathbf {s}\_t, \\\\ &\lambda\_t = \mathbf s\_{t-1}/\mathbf s\_t , \\\\ &\mathbf {k v}\_t=\operatorname{diag}\left\\{\lambda\_t\right\\} \mathbf {k v}\_{t-1}+\overline{\mathbf {k}}\_{{t}} \mathbf {v}\_t^{\top}, \\\\ &\mathbf o\_t^\top = \mathbf {kv}\_t^\top \mathbf q\_t. \end{aligned}

With LRPE:

The recursive formulation becomes:

\begin{aligned} &\mathbf {s}\_t=\mathbf {s}\_{t-1}+\exp \left(\mathbf {k}\_t\right), \tilde{\mathbf {k}}\_{{t}}=\exp \left(\mathbf {k}\_t\right) / \mathbf {s}\_t, \\\\ &\lambda\_t = \mathbf s\_{t-1}/\mathbf s\_t ,\\\\ &\overline {\mathbf q}\_t = [\mathbf q\_t \odot\cos(t\Theta), \mathbf q\_t \odot \sin(t\Theta)], \overline {\mathbf k}\_t = [\tilde {\mathbf k}\_t \odot\cos(t\Theta), \tilde {\mathbf k}\_t \odot \sin(t\Theta)], \bar \lambda\_t = [\lambda\_t, \lambda\_t], \\\\ &\mathbf {k v}\_t=\operatorname{diag}\left\\{\lambda\_t\right\\} \mathbf {k v}\_{t-1}+\overline{\mathbf {k}}\_{{t}} {\mathbf v}\_t^{\top}, \\\\ &\mathbf o\_t^\top = {\mathbf kv}\_t^\top \bar {\mathbf q}\_t. \end{aligned}

In either case, after transformation, we convert it into a recursive formulation consistent with FLA [5], enabling $O(n)$ complexity computation, where $n$ is the sequence length.

Q6. Explain "Parameter Tiling Strategy"

A6. What we actually meant was "parameter sharing strategy." Specifically, the nn.Linear parameters in the additive decay part are shared with the nn.Linear parameters in the key computation, a technique also used in HGRN2 [6].

Citations

Tri Dao, & Albert Gu. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality.
gkamradt. GitHub - gkamradt/LLMTest_NeedleInAHaystack: Doing simple retrieval from LLM models at various context lengths to measure accuracy –- github.com.
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, & Yunfeng Liu. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding.
Zhen Qin, Weixuan Sun, Kaiyue Lu, Hui Deng, Dongxu Li, Xiaodong Han, Yuchao Dai, Lingpeng Kong, & Yiran Zhong. (2023). Linearized Relative Positional Encoding.
Yang, S., & Zhang, Y. (2024). FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism. https://github.com/sustcsonglin/flash-linear-attention
Zhen Qin, , Songlin Yang, Weixuan Sun, Xuyang Shen, Dong Li, Weigao Sun, Yiran Zhong. "HGRN2: Gated Linear RNNs with State Expansion." (2024).

评论- Looking forward to your response

2024-11-28

2024-11-25

Dear Reviewers,

This is a gentle reminder that the authors have submitted their rebuttal, and the discussion period will conclude on November 26th AoE. To ensure a constructive and meaningful discussion, we kindly ask that you review the rebuttal as soon as possible and verify if your questions and comments have been adequately addressed.

We greatly appreciate your time, effort, and thoughtful contributions to this process.

Best regards, AC

评论- General Response

2024-11-28

We sincerely appreciate all the reviewers for their thorough reviews and constructive feedback. Based on their comments, we have revised the manuscript, with modifications highlighted in brown font. Below, we summarize the main changes:

Addressing minor errors and missing citations: Following the reviewers' suggestions, we corrected typographical errors and added missing citations, making the paper more self-contained.
Improved table visualization: Following Reviewer 4dh3's suggestion, we enhanced the presentation of table results.
Refined Section 3.1: Following Reviewer TjCt's suggestion, we revised the recursive definitions of additive decay and multiplicative decay, and provided clearer explanations of global information.
Revised Figure 3: Following Reviewer TjCt's suggestion, we have corrected Figure 3.
Expanded motivation for LightNet positional encoding: Following Reviewers tk6m and YtrF's suggestions, we elaborated on the design motivation for LightNet’s positional encoding, reflected in Section 4.2 and Appendix A2.
Added ablation studies: Following Reviewers tk6m and YtrF's suggestions, we included additional ablation studies on removing additive decay/softmax activation function, detailed in Section 5.3, Effectiveness of Additive Decay.
Extended discussion of related work: Following Reviewer YtrF's suggestion, we added discussions on related works [1], [2].
Comparison with Mamba retrieval capabilities: Following Reviewer tk6m's suggestion, we included a comparison with Mamba’s retrieval performance, as shown in Figure 5.

Citations:

[1] Mega: Moving Average Equipped Gated Attention. Ma et al. ICLR 2023

[2] Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length. Ma et al.

AC 元评审

2024-12-26

The paper proposes a new form of additive linear attention to efficiently process multi-dimensional data. Further, the paper also proposes new forms of positional embeddings for the same. Empirical evaluations show performance improvements over linear attention on many multi-dimensional settings. While the reviewers appreciated the theoretical motivations of the work, there were concerns regarding the clarity of exposition, lack of sufficient ablations and stress testing (eg, in long sequence tasks specifically).

审稿人讨论附加意见

See above on reviewer concerns. The authors provided a few additional experiments and clarifications, but some of the reviewers were not convinced. For example, in distinguishing the importance of PEs vs the additive recurrence as asked by other, the authors chose to rely on a theoretical argument when an experiment was needed. Reviewer 4dh3 was the only positive reviewer, but had low confidence score and also an unaddressed query about asymmetry in the non-causal setting. Overall, the lack of a strong champion amongst the reviewers and insufficient clarity in exposition and experimental ablation makes me consider a reject rating.

最终决定Reject

2025-01-22

Reject