7.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

3.5

置信度

创新性2.5

质量3.3

清晰度3.3

重要性2.8

NeurIPS 2025

Spectral Conditioning of Attention Improves Transformer Performance

Hemanth Saratchandran,Simon Lucey

OpenReview PDF

提交: 2025-05-11更新: 2025-10-29

TL;DR

We propose spectral conditioning of attention layers to improve Jacobian conditioning, leading to more stable and efficient optimization with negligible computational overhead and consistent gains across diverse transformer architectures.

摘要

关键词

spectral conditioningattention

评审与讨论

审稿意见

评分: 4置信度: 32025-06-28

This paper presents a theoretical analysis of the Jacobian matrix of self-attention layers in Transformers, revealing its dependence on query, key, and value projection matrices. Based on this insight, the authors propose a method to systematically adjust the spectral properties of self-attention layers to improve the overall conditioning of self-attention layers within Transformer networks. The method demonstrates consistent performance improvements across various architectures.

优缺点分析

Strengths

Solid theoretical foundation: Establishes the mathematical relationship between the condition number of self-attention layer Jacobian matrices and the condition numbers of WQ, WK, WV matrices, transforming Transformer optimization into a matrix condition number optimization problem.

Simple and practical method: The proposed spectral conditioning method serves as a drop-in replacement, requires no additional parameters, and is applicable to various attention mechanisms with good generalizability.

Comprehensive experimental validation: Validates the method's effectiveness across multiple tasks including image classification, object detection, instance segmentation, and nlp.

Weaknesses:

Insufficient motivation: While the authors mention the effectiveness of Jacobian conditioning optimization in feedforward networks [1], they do not adequately demonstrate whether similar problems exist in Transformers or the necessity of condition number optimization.

Lack of hyperparameter guidance: The $\lambda$ parameter relies on grid search, lacking theoretical guiding principles.

[1] Weight conditioning for smooth optimization of neural networks. ECCV 2025.

问题

Suggest experimental analysis to verify whether attention layers in Transformers exhibit similar condition number problems as feedforward networks.
Recommend extending the analysis to feedforward components to comprehensively validate the applicability of this analytical framework to Transformer architectures.
Suggest providing theoretical selection criteria for the $\lambda$ parameter to enhance the method's practicality in large-scale models.

局限性

Yes

最终评判理由

The authors have addressed my concerns regarding the research motivation and have further supported the necessity of Jacobian conditioning optimization in the Transformer architecture through additional experiments. They have also effectively clarified the role of the FFN component in Transformer blocks.

However, the hyperparameter selection strategy remains insufficiently addressed, which may limit the applicability of the proposed method to larger-scale models. Therefore, I have raised my score from 3 to 4.

格式问题

None

作者回复

2025-07-31

We sincerely thank the reviewer for taking the time to review our paper and for providing thoughtful and constructive feedback. Below, we address each of the points and questions raised in the review.

Insufficient motivation: While the authors mention the effectiveness of Jacobian conditioning optimization in feedforward networks [1], they do not adequately demonstrate whether similar problems exist in Transformers or the necessity of condition number optimization / Suggest experimental analysis to verify whether attention layers in Transformers exhibit similar condition number problems as feedforward networks.

We thank the reviewer for this comment, though we find it somewhat surprising. In Figures 2 and 3, we compute the condition number of the Jacobian of the self-attention block for the ViT-B and XCiT models, respectively (see the rightmost plots in each figure). For ViT-B, the condition number reaches approximately $\mathcal{O}(10^{10})$ , while for XCiT it is around $\mathcal{O}(10^8)$ , both of which are extremely high. These same figures also show the effect of applying spectral conditioning: the condition number for ViT-B is reduced to about $\mathcal{O}(10^7)$ , and for XCiT it drops to approximately $\mathcal{O}(10^5)$ . In both cases, this represents a reduction of nearly three orders of magnitude, clearly demonstrating that the Jacobian of self-attention can become severely ill-conditioned during training, and that spectral conditioning provides a simple and effective remedy.

We further conducted this analysis on the Nyströmformer using the text classification task from the LRA benchmark. As shown in Figure 4, the condition number of the attention block’s Jacobian is around $\mathcal{O}(10^6)$ , and spectral conditioning brings it down to $\mathcal{O}(10^4)$ . Additional results are provided in Appendix A.3, where we carry out a similar analysis for the Nyströmformer on the ListOps task.

Together, these results provide strong empirical evidence that the Jacobians of attention blocks in various transformer architectures are highly ill-conditioned, and that spectral conditioning offers a simple and broadly effective method for improving their conditioning.

Recommend extending the analysis to feedforward components to comprehensively validate the applicability of this analytical framework to Transformer architectures.

The primary focus of our paper is to investigate the condition number of the Jacobian of the self-attention block in Transformer architectures and to show that it is often highly ill-conditioned. Motivated by this observation, we propose a simple and theoretically grounded method, spectral conditioning, that effectively reduces the Jacobian’s condition number. We then demonstrate that applying this technique improves performance across a variety of Transformer-based applications. We believe that this is more than enough for a NeurIPS especially given the theoretical derivations and the comprehensive experimental validation.

To address the reviewer's comment, we conducted an additional set of experiments applying spectral conditioning to the feedforward (FFN) layers within each Transformer block. Specifically, for each weight matrix $W$ in the feedforward layers, we applied a spectral correction matrix $C_W$ as defined by Theorem 3.8, forming the updated matrix $W + C_W$ . Note that Theorem 3.8 ensures that the conditioned matrix $W + C_W$ has a lower condition number than the original $W$ . We then trained ViT-B on the ImageNet-1k dataset under four settings:

No spectral conditioning applied
Spectral conditioning applied only to the attention layers
Spectral conditioning applied only to the feedforward layers
Spectral conditioning applied to both attention and feedforward layers

The results are summarized in the table below. We observe that applying spectral conditioning only to the feedforward layers yields a minor performance gain. In contrast, conditioning only the attention layers results in an approximate 1% improvement. Applying spectral conditioning to both components further boosts accuracy by about 1.1% overall.

Model	Accuracy
ViT-B (original)	80.7
ViT-B with Spec. Cond. on Attention	81.7
ViT-B with Spec. Cond. on Feedforward	80.9
ViT-B with Spec. Cond. on Attention and Feedforward	81.8

These findings support our main claim: for ViT-B models trained on ImageNet-1k, applying spectral conditioning to the self-attention layers significantly improves performance, and conditioning the feedforward layers provides a small additional benefit. This reinforces the importance of controlling the Jacobian’s condition number in attention mechanisms and further highlights the effectiveness of our proposed method.

We also computed the condition number of the Jacobian for both the self-attention and feedforward layers of a ViT-B model during training on ImageNet-1k. For the self-attention layers, we extracted the condition number of the Jacobian for each head in each layer, then averaged across all heads, all layers, and across the entire training pass to obtain a single average condition number representing the self-attention block over training. Similarly, for the feedforward layers, we computed the condition number of the Jacobian for each layer and averaged these values across layers and training steps to obtain an overall average for the feedforward blocks.

We repeated this procedure after applying spectral conditioning. The results are shown in the table below. Notably, we observe that the average condition number of the self-attention block is significantly higher than that of the feedforward block. Applying spectral conditioning to the attention layers reduces their condition number by nearly three orders of magnitude, whereas applying it to the feedforward layers results in a reduction of roughly one order of magnitude.

These results clearly indicate that the attention mechanism is the most ill-conditioned component of the Transformer architecture, and that spectral conditioning is particularly effective in addressing this issue.

	Average condition number
Attention	9.8e9
Feedforward	2.6e4
Spec. Cond. Attention	2.8e7
Spec. Cond. Feedforward	3.2e3

We repeated the above analysis for the Nyströmformer on the text classification task in the LRA benchmark set.

Model	Accuracy
Nyströmformer (original)	63.8
Nyströmformer with Spec. Cond. on Attention	64.8
Nyströmformer with Spec. Cond. on Feedforward	63.9
Nyströmformer with Spec. Cond. on Attention and Feedforward	64.9

Component	Average Condition Number
Attention	9.1e6
Feedforward	4.6e3
Spec. Cond. Attention	2.1e4
Spec. Cond. Feedforward	8.9e2

Lack of hyperparameter guidance: The $\lambda$ parameter relies on grid search, lacking theoretical guiding principles / Suggest providing theoretical selection criteria for the $\lambda$ parameter to enhance the method's practicality in large-scale models.

Theorem 3.8 provides a theoretical lower bound for the $\lambda$ value. While we agree that a deeper theoretical explanation for selecting $\lambda$ would be valuable, we believe that such an analysis falls outside the scope of the current work and intend to explore it in future research. That said, we do provide a concrete ablation study in Appendix A.2.1, treating $\lambda$ as a tunable hyperparameter. This study illustrates how performance varies as $\lambda$ is adjusted, offering clear empirical insight into its effect.

2025-08-02

Thank you for the clarifying response. Given that my concerns have been addressed, I will consider revising my score upward.

2025-08-03

Thank you for your thoughtful follow-up and for considering a higher score now that your concerns have been addressed. We’re grateful for your time and constructive feedback.

审稿意见

评分: 5置信度: 42025-07-01

This paper focuses on analyzing the conditioning of the self-attention mechanism in Transformers to enable more stable training. An upper bound on the condition number of the Jacobian of the self-attention function is derived, providing a principled way to control Jacobian conditioning during training. Specifically, the paper introduces a technique called spectral conditioning, which can be applied not only to the classical attention mechanism but also to more advanced variants such as shifted-window attention (e.g., Swin Transformer) and low-rank approximations (e.g., Nyströmformer). Empirical results demonstrate that spectral conditioning improves the conditioning of attention modules during training. Experiments across diverse learning tasks confirm the enhanced performance brought by this technique.

优缺点分析

Strength: Conditioning analysis in Transformer models remains relatively under-explored compared to other network architectures. Hence, this paper could inspire conditioning-aware design that improves the numerical stability of attention-based models.

Weakness: A discussion and comparison with low-rank attention mechanisms may be necessary.

问题

While I do appreciate the work, there are some related concerns:

Low-rank approximation is closely related to the conditioning of the attention mechanism. Both Theorem 3.3 and Theorem 3.4 involve the term $\text{softmax}(XW_QW_K^\top X^\top)$ , which is a core target of many low-rank Transformer variants, such as Nyströmformer, Linformer [1], Performer [2]. First, to what extent do the theoretical bounds provided in Theorems 3.3 and 3.4 generalize to these low-rank attention variants? From my perspective, these theorem may be applicable to these methods since some spectral components can be controlled under the low-rank condition. Second, is it sufficient to control only the weight matrices $(W_Q, W_K, W_V)$ in these low-rank attentions in order to reduce their Jacobian's condition number? A broader discussion and numerical comparison across different low-rank self-attention mechanisms is necessary, as it could reveal how spectral conditioning complements or overlaps with structural approximations in improving training stability and performance.
To better evaluate the impact of Jacobian conditioning, I feel it is necessary to add experiments that directly reflect numerical stability during training, such as a tracking on gradient norm, losses, etc.
There are still some unclear notations and typos in the manuscript. Eq (4), $J$ should be removed from the partial derivation; Table 1, it is not immediately clear whether “Spec. cons.” refers to “Spec. cons. + the corresponding attention mechanism” and also the subsequent experiments. Caption of Figure 14, ViT-B should be corrected to Nyströmformer. Please double check the typos and expressions.

I would like to consider raising the score if these concerns are resolved.

References

[1] Wang et al., Linformer: Self-Attention with Linear Complexity, arXiv 2020

[2] Choromanski et al., Rethinking Attention with Performers. ICLR 2021

局限性

yes

最终评判理由

The authors have addressed my concerns with extra experimental supports. Hence, I would like to increase the rating.

格式问题

作者回复

2025-07-31

Low-rank approximation is closely related to the conditioning of the attention mechanism. Both Theorem 3.3 and Theorem 3.4 involve the term $softmax(XW_QW_K^TX^T)$ , which is a core target of many low-rank Transformer variants, such as Nyströmformer, Linformer [1], Performer [2]. First, to what extent do the theoretical bounds provided in Theorems 3.3 and 3.4 generalize to these low-rank attention variants? From my perspective, these theorem may be applicable to these methods since some spectral components can be controlled under the low-rank condition.

This is a very good question and we thank the reviewer for asking it. The reviewer is exactly right. Since both theorem 3.3 and 3.4 involve the term $softmax(XW_QW_K^TX^T)$ and this is the critical term used in any low-rank variant of self-attention the theoretical bounds in theorems 3.3 and 3.4 generalize to a low rank approximation of such attention terms. The key point is that in deriving theorem 3.3 and 3.4 we make use of differentiating the term $softmax(XW_QW_K^TX^T)$ with respect to $W_Q$ and $W_K$ which holds for low-rank attention variants. In fact this is exactly why spectral conditioning worked for the Nyströmformer on LRA benchmark shown in section 4.3 of the paper. To further show the reviewer that spectral conditioning helps in the low-rank setting we applied it to the Performer architecture, as referenced by the reviewer, on the LRA benchmark. As can be seen from the table below spectral conditioning yields better performance.

	Listops	Text	Retrieval	Image	Pathfinder
Performer	18.04	65.40	53.83	42.77	77.05
Spec. Cond. Performer	19.08	66.61	54.65	43.79	78.21

Second, is it sufficient to control only the weight matrices $(W_Q, W_K, W_V)$ in these low-rank attentions in order to reduce their Jacobian's condition number? A broader discussion and numerical comparison across different low-rank self-attention mechanisms is necessary, as it could reveal how spectral conditioning complements or overlaps with structural approximations in improving training stability and performance.

Theorems 3.3 and 3.4 establish that the condition number of the self‑attention Jacobian is controlled by the projection matrices $(W_Q, W_K, W_V)$ . Theorem 3.3 gives the explicit relationship. Low‑rank variants such as Nyströmformer, Performer, and Linformer still compute attention in the form $softmax(XW_QW_K^TX^T)$ or compute an approximation of it that still uses the terms $(W_Q, W_K, W_V)$ . Consequently, their Jacobians inherit the same dependence on $(W_Q, W_K, W_V)$ . By conditioning these matrices, we can therefore reduce the Jacobian’s condition number in both the standard and low‑rank settings. We will add a short derivation outlining this point in the appendix.

To better evaluate the impact of Jacobian conditioning, I feel it is necessary to add experiments that directly reflect numerical stability during training, such as a tracking on gradient norm, losses, etc

The tracking of gradient norm only captures the magnitude of the gradient, and since the Jacobian is the transpose of the gradient, this is equivalent to tracking the norm of the Jacobian. However, we would like to emphasize that conditioning is fundamentally different from merely normalizing the size of the Jacobian norm. In general, two matrices can have the same condition number but very different norms. As a concrete example, consider the matrices

A = \\begin{bmatrix} 1 & 0 \\\\ 0 & 1 \\end{bmatrix} \\quad \\text{and} \\quad B = \\begin{bmatrix} 10.1 & 0 \\\\ 0 & 10 \\end{bmatrix}.

The matrix $A$ has condition number $1$ and the matrix $B$ has condition number $1.01$ , thus we see that both matrices have condition number close to $1$ . However, the Frobenius norm of $B$ is significantly larger than that of $A$ . There are also examples where the norms of the matrices can be similar but their condition numbers are very different. For example consider the matrix $C$ and $D$ .

C = \\begin{bmatrix} 1 & 0 \\\\ 0 & 1 \\end{bmatrix} \\quad \\text{and} \\quad D = \\begin{bmatrix} \\sqrt{2} & 0 \\\\ 0 & 0.1 \\end{bmatrix}.

$C$ has a condition number of $1$ and $D$ has condition number of approximately $14.14$ , thus $D$ has a much larger condition number than $C$ . However, their Frobenius norms are similar.

This demonstrates that computing the norm of the Jacobian during training does not necessarily provide any information about its condition number.

In our paper, we apply spectral conditioning by adding predefined matrices $C_Q$ , $C_K$ , and $C_V$ to the weight matrices $W_Q$ , $W_K$ , and $W_V$ , respectively. According to Theorem 3.3, this increases the Frobenius norm of the self-attention Jacobian. However, all transformer architectures used in the paper employ Layer Normalization, which regulates the scale of the weight norms and ensures that Jacobian norms do not explode during backpropagation.

Therefore, while spectral conditioning may increase the Jacobian norm, Layer Normalization helps mitigate this effect and prevents the norm from becoming excessively large. We explicitly discuss this interaction in the paper: on page 211, we refer the reader to Appendix A.3, where we present an ablation study on spectral conditioning and Layer Normalization in the context of the Nyströmformer. See lines 539–542 and Table 8 in Appendix A.3. The table shows that removing Layer Normalization while retaining spectral conditioning leads to worse performance. This is because spectral conditioning alone can increase the Jacobian norm, and without Layer Normalization to regulate it, training becomes unstable.

In summary, spectral conditioning improves the conditioning of the singular value spectrum but does not necessarily reduce the Jacobian norm. In contrast, Layer Normalization helps control the norm of the Jacobian. Thus, spectral conditioning should be viewed as complementary to Layer Normalization, not a replacement. We will add this clarification to Appendix A.3 to make the distinction clear.

We would also like to add that we did compute the condition numbers of the Jacobian of the attention during training for the vision transformers and the Nyströmformer both with and without spectral conditioning, see Fig. 2, 3, 4 (right most figure). In each case you can see that spectral conditioning reduces the condition number of the Jacobian by almost 2-3 orders of magnitude clearly showing the stability of our methodology.

There are still some unclear notations and typos in the manuscript. Eq (4), $J$ should be removed from the partial derivation; Table 1, it is not immediately clear whether “Spec. cons.” refers to “Spec. cons. + the corresponding attention mechanism” and also the subsequent experiments. Caption of Figure 14, ViT-B should be corrected to Nyströmformer. Please double check the typos and expressions.

We sincerely thank the reviewer for pointing this out. The $J$ in Equation (4) was a typo and will be removed from the partial derivative. In Tables 1-4, Spec. Cond consistently refers to applying spectral conditioning to the matrices $W_Q$ , $W_K$ , and $W_V$ used in the attention mechanism of the corresponding architecture. We will clarify this explicitly for the reader. The caption of Figure 14 will also be corrected to Nyströmformer.

2025-08-05

Many thanks to the authors for the detailed responses. As my concerns have been well solved, I would like to increase the score.

审稿意见

评分: 5置信度: 42025-07-03

The paper presents a theoretical analysis of the conditioning of the Jacobian of a self-attention layer of transformers.
Citing previous work, the authors argue that poorly conditioned Jacobians can hinder performance and improving Jacobian conditioning can lead to better optimization and generalization performance.
A theoretical analysis is performed, showing how the conditioning of the Jacobian of a self-attention layer is influenced by the conditioning of the query (Q), key (K), and value (V) matrices.
Based on this analysis, the authors propose spectral conditioned self-attention, which adds correction terms to the Q, K, V matrices during self-attention.
Since computing the correction terms using SVD at each iteration is prohibitive, the paper proposes an implementation-friendly form of the correction terms as $\lambda \mathbf{I}$ , for some scalar hyperparameter $\lambda \geq 2$ (in the experiments, $\lambda=10$ is chosen).
The authors empirically show improved conditioning and generalization on ImageNet classification, COCO object and instance segmentation, LRA text classification, and the GLUE benchmark.

优缺点分析

Strengths

Quality
- The proposed method is simple and technically sound.
- Experiments are performed across multiple domains and tasks.
- Assumption of full rank in the theorems is empirically validated.
Clarity
- The paper is clear and easy to follow.
Significance
- Despite being simple, the method leads to generalization improvements across diverse settings.
Originality
- The proposed spectral conditioned self-attention layer is novel.

Weaknesses

Quality
- The method only aims to minimize the upper bound of the condition number, and the practical implementation further loosens the upper bound being minimized.
- Theorem 3.5 is a trivial result that is unnecessary to derive the final practical method (Theorem 3.8 is all you need).
- Self-attention is a special case of attention, and it is not clear why the analysis does not extend to cases like cross-attention.
Significance
- The paper evaluates relatively small models and the generalization improvements are relatively small. There is no clear signal that the generalization improvements will persist in larger-scale settings.
Originality
- The idea behind the final practical method (Theorem 3.8), adding values to the diagonal to improve condition number, is not novel and is a fairly standard approach in linear algebra.

问题

Is it possible to generalize the paper's analysis and method beyond self-attention (e.g., cross-attention)?

局限性

Yes, except the limitation to the self-attention case of attention.

最终评判理由

In the response to the authors' rebuttal, I pointed out counterexamples to Theorem 3.8, which, if left unresolved, would have led to a clear recommendation of rejection.

The authors were able to fix the theorem and its proof without invalidating the paper. Based on the updated proof provided by the authors, I recommended simplifying and strengthening the theorem statement to not require constants like 0.4 and 2, which the authors agreed to incorporate into the paper.

One of my concerns was that the paper aims to minimize the upper bound of the condition number, but the bound used in the final method is so loose that one of their own intermediate upper bound theorems (3.5) is made unnecessary by the practical upper bound theorem (3.8), which they argue leads to a faster algorithm. In addition to a larger perceived gap between theory and practice, it also leads to a confusing presentation where a notable part of the paper feels disconnected from the rest.

The authors provided an empirical link to reduce this perceived gap: baseline is fast and performs the worst, the method based on theorem 3.5 is the slowest but performs the best, and the method based on theorem 3.8 is comparably fast as the baseline with performance between that of the other two settings.

Another concern, echoed by other reviewers, was that self-attention is a special case of attention, and it is not clear why the analysis does not extend to cases like cross-attention.

The authors already have empirical results where the method is applied to more general forms of attention. During the discussion, they stated that the theoretical results are also general and that they can generalize the paper's narrative accordingly.

Finally, my original review noted that the models and generalization improvements are relatively small.

The authors were limited by resources and pointed out that this is stated in their limitations. This is entirely reasonable.
Regarding the generalization improvements being relatively small, the performance improvements using a method based on theorem 3.5 (discussed above) are larger, providing better support for the more general message for spectral conditioning of attention.

Overall, I am satisfied with the discussions, and I increase my score to 5 (Accept).

格式问题

On pages 4 and 5, non-standard colored boxes are used for theorems and other details.

作者回复

2025-07-31

The method only aims to minimize the upper bound of the condition number, and the practical implementation further loosens the upper bound being minimized.

We thank the reviewer for their comment. However, we would like to note that this point is already acknowledged in the limitations section (Section 5) of the paper. We kindly ask the reviewer to follow the NeurIPS reviewer guidelines, see https://neurips.cc/Conferences/2025/ReviewerGuidelines, where it clearly states that authors should not be punished for being up front about the limitations of their work. While our method minimizes an upper bound on the condition number, our experimental results demonstrate that doing so is nonetheless effective, leading to improved performance across a range of tasks.

In particular, Figures  2, 3, and 4 (rightmost subplots) show that spectral conditioning substantially reduces the condition number of the self-attention Jacobian, by 2 to 3 orders of magnitude, and that this reduction is maintained consistently throughout training. These findings suggest that, even though our approach targets an upper bound, the resulting improvements in conditioning have meaningful and beneficial consequences for performance.

Theorem 3.5 is a trivial result that is unnecessary to derive the final practical method (Theorem 3.8 is all you need).

Thank you for the suggestion. We introduced Theorem 3.5 to show that a tighter bound on the condition number can be obtained when one explicitly applies the singular‑value decomposition (SVD). Because computing an SVD at every iteration is prohibitively expensive, Theorem 3.8 provides a practical, SVD‑free alternative. We believe highlighting this trade‑off is helpful for readers who may not be familiar with such techniques, especially given NeurIPS’ broad, general‑machine‑learning audience. That said, we are happy to move Theorem 3.5 to the appendix or restate it as a proposition if the reviewer feels that would be better.

Self-attention is a special case of attention, and it is not clear why the analysis does not extend to cases like cross-attention.

Our conditioning method does extend to cross attention. This is because cross attention uses the same kernel as regular self-attention. The difference between the two is that cross attention allows an input stream of tokens to interact with a different stream of tokens while self-attention only allows tokens to interact with each other within each stream of input. However, both forms of attention use the same kernel defined by eq. (3) in the paper. Theorem 3.3 and 3.4 of the paper is only dependent on the kernel used, as we are taking derivatives with respect to $W_Q$ , $W_K$ and $W_V$ and not the input tokens $X$ , and thus also holds for cross attention. We will include a paragraph explaining this in the appendix.

We would also like to point out to the reviewer that we empirically demonstrated spectral conditioning works on a variety of different attention forms such as cross covariance attention used in XCiT, Windowed attention in Swinformer, and Dual attention in DaViT. These results were all shown in Table 1 in the paper. In each case spectral conditioning the associated attention form led to improved performance.

The paper evaluates relatively small models and the generalization improvements are relatively small. There is no clear signal that the generalization improvements will persist in larger-scale settings.

The models we trained were at most roughly 100 million parameters primarily because our resources only allowed us to experiment with such scale models. We would also like to point out that we clearly stated this as a limitation in our limitations section in section 5 of the paper. We would kindly ask the reviewer to follow the NeurIPS 2025 reviewer guidelines: https://neurips.cc/Conferences/2025/ReviewerGuidelines where it says that authors should not be punished for being up front about the limitations of their paper, see point 4 under the Main Task heading and point 8 under the Reviewer Form heading.

We believe the insights derived at the roughly 100M‑parameter scale are still valuable to the community: they reveal consistent trends and provide a solid stepping‑stone for future work on larger models.

The idea behind the final practical method (Theorem 3.8), adding values to the diagonal to improve condition number, is not novel and is a fairly standard approach in linear algebra.

While the technique itself is rooted in linear‑algebraic reasoning, our literature search uncovered no prior work, either in machine learning or in the linear‑algebra community, that applies it to attention mechanisms within Transformer architectures. If we have overlooked a relevant citation, we would greatly appreciate the reviewer’s guidance and will gladly include any suggested references to ensure the paper is both comprehensive and fair.

2025-08-02

Thank you for your response.

We kindly ask the reviewer to follow the NeurIPS reviewer guidelines, see https://neurips.cc/Conferences/2025/ReviewerGuidelines, where it clearly states that authors should not be punished for being up front about the limitations of their work.

First, I would like to reassure the authors that they are not being punished for the act of being up front about the limitations of their work. As a general statement, you do gain credit for being up front about a limitation, but it doesn't mean that the limitation itself ceases to count as a weakness; it still has to stand up to scrutiny (e.g., does it undermine the paper's message/claims? does it break under realistic conditions? etc). Listing limitations does not grant immunity against such scrutiny.

The idea behind the final practical method (Theorem 3.8), adding values to the diagonal to improve condition number, is not novel and is a fairly standard approach in linear algebra.

our literature search uncovered no prior work

I thank the authors for pointing this out. Indeed, I was too quick with this point and incorrectly borrowed intuition from the case of the matrix $A$ being p.s.d.

Accordingly, I looked through the details of Theorem 3.8 again, and was able to construct a counterexample:

Let $A = \begin{bmatrix} 0.1 - \lambda & 0 \\\\ 0 & 0.3 \end{bmatrix}$ , where $\lambda \geq 2$ . Here, the singular values are simply the absolute values of the diagonal entries. Thus, we have $\sigma_{min}(A) = 0.3$ and $\sigma_{max}(A) = \lambda - 0.1$ . This setup satisfies the constraints of the theorem.

Then, we have $A + \lambda I = \begin{bmatrix} 0.1 & 0 \\\\ 0 & \lambda + 0.3 \end{bmatrix}$ , with $\sigma_{min}(A) = 0.1$ and $\sigma_{max}(A) = \lambda + 0.3$ .

Now, $\kappa(A) = \frac{\lambda}{0.3}$ and $\kappa(A + \lambda I) = \frac{\lambda + 0.3}{0.1}$ . For $\kappa(A + \lambda I) < \kappa(A)$ to hold, this requires $\lambda < -0.45$ . But $\lambda \geq 2$ , so we have a contradiction.

Looking at the proof in the appendix, we have the following problems:

One of the first inequalities the proof starts with, is incorrect: (Equation 53) $\sigma_{min}(A + \lambda I) \geq \sigma_{min}(\lambda I) - \sigma_{min}(A)$ .
- Counterexample: $A = \begin{bmatrix} - \lambda & 0 \\\\ 0 & 2 \end{bmatrix}$ $A = - λ 0 02$ with $\lambda > 2$ $λ > 2$ . Then,
  - $\sigma_{min}(\lambda I) = \lambda$
  - $\sigma_{min}(A) = 2$
  - $\sigma_{min}(A + \lambda I) = 0$
  - $\sigma_{min}(\lambda I) - \sigma_{min}(A) = \lambda - 2 > 0 = \sigma_{min}(A + \lambda I)$ , which contradicts Equation 53.
The assumptions on $\sigma_{min}(A)$ and $\sigma_{max}(A)$ seem to be different from those in the main paper. This discrepancy is confusing.
- Main paper: $\sigma_{min}(A) < 0.4$ and $\sigma_{max}(A) \geq 1$
- Appendix: $\sigma_{min}(A) < 0.5$ and $\sigma_{max}(A) > 5$

If there are any mistakes in this reasoning, please let me know. If the flaws are valid, are there any missing assumptions, that are satisfied in practice, that can help fix the theory?

Conditioned on this concern being addressed, I provide other responses to the rebuttal below.

The method only aims to minimize the upper bound of the condition number, and the practical implementation further loosens the upper bound being minimized.

we would like to note that this point is already acknowledged in the limitations section (Section 5) of the paper.

Theorem 3.5 is a trivial result that is unnecessary to derive the final practical method (Theorem 3.8 is all you need).

We believe highlighting this trade‑off is helpful for readers who may not be familiar with such techniques

A cleaner narrative can be presented by including results that utilize Theorem 3.5, showing it to perform better than standard attention and the alternative inspired by a corrected Theorem 3.8 (but slower). Such intermediate results also help towards reducing the perceived gap between theory and the method.

Self-attention is a special case of attention, and it is not clear why the analysis does not extend to cases like cross-attention.

Theorem 3.3 and 3.4 of the paper is only dependent on the kernel used, as we are taking derivatives with respect to $W_Q$ , $W_K$ and $W_V$ and not the input tokens $X$ , and thus also holds for cross attention.

We would also like to point out to the reviewer that we empirically demonstrated spectral conditioning works on a variety of different attention forms such as cross covariance attention

If the theory holds more broadly, it is better to present the theory more generally and note the special case of self-attention. This, along with empirical results that are not specific to self-attention (which the authors already seem to have), the paper can avoid specializing its title and discussions to self-attention. If this is not straightforward, can the authors please explain the primary obstacle?

2025-08-03

We sincerely thank the reviewer for their thorough examination of both the main text and the appendix. Your detailed feedback is greatly appreciated.

If there are any mistakes in this reasoning, please let me know...are there any missing assumptions, that are satisfied in practice, that can help fix the theory?

Thank you so much for point this out. As you have shown inequality (53) in the paper we have written is indeed incorrect. The correct form of the Weyl inequality that we need to use is

$\sigma_{\min}(A + \lambda I) \geq \sigma_{min}( \lambda I) - \sigma_{\max}(A)$ .

In order for our result to hold we do need to add a condition namely that

$\lambda - \sigma_{\max}(A) > 1$

We further assume $\sigma_{\min}(A) < 0.4$ and $\sigma_{\max}(A) > 2$ .

With these assumptions we can give the proof as follows:

$\kappa(A + \lambda I) = \frac{\sigma_{\max}(A + \lambda I)}{\sigma_{\min}(A + \lambda I)}$

$\leq \frac{\sigma_{\max}(A + \lambda I)}{\lambda - \sigma_{\max}(A)}$ (by above Weyl inequality. Note denominator is positive by our condition)

$\leq \frac{\sigma_{\max}(A) + \lambda}{\lambda - \sigma_{\max}(A)}$ (by Weyl inequality (52) in appendix)

$< \sigma_{\max}(A) + \frac{\lambda}{\lambda - \sigma_{\max}(A)}$ (by our condition above)

$\leq \sigma_{\max}(A) + 1 + \frac{\sigma_{\max}(A)}{\lambda - \sigma_{\max}(A)}$

$< 2\sigma_{\max}(A) + 1$ (by our condition above)

$= 2(\sigma_{\max}(A) + \frac{1}{2})$

$< \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}$ (by our assumed conditions on $\sigma_{\max}(A)$ and $\sigma_{\min}(A)$ )

$= \kappa(A)$ .

Given this we feel it is best to write theorem 3.8 as:

Theorem: Let $A \in \mathbb{R}^{m\times n}$ and assume that $\sigma_{\min}(A) < 0.4$ and $\sigma_{\max}(A) > 2$ . Let $I_k \in \mathbb{R}^{m\times n}$ denotethe matrix that has 1 on its main $k \times k$ diagonal, where $k = \min(m, n)$ and zero elsewhere. Assume at least one of the following two conditions hold:

(1) $\lambda - \sigma_{\max}(A) > 1$ or

(2) $\sigma_{\min}(A + \lambda I_k) \geq \sigma_{\min}(\lambda I_k) - \sigma_{\min}(A)$ and $\lambda \geq 2$

Then $\kappa(A + \lambda I_k) < \kappa(A)$ .

We believe the above stated theorem employs Weyl's inequality the right way and provides conditions under which the theorem holds.

We reanalyzed the saved attention weights and found that condition (2) held for most transformers, even though it need not for arbitrary matrices, while in ViT-B (and similarly DaViT-B) condition (1) held for $W_Q$ and $W_V$ and condition (2) for $W_K$ . In every case at least one condition held. We’ll expand the appendix with a detailed discussion and a plot of both conditions over training.

A cleaner narrative can be presented by including results that utilize Theorem 3.5, showing it to perform better than standard attention...

Thank you. In the original draft we only had Theorem 3.5, but its stronger guarantee came at a cost, for example vision transformers took 12–20 hours longer to train. To address this, we derived Theorem 3.8, which substantially cuts training time at the expense of performance. We’ve preserved all experiments with Theorem 3.5 and can include them in the appendix to illustrate this trade-off. Below, we show the training times for the different methods for ViTs.

	Acc.	Training time (hrs:mins)
ViT-B (original)	80.7	29:29
ViT-B spec. cond. (Thm. 3.5)	82.0	41:38
ViT-B spec. cond. (Thm. 3.8)	81.7	29:33
DeiT-B (original)	81.6	26:16
DeiT-B spec. cond. (Thm. 3.5)	82.9	37:54
DeiT-B spec. cond. (Thm. 3.8)	82.6	26:24
Swin-B (original)	83.4	53:12
Swin-B spec. cond. (Thm. 3.5)	84.3	68:32
Swin-B spec. cond. (Thm. 3.8)	84.1	53:26
XCiT-M (original)	82.6	91:03
XCiT-M spec. cond. (Thm. 3.5)	83.9	109:12
XCiT-M spec. cond. (Thm. 3.8)	83.5	91:18
DaViT-M (original)	84.3	75:09
DaViT-M spec. cond. (Thm. 3.5)	85.2	93:12
DaViT-M spec. cond. (Thm. 3.8)	84.9	75:28

If the theory holds more broadly, it is better to present the theory more generally and note the special case of self-attention..

We focused on self-attention because it’s the most common form. We can definitely include a general formulation in the appendix and retitle the paper to “Spectral Conditioning of Attention Improves Transformer Performance.” As you noted we already have empirical results for broader attention mechanisms in the paper.

2025-08-03

Thank you for your response.

With these assumptions we can give the proof as follows:

$\kappa(A + \lambda I) = \frac{\sigma_{\max}(A + \lambda I)}{\sigma_{\min}(A + \lambda I)}$

$\leq \frac{\sigma_{\max}(A + \lambda I)}{\lambda - \sigma_{\max}(A)}$ (by above Weyl inequality. Note denominator is positive by our condition)

$\leq \frac{\sigma_{\max}(A) + \lambda}{\lambda - \sigma_{\max}(A)}$ (by Weyl inequality (52) in appendix)

I think you can stop and discard other steps of the proof from here. All you need is for this quantity to be $\leq \sigma_{max}(A) / \sigma_{min}(A)$ to prove $\kappa(A + \lambda I) \leq \kappa(A)$ (or with $<$ if you prefer), which gives you a bound on $\lambda$ in terms of $\sigma_{max}$ and $\sigma_{min}$ without requiring constants like $0.4$ and $2$ .

We’ve preserved all experiments with Theorem 3.5 and can include them in the appendix to illustrate this trade-off. Below, we show the training times for the different methods for ViTs.

Thank you. I think these results help prevent Theorem 3.5 from becoming obsolete and provide empirical support for your arguments regarding benefits of spectral conditioning and the faster practical algorithm. I suggest revising your presentation to make this clear.

2025-08-03

Thank you for your response and the time you’ve dedicated to this discussion with us

I think you can stop and discard other steps of the proof from here....

Sure. That will in fact stream line the proof and make it much easier for the reader to read. We will just keep the condition you point out and discard the rest.

Thank you. I think these results help prevent Theorem 3.5 from becoming obsolete and provide empirical support for your arguments...

Yes we agree. We will revise the presentation and add a discussion about this to the appendix.

审稿意见

评分: 5置信度: 32025-07-03

This paper introduces a theoretically grounded and computationally efficient method to enhance Transformer models by improving the conditioning of their self-attention layers. The authors show that the condition number of the Jacobian of the self-attention block—an important factor affecting optimization stability—depends on the spectral properties of the query, key, and value projection matrices (WQ, WK, WV). To mitigate ill-conditioning, they propose spectral conditioning, which adds fixed correction terms to these matrices, significantly reducing their condition numbers and, in turn, improving the conditioning of the self-attention Jacobian.

The method is simple to implement and compatible with a wide range of attention variants and Transformer architectures. It does not introduce additional trainable parameters and incurs minimal computational overhead. The authors validate their approach across diverse tasks including image classification (ViT, Swin, XCiT, etc.), object detection and segmentation (COCO dataset), long-range sequence modeling (LRA benchmark with Nyströmformer), and language modeling (Crammed BERT on GLUE). In all cases, spectral conditioning consistently improves performance and leads to more stable training dynamics.

While the proposed method optimizes an upper bound on the Jacobian condition number rather than the true value itself, the empirical gains support the theoretical motivation. The technique’s ease of integration and general applicability position it as a valuable tool for improving Transformer robustness without modifying their architecture or training procedure.

优缺点分析

Strengths

This paper is well-written and accessible, offering a clear presentation of both theoretical and empirical contributions. The mathematical analysis is rigorous and well-motivated, with theorems and proofs presented in a concise and structured manner. The accompanying figures, especially those analyzing singular values and condition numbers, are clean and effectively support the theoretical claims. Overall, the exposition is strong and reinforces the soundness of the proposed method.
Despite the simplicity of the idea, the core contribution is conceptually elegant and broadly applicable. The method—spectral conditioning of self-attention—targets a foundational property of Transformers (the conditioning of the Jacobian), and can be applied as a drop-in modification to virtually any self-attention mechanism. This generality strengthens its practical impact. In terms of novelty, the work stands out by focusing specifically on self-attention layers, where previous conditioning methods (e.g., reference [21]) have typically targeted fully connected layers. Thus, I would characterize the level of novelty as medium to high.
The experimental section covers both vision and language domains, evaluating the method across diverse tasks. In vision, the authors explore not only image classification but also object detection and instance segmentation. The inclusion of results on long-range sequence modeling (via the LRA benchmark) and language modeling (Crammed BERT on GLUE) further supports the generality of the approach.

Weaknesses and Suggestions

While the experiments are diverse in terms of task types, the breadth of datasets and model variants could be improved to strengthen the empirical evidence. For image classification, all experiments are limited to ImageNet-1k. Given current standards, it would be beneficial to include evaluations on additional datasets, such as Places365, iNaturalist, or fine-grained benchmarks like CUB-200 or CARS-196. These additions would test the robustness and generalization of spectral conditioning beyond ImageNet and increase the confidence in its effectiveness across visual domains.
Similarly, for detection and segmentation tasks, results are only provided for XCiT within the Mask R-CNN framework. To better demonstrate the generality of the method across vision backbones, I would suggest incorporating one or two additional models in this setup.
A potential omission is the lack of evaluation on vision-language transformer models, such as CLIP or SigLIP. These models also rely on self-attention, and a simple adaptation of spectral conditioning could be tested on tasks like zero-shot image recognition. Given the popularity and widespread use of such models, this would be a valuable addition and highlight the method’s potential in multimodal settings.
Lastly, while the focus is on self-attention, it would be helpful for the authors to briefly discuss whether the proposed conditioning method could extend to cross-attention mechanisms as well. Even a short theoretical or empirical remark on this could broaden the relevance of the work and preempt reader questions.

问题

Questions and Suggestions for the Authors

Additional Classification Datasets.
More Detection/Segmentation Models.
Vision-Language Models.
Discussion on Applicability to Cross-Attention.

My current rating is borderline accept, and I lean towards accepting the paper due to its solid theoretical contributions and promising empirical results. Addressing the questions above—particularly expanding the experimental validation—could strengthen my recommendation.

局限性

Yes.

最终评判理由

Authors addressed my comments. They provided extra experiments with consistent results. In general they had a good rebuttal. Thus, I raise my score +1.

格式问题

No concerns.

作者回复

2025-07-31

1. While the experiments are diverse in terms of task types, the breadth of datasets and model variants could be improved to strengthen the empirical evidence. For image classification, all experiments are limited to ImageNet-1k. Given current standards, it would be beneficial to include evaluations on additional datasets....

We have run spectral conditioning on a ViT-B architecture and a fine-grained transformer architecture TransFG, as in He et al., on the CUB-200, CARS-196 and the iNat2017 datasets. The results shown below clearly show that spectral conditioning helps in fine grained image recognition.

(i) Results on CUB-200

Model	Test Acc. on CUB-200
ViT-B	90.4
Spec. Cond. ViT-B	91.5
TransFG	91.7
Spec. Cond. TransFG	92.9

(ii) Results on CARS-196

Model	Test Acc. on CARS-196
ViT-B	93.6
Spec. Cond. ViT-B	94.8
TransFG	94.8
Spec. Cond. TransFG	95.8

(iii) Results on iNat2017

Model	Test Acc. on iNat2017
ViT-B	68.9
Spec. Cond. ViT-B	71.1
TransFG	71.8
Spec. Cond. TransFG	72.7

I would suggest incorporating more Detection/Segmentation Models.

We ran a Swin-base model for both object detection and instance segmentation. The table below shows that spectral conditioning yields better performance. We are happy to include this in the appendix as an extra experiment.

Model	APᵇ	APᵇ₅₀	APᵇ₇₅	APᵐ	APᵐ₅₀	APᵐ₇₅
Swin-base	45.9	67.1	50.1	41.1	63.7	43.8
Spec. Cond. Swin-base	46.8	68.1	50.7	41.8	64.2	44.6

We also ran a Swin-small architecture on both object detection and instance segment. The table below shows the results one again showing spectral conditioning yields better performance.

Model	APᵇ	APᵇ₅₀	APᵇ₇₅	APᵐ	APᵐ₅₀	APᵐ₇₅
Swin-small	44.8	66.3	49.0	40.3	63.0	42.9
Spec. Cond. Swin-small	45.7	66.9	49.8	40.7	63.4	43.5

A potential omission is the lack of evaluation on vision-language transformer models, such as CLIP or SigLIP....

We thank the reviewer for this suggestion. While vision-language transformer models are very interesting we believe our experimental analysis is already very thorough. We performed experiments on image classification, object detection and instance segmentation, long range sequence modelling (LRA benchmark) and language modelling. Plus we have above carried out 3 different fine grained image classification tasks as well as doing another model on both object detection and instance segmentation. In each case we have shown spectral conditioned attention yields better results clearly showing our methodology is useful to the community.

Lastly, while the focus is on self-attention, it would be helpful for the authors to briefly discuss whether the proposed conditioning method could extend to cross-attention mechanisms as well...

We thank the reviewer for their question. Yes our conditioning method does extend to cross attention. This is because cross attention uses the same kernel as regular self-attention. The difference between the two is that cross attention allows an input stream of tokens to interact with a different stream of tokens while self-attention only allows tokens to interact with each other within each stream of input. However, both forms of attention use the same kernel defined by eq. (3) in the paper. Theorem 3.3 and 3.4 of the paper is only dependent on the kernel used, as we are taking derivatives with respect to $W_Q$ , $W_K$ and $W_V$ parameters and not the input tokens $X$ , and thus also holds for cross attention. We will include a paragraph explaining this in the appendix.

2025-08-04

I would like to thank the reviewers for the extra experiments. Given the consistent results of them and the fact that in general in this rebuttal authors managed to address comments of other reviewers too, I will raise my score.

最终决定Accept (poster)

2025-09-17

This paper studies the conditioning of an attention layer but studying its Jacobian. The authors identify the Q K V matrices dominate the condition number, and this proposes a new parameterization of the attention layer that makes it better conditioned. All reviewers have found this work interesting and its findings relevant to the ML community. The AC agrees this consensus and recommends accept.

One note, this work's finding has high similarity to the SigmaReparam work [1] where they apply spectral norm to the Q K V matrices instead. The authors are encouraged cite and discuss their relation in the camera ready version.

References

[1] Stabilizing Transformer Training by Preventing Attention Entropy Collapse, Zhai et al, ICML 2023