6.0

/10

Poster5 位审稿人

最低5最高7标准差0.9

3.6

置信度

正确性3.2

贡献度2.6

表达3.0

NeurIPS 2024

Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis

Rachel Teo,Tan Minh Nguyen

OpenReview PDF

提交: 2024-05-13更新: 2024-11-06

TL;DR

We show that self-attention projects its query vectors onto the principal component axes of its key matrix in a feature space and propose Attention with Robust Principal Components, a novel robust attention that is resilient to data contamination.

摘要

关键词

TransformersAttentionKernel Principal Component Analysis

评审与讨论

审稿意见

评分: 5置信度: 52024-07-13

This paper introduces a new method, Kernel PCA Attention (RPC-Attention), designed to enhance the self-attention mechanisms in transformers widely used in sequence modeling for both natural language processing and computer vision. The method applies principal component analysis to derive self-attention kernels, which helps in projecting key vectors into a feature space that prioritizes essential components, thereby optimizing task performance. The effectiveness of RPC-Attention is empirically validated through improved results in object classification, language modeling on WikiText-103, and image segmentation on the ADE20K dataset.

优点

The paper reveals the intrinsic connection between the self-attention mechanism and PCA/kernel PCA, providing a thorough mathematical explanation。
The derivation part is convincing and interesting.

缺点

My main concern is about the practicality of the novel RPC-Attention mechanism. In your experiments, RPC-SymViT can only scale to a small size, and it does not show a significant performance advantage compared to ViT-Small.
Additionally, RPC-SymViT does not have an advantage in computational speed; for example, in Table 11, the inference time for RPC-SymViT (2iter/layer) is 70% higher than that of ViT. As it stands, the authors have not convinced me to abandon the traditional ViT in favor of RPC-SymViT.
I am also concerned about the scalability of this model, which can outperform ViT by ~1% on ImageNet at the tiny size, but the performance becomes very close at the small size. What happens if the model is scaled up even more? Do the assumptions in the paper still hold when the feature dimension is higher?

问题

Did the authors try larger models for RPC-Attention?

局限性

The authors have discussed their limitations; Scalability might be a vital limitation to the models proposed in this paper.

作者回复

2024-08-07

Q1. My main concern is about the practicality of the novel RPC-Attention mechanism. In your experiments, RPC-SymViT can only scale to a small size, and it does not show a significant performance advantage compared to ViT-Small.

I am also concerned about the scalability of this model, which can outperform ViT by ~1% on ImageNet at the tiny size, but the performance becomes very close at the small size. What happens if the model is scaled up even more? Do the assumptions in the paper still hold when the feature dimension is higher?

Did the authors try larger models for RPC-Attention?

Answer: We have conducted an additional experiment on RPC-SymViT-Base for the ImageNet object classification task to address the reviewer's concern about the scalability and practicality of our RPC-Attention and reported our results in Table 1 and 2 in the attached PDF. Our RPC-SymViT-Base outperforms the baseline in terms of clean accuracy and is also more robust to data perturbation and adversarial attacks than the baseline. Specifically, we improve on ImageNet-O by over 2 AUPR and on PGD and FGSM by more than 1% in top-1 accuracy. These results further demonstrate the effectiveness of RPC-Attention even in larger models.

Our assumptions in Remark 2, do not hold when the feature dimension is higher than the sequence length as the number of principal components cannot exceed the rank of the covariance matrix. But in practical tasks, the feature dimension is very often smaller than the sequence length since, in those tasks, transformers are used to capture long-range dependency in the input sequence.

Q2. Additionally, RPC-SymViT does not have an advantage in computational speed; for example, in Table 11, the inference time for RPC-SymViT (2iter/layer) is 70% higher than that of ViT. As it stands, the authors have not convinced me to abandon the traditional ViT in favor of RPC-SymViT.

Answer: In Table 11, in our first three settings, RPC-SymViT (4iter/layer1, 5iter/layer1, 6iter/layer1), we only apply RPC-Attention to the first layer of the transformer model. This adds minimal cost to the inference speed. In particular, in Table 11, we show the inference runtime cost (in second/sample) of our RPC-Attention. As can be seen in that table, during inference ("Test" column in Table 11), our RPC-Attention with 4iter/layer1 (4 Principal Attention Pursuit (PAP) iterations at layer 1) and 5iter/layer1 (5 PAP iterations at layer 1) have the same inference runtime as the baseline SymViT, 0.01 second/sample. Our RPC-Attention with 6 iter/layer1 (6 PAP iterations at layer 1) only requires slighly more inference time than the baseline, 0.011 second/sample vs. 0.01 second/sample. Due to the effectiveness of our RPC-Attention, applying it at the first layer of a transformer model is already good enough to improve the model's robustness significantly (see Tables 1, 2, and 3 in our main text and Tables 4, 5, 6, 7, 8, 9, 10, and 12 in our Appendix). Note that results reported in Table 11 also show that our RPC-Attention is as memory, parameter, FLOP, and training efficient as the baseline SymViT.

While applying RPC at all layers, as in RPC-SymViT (2iter/layer), leverages more advantages of RPC, it is not really necessary in most settings due to its small improvement over applying RPC at the first layer and its higher computational overhead, as the reviewer mentioned.

评论- Any Questions from Reviewer rAS8 on Our Rebuttal?

2024-08-11

We would like to thank the reviewer again for your thoughtful reviews and valuable feedback.

We would appreciate it if you could let us know if our responses have addressed your concerns and whether you still have any other questions about our rebuttal.

We would be happy to do any follow-up discussion or address any additional comments.

评论- Another Call for Your Response

2024-08-13

Dear Reviewer rAS8,

We sincerely appreciate the time you have taken to provide feedback on our work, which has helped us greatly improve its clarity, among other attributes. This is a gentle reminder that the Author-Reviewer Discussion period ends in less than two days from this comment, i.e., 11:59 pm AoE on August 13th. We are happy to answer any further questions you may have before then, but we will be unable to respond after that time.

If you agree that our responses to your reviews have addressed the concerns you listed, we kindly ask that you consider whether raising your score would more accurately reflect your updated evaluation of our paper. Thank you again for your time and thoughtful comments!

Sincerely,

Authors

评论- Response to authors' rebutall

2024-08-13

Thanks to the authors for their time and efforts in the rebuttal. The comparison of inference speed is now clear to me, but my concern of scalability remains.

As the authors pointed out, "Our assumptions in Remark 2, do not hold when the feature dimension is higher than the sequence length as the number of principal components cannot exceed the rank of the covariance matrix." Does it mean that your base-sized model can only process sequences longer than 768? If so, this should be a strong limitation to your model.

The claim " in practical tasks, the feature dimension is very often smaller than the sequence length" actually does not hold in many cases. For example, the most commonly compared baseline of ViT is a ViT-Base with 768-d features but only 197 tokens in the sequence (for 224 input and 16 patch size).

评论- Response to the Reviewer's Concern about the Scalability of Our Model

2024-08-13

Thanks for your reply. We respectfully disagree with the reviewer that scalability is a concern about our work. Please allow us to explain why scalability is not an issue below.

In practice and also in our experiments, transformers use multihead attention with $H$ heads. The feature dimension mentioned in our assumptions in Remark 2 is the feature dimension at each head, which is $D/H$ , where $D=768$ and $H=12$ in ViT-Base. Thus, in this case, the feature dimension ( $768/12=64$ ) is smaller than the sequence length ( $197$ ). For ViT-Base, our RPC-Attention model works with sequences longer than $64$ . Furthermore, we can always reduce this number ( $64$ ) by introducing more heads, but tasks with sequence lengths less than $64$ are not very practical. The same holds for ViT-Tiny, which has $D=192$ and $H=3$ . As a result, we believe our statement that "in practical tasks, the feature dimension is very often smaller than the sequence length" is correct, and our assumptions in Remark 2 are still valid and should not raise any concern about the scalability of our RPC-Attention. Below, we provide a table that contains the feature dimensions, the number of heads, and the number of feature dimensions per head of each ViT model.

Model	Total Feature Dimensions ( $D$ )	Number of Heads ( $H$ )	Feature Dimensions per Head ( $D/H$ )
ViT-Tiny-Patch16	192	3	64
ViT-Small-Patch16	384	6	64
ViT-Base-Patch16	768	12	64
ViT-Large-Patch16	1024	16	64
ViT-Huge-Patch16	1280	16	80

评论- my concerns are addressed

2024-08-14

Thanks for the authors‘ detailed clarification. My concern is now addressed and I raise my score to 5.

评论- Thanks for your endorsement!

2024-08-14

Thanks for your response, and we appreciate your endorsement.

审稿意见

评分: 7置信度: 32024-07-15

This paper studies the self-attention mechanism. The authors recover self-attention from kernel principle component analysis (kernel PCA). The authors empirically verify that the projection error is being minimized during training, suggesting that the transformer model is actually training to perform PCA during training. Also, the authors empirically confirm that the value matrix capture the eigenvectors of the Gram matrix. Lastly, using the analysis framework, the authors propose a robust attention based on principal component pursuit that is robust to input corruption and attacks.

优点

Viewing self-attention from kernel PCA perspective is very interesting.
The kernel PCA analysis of self-attention is supported with empirical evidence.
The analysis not only provides the theoretical understanding of self-attention, but also helps guiding the design of a new class of robust self-attention that is robust to input corruption and attacks.

缺点

The RPC-Attention uses an iterative algorithm PAP to solve the convex formulation of principal component pursuit. How is back-propagation via PAP algorithm calculated? Does it lead to any instability?
The appendix E8 shows the runtime and memory of RPC-Attention is comparable to baseline despite using an iterative algorithm. I am curious about why it is not much slower. Is there any analysis?
I noticed that the baseline SymViT that uses symmetric attention has significantly worse accuracy compared to vanilla ViT on clean ImageNet (for tiny model, SymViT accuracy is 70.44, while vanilla ViT is around 75). I am wondering what would be the reason. If this is caused by the use of symmetric attention, is there a way to adopt RPC-Attention for vanilla ViT?
Given that RPC-Attention is more robust and as efficient as baseline, why the majority of experiments only apply it on the first few layers or the first layer, but not apply RPC-Attention on all layers (for example, like 6iter/all-layer).

问题

see weakness section.

局限性

The authors discussed a limitation about the potential efficiency issue of the proposed algorithm.

作者回复

2024-08-07

Q1. The RPC-Attention uses an iterative algorithm PAP to solve the convex formulation of principal component pursuit. How is back-propagation via PAP algorithm calculated? Does it lead to any instability?

Answer: As shown in Algorithm 1 in the main text, the Principal Attention Pursuit (PAP) in our RPC-Attention is quite simple. Most operators in PAP are linear except for the softmax operator and the shrinkage operator $S_{\lambda/\mu}$ . Like the softmax operator in self-attention, the softmax operator in RPC-Attention is easy to differentiate through. Similarly, like the ReLU operator in feedforward layers, the shrinkage operator $S_{\lambda/\mu}$ is also easy to differentiate through. We do not encounter any instabilities with RPC-Attention, and the standard backpropagation algorithm is used to handle gradient propagation in RPC-Attention. As can be seen in Figure 1 (Right) in the attached PDF, the training and validation loss curve of the transformer with RPC attention is quite stable.

Q2. The appendix E8 shows the runtime and memory of RPC-Attention is comparable to baseline despite using an iterative algorithm. I am curious about why it is not much slower. Is there any analysis?

Answer: Since we only apply RPC-Attention to the first layer of the transformer model, it adds minimal cost to the runtime and memory. In particular, in Table 11 in our Appendix, we show the inference runtime cost (in second/sample) of our RPC-Attention. As can be seen in that table, during inference ("Test" column in Table 11), our RPC-Attentions with 4iter/layer1 (4 Principal Attention Pursuit (PAP) iterations at layer 1) and 5iter/layer1 (5 PAP iterations at layer 1) have the same inference runtime and use the memory during inference as the baseline SymViT, 0.01 second/sample and 1181MB, respectively. Our RPC-Attention with 6 iter/layer1 (6 PAP iterations at layer 1) only requires slighly more inference time than the baseline, 0.011 second/sample vs. 0.01 second/sample, and uses the same memory during inference as the baseline, 1181MB. Due to the effectiveness of our RPC-Attention, applying it at the first layer of a transformer model is already good enough to improve the model's robustness significantly (see Tables 1, 2, and 3 in our main text and Tables 4, 5, 6, 7, 8, 9, and 10 in our Appendix). Note that results reported in Table 11 also show that our RPC-Attention is as parameter, FLOP, and training efficient as the baseline SymViT.

Q3. I noticed that the baseline SymViT that uses symmetric attention has significantly worse accuracy compared to vanilla ViT on clean ImageNet (for the tiny model, SymViT accuracy is 70.44, while vanilla ViT is around 75). I am wondering what would be the reason. If this is caused by the use of symmetric attention, is there a way to adopt RPC-Attention for vanilla ViT?

Answer: Thanks for your comments. For the ImageNet object classification task, as the reviewer pointed out and as reported in our paper (see Tables 1 and 12), SymViT has worse performance than the asymmetric vanilla ViT. However, for other tasks, such as neural machine translation (NMT) [Vaswani, 2017] and sequence prediction (SP) [Dai et al., 2019], symmetric attention has comparable or even better performance than asymmetric attention, as reported in [Tsai, 2019]. These evidences show the promise of applying our RPC-Attention to practical models for enhancing their accuracy and robustness.

In Appendix E.9, we discuss the extension of RPC-Attention to asymmetric attention. We report the results of the RPC-Asymmetric ViT (RPC-AsymViT) vs. baseline asymmetric ViT in Table 12. Our RPC-AsymVit improves over the baseline on most of the corrupted datasets. However, as explained in Appendix E.9, as the PAP in Algorithm 1 in our manuscript is not designed for multiple data matrices, it is not as effective in the asymmetric case as in the symmetric case.

Additionally, in Table 4 of the attached PDF, we provide positive results of finetuning an asymmetric Sparse Mixture of Experts (SMoE) language model using RPC on the downstream natural language understanding tasks, Stanford Sentiment Treebank v2 (SST2) and IMDB Sentiment Analysis (IMDB). The significant improvements over the baseline SMoE validate the effectiveness of both of our methods, symmetric RPC-Attention and asymmetric RPC-Attention.

Q4. Given that RPC-Attention is more robust and as efficient as baseline, why the majority of experiments only apply it on the first few layers or the first layer, but not apply RPC-Attention on all layers (for example, like 6iter/all-layer).

Answer: As mentioned in our answer to your Q2 above, due to the effectiveness of our RPC-Attention, applying it at the first layer of a transformer model is already good enough to improve the model's robustness significantly (see Tables 1, 2, and 3 in our main text and Tables 4, 5, 6, 7, 8, 9, and 10 in our Appendix). This strategy also adds minimal cost to the runtime and memory.

While applying RPC at all layers, as in RPC-SymViT (2iter/all-layer) in Table 1, 2, 4, and 6 leverages more advantages of RPC, it is not really necessary in most settings due to its small improvement over applying RPC at the first layer and its higher computational overhead (see Table 11).

References

Vaswani, A., et al. Attention is all you need. 2017.

Dai, Z., et al. Transformer-xl: Attentive language models beyond a fixed-length context. 2019.

Tsai, Y. H. H., et al. Transformer dissection: a unified understanding of transformer's attention via the lens of kernel. 2019.

评论- Any Questions from Reviewer 85Cd on Our Rebuttal?

2024-08-11

We would like to thank the reviewer again for your thoughtful reviews and valuable feedback.

Regarding applying RPC for more iterations at all layers, as we mentioned, this results in higher computational overhead and may only lead to small improvements. We conducted an additional experiment with 4 iterations in all layers (RPC-SymViT (4iter/all-layer)) and provide the results in Table 1 and 2 below. We observe that RPC-SymViT (4iter/all-layer) indeed does outperform RPC-SymViT (2iter/all-layer) on almost all robustness benchmarks but still does not perform as well as RPC-SymViT(5/6iter/layer1). Interestingly, while RPC-SymViT (4iter/all-layer) performs significantly better than the baseline and RPC-SymViT(4/5/6iter/layer1) on PGD and FGSM attacked ImageNet-1K, RPC-SymViT (2iter/all-layer) still has the best accuracy.

Table 1: Top-1, top-5 accuracy (%) , mean corruption error (mCE), and area under the precision-recall curve (AUPR) of RPC-SymViT and SymViT on clean ImageNet-1K data and popular standard robustness benchmarks for image classification. RPC-SymViT ( $n$ iter/layer1) applies $n$ PAP iterations only at the first layer. RPC-SymViT ( $n$ iter/all-layer) applies $n$ PAP iterations at all layers.

Model	IN-1K Top-1 ↑	IN-1K Top-5 ↑	IN-R Top-1 ↑	IN-A Top-1 ↑	IN-C Top-1 ↑	IN-C mCE ↓	IN-O AUPR ↑
SymViT (baseline)	70.44	90.17	28.98	6.51	41.45	74.75	17.43
RPC-SymViT (4iter/layer1)	70.94	90.47	29.99	6.96	42.35	73.58	19.32
RPC-SymViT (5iter/layer1)	71.31	90.59	30.28	7.27	42.43	73.43	20.35
RPC-SymViT (6iter/layer1)	71.49	90.68	30.03	7.33	42.76	73.03	20.29
RPC-SymViT (2iter/all-layer)	70.59	90.15	29.23	7.55	41.64	74.52	19.18
RPC-SymViT (4iter/all-layer)	71.17	90.61	30.09	7.43	42.55	73.35	19.43

Table 2: Top-1, top-5 accuracy (%) on PGD and FGSM attacked ImageNet-1K validation data with the highest perturbation. RPC-SymViT ( $n$ iter/layer1) applies $n$ PAP iterations only at the first layer. RPC-SymViT ( $n$ iter/all-layer) applies $n$ PAP iterations at all layers.

Model	PGD Top-1 ↑	PGD Top-5 ↑	FGSM Top-1 ↑	FGSM Top-5 ↑
SymViT-Tiny (baseline)	4.98	10.41	23.38	53.82
RPC-SymViT (4iter/layer1)	5.15	11.20	26.62	56.87
RPC-SymViT (5iter/layer1)	5.11	11.13	26.75	57.19
RPC-SymViT (6iter/layer1)	5.20	11.34	27.22	57.55
RPC-SymViT (2iter/all-layer)	6.12	13.24	29.20	59.63
RPC-SymViT (4iter/all-layer)	5.46	12.17	27.99	59.01

We would appreciate it if you could let us know if our responses have addressed your concerns and whether you still have any other questions about our rebuttal.

We would be happy to do any follow-up discussion or address any additional comments.

评论- Discussion Deadline is in Less than Two Days

2024-08-13

Dear Reviewer 85Cd,

If you agree that our responses to your reviews have addressed the concerns you listed, we kindly ask that you consider whether raising your confidence score (3 currently) would more accurately reflect your updated evaluation of our paper. Thank you again for your time and thoughtful comments!

Sincerely,

Authors

审稿意见

评分: 7置信度: 22024-07-26

The paper proposes a new perspective for understanding the underlying operation learned by scaled dot-product self-attention from a kernel PCA perspective. The paper recovers the self-attention formulation starting from PCA over feature projections of data points. In particular, the authors show that different parameterizations for the value matrix, while mathematically equivalent, can lead to different optimization problems and therefore different outcomes. Based on the new perspective provided in the paper, the authors propose RPC-Attention, a method to address one of the known vulnerabilities of PCA with corrupted data. The authors provide several experiments to demonstrate the effectiveness of RPC-Attention.

优点

The paper provides a very interesting and important perspective for understanding self-attention from the lens of Kernel PCA. Given the ubiquity and effectiveness of the transformer architecture, such insights are extremely valuable.
The paper is very well written, and the derivations are clear and well-explained.
In addition to clearly explaining the new perspective for understanding self-attention, the authors went ahead and addressed one of the weaknesses self-attention might suffer from given this new understanding. This is very valuable and gives a good example of how new perspectives can pave the way to key improvements in the underlying mechanisms.
The experimental results reflect the effectiveness of the proposed RPC-Attention method in improving the robustness of the models against adversarial attaches and corrupted data.

缺点

[minor] The experiments use relatively small-scale models and short training runs (e.g. 50 epochs on Imagenet). While not a core issue given the objective of the paper, it would be nice to see results in more standardized settings (e.g. ViT-Base with 300 epochs)

问题

Given the new perspective provided in the paper, do the authors have any insights into how to reduce the complexity of self-attention while retaining the same quality?

局限性

Yes

作者回复

2024-08-07

Q1. [minor] The experiments use relatively small-scale models and short training runs (e.g. 50 epochs on Imagenet). While not a core issue given the objective of the paper, it would be nice to see results in more standardized settings (e.g. ViT-Base with 300 epochs)

Answer: Thanks for your suggestion. All of our models are trained on 300 epochs, we provide the plot of 50 epochs in the main text for emphasis on the faster initial convergence during training. A plot of the full training curve with 300 epochs is provided in Appendix E.1, Fig. 4. Following the reviewer's suggestion, we have conducted an additional experiment on ViT-Base for the ImageNet object classification task with 300 epochs and reported our results in Table 1 and 2 in the attached PDF. Our RPC-ViT-Base outperforms the baseline in terms of clean accuracy and is also more robust to data perturbation and adversarial attacks.

Q2. Given the new perspective provided in the paper, do the authors have any insights into how to reduce the complexity of self-attention while retaining the same quality?

Answer: Our kernel principal component analysis (kernel PCA) framework allows the development of efficient attentions. In the following, we will show how to derive two popular efficient attentions: the linear attention in [Katharopoulos, 2020] and the sparse attention in [Child, 2019]. Combining linear and sparse attentions helps reduce the complexity of self-attention while maintaining its quality [Nguyen, 2021; Zhu, 2021; Chen, 2021].

Deriving the linear attention: The self-attention output in Eqn. (8) in our main text can be re-written as follows:

h_{i} = \sum_{j=1}^{N}\frac{\phi(q_i)^{\top}\phi(k_j)}{g(q_i)}v_j = \frac{\sum_{j=1}^{N}\phi(q_i)^{\top}\phi(k_j)v_j}{\sum_{j'=1}^{N}\phi(q_i)^{\top}\phi(k_{j'})}.

Here, we again choose $g(x) := \sum_{j=1}^{N}\phi(x)^{\top}\phi(k_j)$ . Following the derivation in [Katharopoulos, 2020], we use the associative property of matrix multiplication to obtain

h_{i} = \frac{\phi(q_i)^{\top}\sum_{j=1}^{N}\phi(k_j)v_{j}^{\top}}{\phi(q_i)^{\top}\sum_{j'=1}^{N}\phi(k_{j'})}.

We can then choose $\phi(x) = \text{elu}(x) + 1$ to achieve the linear attention in [Katharopoulos, 2020], which is one of the most popular efficient attentions.

Deriving the sparse attention: For each query $q_i$ , we consider a subset $\mathcal{M_{i}}=\\{k_{\ell_{i}(1)}, k_{\ell_{i}(2)},\dots, k_{\ell_{i}(L)}\\}$ of the dataset $\mathcal{M}=\\{k_1, k_2, \dots, k_N\\}$ , where $\mathcal{L_{i}} = \\{\ell_{i}(1), \ell_{i}(2), \dots, \ell_{i}(L)\\} \subset \\{1, 2, \dots, N\\}$ . Following Eqn. (8) and (9) in our main text, the projection $h_i$ of the query $q_i$ onto $D_v$ principal components of $\mathcal{M_i}$ is given by:

h_{i} = \sum_{j=1}^{N}1_{\mathcal{L_{i}}}(j)\frac{k(q_i, k_j)}{g(q_i)}v_j,

where $1_{\mathcal{L_{i}}}(j) = 1 \text{ if } j\in \mathcal{L_{i}}$ . Note that the subsets $\mathcal{M_i}$ are different for different $q_i$ , $i=1,\dots,N$ . Again, as in Section 2.1 in our manuscript, selecting $g(x) := \sum_{j=1}^{N} k(x,k_j)$ and $k(x, y) = \exp(x^{\top}y/\sqrt{D})$ , we obtain a formula of the sparse attention in [Child, 2019] where the binary matrix $(1_{\mathcal{L_{i}}}(j))_{i,j=1}^{N}$ becomes the sparse masking matrix.

References

Katharopoulos, A, et al. Transformers are rnns: Fast autoregressive transformers with linear attention. 2020.

Child, R., et al. Generating long sequences with sparse transformers. 2019.

Nguyen, T., et al. Fmmformer: Efficient and flexible transformer via decomposed near-field and far-field attention. 2021.

Zhu, C., et al. Long-short transformer: Efficient transformers for language and vision. 2021.

Chen, B., et al. Scatterbrain: Unifying sparse and low-rank attention. 2021.

评论- Any Questions from Reviewer hhSg on Our Rebuttal?

2024-08-11

We would like to thank the reviewer again for your thoughtful reviews and valuable feedback.

We would appreciate it if you could let us know if our responses have addressed your concerns and whether you still have any other questions about our rebuttal.

We would be happy to do any follow-up discussion or address any additional comments.

评论- Discussion Deadline is in Less than Two Days

2024-08-13

Dear Reviewer hhSg,

If you agree that our responses to your reviews have addressed the concerns you listed, we kindly ask that you consider whether raising your confidence score (2 currently) would more accurately reflect your updated evaluation of our paper. Thank you again for your time and thoughtful comments!

Sincerely,

Authors

2024-08-14

I thank the authors for the rebuttal. The rebuttal has addressed my main questions/concerns. The additional insights about possible ways to reduce the attention complexity are much appreciated. Therefore, I would like to keep my score of Accept.

评论- Thanks for your endorsement!

2024-08-14

Thanks for your response, and we appreciate your endorsement.

审稿意见

评分: 6置信度: 32024-07-29

The paper derives self-attention from kernel principal component analysis (kernel PCA), showing that the attention outputs are projections of query vectors onto the principal component axes of the key matrix in a feature space. Using this kernel PCA framework, the authors propose Attention with Robust Principal Components (RPC-Attention), a robust attention that is resilient to data contamination, and demonstrate its effectiveness on ImageNet-1K object classification, WikiText-103 language14modeling, and ADE20K image segmentation task.

优点

This paper offers a novel perspective on attention mechanisms, proposing that attention outputs are projections of query vectors onto the principal component axes of the key matrix in a feature space.
The authors provide a detailed mathematical proof to support this claim.
The paper's experimental evaluation is comprehensive, demonstrating the applicability of the proposed RPC-attention mechanism across both vision and language tasks.
The writing is clear and well-organized, making the paper a pleasure to read.

缺点

The results in Table 1 show that the best performance on different tasks is achieved with varying settings. For example, RPC achieves the best performance with 6 iterations/layer 1 on IN-1K, while RPC achieves the best performance with 2 iterations/all layers on IN-A. Moreover, for IN-1K top-5, 2 iterations/all layers even performs worse than the baseline attention. This suggests that more RPC attention is not always better for different tasks, and sometimes even performs worse than the baseline. Can authors provide an explanation for this phenomenon?
As shown in Appendix Table 11, the computational cost of RPC attention increases substantially with the number of iterations/layers. This raises concerns about its practical applicability when extended to multi-layer settings. Specifically, it is unclear whether the performance gains will be sufficient to justify the increased computational cost. This directly affects the method's usability in applications.

问题

For language tasks, ppl may not accurately reflect the final performance. Have the authors conducted experiments on other downstream natural language understanding tasks to evaluate the effectiveness? Will RPA be applicable to pre-trained language models?

局限性

Please refer to questions.

作者回复

2024-08-07

Q1. The results in Table 1 show that the best performance on different tasks is achieved with varying settings. For example, RPC achieves the best performance with 6 iterations/layer 1 on IN-1K, while RPC achieves the best performance with 2 iterations/all layers on IN-A. Moreover, for IN-1K top-5, 2 iterations/all layers even performs worse than the baseline attention. This suggests that more RPC attention is not always better for different tasks, and sometimes even performs worse than the baseline. Can authors provide an explanation for this phenomenon?

Answer: Our RPC-Attention needs more iterations to converge. While applying RPC to all layers leverages more advantages of RPC, 2 iterations/layer is too little for convergence. Also, for 4-5-6 iter/layer1, more iterations tend to give better results since the algorithm converges better. However, sometimes 6 iters perform worse than 5 iters. This might be due to overshooting.

Q2. As shown in Appendix Table 11, the computational cost of RPC attention increases substantially with the number of iterations/layers. This raises concerns about its practical applicability when extended to multi-layer settings. Specifically, it is unclear whether the performance gains will be sufficient to justify the increased computational cost. This directly affects the method's usability in applications.

Answer: Since we only apply RPC-Attention to the first layer of the transformer model, it adds minimal cost to the inference speed. In particular, in Table 11 in our Appendix, we show the inference runtime cost (in second/sample) of our RPC-Attention. As can be seen in that table, during inference ("Test" column in Table 11), our RPC-Attentions with 4iter/layer1 (4 Principal Attention Pursuit (PAP) iterations at layer 1) and 5iter/layer1 (5 PAP iterations at layer 1) have the same inference runtime as the baseline SymViT, 0.01 second/sample. Our RPC-Attention with 6 iter/layer1 (6 PAP iterations at layer 1) only requires slighly more inference time than the baseline, 0.011 second/sample vs. 0.01 second/sample, which is not a "major deficiency" as the reviewer suggested. Due to the effectiveness of our RPC-Attention, applying it at the first layer of a transformer model is already good enough to improve the model's robustness significantly (see Tables 1, 2, and 3 in our main text and Tables 4, 5, 6, 7, 8, 9, 10, and 12 in our Appendix). Note that results reported in Table 11 also show that our RPC-Attention is as memory, parameter, FLOP, and training efficient as the baseline SymViT.

Q3. For language tasks, ppl may not accurately reflect the final performance. Have the authors conducted experiments on other downstream natural language understanding tasks to evaluate the effectiveness? Will RPA be applicable to pre-trained language models?

Answer: Thanks for your suggestion. We have conducted additional experiments on downstream natural language understanding tasks and their results can be found in Table 4 of the attached PDF. RPC-models outperform the baseline models significantly on these tasks. Particularly, we use 2 baseline Sparse Mixture of Experts (SMoE) models, one with symmetric attention (SymSMoE) and one with asymmetric attention (SMoE), pretrained on WikiText-103 without RPC. We finetune these models on the Stanford Sentiment Treebank v2 (SST2) task and IMDB Sentiment Analysis (IMDB) task, without RPC as our baseline and with RPC for comparison. Due to the tight schedule, we did not report SymSMoE's results on IMDB and will do so during the discussion period. We further aim to test our method on more downstream tasks and will continue updating the results in the mean time. From the current results presented, we observe strong advantages of RPC on downstream tasks with large increases in training and validation accuracies.

评论- Any Questions from Reviewer 8tie on Our Rebuttal?

2024-08-11

We would like to thank the reviewer again for your thoughtful reviews and valuable feedback.

We would appreciate it if you could let us know if our responses have addressed your concerns and whether you still have any other questions about our rebuttal.

We would be happy to do any follow-up discussion or address any additional comments.

评论- Another Call for Your Response

2024-08-13

Dear Reviewer 8tie,

Sincerely,

Authors

评论- Additional Results on Downstream Natural Language Understanding Tasks (in addition to those in Table 4 in the attached PDF)

2024-08-13

To further address your concerns, we have conducted an additional experiment on another downstream natural language understanding task in addition to those in Table 4 in the attached PDF. We use a pre-trained transformer language model from the NAACL 2019 tutorial on "Transfer Learning in Natural Language Processing" and finetune the model on the 5-class sentiment classification task, Stanford Sentiment Treebank (SST-5). We implement RPC-Attention during finetuning only (RPC-LM) and compare the results with the baseline model (Baseline-LM) on SST-5 in Table 1 below. As can be seen from the table, our RPC-Attention is applicable to pre-trained language models and performs significantly better than the baseline. Hence, RPC-Attention is highly effective in downstream natural language understanding tasks.

Table 1: Validation and test accuracy of RPC-Attention implemented in a pre-trained transformer language model during finetuning versus the baseline transformer language model. We finetune both models on the 5-class Stanford Sentiment Treebank (SST-5) task.

Model	Validation Accuracy (%)	Test Accuracy (%)
Baseline-LM	46.51	49.23
RPC-LM	48.68	50.36

We are happy to answer any further questions you may have. If you agree that our responses to your reviews have addressed the concerns you listed, we kindly ask that you consider whether raising your score would more accurately reflect your updated evaluation of our paper. Thank you again for your time and thoughtful comments!

2024-08-14

Thank you to the authors for addressing my concerns. I've increased my score to 6.

评论- Thanks for your endorsement!

2024-08-14

Thanks for your response, and we appreciate your endorsement.

审稿意见

评分: 5置信度: 52024-07-31

This work introduces a method for understanding attention mechanisms using kernel principal component analysis (kernel PCA). From this perspective, self-attention is viewed as projecting the query vectors onto the principal components of the key matrix. Building on this framework, this work develops a new robust attention mechanism, termed Attention with Robust Principal Components (RPC-Attention).

RPC-Attention formulates the attention layer as an optimization problem of low-rank matrix recovery under sparse noise conditions, through Robust Principal Component Analysis. It utilizes the Alternating Direction Method of Multipliers (ADMM) algorithm to perform the forward pass of RPC-Attention. The proposed RPC-Attention layer converges within 4 to 6 iterations during the forward pass and shows robustness against data corruption and adversarial attacks.

优点

Soundness: The analysis of the attention mechanism presented in the manuscript is generally sound. The development of RPC-Attention is well-motivated by the formulation of standard self-attention.
Writing: The paper is well-motivated and written with clarity.
Evaluations: The evaluations encompass a wide range of common tasks in computer vision and language modeling. These evaluations support the claim of the proposed RPC-Attention's robustness against data corruption and adversarial attacks.

缺点

Related works. The manuscript omits several relevant works. References [1,2] are particularly pertinent given the connection between attention mechanisms and decomposition methods for recovering clean signal subspaces discussed in this work. Additionally, RPC-Attention, as an iterative optimization layer, also links to OptNet [3] and Deep Equilibrium Models (DEQs) [4].

[1] Is Attention Better Than Matrix Decomposition?

[2] Graph Neural Networks Inspired by Classical Iterative Algorithms

[3] OptNet: Differentiable Optimization as a Layer in Neural Networks

[4] Deep Equilibrium Models
Speed. A major deficiency of RPC-Attention is that it adds cost to the inference speed.

问题

Gradient Flow: Iterative optimization layers often face issues with unstable gradient flow. How does RPC-Attention handle gradient propagation through the layer? Does it utilize implicit differentiation [3,4], inexact gradient methods [1], or Backpropagation Through Time (BPTT)? Specifically, are there any instabilities when differentiating through the softmax operator?
Convergence: How does the RPC-Attention layer converge in the forward pass when using 4 or 6 iters? What if training the layer using 4 iterations and performing inference with 6 iterations? Would improved convergence at test time enhance the results?

局限性

The authors have discussed the limitations in the conclusion section.

Given the connection between attention mechanisms and Graph Neural Networks (GNNs), it would be interesting to see further exploration of RPC-Attention applied to graph data.

作者回复

2024-08-07

Q1. Related works. The manuscript omits several relevant works such as those below

[1] Is Attention Better Than Matrix Decomposition?

[2] Graph Neural Networks Inspired by Classical Iterative Algorithms

[3] OptNet: Differentiable Optimization as a Layer in Neural Networks

[4] Deep Equilibrium Models

Answer: Thanks for your suggestion. We agree with the reviewer that [1], [2], [3], and [4] should be discussed in the Related Work section.

Hamburger in [1] models the global context discovery as the low-rank recovery of the input tensor and solves it via matrix decomposition. Both Hamburger and our Attention with Robust Principal Components (RPC-Attention) try to recover clean signal subspaces via computing a low-rank approximation of a given matrix. The key differences between our RPC-Attention and Hamburger are: (1) Our RPC-Attention finds a low-rank approximation of the key matrix $K$ while Hamburger finds a low-rank approximation of the input matrix $X$ , and (2) Our RPC-Attention models the corruption by a sparse matrix while Hamburger does not enforce this condition. The entries of this sparse corruption can have an arbitrarily large magnitude and help model grossly corrupted observations in which only a portion of the observation vector is contaminated by gross error. Numerous critical applications exist where the data being examined can be naturally represented as a combination of a low-rank matrix and a sparse contribution, such as video surveillance, face recognition, and collaborative filtering [Candes, 2009].

[2] derives each component in a Graph Neural Network (GNN) from the unfolded iterations of robust descent algorithms applied to minimizing a principled graph regularized energy function. In particular, propagation layers and nonlinear activations implement proximal gradient updates, and graph attention results from iterative reweighted least squares (IRLS). While this is an interesting approach, it has not been extended to explaining the architecture of transformers, including self-attention, yet. Even though graph attention in GNNs and self-attention in transformers share many similarities, they are not the same. For example, query, key, and value matrices are introduced in the transformer's self-attention but not GNN's graph attention. In contrast, our kernel principal component analysis (kernel PCA) allows us to derive self-attention in transformers, showing that the attention outputs are projections of the query vectors onto the principal components axes of the key matrix $K$ in a feature space.

[3] and [4] implement each layer as an optimization and fixed-point solver, respectively. In particular, an OptNet layer in [3] solves a quadratic program, and a Deep Equilibrium layer in [4] computes the fixed point of a nonlinear transformation. Different from these layers, our RPC attention solves a Principal Component Pursuit - a convex program. Also, both OptNet layer in [3] and Deep Equilibrium layer in [4] do not shed light on the derivation and formulation of self-attention, which our kernel PCA framework does.

Also, we would like to point out that our RPC-attention aims at improving the robustness of self-attention mechanism to data contamination (See Tables 1, 2, and 3 in our main text and Tables 4, 5, 6, 7, 8, 9, 10, and 12 in our Appendix). None of the methods proposed in [1], [2], [3], and [4] addresses the model's robustness, neither theoretically nor empirically.

Following the reviewer's suggestion, we will include the discussion above in the Related Work section of our revision.

Q2. Speed. RPC-Attention adds cost to the inference speed.

Answer: Please see our answer to Q1 in the Global Rebuttal.

Q3. Gradient Flow: How does RPC-Attention handle gradient propagation through the layer? Does it utilize implicit differentiation [3,4], inexact gradient methods [1], or Backpropagation Through Time (BPTT)? Are there any instabilities when differentiating through the softmax operator?

Answer: As shown in Algorithm 1, the Principal Attention Pursuit (PAP) in our RPC-Attention is quite simple. Most operators in PAP are linear except for the softmax operator and the shrinkage operator $S_{\lambda/\mu}$ . Like the softmax operator in self-attention, the softmax operator in RPC-Attention is easy to differentiate through. Similarly, like the ReLU operator in feedforward layers, the shrinkage operator $S_{\lambda/\mu}$ is also easy to differentiate through. We do not encounter any instabilities with RPC-Attention, and the standard backpropagation algorithm is used to handle gradient propagation in RPC-Attention. Fig. 1 (Right) in the attached PDF in Global Rebuttal shows that training/validation loss curve of the transformer with RPC attention is quite stable.

Q4. Convergence: RPC-Attention layer's convergence in the forward pass when using 4 or 6 iters? What if training the layer using 4 iters and performing inference with 6 iters? Would improved convergence at test time enhance the results?

Answer: In Figure 1 (Left) in the attached PDF, we plot the objective loss given in Eqn. (14) in the main text of our manuscript versus the number of PAP iterations in RPC-Attention (see Algorithm 1 in our manuscript) for models that use 4 and 6 PAP iterations. In Table 3 in the attached PDF, we report the results of using 6 iterations at inference on a model trained with 4 iterations.

Q5. Further exploration of RPC-Attention applied to graph data.

Answer: Thanks for your suggestion. We agree with the reviewer that given the connection between the attention mechanism and GNNs, as pointed out in [Joshi, 2020], our RPC-Attention can be extended to apply to GNNs. For example, RPC-Attention can be incorporated into the graph attention of Graph Attention Networks [Veličković, 2018].

References

Candès, E. J., et al. Robust principal component analysis? 2011.

Veličković, P, et al. Graph Attention Networks. 2018

2024-08-13

I appreciate the authors' efforts to address my concerns.

Particularly,

Comparisons with relevant works differentiate RPC-Attention from existing approaches, which are beneficial for broader readers.
Clarify the backward pass definition, easing reproduction and improving the clarity.
I also acknowledge the efforts on graph data.

I have no concerns about acceptance.

评论- Any Questions from Reviewer 9vVz on Our Rebuttal?

2024-08-11

We would like to thank the reviewer again for your thoughtful reviews and valuable feedback.

Regarding your suggestion on further exploration of RPC-Attention applied to graph data, we have implemented RPC-Attention in a GAT model to replace Edge Attention. We train the baseline GAT model and RPC-GAT on the Cora dataset with the same settings as in [Veličković, 2017] for 10 seeds and report the average, best validation accuracy, as well as their standard deviation, in Table 1 below. Our RPC-GAT outperforms baseline GAT by 0.76%.

Table 1: RPC-GAT vs. GAT on the Cora dataset.

Model	Val Acc (%)
GAT (baseline)	81.28 $\pm$ 0.3%
RPC-GAT	82.04 $\pm$ 0.5%

We would appreciate it if you could let us know if our responses have addressed your concerns and whether you still have any other questions about our rebuttal.

We would be happy to do any follow-up discussion or address any additional comments.

References

Veličković, Petar, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. "Graph attention networks." arXiv preprint arXiv:1710.10903 (2017).

2024-08-13

The demonstration of robustness gain from RPC-Attention is sound, considering RPCA is designed for arbitrarily large sparse corruptions.

I have raised my score to 5.

评论- Thanks for your endorsement!

2024-08-13

Thanks for your response, and we appreciate your endorsement. Following your suggestions, we will include comparisons with relevant works, clarification on the backward pass, and experiments with graph data discussed in our rebuttal in our revision.

评论- Another Call for Your Response

2024-08-13

Dear Reviewer 9vVz,

We sincerely appreciate the time you have taken to provide feedback on our work, which has helped us greatly improve its clarity, among other attributes. This is a gentle reminder that the Author-Reviewer Discussion period ends in less than two days from this comment, i.e., (11:59 pm AoE on August 13th). We are happy to answer any further questions you may have before then, but we will be unable to respond after that time.

Sincerely,

Authors

作者回复

2024-08-07

Global Rebuttal

Dear AC and reviewers,

Thanks for your thoughtful reviews and valuable comments, which have helped us improve the paper significantly. We are encouraged by the endorsements that: 1) The perspective for understanding self-attention from the lens of Kernel PCA provided by the paper is very interesting (Reviewer hhSG, 85Cd), important (Reviewer hhSG), novel (Reviewer 8tie), and intrinsic (Reviewer rAS8); 2) The kernel PCA analysis of self-attention is sound (Reviewer 9vVz) with detailed mathematical derivation and proof to support the claim (Reviewer 8tie, hhSg, rAS8) and is supported with empirical evidence (Reviewer 85Cd); 3) The experimental evaluation is comprehensive, demonstrating the applicability of the proposed RPC-attention mechanism across both vision and language tasks (Reviewer 8tie, 9vVz) and supporting the claim of the proposed RPC-Attention's robustness against data corruption and adversarial attacks (Reviewer 9vVz, hhSg). We have included the additional experimental results requested by the reviewers in the 1-page attached PDF.

One of the main concerns from the reviewers is that our RPC-Attention might add cost to the inference speed and require more memory. Another concern is that we need to verify the performance of our RPC-Attention on a larger model, like ViT-Base with 300 epochs. Additionally, the reviewers ask about the stability of back-propagation through PAP iterations. We address these questions here.

Q1: Efficiency of RPC-Attention

Answer: Since we only apply RPC-Attention to the first layer of the transformer model, it adds minimal cost to the inference speed. In particular, in Table 11 in our Appendix, we show the inference runtime cost (in second/sample) of our RPC-Attention. As can be seen in that table, during inference ("Test" column in Table 11), our RPC-Attentions with 4iter/layer1 (4 Principal Attention Pursuit (PAP) iterations at layer 1) and 5iter/layer1 (5 PAP iterations at layer 1) have the same inference runtime as the baseline SymViT, 0.01 second/sample. Our RPC-Attention with 6 iter/layer1 (6 PAP iterations at layer 1) only requires slighly more inference time than the baseline, 0.011 second/sample vs. 0.01 second/sample, which is not a "major deficiency". Due to the effectiveness of our RPC-Attention, applying it at the first layer of a transformer model is already good enough to improve the model's robustness significantly (see Tables 1, 2, and 3 in our main text and Tables 4, 5, 6, 7, 8, 9, 10, and 12 in our Appendix). Note that results reported in Table 11 also show that our RPC-Attention is as memory, parameter, FLOP, and training efficient as the baseline SymViT.

Q2: Experiments on a larger model, e.g., ViT-Base with 300 epochs

Answer: We have conducted an additional experiment on ViT-Base for the ImageNet object classification task with 300 epochs and reported our results in Tables 1 and 2 in the attached PDF. Our RPC-SymViT-Base outperforms the baseline in terms of clean accuracy and is also more robust to data perturbation (Table 1) and adversarial attacks (Table 2) than the baseline.

Q3 (Gradient Flow): Stability of back-propagation through PAP iterations

Answer: As shown in Algorithm 1 in the main text, the Principal Attention Pursuit (PAP) in our RPC-Attention is quite simple. Most operators in PAP are linear except for the softmax operator and the shrinkage operator $S_{\lambda/\mu}$ . Like the softmax operator in self-attention, the softmax operator in RPC-Attention is easy to differentiate through. Similarly, like the ReLU operator in feedforward layers, the shrinkage operator $S_{\lambda/\mu}$ is also easy to differentiate through. We do not encounter any instabilities with RPC-Attention, and the standard backpropagation algorithm is used to handle gradient propagation in RPC-Attention. As can be seen in Figure 1 (Right) in the attached PDF, the training loss and validation loss curve of the transformer with RPC attention is quite stable.

We hope that our rebuttal has cleared your concerns about our work. We are glad to answer any further questions you have on our submission, and we would appreciate it if we can get your further feedback at your earliest convenience.

评论- Any Questions about Our Rebuttal

2024-08-12

Dear Reviewers,

We would like to thank you very much for your feedback, and we hope that our response addresses your previous concerns. In case you have not responded to our rebuttal so far, please feel free to let us know if you have any further comments on our work as the discussion period is expected to end soon (11:59pm AoE on August 13th). We would be more than happy to address any additional concerns from you.

Thank you again for spending time on the paper, we really appreciate that!

Best regards,

The Authors

最终决定Accept (poster)

2024-09-25

This work aims to understand attention mechanisms from kernel PCA. The presentation is clear, the problem is well-motivated, and the perspective is novel and insightful. More interestingly, a robust attention (RPC-attention) is further proposed. Major concerns include the performance stability and the computational cost. Overall the reviewers give positive scores after rebuttal discussions. AC agrees this is an interesting perspective on the understanding and design of the attention mechanisms and thus recommends acceptance.