6.0

/10

Rejected3 位审稿人

最低5最高8标准差1.4

3.0

置信度

ICLR 2024

Fine-grained Local Sensitivity Analysis of Standard Dot-Product Self-Attention

Aaron J Havens,Alexandre Araujo,Huan Zhang,Bin Hu

OpenReview PDF

提交: 2023-09-24更新: 2024-02-11

TL;DR

We propose a fine-grained local senstiivity anaylsis of dot-product self-attention with respect to l2-bounded input perturbations.

摘要

Self-attention has been widely used in various machine learning models, such as vision transformers. The standard dot-product self-attention is arguably the most popular structure, and there is a growing interest in understanding the mathematical properties of such attention mechanisms. This paper presents a fine-grained local sensitivity analysis of the standard dot-product self-attention. Despite the well-known fact that dot-product self-attention is not (globally) Lipschitz, we develop new theoretical local bounds quantifying the effect of input feature perturbations on the attention output. Utilizing mathematical techniques from optimization and matrix theory, our analysis reveals that the local sensitivity of dot-product self-attention to $\ell_2$ perturbations can actually be controlled by several key quantities associated with the attention weight matrices and the unperturbed input. We empirically validate our theoretical findings through several examples, offering new insights for achieving low sensitivity in dot-product self-attention against $\ell_2$ input perturbations.

关键词

Self-attentionVision Transformers (ViT)Local Sensitivity

评审与讨论

审稿意见

评分: 5置信度: 22023-10-29

This paper studies the local sensitivity of dot-product self-attention in Transformers. Though the outputs of all heads is not globally Lipchitz, a weaker condition, i.e., the local sensitivity can be theoretically analyzed by providing a upper bound. Besides, the upper bound is empirically verified and certification on practical models is also given.

优点

Upper bound of local sensitivity analyse is derived
Numerical validations are also provided to support the fact that, the upper bound is tight and reasonable

缺点

The dimension of some matrices are undefined, e.g., $W^O \in R^{d \times d}$ and $H \in R^{n \times n}$ ?
Solving Eq. (10) requires SVD for the n-by-d matrix. How to ensure the computational efficiency?
To bound the second term in Eq. (14), the author uses the triangle inequality to obtain the upper bound at first. However, this can be also obtained with a closed-form solution? This is because the objective function and constraint are both linear.

问题

Before Proposition 1, the authors mention the robustness under l_2 perturbation. How about using l_\inf perturbations for robustness when compared to the adversarially chosen l_2 perturbations? In this case, Eq. (5) will be changed to the l_\inf norm but I’m not sure the used techniques are still valid.

评论- Response to Reviewer BhUy

2023-11-19

Thanks for your comments. We believe that the significance of our contribution has been underestimated. We share Reviewer d9ub's opinion that our paper has made significant contributions in achieving the first non-zero certified robustness result of standard dot-product self-attention networks on CIFAR. We further address all your individual comments as below.

The dimension of some matrices are undefined, e.g., $W^O$ and $H$ ?

We consider a standard setting with $W^O\in \mathbb{R}^{(d/h) \times d}$ and $H\in \mathbb{R}^{n\times n}$ . We have now made the dimensions of these matrices clear in the revised paper

Solving Eq. (10) requires SVD for the n-by-d matrix. How to ensure the computational efficiency?

To be clear, our approach only requires computing the spectral norm, and SVD is not needed for solving the spectral norm. In our paper, we just use the power iteration method which is known to be very efficient in computing the spectral norm and has been used many times in deep learning (e.g. Tsuzuku'18, Miyato'18, Meunier'22).

To bound the second term in Eq. (14), the author uses the triangle inequality to obtain the upper bound at first. However, this can be also obtained with a closed-form solution? This is because the objective function and constraint are both linear.

Unfortunately, there is no closed-form solution for the second term of Eq. (14), since this is a problem maximizing the spectral norm under a quadratic constraint. The objective function $||X' W^V W^O||$ is not a linear function of $X'$ due to the appearance of the spectral norm. The constraint $||X'-X||_F \leq \epsilon$ is also not linear in $X'$ due to the appearance of the Frobenius norm. It is actually quadratic. For problems minimizing the spectral norm under a quadratic constraint, it is possible to reformulate them as semidefinite programs (SDPs). However, the second term of Eq. (14) requires maximizing the spectral norm, and hence it is difficult to obtain a bound tighter than our current bound based on triangle inequality.

Before Proposition 1, the authors mention the robustness under $\ell_2$ perturbation. How about using $\ell_\infty$ perturbations for robustness when compared to the adversarially chosen $\ell_2$ perturbations? In this case, Eq. (5) will be changed to the $\ell_\infty$ norm but I’m not sure the used techniques are still valid.

Extending our fine-grained $\ell_2$ analysis to the $\ell_\infty$ case is definitely non-trivial and would require some major changes, since we are using the fact that softmax is $1$ -Lipschitz with respect to the $\ell_2$ norm. We want to emphasize that it is totally reasonable for our current paper to focus on the $\ell_2$ perturbation cases, and there are many papers published in top machine learning venues (e.g Singla'21, Trockman'21, Meunier'22, Singla'22, Prach'22, Araujo'23, Wang'23,Hu'23) that only focus on robustness and sensitivity of neural networks under the $\ell_2$ setting. Our paper is the first obtaining a non-vacuous certified robustness result of standard dot-product self-attention networks on CIFAR.

[Singla'21]. Sahil Singla and Soheil Feizi. Skew orthogonal convolutions. ICML

[Trockman'21] Asher Trockman and J Zico Kolter. Orthogonalizing convolutional layers with the Cayley transform. ICLR

[Meunier'22] Laurent Meunier, Blaise J Delattre, Alexandre Araujo, and Alexandre Allauzen. A dynamical system perspective for lipschitz neural networks. ICML

[Singla'22] Sahil Singla, Surbhi Singla, and Soheil Feizi. Improved deterministic l2 robustness on CIFAR-10 and CIFAR-100. ICLR

[Prach'22] Bernd Prach and Christoph H Lampert. Almost-orthogonal layers for efficient general-purpose lipschitz networks. ECCV

[Araujo'23] Alexandre Araujo, Aaron J Havens, Blaise Delattre, Alexandre Allauzen, and Bin Hu. A unified algebraic perspective on lipschitz neural networks. ICLR

[Wang'23] Ruigang Wang and Ian Manchester. Direct parameterization of lipschitz-bounded deep networks. ICML

[Hu'23] Kai Hu, Andy Zou, Zifan Wang, Klas Leino, and Matt Fredrikson. Unlocking deterministic robustness certification on imagenet. NeurIPS

2023-11-23

thanks for the authors' response.

If the author(s) would like to focus on the first non-zero certified robustness result of standard dot-product self-attention networks on CIFAR, there are some references that are missing:

https://openreview.net/forum?id=BJxwPJHFwS

Shi, Z., Zhang, H., Chang, K.W., Huang, M. and Hsieh, C.J., 2020. Robustness verification for transformers. ICLR 2020.

Bonaert, G., Dimitrov, D.I., Baader, M. and Vechev, M., 2021, June. Fast and precise certification of transformers. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (pp. 466-481).

Sorry for the late reply but my score will remain unchanged.

审稿意见

评分: 5置信度: 42023-10-31

This paper aim to theoretically analyze the sensitivity of the self attention mechanism. Local perturbations are imposed on the weights, and authors quantify the relationship between the sensitivity between the input, weight matrices, etc. Experiments are done to validate the theory, and insights are provided to achieve more stable self attention structure.

优点

This paper captures a common problem of the popular Transformer model: self attention mechanism can be sensitive. The work quantifies the sensitivity and provides insight into how to make the self attention structure stable. This topic is important in the performance of Transformer model, which is widely applied in NLP, CV tasks.
I do not have doubt on the theoretical results, as they are clearly derived.
The experiments are closely related with the theory.

缺点

My main concern is that this work does not provide enough contribution. In Section 4, the gap caused by perturbation is derived. However, these results are not novel, in fact, they are easy to derive. The main idea of Section 4 is just finding a Lipschitz constant to bound the gap when perturbation is added to input. This can be easily done if we take derivative over input X and find an upper bound for the Frobenius norm of the gradient over X. In some other works, the closed form gradients (maybe over $W^Q,W^K$ , but similar to gradient over X) are easily derived, e.g, Tian, Yuandong, et al. "Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer." arXiv preprint arXiv:2305.16380 (2023). Thus, I do not think the theory has much contribution.
The theory in Section 4 implies that weight matrices and data with small magnitude is better. However, 'small magnitude' does not mean a self attention mechanism is a good model. Consider an extreme case where all weight matrices are close to 0, then the attention mechanism has poor representation ability. We usually require a model with both expressivity and stability, while in this work, the expressivity is ignored.

问题

How to theoretically guarantee that a model can both have good expressivity and stability?
When weights $W^Q,W^K,W^V$ follows some specific distribution, can the sensitivity bound be improved? Or the bound is only related to the magnitude of weights?

评论- The list of the references mentioned in our response

2023-11-19

For your convenience, here we list the detailed information of the references mentioned in our above response. Hopefully this makes reading our response easier. We are also willing to address any follow-up questions.