Questions:

Q1: Please see the response to weakness 2a.

Q2: How does the variance impact the results?

Thank you for the question. We evaluate the sensitivity values of the five models compared on the CIFAR10 dataset at the end of training using different values of . We added the results in Appendix A.3 and we see that even though the sensitivity values are different for different (as expected), the conclusion that transformers learn lower sensitivity functions than CNNs is robust to the value of . Appendix A.4 also has results for the QQP dataset with a different value of than considered in the paper.

Q3: Using fixed noise level for different tokens.

We believe that token norms are comparable and thus, it makes sense to use a fixed noise level. Specifically, noise is added at patch level for images, and as mentioned in line 377 in the paper, for language tasks, noise is added after layer normalization.

Q4: Attention architecture considered in Section 3.1

The model considered in Section 3.1 is a standard attention layer with key, query, and value weights , composed with a linear decoder . Please also see the response to weakness 1a.

Q5: Does test accuracy correlate with sensitivity?

Fig 13 in the Appendix shows the test accuracies for the models considered in Fig. 4 on the CIFAR-10 dataset. Although the ViTs have a slightly higher test accuracy, it is comparable across the five models.

Q6: How does model scale affect the difference between sensitivity values?

This is an interesting question. However, as mentioned in the response to weakness 1b, comparing larger models is challenging because we also have to control the pretraining data and strategies for a fair comparison.

Q7: “Section 5 only considers the RoBERTa model. Do the claims hold for more recent language models?”

As mentioned in line 401 in the paper, we include the results with GPT-2 in the Appendix which lead to the same conclusions. Please also see the response to weakness 1b.