/10

Poster4 位审稿人

最低3最高4标准差0.4

ICML 2025

Softmax is not Enough (for Sharp Size Generalisation)

Petar Veličković,Christos Perivolaropoulos,Federico Barbero,Razvan Pascanu

提交: 2025-01-24更新: 2025-07-24

TL;DR

We prove that the softmax function cannot robustly model sharp functions with increasing size, based on several controlled experimental observations over simple attention heads, as well as in language models.

摘要

关键词

softmaxsize generalisationattentiontransformerslength generalisationsharpnessentropy

评审与讨论

审稿意见

评分: 32025-03-13

This paper can be divided into 3 parts:

The observation and proof that softmax-based architectures (such as Transformers) will have a "dispersion" phenomenon when tested on longer inputs than they are trained on.
The observation that this dispersion phenomenon can degrade the length-generalization performance of transformers on simple algorithmic tasks such as finding the maximum element in a list.
An ad hoc "adaptive temperature sampling" scheme that seeks to remedy the dispersion phenomenon and leads to performance improvements on the "maximum element" task and on several problems in the CLRS-text benchmark.

给作者的问题

N/A

论据与证据

The proofs are clear.

方法与评估标准

Yes, these make sense.

理论论述

Yes, I checked the main theorem and its proof.

实验设计与分析

Yes, I checked the experimental details for the maximum task and the CLRS-text task and they seemed ok to me. The adaptive temperature scheme is arguably very ad hoc, but it seems to give some small gains.

补充材料

Yes, I reviewed Appendices A-D.

与现有文献的关系

The paper's observation is simple, but thought-provoking. The maximum-element task is a convincing illustration that this dispersion phenomenon in LLMs is real and could be one of the barriers to length generalization. On the other hand, the adaptive temperature sampling scheme is less convincing, more ad hoc, and leads to seemingly small gains on the problems tested. Therefore, I overall find highlighting the dispersion phenomenon to be the more valuable contribution in this paper, since it may well motivate future work on this topic.

遗漏的重要参考文献

Not insofar as I know

其他优缺点

I had some trouble following the explanation of the adaptive temperature sampling scheme.

其他意见或建议

The paper is written in a somewhat informal style, which I am fine with because it is mostly clear. However, in particular this sentence could be improved: "We prove this important result now" before Theorem 2 could be changed to simply read "We prove this result now".

作者回复

2025-04-01

Dear Reviewer n5x1,

We are highly pleased to read your review, and really appreciate your positive view on our results and their significance!

In what follows, we reply to all of the points you raised:

On the other hand, the adaptive temperature sampling scheme is less convincing, more ad hoc, and leads to seemingly small gains on the problems tested. Therefore, I overall find highlighting the dispersion phenomenon to be the more valuable contribution in this paper, since it may well motivate future work on this topic.

We fully agree with you that the key outcome of our work should be highlighting the dispersion effect, improving understanding of it, and stimulating future work towards addressing it. Adaptive temperature was designed as a mostly ad-hoc method to illustrate that even simple interventions can counter dispersion in a way that leads to measurable improvements, but it does not escape the confines of our theoretical results – as we clearly highlight in our Conclusions.

I had some trouble following the explanation of the adaptive temperature sampling scheme.

We appreciate this remark and commit to adding a new section (potentially in the Appendix) which will provide a step-by-step overview of how the sampling scheme was arrived at.

However, in particular this sentence could be improved: "We prove this important result now" before Theorem 2 could be changed to simply read "We prove this result now".

This is a great suggestion, and we will tone down by removing the word ‘important’ here.

审稿意见

评分: 42025-03-14

The function softargmax (commonly referred to as softmax), which is used to create probability output vectors and attention heads within neural networks, becomes less like argmax (less sharp) as the number of elements over which the softmax is applied increases. This is detrimental for learnt circuits within transformer architectures that need sharp attention, especially when deployed at inference time on longer sequences than presented during training.

给作者的问题

None.

论据与证据

The proofs given in Lemma 2.1 and Theorem 2.2 are rather weak with respect to the arguments made in the paper. For example, Lemma 2.1 assumes that logits are bounded below by $m$ and above by $M$ , both of which are finite values, and provides a proof for the limiting case $n\to\infty$ . This is an incredibly loose bound. The experiments in the paper go up to $n=16,384=2^{14}$ , while machine precision limits would only provide a lower bound of $m=-10^{38}=-2^{126}$ , so this sequence length and lower bound is going to induce minimal dispersion on the attainable sharpness. The proofs provided by the authors demonstrate that softmax must disperse in the limiting case, but not that it does in practice for the scales actually used with transformer models.

The authors also included empirical studies (Fig 2 and 3), where a model is tasked with identifying the maximal element in a sequence, which I appreciated. However, I think it would help support the paper if there were additional empirical measurements. For example, what is the distribution of logit values seen in LLM attention heads at inference time across a range of tasks? The minimum and maximum values from this would be informative to establish a practical range for $m$ and $M$ .

Additionally, I think there could be more discussion on the factors which prevent the model from achieving an arbitrarily sharp distribution (e.g. label noise prevents the model from learning arbitrarily large parameter values; the derivative to make off-target logits arbitrarily negative can vanish when the softmax output is already sufficiently sharp).

方法与评估标准

Yes.

理论论述

The proofs in the main paper (Lemma 2.1, Theorem 2.2, Prop 3.1, Prop 3.2) are correct.

实验设计与分析

For the experiments comparing the adaptive- $\theta$ model presented by the authors against the baseline model for the max task (Table 1), I think these results could be usefully expanded by including additional "oracle" measurements which use the optimal theta at inference time. This will help to indicate how much of the possible gains which could be attained by only changing the temperature were achieved by the adaptive temperature.

The discussion of the comparison for the adaptive- $\theta$ method introduced in the paper (Fig 8.) could be more detailed. In particular, some tasks see the adaptive temperature model perform worse than the baseline (namely heapsort, mst kruskal, and bubble sort). I would appreciate if the authors could comment on whether this deficiency is meaningful, for instance are there some features these tasks have in common which makes adaptive- $\theta$ perform poorly here? Is it just because adaptive- $\theta$ was fit on the max task and does not generalize well to these tasks?

补充材料

Appendix A.

与现有文献的关系

The issue of softmax dispersion is already known within the community broadly speaking, but I appreciate that this work presents the problem well and raises awareness of this issue.

遗漏的重要参考文献

Not that I am aware of.

其他优缺点

None.

其他意见或建议

Figure 5.

The title and y-axis indicates that in half the domain you are dividing the logits by a negative number, which doesn't make sense w.r.t. the meaning of temperature. I suggest this is refactored. The implication is more like the sequence is changed by negation rather than the temperature adjusted. But the sequence is controlled by lambda on the x-axis, so it is odd to have two different sequences stacked one above the other and joined at the $\theta=0$ line To be honest, I am not sure what I am supposed to learn from this figure anyway. Is this power series supposed to be indicative of real data? If so, then how is it related to real data?

The "jet" colormap used here is not perceptually smooth. Its use creates the visual appearance of discontinuities at the luminosity inflection points (e.g. yellow and cyan). I recommend using a more modern colormap instead. There is a good reason why the defaults in matplotlib and MATLAB have both moved away from jet several years ago. Seaborn throws an error if the user requests to use jet, rather than comply. There are many resources which have published which discuss the issues with jet, e.g.

If you like jet because it breaks the data down into blocks of distinct colours and feel that the new colormaps with smooth luminosity changes (e.g. viridis) are too smooth making them hard to read, then you should be breaking the data up into discrete colours with intentionality, rather than as the side-effect of a poor colormap. In this case, you should use a contour plot, either with contour lines or with discrete blocks of ~24 colours, instead of a smooth colormap.

Typos and snags.

L018 "circuits" incorrect quote mark
L055 right column, missing word "A strong current [topic] in this space"
L073 right column, it would be helpful to know how samples are OOD for this point (i.e. is it OOD in sequence length or not, as that is pertinent to the topic of the paper)
L105 right column, "a collection of [n] nodes" as n is the number of nodes, not the label of the collection.
L112 The notation is a bit confusing since k_i and v_i are vectors instead of elements within vectors.
L124 Eq 3, suffix variable j should be i
Eq 3, 4. Adding a comma between equations on the same line may help delineate between them.
Fig 1: It appears that the caption text colours are intended to indicate the block types within diagram. However, the token and MLP colours are too similar in the figure and the text colours are not quite the same shades as the figure colours. I thus recommend the authors add labels within the figure, and/or describe the colours in the caption to make the reference clear.
Fig 2: What's the number of items used in the base case? Please add this to the caption.
Fig 3: Add units for y-axis (presumably bits)
Fig 4: ~L222 "see Figure 5" should be "see Figure 6"
Tab 1, L382: There shouldn't be a space after the thousands separator, having this in makes the numbers harder to read.
Fig 8: Some tasks have the adaptive temperature model worse than the baseline (heapsort, mst kruskal, bubble sort) Can the authors comment on what features these tasks possess
Fig 8: Legend should be at the bottom of bridges or activity selector instead of the top of bfs so it doesn't obstruct data.
L692: Usually it is just called the Adam optimizer, not the Adam SGD optimiser. Referring to it as such may cause confusion.
L654: Hard-written "Equations 2--3" is incorrect and should be replaced with actual equation numbers as reference.

Citations:

Be careful with protecting casing of acronyms.
- L462: "Transformers need glasses! information..."
- L497: "clrs"
- L583: "gpt-2"
Please consider adding links to references which currently don't have one. This makes it easier for a reader to navigate to the cited works. You appear to have your arXiv references set up in a consistent manner, so adding either a URL or a DOI field can be done automatically for these with a regex call.

作者回复

2025-04-01

Dear Reviewer jdu5,

Thank you so much for the very careful review and the vast amount of useful suggestions for improving our work! We are very happy you appreciated our efforts to raise awareness of the dispersion issues of softmax in the ICML community!

We provide detailed answers to your comments below, and we hope they will be to your liking.

Bound looseness

We highly appreciate your remark about the looseness of our bound, and wholeheartedly agree it is worth discussing in more depth. We stratify our answer by your specific comments, and commit to updating the paper to incorporate all aspects of our discussion!

The experiments in the paper go up to n=16,384=2^14, while machine precision limits would only provide a lower bound of m=−10^38=−2^126, so this sequence length and lower bound is going to induce minimal dispersion on the attainable sharpness.

Just to assert our understanding: due to the exponents you presented, we find it likely that your bounds were derived under the assumption of the bfloat16 type being used. This is a fair assumption, though we remark that large-scale frontier models are frequently served at even lower-precision types – sometimes with aggressive quantisation. This might well affect the practical lower bound obtained for $m$ due to machine precision.

The proofs provided by the authors demonstrate that softmax must disperse in the limiting case, but not that it does in practice for the scales actually used with transformer models.

In this regard, it is worth taking Theorem 2.2 together with Proposition 3.1, which shows models aiming to improve sharpness have no choice but to amplify their weights at some point in the architecture, and that the weight magnitudes directly control the empirically observed values of $m$ and $M$ . But amplifying weights is generally risky due to the danger of overfitting—as such, the model has no choice but to keep the empirical spread somewhat contained. This immediately brings us to your next point:

For example, what is the distribution of logit values seen in LLM attention heads at inference time across a range of tasks? The minimum and maximum values from this would be informative to establish a practical range for m and M.

This is a fantastic suggestion and we really appreciate it!

To measure this, we fed an entire code sample of Gemma’s Modules file (available at https://github.com/google-deepmind/gemma/blob/main/gemma/modules.py, ~4,000 tokens) to Gemma 2B and 7B models. The empirically observed values of the logit spread, $\delta = M - m$ , across all attention heads, were as follows:

Model	Average $\delta$	Maximum $\delta$	Minimum $\delta$
Gemma 2B	5.69 ± 2.05	14.78	2.28
Gemma 7B	5.82 ± 2.61	32.74	0.09

This shows that empirical logit spreads in practical queries are, even in their maximal occurrence, rather small compared to the machine epsilon-induced bounds mentioned before, and should give further credibility to the practicality of our results. We of course will include this table in the paper!

Additionally, I think there could be more discussion on the factors which prevent the model from achieving an arbitrarily sharp distribution

Another excellent proposal – we fully agree with all of your specific factors, and will make sure to thoroughly discuss those in our revision.

Specifics of Adaptive- $\theta$

As suggested, we have performed preliminary experiments with a custom oracle multiplier for Adaptive- $\theta$ . While this led to (2–3%) percentage point improvements on lesser OOD sizes, there was no consistent improvement on longer sequence lengths.

We are also happy to enrich the discussion of the comparisons in Figure 8. While we cannot make very strong claims, a unifying property of Heapsort, MST Kruskal and Bubble Sort in CLRS-Text is that they all occupy relatively large chunks of Gemma 2B’s context window, which stretches further beyond the largest contexts over which the polynomial fit of Adaptive- $\theta$ was fit—and this might cause an unintended shift, which is in line with your suggestion.

On Figure 5

We acknowledge your point regarding jet. We agree that perceptually uniform colormaps like viridis or plasma offer better visual representation, and will switch!
We used power series (controlled by $\lambda$ ) as a simple model to represent sequences with varying degrees of "sharpness" in their logit distributions.
While negative $\theta$ does not have a direct thermodynamics meaning, it allows us to explore the behavior of the softmax function beyond the typical regime.

Miscellaneous issues

We are very grateful for your thorough comments about several miscellaneous minor issues in the paper. We fully agree with your remarks, and commit to correcting them in our revision.

审稿人评论

2025-04-09

I am glad to hear my constructive feedback was well received!

To a first order approximation, logits within neural networks are typically distributed like a standard normal distribution. So an average delta around 6 is what I would intuitively expect, and the reason why the issue of the dispersal is intuitively salient. I was of course being somewhat flippant when I referred to machine precision limits setting a lower bound for m: the problem was that the typical values for m and M or delta were not discussed in the paper, so the scale of the issue of the dispersal was not made as clear as it deserved. I appreciate the addition of the measurements made on the Gemma code base, but to better integrate the measurements I encourage the authors to measure the delta observed with stock Gemma models on CLRS-Text as well.

Figure 5

We used power series (controlled by $\lambda$ ) as a simple model to represent sequences with varying degrees of "sharpness" in their logit distributions. While negative $\theta$ does not have a direct thermodynamics meaning, it allows us to explore the behavior of the softmax function beyond the typical regime.

I still think I would prefer to see this figure as four subplots, $\pm(\lambda^{\pm i})/\theta$ where $\lambda \ge 1$ , $\theta \ge 0$ , $i \in [1, \cdots, 10]$ . This clearly delineates the four regimes which are being considered:

$+(), +i$ : power series where smaller terms are high density and big terms are low density
$-(), +i$ : power series where small terms are low density and bigger terms are high density
$+(), -i$ : power series where small terms are low density and bigger terms are high density, and everything is near 0
$-(), -i$ : power series where smaller terms are high density and big terms are low density, and everything is near 0

As the authors have addressed my critique sufficiently, I have raised my score.

作者评论

2025-04-09

Dear Reviewer jdu5,

Thank you so much for acknowledging our efforts and raising your score!

We completely share your motivation about the utility of reporting the empirical delta values and will report them for CLRS-Text as well.

Your suggestion about four subplots makes sense and we will amend the figure accordingly.

Best, Authors

审稿意见

评分: 32025-03-17

The authors argue that modern deep learning architectures are fundamentally incapable of learning sharp functions(for example, max) due to the disperse nature of the softmax function in out-of-distribution settings. In addition, the authors propose an adaptive temperature mechanism as a plug-in technique at inference time for improving the sharpness.

给作者的问题

The authors clearly discuss the disperse problem of softmax function. I wonder how big the problem is when we zoom out to the entire transformer, given all the residual connections and normalization etc?

论据与证据

The authors argue about the limitation of the softmax function and provide both theoretical and empirical evidence for it. The authors also provide theoretical and empirical visualization to support adaptive temperature.

方法与评估标准

Both the argument around the softmax limitation and the proposed method make sense.

However, the evaluation is rather "toy". I understand that max retrieval provides a clean setting. However, only CLRS-Text makes it hard to understand the usefulness of adaptive temperature or the problem from the dispersed nature of softmax in the real world.

理论论述

I checked Lemma 2.1 and Theorem 2.2, which seem correct.

实验设计与分析

I checked the settings for max retrieval and CLRS-Text, and the experimental design and analysis make sense.

补充材料

I review Appendix A to understand max retrieval settings.

与现有文献的关系

The limitation on softmax and adaptive temperature is relevant for the boarder audience, given the prevalence of softmax in modern ML systems.

遗漏的重要参考文献

I am not an expert on theoretical work around softmax or out-of-distribution. However, the discussion on the background, primer on attention heads and transformer, and related work on adapting temperature seem complete to understand the problem and proposed method.

其他优缺点

I think the primary significance of this paper lies in the discussion on softmax limitation. My main concern is discussed in the evaluation criterion area.

其他意见或建议

None

作者回复

2025-04-01

Dear Reviewer 5ZBY,

Thank you for your careful review and the positive assessment of our contribution! We are very grateful for your comments, and provide our responses below – hoping that they are to your liking!

The authors clearly discuss the disperse problem of softmax function. I wonder how big the problem is when we zoom out to the entire transformer, given all the residual connections and normalization etc?

This is an excellent question!

We study exactly this, to varying extents, in Appendix B and Appendix C of the original submission. We provide a brief summary of these results for your convenience, and are very happy to discuss them further:

[Corollary B.1] We prove that the dispersion effect necessarily leads to classification failures in a single-layer Transformer architecture on simple reasoning tasks (such as predicting the maximum).
[Remark B.2] We make an informal sketch of how Corollary B.1 can be extended to deep Transformer architectures (both BERT- and GPT-style) to show the same kind of classification failures must occur past a certain number of input tokens. This argument explicitly takes into account residual connections.
[Remark C.1] We prove that, in BERT-style Transformers without residual connections, the situation is particularly dire: when a particular layer disperses, all layers after it will be immediately dispersed as well.

The residual connections play an important role with depth, as evidenced by Remark C.1: they allow a model to “shortcut” a dispersed layer and retain its original embeddings for longer. However, Theorem 2.2 shows that, no matter how many residual connections are used, each individual layer still must disperse past a certain size.

The normalisation layers often play a counter-productive role, which we discussed in Section 3: they clamp the input to a certain expected norm, meaning that there is higher pressure on key/query matrices ( $\mathbf{K}$ , $\mathbf{Q}$ ) in order to achieve a sufficient logit spread (cf. Proposition 3.1).

However, the evaluation is rather "toy". I understand that max retrieval provides a clean setting. However, only CLRS-Text makes it hard to understand the usefulness of adaptive temperature or the problem from the dispersed nature of softmax in the real world.

Thank you for your remarks! Since our study concerns out-of-distribution generalisation specifically, we have focused our analysis on tasks requiring out-of-distribution generalisation (such as CLRS-Text, a collection of thirty challenging algorithmic execution tasks across many problem sizes). In most other static benchmarks, it might be very difficult to measure the distribution shift in the test set.

We also remark that focusing on synthetic execution tasks is the standard approach in papers studying length generalisation in LLMs. As a standard representative we refer to “Exploring Length Generalization in Large Language Models” (Anil et al., NeurIPS’22), which studies only two synthetic problems: parity and variable assignment. In contrast, CLRS-Text studies thirty such problems, with a significant increase in their complexity.

审稿意见

评分: 32025-03-19

This paper studies the sharpness of the softmax function from a size generalization perspective. The authors regard a function as being sharp if its output can be expressed using a constant number of inputs. The authors refer as size generalization the study of what happens when the function is subject to a larger number of inputs. In this paper, the authors argue that using an adaptive temperature parameter can help preserve sharpness by lowering the temperature enough to reduce entropy while maintaining the trained model accurate. More generally, the authors argue in their main theoretical results that it is not possible to preserve the sharpness of the softmax function as the number of inputs grows arbitrarily large.

给作者的问题

By the last paragraph of Section 2, my impression is that the whole argument of this paper is that there are clear limits to the transformer architecture if taking arbitrarily large inputs. Is that the case, or do you believe that there is a better function than softmax for the purpose that it serves?

论据与证据

My main concern with this paper is not correctness, but rather significance: the theoretical results claimed do not seem surprising or nontrivial to prove for someone working on that line of inquiry. From the examples at the top of Page 2 that the max function is sharp and the average function is not sharp, the fact that softmax is not sharp for an arbitrarily large number of inputs seems evident and proving that result seems in line with a doctoral-level homework exercise.

Moreover, if we are to assume that softmax is sharper with smaller inputs or subject to a lower temperature parameter, then I believe that we lack a definition for proper theoretical discussion: there should be a threshold for the minimum contribution of an input to consider it relevant to the output of the function (and possibly how many inputs significantly contributing to the function output is too many). Otherwise, any contribution of an input should be counted, and then softmax is trivially not sharp.

More generally, any discussion about models when their dimensions tend to infinity changes the nature of the beast. For example, arbitrarily deep or wide neural networks may hold properties that a finite neural network architecture cannot promise. Hence, I believe that the proper framing here should have been about scaling up sharpness with respect to input size - and how to overcome the challenges associated with that.

方法与评估标准

See item above.

理论论述

See two items above.

实验设计与分析

See three items above.

补充材料

No.

与现有文献的关系

I am not sufficiently familiar with the line of work to which this paper contributes to make a comment on this.

遗漏的重要参考文献

I am not sufficiently familiar with the line of work to which this paper contributes to make a comment on this.

其他优缺点

Strength: the authors have a clear writing and frame well some interesting aspects of the theoretical work around attention. I am curious to read more about it following their description.

Weakness: I personally find the terminology "reasoning device" speculative. I would recommend making the discussion more objective without such and related terms.

其他意见或建议

The use of rephrasing in theoretical statements (the "That is" in Lemma 2.1, Theorem 2.2, and Proposition 3.1) is not adequate. If needed, those can be added either before or after the formal statement (or after the proof), but not in it.

Drawing conclusions inside a theoretical statement (the "Thus" in Proposition 3.2) is not adequate. If relevant, that part should have been a separate corollary after the parent result.

Abstract:

' "circuits" ': replace '' with `` before this word

Page 1:

"does not have a chance": too informal

Page 4:

In the proof of Theorem 2.2: "for [some choice of] constants $m$ and $M$ "

作者回复

2025-04-01

Dear Reviewer WiLd,

We would like to thank you for carefully considering our paper. While we regret that your initial rating of our paper was negative, we believe you raised important points and that there is a clear discussion to be had, and that we may be able to provide relevant arguments for you to reconsider the relevance of our work.

To that end, we address all of your points in order:

On significance of our results

We do not dispute that this result was not very difficult to prove, but we argue that it is definitely not evident to a significant part of the ICML community. And we hope you agree we should preferably judge proofs’ significance using the latter criterion rather than the former -- simplicity is not in and of itself bad.

Therefore, we will focus this part of the response on discussing whether our results are evident.

Due to the anonymous nature of the reviewing process, the only concrete evidence we can provide towards this are the reactions of the other three reviewers:

[Reviewer n5x1] The paper's observation is simple, but thought-provoking.
[Reviewer jdu5] I appreciate that this work presents the problem well and raises awareness of this issue.
[Reviewer 5ZBY] The limitation on softmax and adaptive temperature is relevant for the boarder audience, given the prevalence of softmax in modern ML system

We wish that we could provide more concrete evidence, but we cannot for obvious reasons. Suffice it to say, we have had numerous discussions with our previous collaborators (who are at many varying levels of seniority and expertise about self-attention) about this result, and the overwhelming majority of them initially reacted to our result with surprise; i.e., not expecting that the dispersion effect is guaranteed in softmax. In fact, it was exactly these interactions that compelled us to write this paper in the first place!

It is interesting to ponder why our result is surprising to such a broad audience of AI researchers. Our hypothesis is that this is due to several current trends:

The overall prevalence of the Transformer architecture and the many expressivity results for it;
The elevated importance of the dataset for training AI models, and the diminished importance of architecture choice;
Mechanistic interpretability research, which reverse-engineers sharp behaviours in trained LLMs.

Such trends may easily lead to a naïve intuition that softmax should always be able to pick out the key elements / circuits to apply to the data, so long as we choose the “right data” to train on.

However, the length generalisation setting challenges this preconception, because it by default focuses on evaluating beyond the largest training data point. Further, mechanistic interpretability research typically does not operate in such regimes, and the circuits discovered therein do not generalise to ever-increasing inputs.

We believe our paper plays an important part in grounding the limitations of the softmax function, especially when considering how it is leveraged in modern LLMs. We hope you will agree with this motivation!

On thresholding contributions

We appreciate your comment and believe addressing it will improve the rigour of our argument! In our revision, we will explicitly mention thresholding contributions when defining sharpness, and how our Theorem 2.2. proves that no fixed threshold ( $\epsilon > 0$ ) is sufficient to maintain sharpness on larger inputs.

On the infinity dimensions and the ‘scaling up sharpness’ framing

To be clear, our work does not assume infinitely deep or wide architectures. We start with the practical assumption of a model of fixed depth & width, and then quantify how its coefficients’ sharpness decays with respect to input size – exactly as you suggested. We commit to adding further sentences around the problem description to make this crystal clear.

Improvements to `softmax`

Our argument is slightly more nuanced than what you suggested. We suggest that it is the combination of the softmax function and how it is used within Transformers (e.g. tokenisation, global attention, etc.) that causes limitations over arbitrarily large inputs. That is, certainly the Transformer itself could be improved, but there are also possibilities of improving the softmax function itself.

We cited several examples of possible alternative aggregation functions in the Conclusions: linear attention, sigmoidal attention, and stick-breaking attention. There are also proposals such as selective attention (Leviathan et al.), which retain softmax but modify the algorithm which allocates logits in a more size-robust manner. For several of these proposals, Theorem 2.2 would not apply.

We are happy to add this discussion to the revised paper!

Miscellaneous issues

We are happy to correct all minor nits you pointed out, as well as avoid usage of terms like ‘reasoning device’ in a revised paper.

审稿人评论

2025-04-06

Given the argument of the other reviewers about the relevance of the paper to the community, I will update my score. I hope that the authors make the paper a little more clear and precise, as requested in my review.

作者评论

2025-04-06

Dear Reviewer WiLd,

Thank you very much for taking our response into consideration and improving your score!

I hope that the authors make the paper a little more clear and precise, as requested in my review.

We reiterate our full commitment to incorporating all the clarity and precision changes you requested, along with any other opportunity we find for doing so.

Best, Authors

最终决定Accept (poster)

2025-05-01

After a productive discussion, there is consensus among the reviewers that this paper provides a solid contribution to the ML community. I urge the authors to incorporate the feedback by the reviewers, particularly by Reviewer WiLd, into the final version.

Softmax is not Enough (for Sharp Size Generalisation)

摘要

评审与讨论

给作者的问题

论据与证据

方法与评估标准

理论论述

实验设计与分析

补充材料

与现有文献的关系

遗漏的重要参考文献

其他优缺点

其他意见或建议

给作者的问题

论据与证据

方法与评估标准

理论论述

实验设计与分析

补充材料

与现有文献的关系

遗漏的重要参考文献

其他优缺点

其他意见或建议

Bound looseness

Specifics of Adaptive-θ\thetaθ

On Figure 5

Miscellaneous issues

给作者的问题

论据与证据

方法与评估标准

理论论述

实验设计与分析

补充材料

与现有文献的关系

遗漏的重要参考文献

其他优缺点

其他意见或建议

给作者的问题

论据与证据

方法与评估标准

理论论述

实验设计与分析

补充材料

与现有文献的关系

遗漏的重要参考文献

其他优缺点

其他意见或建议

On significance of our results

On thresholding contributions

On the infinity dimensions and the ‘scaling up sharpness’ framing

Improvements to softmax

Miscellaneous issues

Specifics of Adaptive- $\theta$

Improvements to `softmax`