/10

Poster4 位审稿人

最低2最高4标准差0.7

ICML 2025

Regress, Don't Guess: A Regression-like Loss on Number Tokens for Language Models

Jonas Zausinger,Lars Pennig,Anamarija Kozina,Sean Sdahl,Julian Sikora,Adrian Dendorfer,Timofey Kuznetsov,Mohamad Hagog,Nina Wiedemann,Kacper Chlodny,Vincent Limbach,Anna Ketteler,Thorben Prein,Vishwa Mohan Singh,Michael Danziger,Jannis Born

OpenReview PDF

提交: 2025-01-23更新: 2025-08-16

TL;DR

The Number Token Loss (NTL) is a regression-like loss on number tokens that augments cross-entropy to improve Language Models in numerical tasks.

摘要

While language models have exceptional capabilities at text generation, they lack a natural inductive bias for emitting numbers and thus struggle in tasks involving quantitative reasoning, especially arithmetic. One fundamental limitation is the nature of the cross-entropy (CE) loss, which assumes a nominal scale and thus cannot convey proximity between generated number tokens. In response, we here present a regression-like loss that operates purely on token level. Our proposed Number Token Loss (NTL) comes in two flavors and minimizes either the $\mathcal{L}_p$ norm or the Wasserstein distance between the numerical values of the real and predicted number tokens. NTL can easily be added to any language model and extend the CE objective during training without runtime overhead. We evaluate the proposed scheme on various mathematical datasets and find that it consistently improves performance in math-related tasks. In a direct comparison on a regression task, we find that NTL can match the performance of a regression head, despite operating on token level. Finally, we scale NTL up to 3B parameter models and observe improved performance, demonstrating its potential for seamless integration into LLMs. We hope to inspire LLM developers to improve their pretraining objectives and distribute NTL as a minimalistic and lightweight PyPI package *ntloss*: https://ibm.biz/ntl-pypi-repo. Development code for full paper reproduction is available separately.

关键词

language modelsmathematical reasoningarithmeticsnumber representationnumber token loss

评审与讨论

审稿意见

评分: 42025-03-10

Authors start with a clear objective in mind: rather than relying on strategies to fix the number token behaviour, it would be better to really treat numbers as special class and impose a new loss which allows the model to be strongly penalized if the prediction is far off. So authors introduced a new loss, called NTL, to improve model performances on arithmetic tasks.

Authors also introduced this line work to be model agnostic and can be added without overhead.

给作者的问题

How does NTL perform on even larger models (say 7B or 20B)?

It's relatively unclear how the model performance on floating number.

Can you add a breakdown of where the model fails (maybe numerical ranges or specific types of arithmetic operations) to add more intuition of where NTL can be improved in the future?

论据与证据

Claims are pretty rigorous. Authors provide extensive experiments across different datasets and show that NTL outperforms standard CE loss and additional baselines in numerical task. Another critical point shown by the authors is the fact that NTL does not degrade performance in text-only task, which is a strong point in favor on author's claim of being easy to plug an play.

方法与评估标准

The evaluation criteria (accuracy, MAE, R²) are appropriate for the tasks being evaluated, and the authors provide a thorough comparison with baseline methods.

理论论述

No theoretical claim has been made by the authors, and for future work I strongly suggest to understand if theoretical guarantees can be made on this new loss.

实验设计与分析

The experimental design makes sense with a good ablation study to validate their proposed approach. They compare NTL with multiple baselines on different tasks and model size.

补充材料

N/A

与现有文献的关系

The paper is within the literature on improving numerical abilities from the language models. Authors cite relevant prior work, like Regression Transformer, CoT and verifiers. I can suggest to add symbolic reasoning methods or program aided language models, so we can compare NTL with these techniques as well.

遗漏的重要参考文献

Although this comes from a different area, https://arxiv.org/pdf/2402.01082 provides a new loss to treat numerical tokens. Similarly https://arxiv.org/abs/2410.03569 shows how to use this loss on very hard mathematical problem.

其他优缺点

The paper is particularly original and significative. Before integrating to any large scale pretraining, it would be beneficial if authors can train a larger model (say 7B or 20B params) to be more conclusive on the pretraining part.

其他意见或建议

N/A

作者回复

2025-04-01

We appreciate your constructive and positive review of our work!

(1) Larger models: We completely agree that evaluating NTL on larger models beyond 3B would provide further evidence regarding its scalability. We aim to do this in future work, however, at present, our computational resources are limited to a standard academic setting and running experiment on models that dont fit on single GPUs is beyond our capacity for this rebuttal. However, please be aware of the additional experiments on integer multiplication with multiple decoder-only architectures, namely GPT2 (up to 1.5B) and Granite (up to 3B), that confirm the previous results from our paper (see response to R1 aka p8jQ for details).

(2) Floats: Regarding your comment on unclear performance on floating numbers: First note our existing experiment on a proper regression task (rJokes dataset, Table 4 and also Table 5) where labels are in $\in [0, 11]$ . In this experiment NTL matches the performance of a regression head, whereas CE performs substantially worse. Secondly, we added an experiment on a real-world task from chemistry, where the labels are floats with $\sim$ 2 digits precision, and NTL outperformed CE (see response to R3 / bnfs). Third, note that DeepMind’s Mathematics dataset (see Table 1) contains almost 5M samples (18.7% of all 25M training samples) where labels are floats and not integers. This is particularly pronounced in the extrapolation test (28%) where we see a substantial improvement with NTL over CE. Together, these findings make us confident that the benefits of NTL extends well to floats.

(3) Error analysis: One practical issue for models that tokenize numbers in multiple digits is that some tokens have large numerical value which disproportionally affects the loss (even if the logit is low), especially if the number tokens are not regularly spaced. We recommend to enforce digit-level tokenization as this ensure NTL is well-behaved. As a more conceptual mitigation strategy we note that NTL is not limited to using Euclidean distance between numbers. We will update Eq (4) in the paper to reflect this and emphasize that distances can be defined in a fully flexible manner. For example one can squash the distances so that, for a label 0, predicting 9 is not exactly 9x worse than predicting 1 but only 2x. We have done an experiment on this that confirms that it works still better than CE but less good than vanilla NTL on multiplying integers with up to 6 digits with GPT2-Large as measured by Mean Absolute Percentag Error (MAPE):

Loss	MAPE
CE	0.502%
NTL-Squash-2	0.491%
NTL	0.485%
For further details on this experiment see our response "decoder-only" to R1/p8jQ.

Moreover, note that this transformation does not need to adhere to the mathematical definition of a distance, the user could provide any pairwise distance matrix between number tokens. This allows to handle even exotic cases like modular arithmetics (for details see comment from R2 aka fuzE). We will clarify this in the final manuscript.

Additionally, we conducted a detailed error analysis on the GSM8K dataset to examine predictions for numbers ending with specific digits (0–9), comparing CE and NTL. The error histograms (see last_digit_vs_distance_histogram.png) reveal a consistent pattern across all digit groups: NTL error distributions are narrower and concentrated around zero, confirming improved numerical reasoning and lower systematic biases compared to CE.

We further investigated errors specifically at digit boundaries (e.g., numbers ending in 0 or 9) on the GSM8K dataset. The table below breaks down model errors at specific digit boundaries and highlights how often predictions are overestimations, underestimations, and exact matches.

Sample Type	Metric	CE	NTL
Ends with 0	Overestimation Rate	28.4%	29.4%
	Underestimation Rate	56.5%	51.1%
	Exact Match Rate	15%	19.4%
Power of 10	Overestimation Rate	54.4%	49.1%
	Underestimation Rate	24.6%	24.6%
	Exact Match Rate	21.1%	26.3%
Ends with 9	Overestimation Rate	33.3%	38.8%
	Underestimation Rate	46.7%	32.6%
	Exact Match Rate	20%	28.6%

The results show that NTL achieves a more balanced error distribution. The exact match accuracy is consistently higher, but particularly so for samples ending with the 9 token, implying that NTL leads to a better handling on those digit boundaries.

(4) Literature: Thanks for sharing the literature on modular arithmetics and cryptography. We will highlight these works in the final version of the paper.

审稿人评论

2025-04-06

Thanks for these answers and for the MAPE, this is exactly what I had in mind. I will keep the score!

审稿意见

评分: 22025-03-11

The paper introduces a Number Token Loss (NTL), a regression-like loss function designed to improve the numerical reasoning capabilities of Language Models (LMs). The core contribution is twofold:

NTL-MSE: A loss function that computes the Mean Squared Error (MSE) between the numerical value of the label and the predicted probability distribution of number tokens.

NTL-WAS: A loss function based on the Wasserstein-1 distance, measuring the difference between predicted and true distributions of number tokens.

The authors claim that NTL improves arithmetic tasks, can match regression heads, and scales well to large models while adding minimal computational overhead. Empirical results on mathematical reasoning tasks, regression benchmarks, and large-scale models (T5-3B on GSM8K) support these claims.

给作者的问题

For NTL-MSE, does the model tend to predict the highest-probability digit slightly below the true number?

If the true number is 9, does the model predict 8 with a probability mass summing to 9?

How does NTL-WAS affect model confidence in its numerical predictions?

Does the Wasserstein loss lead to over-smoothing, making models less confident?

In practice you need to extract all numbers in the training data and add your NTL, what happens to numbers like phone number, years?

If you directly project the last hidden state to number using a linear layer, will your method be better? This should be a baseline

论据与证据

NTL improves arithmetic performance compared to standard CE loss. The results on the DeepMind Mathematics Dataset (Table 1) show that both NTL variants improve accuracy and reduce MAE. The improvement is most pronounced in interpolation tasks, while extrapolation benefits are more limited.

On the rJokes dataset, NTL matches a regression head in RMSE and Pearson correlation (Table 4).

NTL does not degrade performance on non-numerical tasks. The MultiRC dataset results (Table 7) confirm that adding NTL does not harm text generation.

NTL scales well to large models. The authors apply NTL to a 3B parameter T5 model on GSM8K, improving accuracy from 13.5% to 17.7% (Table 8). The computational overhead is minimal (Figure 4).

方法与评估标准

The experimental design is mostly rigorous but has a few gaps:

Appropriate Choice of Tasks: The use of arithmetic benchmarks (DeepMind Mathematics, GSM8K) is well-justified. However, evaluating real-world numerical tasks (e.g., time series. finance, physics) would strengthen applicability.

Evaluation Metrics: Accuracy, MAE, and R² are appropriate but could be complemented by a finer analysis of failure cases, such as systematic bias towards certain number magnitudes.

理论论述

The theoretical justification for NTL is mostly sound:

Cross-entropy's failure in numerical tasks is well-motivated. The issue that CE treats numbers as categorical rather than ordinal is a widely recognized problem.

NTL-MSE's non-uniqueness issue (where a sum of probabilities can approximate the correct number without a peaked distribution) is correctly identified, and NTL-WAS addresses this.

Wasserstein-1 distance as a better loss function is conceptually strong. However, the claim that NTL-WAS is always preferable to NTL-MSE is not fully substantiated, as it depends on task properties.

实验设计与分析

Tokenization Choices: While single-digit tokenization improves performance, it is unclear whether models trained with multi-digit tokenization + NTL can reach comparable results.

补充材料

I reviewed the supplementary materials, including:

Algorithmic details for NTL: Pseudo-code for NTL-MSE and NTL-WAS.

Additional ablation studies: Testing different λ values and combining NTL with Gaussian Cross Entropy.

Implementation details: Training settings, tokenization choices.

The supplementary material is well-structured and clarifies implementation details

与现有文献的关系

The paper positions itself well within the literature on numeracy in LMs

遗漏的重要参考文献

n/a

其他优缺点

Strengths: Novel and simple loss function that is easy to integrate. Comprehensive experiments covering multiple architectures and tasks. Solid theoretical motivation for why CE is suboptimal for numerical tasks.

其他意见或建议

n/a

作者回复

2025-04-01

Thanks for the constructive feedback. Below are detailed responses & new analyses:

Real-world task from physics: To demonstrate applicability to scientific data, we evaluate NTL on estimating molecular solubility — as studied by the Regression Transformer (RT). Each molecule is a SMILES string & the goal is to predict its solubility as a float. Following the RT paper we report performance over 3 random splits. Our results show, again, that training with NTL significantly improves performance over standard CE loss, reducing RMSE and increasing $R^2$ :.

Loss RMSE $R^2$
CE $1.08 \pm 0.16$ $0.72 \pm 0.07$
NTL $0.91 \pm 0.07$ $0.80 \pm 0.03$

Loss	RMSE	$R^2$
CE	$1.08 \pm 0.16$	$0.72 \pm 0.07$
NTL	$0.91 \pm 0.07$	$0.80 \pm 0.03$

Our approach outperforms the baselines from the RT paper, including a RF (RMSE: 1.16) and XGBoost (RMSE: 1.05). Regarding time series: this can be modeled with NTL in principle but time series transformers do not typically employ token-based generation like language models do. Since NTL is an improvement to cross entropy, we believe that studying utility for time series forecasting goes beyond the scope of this paper.

“The claim that NTL-WAS is always preferable to NTL-MSE is not fully substantiated, as it depends on task properties”: Indeed from a theoretical perspective, NTL-WAS is always preferable since it has a unique minimum as shown in Figure 3. However, there is one advantage of NTL-MSE: It explicitly computes a numerical value from the logits (via dot product), during loss calculation. This float can be combined with arbitrary loss functions (MSE, MAE etc), or even be used at inference time for decoding numbers as dot product of all logits (rather than via beam search). In the future we aim to test an expression-loss where all parts of a mathematical expression are transformed to numbers in this way & the loss penalizes the consistency of the equation. Thus, in some usecases NTL-MSE can be advantageous, although the theoretical properties are weaker. We will clarify this in the manuscript.
Extrapolation benefits are more limited: Note that the extrapolation performance with NTL doubles (from 5% to 10%) on the arithmetics dataset. In relative terms, the extrapolation benefits are stronger than interpolation benefits (Table 2). To strengthen the point, we ran new experiments with 2 decoder-only models (GPT-2, IBM Granite) on a multiplication dataset. NTL again outperforms CE in both interpolation & extrapolation but particularly for extrapolation tasks. For details, see response “Decoder-only” to R1/p8jQ.
Finer analysis of error cases: Please see the error analysis for R4/4obq
Effect of NTL-WAS on confidence: We analyzed the logit distributions over all number tokens for simple arithmetic tasks in throughout training: NTL increases the model’s confidence in its numerical prediction, particularly in early training. NTL-WAS yields logits that are more sharply centered around the correct number compared to CE. See plot at: http://bit.ly/445tlAD
Tendency to predict the highest-probability digit slightly below the true number for NTL-MSE? Since NTL-MSE computes the dot product, or weighted sum, there is no reason why it should systematically underestimate the value. We also confirmed this empirically, see analysis right above: logits are more centered for NTL, plot at: http://bit.ly/445tlAD
Extracting of numbers in training data & handling of phone numbers: In practice, we dont extract numbers explicitly but use an indicator vector from the tokenizer to index the logits of the number tokens to compute NTL. NTL is not suitable in cases where numerical proximity is irrelevant (e.g., phone numbers). However, such tokens could either be excluded from NTL or a squashing transformation could be applied (for details see “Error Analysis” to R4/4obq). Such cases are exceptions rather than the norm. Also note that we still minimize CE for all tokens, NTL is just an extra loss term. For many common number types—such as years, quantities, and measurements—numerical proximity matters and NTL is more meaningful.
Unclear results for NTL+Multi-digit tokenization (MDT): We already ran this experiment on a regression task (Table 5). Single-digit tokenization (SDT) and NTL yield complementary benefit: NTL+MDT is better than just MDT, but SDT + NTL is the best.
Projection of last hidden state to linear layer: We already ran this experiment, see Section 4.3 (”NTL can match regression models”). We even used a more complex regression head than just a simple linear layer (i.e., 2 linear layers with dropout). Nevertheless, NTL performs on-par with the model trained with this regression head (see Table 4). This shows its competitiveness, considering that it is a LM that can still be used on non-numeric tasks

审稿人评论

2025-04-06

Thanks for your detailed response. I have a follow-up question: How practical is it to apply NTL in large-scale LLM pretraining, given that it requires knowing which tokens represent numbers? Specifically, how do you obtain the indicator vector over a large, noisy corpus? And if NTL still needs such a mechanism, isn’t that functionally equivalent to selectively activating a regression head — meaning NTL isn't truly general-purpose without explicit number detection?

In the molecular solubility task, why not use a dedicated regression head instead of NTL?

作者评论

2025-04-06

Thank you for your follow up questions! We appreciate the opportunity to clarify further:

(1) Practicality of NTL in large-scale pretraining:

Applying NTL is extremely practical for large-scale LLM pretraining. Access to the data corpus is not needed, we only need to run a single pass over the tokenizer instance before training. This offline step identifies which tokens have numerical values by trying to convert each token string to a float. The result is the indicator vector you mention. This step is basically instant, it takes 80ms for Llama3 or DeepSeekV3 which have vocabularies of ~128K tokens. Also, it only needs 10 lines of code (see anonymous code: https://bit.ly/4lkvpee). During training, no scanning is needed, we just index the logits at the number token positions. We also already showed in Figure 4 that there is no overall runtime overhead.

(2) NTL vs. selective regression head:

“Selectively activating a regression head” would require model modifications (extra parameters) and a custom decoding strategy (which head to use when?) and is thus not general-purpose. Instead NTL does not have a notion of heads. It is fully general purpose, e.g., it can be added as a plug-and-play loss function to any LM without requiring any model modifications (for a minimalistic training example with Llama see anonymous code: https://bit.ly/4iXLtBp). Please re-read the end of our introduction for details (page 2 bottom, line 089 following).

Thus, calculating a regression-like loss directly on a token head is probably the most general purpose approach, exactly because it requires zero modifications to model and tokenizer. The loss class only needs access to the tokenizer once — to extract the tokens corresponding to numbers. So, NTL is fully practical for large-scale LLM pretraining whereas carrying an extra regression head is impractical.

Note that our approach is entirely novel. Despite its simplicity nobody has yet demonstrated how a regression-like loss can be calculated directly on the logits produced by a standard LM head.

(3) Molecular solubility regression example:

We did not include a regression head in this rebuttal experiment because we already show an extensive comparison of NTL against an explicit regression head in the main body of our paper (see Table 4 and Table 5). Those experiments already demonstrate that performance of a regression head can be matched by NTL. Beyond this, as we said above, please note that regression heads are generally impractical in LMs and not commonly used. Taken together, our main competing method should be the standard cross entropy loss, because this is the loss used to train general-purpose LMs in practice.

We would appreciate if you would reconsider your score in light of these clarifications.

审稿意见

评分: 32025-03-12

This study proposes Number Token Loss (NTL), a new regression-like loss to better handle numbers in texts. With the NTL loss, the prediction of a number token is determined by the weighted average of numbers with their softmax probability from logits, and the NTL loss measures MSE between the weighted average and the ground truth. This approach is advantageous over prior methods in that a) it is model agnostic with minimal assumptions about the vocabulary and b) it comes with minor computational overhead. The experiments demonstrate its superiority in accuracy over the standard cross-entropy based approach and several prior methods including Gaussian Cross Entropy, Regression Transformer, and xVal, particularly in classification tasks.

update after rebuttal.

I appreciate the author's full elaborations and answers to my concerns, which greatly deepened my understanding of their work. The explanation and additional results address my concerns, and I suppose that this work is worth being presented in the main conference.

给作者的问题

Please refer to the other cells, but major concerns from me are: a) Are baseline methods (e.g., xVal) implemented and compared in a fair manner? b) The applicability of the proposed method to modular arithmetic and other hard math tasks is not tested and discussed.

论据与证据

The claims are generally well tested by experiments. The experiments cover classification and regression tasks, and they examine the variants of the NTL loss, the scaling characteristics on model size, and training speed.

方法与评估标准

The NTL loss naturally implements the idea that incorrect predictions on the number tokens should be evaluated with some distance metric, so that the model can know whether the predictions were close or not. While this idea itself has been tested by several methods including the Gaussian Cross Entropy and xVal, the NTL is more advantageous in that it can work within the framework of classification. This allows users to introduce minimal modification in their codes. The experiments show comparisons of these methods with standard metrics including prediction accuracy, MAE, and R2 scores.

A caveat of the proposed method is that it assumes the Euclidean topology or its variants. Namely, it assumes that 2 and 3 are closer than 2 and 16. However, this is not always the case, this should be carefully discussed. For example, when one performs modular arithmetic, or some operations that involve periodicity, these numbers do not equip any distance (it violates triangular inequality). Thus, in such a case, the injection of the Euclidean topology should be harmful to the learning, and the cross-entropy loss should perform better.

It agrees that the NTL is useful, but the authors should make readers aware of this.

理论论述

This study does not contain any theoretical claims.

实验设计与分析

I checked the setup of experiments. Overall, the experiments investigate the proposed method and variants well. There are several unclear points.

[l.190, right] this scheme (of xVal) is incompatible with T5. Appendix A.5 explains that the xVal is not compatible with T5 model, but then the authors should test it on other reasonable models. The experiments can underestimate the performance of xVal since it does not use xVal as it is.

[l.212, left] The effective number range of xVal is limited to [-5, 5] due to the combination of its scaling of the number token embedding and the layer-norm in the backbone. Over the experiments, the performance of the xVal is very poor. This part needs more explanation. Further, Tables 1(a) and 1(b) do not show the accuracy by xVal. While the xVal does not directly predict discrete tokens, the accuracy can be computed simply by nearest neighbor. To take this into account, we scale our dataset for xVal with log(1+x) Are the results in Table 1 using this logarithm map, or not? I assume not, but then where can I find the results?

补充材料

I read part of it (e.g., Appendix A.5) to find if it answers my concerns and questions.

与现有文献的关系

The key contributions of this study relate to the studies that address number tokens. This study addresses general documents and basic math tasks, but it also relates to the literature on hard mathematical problems as given in the next cell.

遗漏的重要参考文献

Relevant to the earlier comment on the modular arithmetic, I encourage the authors to include the literature on modular arithmetic, or more broadly arithmetic/symbolic computation tasks.

Learning the greatest common divisor: explaining transformer predictions, Francois Charton, ICLR'24
Impact of Pretraining Term Frequencies on Few-Shot Numerical Reasoning, Yasaman Razeghi, Robert L. Logan IV, Matt Gardner, Sameer Singh, ACL'22
Learning to compute Gröbner bases, Hiroshi Kera, Yuki Ishihara, Yuta Kambe, Tristan Vaccon, Kazuhiro Yokoyama, NeurIPS'24
Learning Label Encodings for Deep Regression, Deval Shah & Tor M. Aamodt, ICLR'23

The first one discusses the dependency of the successful GCD calculation on the base to represent numbers. The second one provides an insight in the number embedding from the the frequency perspective. The third introduces a regression head to regress the coefficients in polynomials (similar to xVal) and observes that it particularly performs poor on finite field coefficients (i.e., the case involving modular arithmetic). The last one is not about Transformer, but regression by classificaion has been widely studies, so covering this literature makes this study richer.

其他优缺点

Most of major strengths have been mentioned above, but I appreciate the proposed method in terms of the compatibility of the standard classification-based pipeline. It is natural to introduce regression to provide richer supervisory signals to incorrect number token prediction, but simple introduction of regression head requires additional modification in the codes such as in auto-regressive generation. The proposed method does not require this.

其他意见或建议

Nothing. The paper is well-written and easy to follow.

作者回复

2025-04-01

Thanks for your valuable feedback! We appreciate your recognition of NTL’s effectiveness and practicality, considering its compatibility with standard pipelines. To address your questions:

(1) Non-euclidean topology:

Thanks a lot for raising this interesting point regarding transferability beyond Euclidean topology. We agree that our specific loss (Equation 4) makes this assumption but note that the general NTL-WAS formulation (Equation 2) uses an arbitrary cost function $c$ defining the pairwise cost between tokens. This allows to cover more general relationships between the numbers.

We confirmed this experimentally. Instead of using euclidean distances, we squashed the distances so that for a label 0, predicting 9 is not exactly 9x worse than predicting 1 but only 2x. Results showed performance better than CE but slightly worse than vanilla NTL for multiplying integers with up to 6 digits (GPT2, measured by MAPE; details in response Error Analysis for R4/4obq).

Moreover, regarding modular arithmetics: Consider the modular addition task, where $y(n,m) = n+m\ mod\ p$ as described in “Grokking modular arithmetic” (Gromov, 2023). Here, a reasonable cost function could be found in $c(y_1, y_2) = min(abs(y_1-y_2), m-abs(y_1-y_2))$ , which considers the “warping” property / periodicity of numbers in modular arithmetics.

(2) Extended explanation for xVal:

We have carefully revisited the corresponding section and would like to clarify several points.

xVal Incompatibility with T5
As noted in Appendix A.5 of our paper, xVal multiplyies the [NUM] token embedding $X$ by the number value $a$ . In T5, however, a per-sample pre-layer normalization is applied immediately after the embedding, which effectively removes the scaling by $a$ . Specifically:

\frac{aX-E[aX]}{\sigma(aX)}=\frac{aX-aE(X)}{\sqrt{a^2E(X^2)-a^2E(X)^2}}=\frac{X-E(X)}{\sigma(X)}

Hence, under T5’s architecture, all numbers collapse to the same embedding, making xVal incompatible with T5. Consequently, we do not use xVal with T5 in our experiments. Instead, we follow the original xVal encoder architecture (as in the xVal paper).

Limited Dynamic Range of xVal & Log-scaling

Even in the original xVal architecture, the range of values xVal can process meaningfully is limited by the layer normalization that follows the positional embedding step. For further Information on that, please see xVal under section 2 “Implicit normalization via layer-norm”. Therefore, xVal normalizes each value to [-5,5] prior to training to mitigate this issue.

We argue that this approach cannot be applied in practice, since in real texts the range of numbers is not known in advance, and thus a simple min-max normalisation to [-5,5] prior to training or inference is not really practical.

Therefore we opted for a simpler approach in our experiments: applying a signed log⁡(1+x) transformation to all numeric inputs. This avoids the overhead of parsing and re-scaling each number to [-5,5] prior to training, but it also has the drawback that large numbers are squashed in the embedding space, making fine-grained distinctions difficult for the model.
Thus: Yes, the results in Table 1 do use this logarithmic mapping. We will clarify this explicitly in the table caption and experimental setup.
Experimental results on dataset from xVal paper
For a direct comparison without any modifications to the xVal processing, we repeated the 3-digit multiplication experiment from the xVAl paper. Again, our model beats xVal (see response Simpler baseline for R1/p8jq).
Accuracy of xVal Predictions
We appreciate your request for explicit accuracy metrics and your idea of using the nearest neighbor for xVal predictions.

When rounding xVal predictions to match the decimal places of our dataset, the resulting accuracies are quite low:
- Interpolate: 0.052
- Extrapolate: 0.018
If we reduce precision by rounding to only two decimals, the accuracy improves somewhat, but remains modest:
- Interpolate: 0.096
- Extrapolate: 0.058
These findings underscore that xVal struggles with larger numbers in particular.

(3) Hard math tasks:

While the paper indeed only covers tasks related to euclidean topology, we respectfully disagree that it does not include “hard math tasks”. E.g., the extrapolation task of DeepMind’s math dataset is very difficult (→ maximal accuracy of 10%, see Table 3). It includes a large variety of tasks, including not only arithmetics but also algebra, number conversions and polynomials. Furthermore, the GSM8k dataset (see Table 8) is an accepted benchmark for reasoning and even commercial LLMs cannot solve it perfectly so far.

(4) Literature:

Thanks for sharing additional references about arithmetics & number representation in LMs. We will add those to the final version of the manuscript.

审稿人评论

2025-04-07

I appreciate the authors' answers to the details. The explanation and additional results address my concerns so I'll keep my positive score.

作者评论

2025-04-09

Thanks for the response. We're glad that all your concerns have been addressed and look forward to updating the final paper, if it gets accepted.

Since our paper has received very tight scores overall, we would truly appreciate if your score would fully reflect your support of our paper.

审稿意见

评分: 32025-03-14

This paper introduces Number Token Loss (NTL), a loss function designed to improve numerical reasoning in language models (LMs). The core idea is that standard cross-entropy (CE) loss treats numbers as categorical variables, disregarding numerical proximity. NTL aims to address this by incorporating numerical relationships into training, proposing two variants:

NTL-MSE: Uses Mean Squared Error (MSE) between predicted number token distributions and the ground truth.
NTL-WAS: Uses the Wasserstein-1 distance to align predicted and target number distributions.

The authors argue that NTL is model-agnostic, computationally efficient, and improves numerical reasoning while maintaining performance on text-based tasks. Experiments on mathematical tasks and a real-world regression dataset suggest that LMs trained with NTL perform comparably to those with dedicated regression heads.

=======

update after rebuttal.

I appreciate the authors for additional experiments on a simple arithmetic baseline and on decoder-only models. Although I am not fully convinced about the results on the effectiveness of the claimed methods, I would like to increase my score to 3.

给作者的问题

For digit-level tokenization, how does NTL handle edge cases where predictions differ by a single digit? For example, if the ground truth is "20" but the model predicts "19", does NTL penalize this more than it should because "9" and "0" are numerically distant tokens? Can the authors analyze error cases to determine whether this issue occurs frequently?
How does NTL affect learned number embeddings? A visualization of how number tokens cluster in embedding space (e.g., using PCA) would help illustrate whether NTL improves numerical representation learning. For example, it is known that training on CE loss leads to the embeddings of number tokens to form a circular shape with PCA.
Can NTL be used in decoder-only architectures? Since the method is only tested on T5, it is unclear if it generalizes to decoder-only transformers like GPT.
Does NTL improve sample efficiency for learning arithmetic tasks? Does it allow models to learn numerical relationships with fewer training samples compared to CE?

论据与证据

Their claim that NTL improves performance in arithmetics is clear and convincing.

NTL improves performance on arithmetic tasks

Supported by empirical results from the DeepMind Mathematics dataset, showing improved accuracy and lower error rates compared to CE.
NTL shows consistent improvements in both interpolation and extrapolation tests.
However, a straightforward baseline using simple arithmetic tasks is missing.

NTL does not degrade text generation performance

MultiRC dataset experiments confirm that NTL does not negatively impact text-based tasks.

方法与评估标准

Yes.

理论论述

The paper does not present formal theoretical claims.

实验设计与分析

Yes, I assessed the validity of experiement designs.

The experiments are well-designed, with strengths including:

Model scale comparisons: Results on both small-scale (T5-small) and larger models (T5-3B) suggest scalability.
Computational efficiency analysis: Confirms that NTL adds minimal overhead (<2%).

However, some gaps remain:

Limited alternative baselines: While xVal and Regression Transformer are tested, other numeracy-enhancing strategies (e.g., GCE, continuous number embeddings) are not considered.
No decoder-only architecture evaluation: The method is only tested on an encoder-decoder (T5) model. It is unclear whether NTL applies effectively to decoder-only models like GPT-style transformers.

补充材料

Only roughly checked the experiment setting and their psuedo-code algorithm.

与现有文献的关系

The paper situates itself within work on:

Numeric representations in LMs (Geva et al., 2020; Golkar et al., 2023)
Mathematical reasoning and arithmetic in LMs (Cobbe et al., 2021; Dziri et al., 2024)
Tokenization strategies for numbers (Born & Manica, 2023)

遗漏的重要参考文献

Some relevant references are not discussed: Number embeddings:

Do NLP Models Know Numbers? Probing Numeracy in Embeddings (Wallace et al., 2019)
Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs (Sing and Strouse, 2024)

Sequential Predictions, Training Objective:

Teaching Arithmetic to Small Transformers (Lee et al., 2023)
The Pitfalls of Next-Token Prediction (Bachmann and Nagarajan, 2024)
Length Generalization in Arithmetic Transformers (Jeleassi et al., 2023)

其他优缺点

Strengths

The method is straightforward and easy to implement.
The writing is clear and well-organized.
The empirical results show consistent improvements across tasks.

Weaknesses

The method is only evaluated on an encoder-decoder model (T5). It is unclear if NTL applies to decoder-only architectures.
The connection between NTL and learned numerical representations is not analyzed.
The paper lacks a direct comparison with Gaussian label smoothing, which is another method for handling numeric proximity.

其他意见或建议

Synthetic arithmetic tasks: Simple arithmetic tasks, such as those from Length Generalization in Arithmetic Transformers (Jeleassi et al., 2023), could serve as a baseline. This would allow comparison of NTL’s performance, sample efficiency, and training efficiency.
Applicability to decoder-only models: The paper should test whether NTL works for autoregressive models like GPT.

作者回复

2025-04-01

Thanks for the constructive feedback. We are delighted that you agree that NTL shows consistent improvements. To clarify your questions, we ran additional experiments with decoder-only models on “simpler arithmetic tasks” (multiplication):

(1) Decoder-only: NTL is simply a loss function and can be used with any model, also beyond Transformers. To prove our point, we constructed an arithmetic task, much akin to the length generalization task by Jeleassi et al (2023): multiplications of 2 numbers with $k$ and $l$ digits, with $k, l \in [1..5]$ in training and $k, l \in [1..6]$ in evaluation. The model was trained to answer the question: “What is the result of multiplying x with y?”. We tested 2 different decoder-only models (GPT2 and IBM’s Granite) both on varying sizes from 125M to 2B parameters. We report the mean absolut percentage error (MAPE) separately for unseen interpolation (up to $5 \times 5$ digits) and extrapolation ( $m \times 6$ digits) samples:

Model	Size	Interpolation CE	Interpolation NTL	Extrapolation CE	Extrapolation NTL
GPT2 Small	125M	0.55%	0.49%	1.11%	1.00%
GPT2 Medium	350M	0.43%	0.42%	0.82%	0.82%
GPT2 Large	774M	0.39%	0.37%	0.76%	0.75%
GPT2 XL	1.5B	0.43%	0.40%	0.83%	0.82%
Granite 3.2	2B	0.35%	0.21%	0.60%	0.42%
Granite 3.1	1B	0.28%	0.15%	0.68%	0.23%

NTL works consistently better for 2 different decoder-only architectures across all model sizes. Length generalization is also improved for NTL, see detailed results (digit by digit and extra metrics): http://bit.ly/41Sx8jp

We hope that this comprehensive experiment with 6 models & 2 backbones rules out your concern regarding generalizability to decoder-only models.

(2) Simpler baseline task & limited baselines: We had already done experiments on the (simple) 3-digit multiplication problem from xVal. But since this dataset is too easy, we didnt show it in the paper, in favor of more challenging experiments. This compares also to other number encoding strategies than xVal. We will add this Table to the final Appendix.

Encoding	R2 Value
P10	0.9989
P1000	0.9997
B1999	0.9998
FP15	0.7119
xVal	0.9986
T5 CE	0.999934
T5 NTL-WAS	0.999997
T5 regression head	0.999891

(3) Comparison with Gaussian label smoothing: Note that we already reported results for Gaussian label smoothing (GCE) in the submitted version (Table 3). We add a more direct comparison below. The results on the arithmetic dataset show that GCE combined with NTL yields best performance, showing that both methods are complementary.

Interpolation test set

GCE	σ	NTL	Accuracy	MAE	R²
X	-	X	0.34	2.15	0.95
X	-	O	0.43	0.91	0.99
O	0.5	X	0.42	0.95	0.99
O	0.5	O	0.48	0.76	0.99

Extrapolation test set

GCE	σ	NTL	Accuracy	MAE	R²
X	-	X	0.05	61.92	0.61
X	-	O	0.10	58.18	0.68
O	0.5	X	0.10	58.55	0.65
O	0.5	O	0.10	66.97	0.59

(4) Additional questions

Single-digit edge cases (19 vs 20 vs 21): Yes, with single-digit tokenization, NTL would give a higher loss for 19 than for 21 since 9 is far from the ground truth. However, cross entropy has the same problem: The loss for 19 is higher than for 21, since two tokens are incorrect. However, NTL also works with multi-digit tokens, where such a case would not occur. Nevertheless, we looked into the frequency of such cases. Please see the response to R4 aka 4obq for details.
Number embedding: A PCA of number token embeddings of CE and NTL did not show significant differences, embeddings were indeed roughly circular in both cases. We will add the plots to the final appendix.
Continuous number embeddings: You mention that we lack such comparison where actually xVal does use continuous number embeddings (see R2/fuzE for details).
Sample efficiency: As suggested, we analyze the sample efficiency for the decoder-only digit multiplication task described above. We show the evolution of the MAPE during training. As expected, the errors decrease much faster for NTL than for CE loss.

MAPE @Epoch	1	5	10	20	40	100
CE	3.1%	2.0%	1.6%	0.6%	0.7%	0.4%
NTL	2.7%	1.3%	1.1%	0.9%	0.5%	0.3%

This corresponds to $3.43$ epochs on average to achieve a MAPE $<0.5\%$ with CE loss, compared to $2.55$ epochs with NTL. This difference is pronounced for more difficult multiplications, in case of interest see Figure at: bit.ly/4lc6wkY

最终决定Accept (poster)

2025-05-01

The paper proposes two new losses for language modeling that are meant to encourage better predictions on numerical tasks, by leveraging the ordered structure of numbers, either through an expectation of predicted numbers with an MSE loss, or through a specific choice of transport costs with a Wasserstein loss. The authors obtain encouraging results on mathematical and scientific datasets.

The reviewers were generally positive, particularly after the author response, thus I am leaning towards acceptance. That said, some additional experiments on decoder-only models would be good to include, and a study of sample/compute efficiency and scaling laws would be helpful here to assess if the proposed losses should actually be preferred to discrete tokenization in different regimes. The authors should include the additional results and datasets provided during the rebuttal in the final version, if the paper is accepted.