Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations
Language models are surprisingly robust to non-canonical tokenizations of the input, which can even lead to improved performance
摘要
评审与讨论
This paper presents a study about how non-canonical tokenization of the input text affects language models’ ability to provide correct outputs. First, the authors show the instruction-tuned versions of Qwen-2.5-7B, Llama-3.1-8B, and Olmo-2-7B, on a wide set of benchmarks, show only a small performance degradation due to non-canonical tokenization, and that their performance is negatively correlated with the granularity of the tokenization (i.e., the more tokens are used to express the same input, the lower the performance). Then, the authors show that non-canonical tokenization can lead to higher model performance in tasks such as counting characters, identifying common morphemes in different words, and performing arithmetic calculations. Finally, the authors find that the instruction-tuned models’ robustness to different tokenizations is mainly due to the SFT stage of training and in particular to the introduction of the chat template that distinguishes separate turns in the model-user dialogue.
优缺点分析
Strengths
- The paper is very clearly written, with well-structured experiments and thorough analyses.
- The findings offer practical insights for researchers working on tokenization, model robustness, and internal representations of language.
- The observation that non-canonical tokenization can improve performance in certain tasks is novel and may inspire future work on task-aware or adaptive tokenization methods.
Weaknesses
- The paper could benefit from a deeper discussion of why language models are able to interpret non-canonically-tokenized inputs effectively. See Question 1 below.
问题
- The robustness of language models to non-canonical tokenizations is likely due to the presence of misspellings in the pre-training data. For instance, GPT-4o’s tokenizer has a single token for “Seattle”, but splits it in different ways depending types (e.g., “Seat-le” or “Se-att-lle”). I wonder if the authors looked into whether a high number of misspelling for a specific word in the training data might lead to better”understanding” (in the sanse described in Section 4.3) of the non-canonically-tokenized word.
- In Section 4.1, can you provide the value for the “Spelling” and “Grammaticality” metrics for the canonical tokenization?
局限性
Yes
最终评判理由
I will maintain my score, as I think the paper presents an interesting study and should be accepted.
格式问题
None
We thank the reviewer for their recognition of the “thorough analysis” in our experiments, the “novelty” and “practical insights” of our findings, and the potential to “inspire future work on adaptive tokenization.”
Are LMs robust to non-canonical tokenizations due to misspellings?
We thank the reviewer for this insightful hypothesis. Because it would be difficult to automatically identify misspellings of words, we designed a similar experiment to study this question. In particular, we consider the diversity of ways in which a token is internally segmented (when appearing within a larger word), and measure whether that is correlated with the model’s understanding of the spelling of that token. Consider the token bio — when appearing as a substring in, e.g., “microbiology,” it may be segmented internally, e.g., as [microbi, ology], with the token boundary appearing between “bi” and “o.” We hypothesize that LMs will better understand tokens that are segmented in more unique ways.
To test this, we use tokens from the vocabulary of which are 5-15 characters long and occur 50-500 times in a 13GB sample of The Pile. From there, we construct two subsets of tokens: those that are segmented in at least 5 different ways, and those that are consistently segmented the same way. We then evaluate the model on intra-token understanding using task constructions from CUTE. Shown in the table below, we find that the model consistently understands commonly segmented tokens better. We believe that this indirectly provides support for the reviewer’s hypothesis that misspellings would also lead to better understanding, as it similarly splits a string into a different sequence of subtokens.
| Variation | Commonly Segmented | Rarely Segmented |
|---|---|---|
| Char Deletion | 38.8 | 20.5 |
| Char Substitution | 18.0 | 4.1 |
| Char Swapping | 16.2 | 8.2 |
| Spell | 82.5 | 54.8 |
| Inverse Spell | 60.2 | 36.3 |
Spelling and grammaticality comparison for canonical tokenization in §4.1
We provide a comparison to canonical tokenizations below. We see that the performance of SFT, DPO, Instruct models are similar to those given non-canonical tokenizations (Figure 3), whereas the Base model performs much better given canonical compared to non-canonical tokenizations. This is consistent with what we expect, given that robustness to non-canonical tokenization arises after the SFT stage.
Spelling
| Model | Base | SFT | DPO | Instruct |
|---|---|---|---|---|
| OLMo | 61.3 | 79.6 | 78.3 | 77.5 |
| TULU | 67.6 | 78.5 | 79.4 | 79.7 |
Grammaticality
| Model | Base | SFT | DPO | Instruct |
|---|---|---|---|---|
| OLMo | 72.4 | 92.9 | 92.9 | 92.4 |
| TULU | 78.2 | 91.8 | 95.4 | 95.5 |
Thank you for your response! I appreciate the additional analysis about the misspellings. I think it would be valuable to include it in the final version of the paper.
I will maintain my score, as I think the paper presents an interesting study and should be accepted.
This work studies the problem of tokenization. In particular, they show that instruction finetuned LLM can handle broken tokenization, i.e. encodings that does not follow the BPE such as character-level id. They studies the problem through a systematic evaluation and also show that broken tokens can also help LLM improve its performance in certain character-level aware task.
优缺点分析
Strength: The paper's message is interesting and can provide further insights for future development of tokenization.
Weakness: Certain experiments are not very convincing and open up questions. See Questions.
问题
The experimental design in Section 3 lacks persuasiveness, primarily due to the suboptimal digit tokenization strategy employed by Llama3. It remains uncertain whether the proposed approach generalizes effectively to other large language models (LLMs). Does the strategy hold for tokenizers that do not merge digits?
In Section 3, why were 10-digit numbers specifically chosen for the arithmetic task? Is there a particular rationale behind this decision? Also, the Counting Characters dataset size contains only 331 samples, which makes the evaluation less convincing.
It seems like that Qwen and Llama3 share many common tokens (correct me if I'm wrong). Do these findings extend to other models, such as Llama2/Mistral? It would be more compelling if the authors could demonstrate similar results for task-specific models like CodeLlama, where spelling errors could significantly impact performance.
Regarding the findings in Section 4.2, when does this robustness emerge? Specifically, how many training epochs are required for this robustness to emerge, and how does this align with the reported results? Also, I wonder if we can take the idea from Section 3 to improve the performance during the SFT stage? From Figure 1, it seems like the model always output canonical encodings, I wonder if the author has further insight why this is the case? Does your finding in Section 3 applies for Llama3-1B model in Section 4.2?
Figure 4 suggests that the chat template is critical for enabling the model to become "character" aware. Can you provide further insights into why this might be the case? My understanding is that the template encourages the model to learn from fragmented or broken tokens. This approach seems similar to techniques used in code completion LLMs, where random token insertion is employed to train models to handle incomplete or broken tokens effectively.
Overall, although I find the message the paper's conveying is interesting, I feel like the experiments conducted do not fully investigate the phenomenon.
局限性
While the paper presents interesting observations, it would be more informative for readers if additional details about the experimental setup were provided. See Questions.
I'll happily increase my score if the authors address my questions.
最终评判理由
While the paper leaves some open questions on why this phenomenon emerges during SFT finetuning phase (with chat template), this work can potentially offers many interesting following up work in this area.
格式问题
No formatting concern
We thank the reviewer for recognizing that our paper presents “interesting observations,” as well as the detailed comments and insightful questions. We hope you find that our response addresses your concerns.
On testing models that do not merge digits
LMs whose tokenizers do not merge digits will not have multi-digit tokens in the vocabulary, making it impossible to evaluate right-to-left tokenization (which chunks digits into groups of 3, e.g., 1000 [1, 000]) at inference-time.
We also want to mention that the best representation of digits is still an open question. For instance, [1] found that character-level tokenization is still suboptimal and underperforms “fixed-character” tokenization, which is somewhat similar to R2L because it ensures consistent digit positions. In addition, grouping digits in chunks of 3 is more common among tokenizers of frontier LMs today.
Expanding the test sets in Section 3
Thank you for identifying the limitation of small test sets. Following the reviewer’s suggestion, we expand the test sets for Arithmetic and Counting Characters to >1000 examples. For Arithmetic, we include numbers with 5 to 15 digits. We find that right-to-left tokenization of digits still significantly outperforms the canonical left-to-right tokenization, achieving 30.8% compared to 9.8% accuracy. We also expand the Counting Characters dataset by considering tokens of 5 to 10 characters in Llama3’s vocabulary. Again, we replicate the finding that character-level tokenization leads to better performance, achieving 73.5% compared to 66.5% for canonical tokenization. We will update the test sets and corresponding results in the next revision.
On evaluating models with tokenizers that differ more from &
Following the reviewer’s suggestion, we perform evaluations on and . Note that we do not use or (based on ) due to very poor performance, even with canonical tokenization. The findings are consistent with those reported in the paper — retains 94% of performance and retains 80% of performance with random tokenization.
Average Retention: Random: 94.1%, Char: 89.6%
| Dataset | Normal | Random | Char |
|---|---|---|---|
| ARC-C | 0.816 | 0.772 | 0.712 |
| ARC-E | 0.906 | 0.860 | 0.806 |
| COPA | 0.932 | 0.898 | 0.902 |
| Winogrande | 0.544 | 0.538 | 0.522 |
| Winograd | 0.722 | 0.676 | 0.634 |
| CSQA | 0.794 | 0.714 | 0.664 |
| OpenbookQA | 0.810 | 0.700 | 0.650 |
| PIQA | 0.842 | 0.798 | 0.734 |
| MMLU | 0.656 | 0.558 | 0.546 |
| BoolQ | 0.826 | 0.802 | 0.792 |
| HellaSwag | 0.760 | 0.676 | 0.596 |
| WikidataQA | 0.552 | 0.630 | 0.576 |
| TOFU | 0.795 | 0.778 | 0.769 |
| TriviaQA | 0.510 | 0.384 | 0.354 |
| JeopardyQA | 0.252 | 0.176 | 0.158 |
| AlpacaEval | 0.500 | 0.5923 | 0.5293 |
| MATH | 0.488 | 0.448 | 0.456 |
| CUTE | 0.457 | 0.467 | 0.500 |
| GSM8K | 0.804 | 0.741 | 0.701 |
| DROP | 0.814 | 0.816 | 0.786 |
Average Retention: Random: 80.3%, Char: 70.5%
| Dataset | Normal | Random | Char |
|---|---|---|---|
| ARC-C | 0.710 | 0.512 | 0.438 |
| ARC-E | 0.846 | 0.642 | 0.538 |
| COPA | 0.944 | 0.840 | 0.766 |
| Winogrande | 0.520 | 0.454 | 0.406 |
| Winograd | 0.622 | 0.526 | 0.436 |
| CSQA | 0.724 | 0.480 | 0.454 |
| OpenbookQA | 0.742 | 0.506 | 0.438 |
| PIQA | 0.824 | 0.668 | 0.612 |
| MMLU | 0.576 | 0.428 | 0.382 |
| BoolQ | 0.857 | 0.754 | 0.734 |
| HellaSwag | 0.704 | 0.460 | 0.390 |
| WikidataQA | 0.820 | 0.652 | 0.542 |
| TOFU | 0.846 | 0.803 | 0.718 |
| TriviaQA | 0.760 | 0.576 | 0.498 |
| JeopardyQA | 0.446 | 0.240 | 0.180 |
| AlpacaEval | 0.500 | 0.4391 | 0.3552 |
| MATH | 0.106 | 0.096 | 0.093 |
| CUTE | 0.286 | 0.271 | 0.205 |
| GSM8K | 0.491 | 0.406 | 0.356 |
| DROP | 0.798 | 0.752 | 0.728 |
How many training epochs are required for robustness to emerge?
Thank you for the question, we should have stated explicitly that we trained for two epochs, which we will include in the next revision. To observe how robustness varies with the amount of SFT training, we also evaluate the epoch 1 checkpoint on our robustness metrics (spelling, grammaticality, and win rate). We find that the 1-epoch checkpoint’s performance lies between the base model and the 2-epoch model. This suggests that robustness increases during the SFT process.
Why do models always produce canonical tokenizations?
We hypothesize that models are able to understand non-canonical tokenizations in its context because it has the opportunity to “re-aggregate” information over multi-token words into the last token’s hidden representation. This process has been observed in how models process multi-token entities (e.g., [␣New, ␣York]) and multi-token words [1, 2, 3]. This learned process of “internal retokenization” in early layers likely grants the robustness to non-canonical tokenizations.
In contrast, models are not trained to predict non-canonical token sequences. Of course, mathematically, models must place non-zero probability on non-canonical token sequences [4, 5]. This means that over many samples, models should sometimes produce non-canonical tokenizations (regardless of whether the input is canonically tokenized). However, we qualitatively observe that generations are overwhelmingly canonical.
Applying more optimal tokenizations during SFT
We interpret the reviewer’s suggestion (“take the idea from §3 to improve the performance during the SFT stage”) as training with a non-canonical tokenizations during SFT to further improve upon the gains that we found at inference-time alone.
To test this, we finetune the base model on 30K random examples from the SFT mixture: one with L2R digit grouping, and one with R2L grouping. Then, we compare the following settings: (1) the L2R-trained model with L2R inference, (2) L2R-trained model with R2L inference, and (3) R2L-trained model with R2L inference. We expect (3) to outperform (2) due to the additional opportunity to adapt to R2L tokenization during training. We evaluate these models on 1,000 instances of addition and subtraction of 4 - 8 digit numbers. We find that (1) gets 2%, (2) gets 24%, and (3) gets 12.3%, counter to our hypothesis. We conjecture that this is because (3) learns to predict two different representations of numbers between pretraining and finetuning, and when generating output, divides probability among R2L and L2R representations. In contrast, (2) continues to generate numbers L2R, even when conditioned on R2L representation of digits.
Evaluating alternative tokenization schemes on
Following the reviewer’s suggestion, we investigate whether the performance improvements we identify in §3 for also apply to . On the Arithmetic task (expanded to include 1000 examples with arithmetic for 4-8 digit numbers instead of 5-15 digit numbers to account for weaker model abilities), we find that R2L segmentation achieved 32.7% accuracy, a substantial improvement over 9.7% with canonical tokenization.
However, for the other four tasks (Common Morpheme, Codeline Description, Acronym, and Counting Characters tasks), Llama-3.2-1B-Instruct consistently performs worse when presented with character-level tokens. We hypothesize this is because, as we observed, weaker models are less robust to non-canonical tokenizations compared to stronger models.
Why is the chat template crucial to robustness?
We believe that the chat template is crucial for indicating a new turn of conversation, “freeing” the model from continuing the perceived misspellings in the context. Indeed, we show in §4.2 that it does not need to be an actual “chat template,” and a generic “Question:/Answer:” template that separates the user and model turn produces the same effect.
Regarding more detailed documentation
Thank you for raising this important point. We plan on updating the manuscript to include more details about the experimental setup, alongside the exact script/settings that we used to train/evaluate our models and our benchmarks.
Thank you for your reply. I updated my score accordingly. I hope future work can rigorously explain why the chat template fixes this issue.
The paper investigates how instruction-tuned language models handle non-canonical tokenizations of input text. Through systematic evaluation on 20 benchmarks and three model families, the authors show that models retain over 90 percent of their performance under random or character-level tokenizations. They further demonstrate that swapping in alternative tokenizations such as character-level segmentation for string tasks or right-aligned digit grouping for arithmetic can yield substantial gains (up to +33 percent) without additional training. Finally, an analysis of model stages reveals that robustness to non-canonical tokenizations emerges during the supervised instruction-tuning phase.
优缺点分析
Strengths:
- Addresses the under-explored tokenizer, with a large-scale empirical study on diverse tasks.
- Demonstrates opportunities for performance gains via simple inference-time interventions with a creative and diverse task suite.
- Rigorous analysis pinpoints instruction-tuning as the key driver of robustness, supported by ablations on training data and procedure.
Weaknesses:
- Non-canonical tokenizations do not show performance gain across the board. Would be great to have an adaptive tokenization scheme with consistent improvements.
问题
- Would be good to see if fine-tuning on some non-canonical scheme (e.g. right-to-left digit grouping) further boosts improvements.
- Would be good to see some studies on non-latin languages.
局限性
yes
最终评判理由
This paper provides meaningful insight to the often overlooked tokenization in LLMs. My questions have been addressed with additional insightful experiments.
格式问题
no
We thank the reviewer for recognizing our “systematic evaluation” of model robustness to non-canonical tokenizations, the “performance gains” delivered by “simple inference-time interventions,” and “rigorous analysis” into the source of robustness. We hope you find that our response addresses your concerns.
On the possibility of automatically adapting tokenization schemes
The reviewer is correct that, in this work, we do not find a single adaptive tokenization strategy that improves performance across the board. The goal of the work was not to improve tokenization, but instead demonstrate that LMs can handle non-canonical tokenization, which is surprising and important because these tokenizations are provably out-of-distribution. We believe that automatically finding the best segmentation of text at inference-time is a promising direction for future work.
Finetuning on non-canonical tokenizations
Following the reviewer’s suggestion, we explore whether finetuning on a non-canonical tokenization scheme, which has proven useful at inference-time, can further improve downstream performance. In particular, we finetune the base model on 30K random examples from the Tulu3 SFT mixture: one with L2R digit grouping, and one with R2L grouping. Then, we compare the following settings: (1) the L2R-trained model with L2R inference, (2) L2R-trained model with R2L inference, and (3) R2L-trained model with R2L inference. We expect (3) to outperform (2) due to the additional opportunity to adapt to R2L tokenization during training. We evaluate these models on 1,000 instances of addition and subtraction of 4 - 8 digit numbers. We find that (1) gets 2%, (2) gets 24%, and (3) gets 12.3%, counter to our hypothesis. We conjecture that this is because (3) learns to predict two different representations of numbers between pretraining and finetuning, and when generating output, divides probability among R2L and L2R representations. In contrast, (2) continues to generate numbers L2R, even when conditioned on R2L representation of digits.
Evaluations on non-Latin languages
Thank you for pointing out the value of additional evaluations on non-Latin-script languages! This is indeed a point that we overlooked. We did additional evaluations on Chinese with (since and do not officially support Chinese), comparing character-level tokenization and canonical tokenization. We evaluate on Chinese MMLU and Chinese GSM. Shown in the table below, is robust to character-level tokens in Chinese. We plan to expand these results and include them in the next revision.
| Normal | Char | |
|---|---|---|
| Chinese MMLU | 77.8 | 74.2 |
| Chinese GSM | 78.8 | 76.8 |
I'd like to thank the authors for additional insightful experiments. I'll maintain my score.
This paper studies how LLMs trained with a single, deterministic BPE tokenizer work when are fed input that is tokenized in “non-canonical” ways in inference time. Non-canonical tokenization ranges from random alternative BPE to pure character-level splits. Across 20 benchmarks the authors show that performance drops surprisingly little, with instruction-tuned models retaining more accuracy than their pre-train–only counterparts. Moreover, they uncover scenarios where deliberately altering the tokenization improves results: character-level splits boost string-manipulation and code tasks, while right-aligned digit grouping markedly improves large-number arithmetic.
优缺点分析
Strengths:
- The work focuses on an important yet under-explored aspect of LLMs—how BPE tokenization choices at inference time affect model behavior. This paper reveals unexpected flexibility and generalizatibity of LLMs.
- The comparison of instruction-tuned vs. base models is very insightful. By comparing Llama-3 with its SFT-fine-tuned variant, the paper clearly shows that instruction tuning significantly increases robustness to tokenization change, showing the practical value of SFT.
- This paper finds a simple yet effective number tokenization trick. The right-aligned digit grouping technique is very easy to implement and delivers substantial gains on arithmetic tasks, offering a “free” performance boost.
Weaknesses:
- This paper has limited model diversity in the SFT analysis. Section 3 compares only one base/SFT pair (Llama-3 vs. Llama-3-Instruct), leaving open whether the observed gap generalizes to other architectures and scales.
- This paper misses related work on the same topic. Existing work (e.g., DeepSeek-V3, see question 1) already uses randomized tokenization during training; the paper neither acknowledges this clearly nor tests such models. I suggest adding DeepSeek-V3 to experiments, to see how a small percentage of random tokenization in pretraining stage can make a difference.
- The tri-color bar plots in Figures 3 & 4 occupy significant space yet convey highly correlated and redundant information: showing one bar instead of 3 bars make little difference to the conclusions drawn; a more compact format (e.g., a table with a few numbers) could communicate the key numbers more efficiently.
问题
The study focuses on non-canonical tokenization at inference. Could you clarify why you do not explore randomized tokenization during pre-training, as in DeepSeek-V3 (https://arxiv.org/pdf/2412.19437, Sec. 4.1)? What challenges or trade-offs led you to focus solely on inference-time?
局限性
N/A
格式问题
N/A
We thank the reviewer for recognizing the “important” and “unexpected” finding that LLMs are robust to non-canonical tokenizations of text, and furthermore, can see “substantial gains” from R2L tokenization in arithmetic.
Model diversity in SFT analysis
We wish to clarify which section the reviewer is referring to. §3 (mentioned by the reviewer) identifies tasks where non-canonical tokenization leads to better performance, and uses only instruction-tuned models, not base/chat pairs. §4 (perhaps what the reviewer intended) does use base/chat pairs to identify the source of robustness, but we study both and model families. We would be happy to run additional experiments with clarification from the reviewer!
Related work in
We thank the reviewer for this important reference that we were not aware of. Indeed, trains with non-deterministic tokenization, randomly dropping merges of punctuation tokens like \n\n to overcome the “prompt boundary problem” (so that the model can generate \n when conditioned on \n). However, we regret to share that we were not able to run , which has 671B parameters, with our compute budget. We considered distilled versions of , but this would not test the same hypothesis, since the distilled versions themselves were not trained with non-deterministic tokenization. We will definitely discuss in the next revision of the paper.
More compact format for Figures 3 & 4
Thank you for the suggestion, we plan to present this data in a table as suggested for the next revision.
On random tokenizations during pretraining
It would be very interesting to control for the extent of randomness in tokenization during pretraining and measure its impact on models’ robustness to non-canonical tokenizations. However, pretraining from scratch is not possible with our compute budget, and moreover, our paper focuses instead on demonstrating the unexpected finding that even with deterministic tokenization, models are surprisingly robust to non-canonical tokenizations. We believe the reviewer’s suggestion is a promising direction for future work, as non-deterministic tokenization during pretraining may enable further inference-time gains from adaptive tokenization.
I understand that the large-scale experiments I proposed may not be possible given computation budget constraints. Still, I'd like to thank the authors for their response, and keep my score of 5 (Accept)
This paper presents a thorough study of how well LMs handle non-canonical tokenizations of input text. This is an important, but under-explored problem, and the findings are surprising: LMs are surprisingly robust. This opens up the door to swap in alternative tokenizations at inference time to boost performance, as is shown in the paper, which presents an exciting direction for future work.
Reviewers agree that this paper studies an important topic, that the experiments conducted are thorough, and that the findings are (very) interesting. There remains some uncertainty about precisely the role SFT plays for this behavior to emergence, and how to best leverage the insight for adaptive tokenization at inference time. However, these can be addressed in future work and should not stand in the way of publication.