6.0

/10

Poster6 位审稿人

最低3最高8标准差1.8

3.2

置信度

COLM 2025

UTF-8 Plumbing: Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8

Preston Firestone,Shubham Ugare,Gagandeep Singh,Sasa Misailovic

OpenReview PDF

提交: 2025-03-20更新: 2025-08-26

TL;DR

Byte-level subword tokenization creates vocabularies whose tokens aren't well-formed utf-8, requiring workarounds to interpret as code points.

摘要

关键词

tokenization

评审与讨论

审稿意见

评分: 4置信度: 42025-04-26

This paper provides a proof for the generation of invalid UTF-8 sequences, as a result of the tokenizer vocabulary not covering the full range of Unicode code points. The authors propose an algorithm to prevent errors in UTF-8 decoding. The paper is well organized, but lacks clarity in some key areas. The paper addresses and important issues, but does not make a sufficiently large contribution to the understanding of the issue or make clear steps towards a solution.

接收理由

Incomplete and invalid UTF-8 sequences constitute an important issue with tokenization. They cause errors in tokenizations through the sequence and can lead to glitch tokens (https://arxiv.org/pdf/2405.05417), which can lead to unpredictable and unsafe model behavior (https://arxiv.org/pdf/2402.14020).

拒绝理由

(1) The paper highlights a genuine problem with using UTF-8 encoding. However, the authors motivate this problem theoretically without highlighting the empirical evidence of this problem (https://arxiv.org/pdf/2410.23684). Given the previous work on the topic and the fact that the proposed algorithm is not tested, the proof itself is not a sufficient contribution.

(2) On page 7 of the paper, the authors write “ This violates Equation 7, which says that tokenizing and detokenizing an input should return exactly that input. Tokenizers that do not satisfy this property have their own problems that are outside of the scope of this paper.” This surprised me as I believed that tokenizers that do not satisfy that property were one of the key topics of discussion in the paper. On page 1, the authors write: “Detokenization should be a morphism: combining two token sequences and detokenizing them should yield the same result as detokenizing each sequence separately and combining the results.” As this is core to the aims and contribution of the paper, this contradiction significantly weakens the paper.

(3) The use of terminology of the paper is imprecise and the authors introduce novel terms, where there already exist terms. Furthermore, the authors claim that the paper advances terminology on this topic, which I do not agree to be a contribution of the paper. For example, the authors use the word ‘cutting’ throughout the paper. In most of the paper (second to last paragraph of p.9), this term is being used instead of ‘segmenting’ or ‘tokenizing’. In the final subsection on p. 8, the authors use it instead of ‘pre-tokenization’. The authors at no point in the paper use the term pre-tokenization to refer to that process. Another case of imprecise terminology usage concerns the usage of the term ‘detokenization’. It is more common in the literature to refer to this as ‘decoding’ (and the authors do use this term throughout the process to refer to the same concept). The authors should use only one term for clarity.

(4) The authors do not sufficiently engage with the literature on this topic. In some cases, prior work is cited, but the paper does not reflect the findings of the cited work. The most obvious case relates to the claim that “the power of BPE has attained the status of a folk theorem and is taken for granted” (p. 7). However, in the last paragraph of Section 5 (p. 9), the authors cite Bostrom & Durrett (2020) and Gallé (2019) and discuss previous work on the features of BPE that make it effective and widely adopted. I think the authors should add to this and discuss other evidence about the pros and cons of BPE (e.g. https://aclanthology.org/2024.findings-naacl.247.pdf, https://arxiv.org/pdf/2411.08671, https://aclanthology.org/2023.cl-4.5/). I also recommend the authors also look at this discussion of tokenization as lossless (https://arxiv.org/pdf/2309.10668), the definition of partial UTF-8 sequences (https://arxiv.org/pdf/2405.05417), and work on glitch tokens (https://arxiv.org/pdf/2402.14020, https://arxiv.org/pdf/2405.05417).

(5) The term “code points” is used multiple times (including in the abstract) before it is defined on p. 5. The authors should either define this in the introduction or use a different word to refer to this, which does not require definition.

给作者的问题

The authors list "tools and code" as a keyword in the paper, however I did not see any code or tools released with the paper. What are the tools and/or code that the paper is contributing?

Typos:

utf-8 → UTF-8
greek → Greek
lettter → letter
“Gastaldi et al. (2024); Ahia et al. (2023).” (Section 3.1) → use \citep{}
"Petrov et al. (2023); Ahia et al. (2023)" (p.8) → use \citep{}
"Synchromesh Poesia et al. (2022) and SynCode Ugare et al. (2024)"→ Synchromesh (Poesia et al., 2022) and SynCode (Ugare et al., 2024); use \citep{}
Error with OpenLLM citations (p. 8)
langauges → languages
smal → small
Missing or incorrect \ref{} in second paragraph of Appendix A

Formatting issues:

The ≀ in Table 1 indicate token boundaries, but it difficult to visually dermine byte boundaries. Could the authors add color coding or some other way to aid the reader?
The authors didn’t submit the paper in preprint format (with line numbers)

2025-06-03

On page 7 of the paper, the authors write “This violates Equation 7, which says that tokenizing and detokenizing an input should return exactly that input. Tokenizers that do not satisfy this property have their own problems that are outside of the scope of this paper.” This surprised me as I believed that tokenizers that do not satisfy that property were one of the key topics of discussion in the paper. On page 1, the authors write: “Detokenization should be a morphism: combining two token sequences and detokenizing them should yield the same result as detokenizing each sequence separately and combining the results.” As this is core to the aims and contribution of the paper, this contradiction significantly weakens the paper.

The sentence you point out on page 7 comes at the end of a discussion on byte-fallback tokenization. The sentence you point out is an error, which we will fix as described in the following paragraph.

Byte-fallback tokenization is a special case of byte-level tokenization where each token in the vocabulary is constrained to be well-formed UTF-8, except for 256 single-byte tokens, each representing one of the 256 bytes. The tokenizer uses these single-byte tokens as a last resort to represent a part of the input that cannot be represented using any of the other tokens in the vocabulary. Because a byte-fallback tokenizer's vocabulary contains tokens that are ill-formed UTF-8 (all the single-byte tokens 0x80 through 0xFF), it is capable of generating ill-formed UTF-8 sequences: this follows from our Proposition 1 ("any vocabulary that contains a member that is ill-formed UTF-8 will be able to generate ill-formed sequences").

There is no substantive contradiction with the remainder of the paper. The passage you point out results from an error in our exposition of byte-fallback tokenization, which, as we have just shown, is covered fully by our framework. Thank you for pointing out our mistake and spurring us to clarify the exposition of byte-fallback tokenization.

The authors claim that the paper advances terminology on this topic, which I do not agree to be a contribution of the paper

The terminological clarity we hope to advance is the explicit use of "byte-level" and "character-level" to describe token vocabularies. Seldom do papers introducing models state explicitly whether their tokenizer works with bytes or characters, though such information is relevant to those implementing downstream applications for those models. We will add clarifying notes on our choices of terminology in the paper's final version.

The authors use the word ‘cutting’ throughout the paper. In most of the paper (second to last paragraph of p.9), this term is being used instead of ‘segmenting’ or ‘tokenizing’. In the final subsection on p. 8, the authors use it instead of ‘pre-tokenization’. The authors at no point in the paper use the term pre-tokenization to refer to that process.

We are happy to vary our term from "cutting" to some other description such as "segmenting" or "chunking". We believe "tokenizing" to be conceptually distinct from "cutting", since "tokenizing" can be represented by a transducer or another formalism, whereas "cutting" refers specifically to our formalism based on monoids.

Another case of imprecise terminology usage concerns the usage of the term ‘detokenization’. It is more common in the literature to refer to this as ‘decoding’ (and the authors do use this term throughout the process to refer to the same concept). The authors should use only one term for clarity.

It is important to distinguish between the process of interpreting a byte sequence according to the UTF-8 encoding scheme and converting a sequence of tokens to a sequence of bytes. We use "decode" for the former and "detokenize" for the latter. In terms of the monoids in the paper, tokenization and detokenization refer to mapping from some $\Sigma^\*$ to some $\Sigma^{\*\circ\wr{}\*}$ and back, whereas encoding and decoding refer to mapping from $\mathrm{B}^\*$ to $\Upsilon^\*$ and back.

If the reviewer thinks using encode and decode are more clear, we are happy to use that terminology, yet we want to maintain a terminological distinction between mapping in and out of the token vocabulary and mapping in and out of UTF-8.

2025-06-03

The authors list "tools and code" as a keyword in the paper, however I did not see any code or tools released with the paper. What are the tools and/or code that the paper is contributing?

We selected this keyword because our paper discusses existing tools and codebases.

References

Please see common response for references.

2025-06-06

This is a reporting/transparency problem

This is reasonable. We emphasize the importance of reporting whether the vocabulary is byte- or character-level explicitly, but we will reframe the contribution in terms of transparency rather than terminology in the next version of the paper. Papers are often vague or unclear about what level of tokenization they use. (e.g. Llama 1, 2 use “SentencePiece” for “character-level”, OLMo 1, 2 use “BPE-based” and “borrowed from cl100k”, never mentioning “byte-level”)

I believe ‘partial UTF-8 sequences’ [3] and ‘incomplete tokens’ [4] exactly refer to ‘ill-formed sequences’.

The terms you point out are used to refer to individual tokens that are ill-formed. We refer to sequences of tokens that collectively make up ill-formed UTF-8.

[4] assembles individually ill-formed tokens to create sequences of tokens that detokenize to bytes that can be successfully interpreted (“decoded”) as UTF-8. They do not work with ill-formed sequences, only with ill-formed tokens. [3] excludes partial UTF-8 sequences (and unreachable tokens) from their experiments: they introduce the term solely for the purpose of naming and avoiding the problem we attack in our paper. We, on the contrary, concern ourselves with the question of ill-formed sequences, the conditions under which they cannot be avoided, and ways in which we can cope with them to avoid breaking systems that expect well-formed UTF-8.

We give an example of what we mean. We will represent bytes by two-character hexadecimal alphanumeric characters and use $\wr$ to separate tokens. Let our example vocabulary contain the tokens { $\wr\mathrm{E0\ A4}\wr, \wr\mathrm{B3}\wr$ }. Note that the sequence of bytes $\mathrm{E0\ A4\ B3}$ is well-formed UTF-8, but neither of the tokens in the vocabulary is well-formed on its own.

[3, 4] deal with the tokens in this example vocabulary as “partial UTF-8” and “incomplete tokens” respectively. [3] would simply leave these tokens out of their consideration entirely. [4] would only deal with the bigram $\wr\mathrm{E0\ A4}\wr\mathrm{B3}\wr$ (not that this bigram is particularly improbable) because only together do these tokens form well-formed UTF-8.

We deal with sequences like $\wr\mathrm{E0\ A4}\wr\mathrm{B3}\wr\mathrm{E0\ A4}\wr$ , which is ill-formed UTF-8 and something that a model that worked with this example vocabulary could generate. This sequence of 5 bytes is ill-formed UTF-8 (once it is detokenized): though the first three bytes make up a well-formed character encoding form, the sequence as a whole contains bytes which are ill-formed (i.e. the last two bytes, the last token). This situation is fully out of scope for both [3], because they do not consider ill-formed (“partial”) tokens in their study, and [4], because they only consider bigrams that create well-formed UTF-8. It is, however, the kind of situation we treat in our paper.

This is clear as you explain it, but please add clarification on the use of the terminology in the paper. That makes sense to me. I think this should be clarified as you introduce these terms in the paper.

We will clarify our introduction and use of terms in the next version of the paper.

“The power of BPE…”

We will remove this sentence from the next version of the paper and improve our discussion of algorithms in the related work section.

Choice of keyword

We will remove this keyword from the paper.

评论- Response to Authors

2025-06-10

I thank the authors for addressing my comments so thoroughly. I will slightly raise my score, but I agree with Reviewer Vin2 that the paper does not offer a sufficient contribution, as it contains neither empirical results or a novel algorithm.

2025-06-10

We appreciate the increased score and would like to take this opportunity to clarify our contributions: we give a formal structure to a problem that has so far only been encountered empirically, present an abstract formalization of existing solutions, and show a novel impossibility result about the compatibility of byte-level subword tokenization with UTF-8. We thank you for the conversations and comments, which will make our paper better.

评论- Response to Authors

2025-06-06

The terminological clarity we hope to advance is the explicit use of "byte-level" and "character-level" to describe token vocabularies.

This isn’t a terminology problem, but a reporting/transparency problem. These terms are already the default terms for those concepts. The authors should de-emphasize this terminology usage as a contribution.

"tokenizing" can be represented by a transducer or another formalism, whereas "cutting" refers specifically to our formalism based on monoids

This is clear as you explain it, but please add clarification on the use of the terminology in the paper.

We use "decode" for the former and "detokenize" for the latter.

That makes sense to me. I think this should be clarified as you introduce these terms in the paper.

[2,3,4] all exclude ill-formed sequences from their analyses.

I believe ‘partial UTF-8 sequences’ [3] and ‘incomplete tokens’ [4] exactly refer to ‘ill-formed sequences’.

Of the efficacy of BPE: we are not invested in the superiority of any cutting algorithm over another.

I am specifically referring to this claim: “the power of BPE has attained the status of a folk theorem and is taken for granted” (Section 4.2), for which no citation was provided.

We selected this keyword because our paper discusses existing tools and codebases.

I do not think the choice of keyword is appropriate.

2025-06-03

The paper highlights a genuine problem with using UTF-8 encoding. However, the authors motivate this problem theoretically without highlighting the empirical evidence of this problem (https://arxiv.org/pdf/2410.23684) [4].

Thank you for the reference. The problem [4] examines is distinct from our own. They show how to construct improbable but well-formed sequences of characters from models' vocabularies and that these sequences cause undesireable behavior in models. This vulnerability is similar to that caused by glitch tokens [2,3], where undertrained tokens cause bad behavior in models. Both issues stem from undertrained tokens or sequences of tokens forcing the model into generating a sequence distant from any in its training data.

We show that the presence of ill-formed tokens in the model's vocabulary permit the model to generate sequences of bytes that are not well-formed UTF-8. The presence of ill-formed UTF-8 in the output presents a vulnerability to systems that work with LLMs and expect their inputs and outputs to be well-formed UTF-8. [2,3,4] all exclude ill-formed sequences from their analyses.

The authors do not sufficiently engage with the literature on this topic.

Thank you for the detailed bibliographic citations. Since you raise several points in this question, we shall treat them one at a time.

Of the efficacy of BPE: we are not invested in the superiority of any cutting algorithm over another. In other words, we are not concerned with where $\Sigma^{*\wr\circ}$ comes from. BPE and Unigram are both members of the class of tokenizers covered by our formalism, as are any other algorithms that work by producing a finite vocabulary and cutting inputs such that each part of the input between the cut marks is a member of the vocabulary. We merely point out a problem that will affect any byte-level tokenizer, regardless of how the vocabulary is determined. Discussing the pros and cons of BPE as a compression algorithm for producing subword vocabularies is not in the scope of our paper, though we appreciate the references to expand our related work section.

"Glitch tokens" are undertrained tokens in the model's vocabulary and are distinct from the ill-formed UTF-8 sequences we discuss, though the latter also cause glitches. [3] discusses what they call "partial UTF-8" tokens, defined as "tokens representing byte sequences that cannot be converted to Unicode characters [decoded, in our terms] as they contain only part of the full UTF-8 encoding for a character." This is a special case of our ill-formed tokens: ill-formed byte sequences need not contain any part of a UTF-8 encoding form. Ill-formed tokens are explicitly excluded from [3]'s experiments because they "are not suitable for building verification prompts", presumably because these prompts must be well-formed UTF-8 to work with existing interfaces (e.g. Huggingface's).

Our language of character-level and byte-level would clarify [3]. In particular, [3] section 3.1.1 describes character-level tokenizers with byte-fallback and point out the interesting bug that these vocabularies include both a single-character token and a byte-fallback token for all the ASCII characters, code points U+00 through U+7F, which UTF-8 encodes as themselves (see our Appendix B). This is a crucial bug in these models, precisely because it introduces glitch tokens by necessity, but one that is hard to understand because their paper does not discuss it this clearly.

ASCII byte-fallback tokens are a source of very glitchy tokens (see [2] Figure 23, in the Appendix) and one that is important to be explicit about for those models that use character-level tokenizers with byte-fallback. [2] exclude byte-fallback tokens from their Figure 9 in the main body of the paper, which figure reports the glitchiest tokens in the Llama 2 vocabulary. They drop these tokens presumably because they are unreachable [3], namely, they cannot be the product of ordinary tokenization. However, attackers that control the tokenization process could force these tokens into the input of a model in a way that would be concealed on the output, since the detokenization process converts these byte-fallback tokens into the bytes they represent.

We clarify this discussion in a novel, explicit way, and thereby contribute to the discourse by permitting the disambiguation of the role of bytes, characters, and byte-fallback tokens in tokenization.

The term “code points” is used multiple times (including in the abstract) before it is defined on p. 5. The authors should either define this in the introduction or use a different word to refer to this, which does not require definition.

We will clarify this in the next version of the paper. Thank you for the feedback.

审稿意见

评分: 7置信度: 22025-05-03

This paper formalizes subword tokenization and treats issues that arise when converting back and forth between bytes and unicode.

接收理由

The general formal approach is quite nice and the morphism requirement is a nice one. It's not clear what the value of adding the stochastic dimension is. It doesn't play any role here.

The homomorphism proof seems solid but would be easier to follow if homomorphism were actually defined explicitly.

The general assumption seems to be that we might encounter characters we've not seen before. Is this realistic? It's certainly true, but unless the text sample is really small, do these new characters ever really matter?

I was suprised to see no explicit contact with the finite-state literature. I think this is more opportunity than problem though.

拒绝理由

There are a number of undefined items, e.g. serving engine and foundation model.

The general proposal seems to be: have a buffer. Is this new?

It looks like a number of systems have strategies that are like what's proposed here, so it's not clear what this paper offers except perhaps a deeper understanding of the issue.

2025-06-03

It's not clear what the value of adding the stochastic dimension is. It doesn't play any role here.

We include stochasticity to show that the problem is not unique to deterministic tokenizers; should one implement a tokenizer that returns multiple possible tokenizations, or retokenizes the input dynamically on line according to some assessment of the probability or semantic aptness of a given tokenization, the same problem of potentially ill-formed UTF-8 would be present. Although deterministic tokenizers are currently more common, it is possible that future LLMs will use stochastic tokenizers, which might result in more robust models [5, 10, 13].

Other works analyzing BPE (e.g. [11, 12]) assume that the algorithm is deterministic or have to cope with the ways it might not be. We include the mention of stochasticity to explicitly announce that we do not assume this about the tokenizers we study; deterministic tokenizers are included a fortiori because they are a special case of stochastic tokenizers where the probability distribution is such that exactly one tokenization has a probability of 1 and all others have a probability of 0.

The homomorphism proof seems solid but would be easier to follow if homomorphism were actually defined explicitly

Thank you for pointing out the inconsistency, which we shall resolve in the next version of the paper.

The general assumption seems to be that we might encounter characters we've not seen before. Is this realistic? It's certainly true, but unless the text sample is really small, do these new characters ever really matter?

Rare and out-of-vocabulary characters may arise in practice. We include the multiocular o ꙮ as an example in Table 2 because it appears once in all of written Old Church Slavonic but is nevertheless present in models' training data via Wikipedia. Even if the vocabulary does not contain a single token for ꙮ and has to break it down into bytes, those bytes have appeared in the model’s training set, which allows the model to make use of those tokens for meaningful inference.

There are a number of undefined items, e.g. serving engine and foundation model.

We will clarify the definition of these terms in the next version of the paper. Thank you for pointing this out.

I was suprised to see no explicit contact with the finite-state literature. I think this is more opportunity than problem though.

We considered and chose not to use the abstraction of transducers [8] for our description of tokenization.

The general proposal seems to be: have a buffer. Is this new? It looks like a number of systems have strategies that are like what's proposed here, so it's not clear what this paper offers except perhaps a deeper understanding of the issue.

Algorithm 1 is presented to document behavior that is already present in deployed systems. The contribution we offer is a formal proof that ill-formed UTF-8 sequences are always possible for models that have vocabularies containing tokens made up of ill-formed UTF-8, as well as a formalization of the relationship between tokens, bytes, and UTF-8 encoding forms.

References

Please see common response for references.

评论- stats

2025-06-04

The point in the response about why the authors give a statistical version is good...and should be in the paper.

2025-06-05

We will explain it more clearly in terms like the ones we used above in the next version of the paper. Thank you for the comment requesting clarification.

审稿意见

评分: 8置信度: 32025-05-13

This paper investigates a subtle but critical issue in language model tokenization: how subword tokenizers, particularly byte-level ones, can violate UTF-8 validity and morphism properties when detokenizing outputs. Using a rigorous algebraic formalism based on monoids, the authors derive a fundamental impossibility result: no byte-level tokenizer with a finite vocabulary can simultaneously guarantee UTF-8 validity and full coverage of Unicode inputs. The authors introduce a theoretical framework that explains how incremental detokenization and streaming decoding break when using byte-level tokenization. They propose Algorithm 1 as a partial solution, evaluate it on real-world systems (SynCode, HuggingFace TGI, vLLM), and demonstrate bug fixes guided by the theory. The paper also categorizes tokenizers used by major foundation models and discusses implications for model serving, constrained generation, and Unicode safety.

接收理由

Tackles a foundational systems issue underlying nearly all LMs.
Presents novel formal results on tokenization morphisms using monoid theory. Derives rigorous proofs that relate tokenizer vocabulary composition to UTF-8 correctness.
Demonstrates how theoretical results predict real bugs in existing model infrastructure. Offers a practical solution (Algorithm 1), with empirical validation on SynCode and inference engines.
Provides a compelling mix of formal theory, practical systems analysis, and engineering recommendations.

拒绝理由

Heavy reliance on abstract algebraic notation and formal language theory (monoids, morphisms) may limit accessibility for practitioners. It is difficult to follow the paper.
Algorithm 1 is presented as a workaround, not a fundamental fix—it avoids errors but introduces unbounded memory risk. The mitigation strategies are not fully benchmarked in terms of latency or robustness across model sizes.
While the theoretical analysis is excellent, the paper stops short of offering new tokenizer designs that inherently satisfy morphism and UTF-8 constraints.

给作者的问题

Could your impossibility theorem be extended to cover other encodings (e.g., UTF-16, UTF-32) or only applies to UTF-8?
Have you observed model output behavior (e.g., hallucinations, truncation) that can be directly attributed to UTF-8 detokenization errors?
Could an embedding-level mitigation be proposed—e.g., enforcing character-level constraints during decoding without restricting the tokenizer vocabulary?
How does Algorithm 1 affect streaming performance in real deployments? Are memory buffers manageable at inference scale?

2025-06-03

Algorithm 1 is presented as a workaround, not a fundamental fix—it avoids errors but introduces unbounded memory risk.

The buffer in the pseudocode of the algorithm is an abstraction of an implementation trick whereby one indexes into the array of generated tokens to keep track of the last sucessfully-decoded token. The memory cost of Algorithm 1 is no greater than the storage of the generated tokens, plus the storage for the pointer into the array.

While the theoretical analysis is excellent, the paper stops short of offering new tokenizer designs that inherently satisfy morphism and UTF-8 constraints.

We show that there is no simple solution to this problem: if the tokenizer is able to produce an arbitrary sequence of tokens from its vocabulary, and the vocabulary contains tokens made up of ill-formed UTF-8, then the model will be able to produce ill-formed UTF-8 sequences.

Could your impossibility theorem be extended to cover other encodings (e.g., UTF-16, UTF-32) or only applies to UTF-8?

No, it only applies to UTF-8. We will add a note clarifying this in the next version of the paper. UTF-8 is used by the vast majority of web sites, is the standard encoding used by programming language standard libraries, and is the format used to transmit data sets used to train large language models and thus our work primarily focuses on UTF-8.

Have you observed model output behavior (e.g., hallucinations, truncation) that can be directly attributed to UTF-8 detokenization errors?

One author experienced a crash in an interactive LLM application while generating the Coptic Small Letter Sima ⲥ. We believe this to have been caused by the interface's attempting to decode the model's output token-by-token, failing entirely because the letter was represented by multiple tokens, the first of which was ill-formed.

The mitigation strategies are not fully benchmarked in terms of latency or robustness across model sizes. How does Algorithm 1 affect streaming performance in real deployments?

The latency of the incremental decoding is negligable compared to the cost of generating tokens and can be performed in parallel with generation. Our tests on a MacBook M1 Pro with vLLM's and TGI's implementations of this algorithm result in a latency of less than 0.001s/token. On the same machine, generating a token with Qwen3-8b takes about 0.05s/token. The cost of detokenization is constant with regard to model size.

Are memory buffers manageable at inference scale?

The memory buffer in our algorithm is an abstraction of the implementation trick of indexing into the array of input ids the model has generated to keep track of the position of the last successfully-decoded token; there is no memory overhead beyond storing the model's output itself and the constant memory required for the pointers.

Heavy reliance on abstract algebraic notation and formal language theory (monoids, morphisms) may limit accessibility for practitioners. It is difficult to follow the paper.

Thank you for the comment. A future revision will include more examples and clarify the introduction.

References

Please see common response for references.

2025-06-04

Dear Authors,

I am fine with your responses. However, I would request that you add the limitations highlighted by me and confirmed by you in the paper clearly.

2025-06-05

We will include the limitations clearly in the next version of the paper. Thank you for the comments.

审稿意见

评分: 7置信度: 22025-05-13

This paper formalizes the non-morphism issue of detokenization in systems based on byte-level detokenization. It uses a formalism based on monoids and stochastic maps which formally concludes an impossibility which can be used to explain issues in incremental detokenization, a problem prevalent in modern streaming services based on language models. The authors propose a potential solution to this issue using an algorithm which incrementally detokenizes tokens as long as they are well-formed UTF-8 and otherwise "collects" tokens until they form a well-formed one, together with the previous ones collected. While this algorithm introduces other issues (e.g., unlimited memory costs), it nevertheless provides a simple approach to overcome detokenization issues encountered in current systems. This is highlighted with a concrete case study.

接收理由

formalization of detokenization issues encountered in byte-level tokenizers which are used by many state-of-the-art language models
algorithm on how to deal with issues encountered during incremental detokenization (albeit not feasible for pruduction systems due to unlimited memory costs)
the paper follows a clear structure and is mostly well-written

拒绝理由

some more examples are needed along the formalized theoretical framework which helps the reader to better understand the formal definitions
Proposed solution (i.e. Algorithm 1) to the problem is not applicable to real-world systems due to unlimited memory costs (which can break a running system when confronted with malicious user)

给作者的问题

I suggest to shortly explain the term "monoid" if you use it in the Introduction section
The transition from the explanatory example for decoding issues (Table 1) to stating the contributions can be improved, there should be a clear motivation explaining your work/contributions given the issues discussed before.
What do you exactly refer to with "cell" in the text explaining Table 1?
Can you clarify the difference between monoid sum^{star} and sum^{star_squiggle_circle_star} in Definition 7?
Contribution Nr 4. states that a fundamental problem is due to "incremental detokenization" but I don't see where this is inferred from the theoretical framework introduced in Section 3. This conclusion needs to be made more explicit.
In Algorithm 1, what is epsilon? How do you deal with an increasing buffer which does never see a well-formed UTF8 sequence?

2025-06-03

Can you clarify the difference between monoid $\Sigma^{\*}$ and $\Sigma^{\*\wr\circ{}\*}$ in Definition 7?

The monoid $\\Sigma^\*$ is all possible finite sequences of members of $\Sigma$ (e.g. $\mathrm{abcd}$ ). The monoid $\Sigma^{\*\wr}$ is the set of all possible finite sequences of members of $\Sigma$ with squiggles pre- and postpended (e.g. $\wr{}\mathrm{abcd}\wr{}$ ). $\Sigma^{\*\wr\circ}$ is a finite subset of $\Sigma^{\*\wr{}}$ and is intended to represent the vocabulary of a given tokenizer (e.g. $\{\wr{}\mathrm{abc}\wr, \wr{}\mathrm{def}\wr, \wr{}\mathrm{acf}\wr\}$ . Then $\Sigma^{\*\wr\circ{}*}$ contains all possible sequences of members of $\Sigma^{\*\wr\circ}$ and is intended to represent all possible sequences that can be made using the vocabulary of a given tokenizer (e.g. $\wr\mathrm{abc}\wr\mathrm{acf}\wr$ ; note that we elide adjacent squiggles for readability).

In principle, we could say $\Sigma^{\*\circ\wr}$ , which would be a finite subset of $\Sigma$ with a squiggle pre- and postpended to each of its members, instead of $\Sigma^{\*\wr\circ}$ . We believe that the distinction does not matter formally, though the choice between them could be motivated by clarity of exposition.

Proposed solution (i.e. Algorithm 1) to the problem is not applicable to real-world systems due to unlimited memory costs (which can break a running system when confronted with malicious user)

The memory buffer in Algorithm 1 is an abstraction of the implementation trick used in deployment of indexing into the array of generated input ids to keep track of where the last successfully-interpreted encoding form ends. The pointer is advanced when more tokens are successfully decoded. There is no memory overhead beyond storing the input ids the model has generated and the pointer into that array.

Some more examples are needed along the formalized theoretical framework which helps the reader to better understand the formal definitions.

Thank you for the comment. We will incorporate examples in the next version of the paper.

Contribution Nr 4. states that a fundamental problem is due to "incremental detokenization" but I don't see where this is inferred from the theoretical framework introduced in Section 3. This conclusion needs to be made more explicit.

In a future revision, this contribution will be rephrased to something like "We show that decoding a sequence of tokens as UTF-8 can violate formal assumptions about the properties of tokenizers, such as their being lossless or homomorphisms." Thank you for pointing out the inconsistency.

In Algorithm 1, what is epsilon?

The empty string. This will be stated more clearly in the next version of the paper.

How do you deal with an increasing buffer which does never see a well-formed UTF8 sequence?

Algorithm 1, as deployed, always returns an empty string in this case.

What do you exactly refer to with "cell" in the text explaining Table 1?

The cells in question are the vertical columns of the table, one for each code point. These are separated by space, but in the next version of the paper we will distinguish more clearly between them.

I suggest to shortly explain the term "monoid" if you use it in the Introduction section.

Thank you for the comment. We will clarify the introduction in the next version of the paper.

References

Please see common response for references.

审稿意见

评分: 3置信度: 42025-05-13

This paper notices the problem of UTF-8 detokenization when using byte-level tokenizers. Essentially, it shows that any byte-level tokenizers that may lead to invalid UTF-8 tokens can be potentially leaky and produce discrepancy when comparing results for two raw byte strings produced by different orders of concatenating and detokenization. The paper formalizes this using monoid theory and shows that any vocabulary that can lead to invalid UTF-8 tokens has this problem. The paper finally tries to mitigate this problem by introducing an incremental detokenization algorithm. A detailed case study is also conducted to show the practicality of the problem.

接收理由

The paper uses Monoid to formalize the problem and rigorously proves all the results
The case study of the paper is extensive and coveres majority of tokenizers used in popular models

拒绝理由

The problem itself is fairly simple and highly restrictive. The paper also acknowledges this in the case study that many of the current tokenizers have already taken this into account.
The paper formalizes the problem with the Monoid theory but made little to none theoretical contributions regarding nontrivial results or novel techniques.
The solution proposed in this paper is the standard incremental decoding algorithm, which is not novel at all.
The evaluation only compares with SynCode on a set of tasks with singular goals and doesn't compare with other works like GCD, Guidance, automata-based constraint decoding, etc.

给作者的问题

Why are Monoids necessary for illustrating the problem? Are there more straightforward approaches to show the existence of the problem?
What are the differences between the algorithm proposed and the incremental decoding algorithm that are implemented in, say, Python?
How are other constraint decoding schemes affected by this problem?
How does Algorithm 1 react when it never reaches a correct UTF-8 token? How resilient is Algorithm 1 in terms of the extent to which the raw byte string is erroneous?

2025-06-03

Why are Monoids necessary for illustrating the problem? Are there more straightforward approaches to show the existence of the problem?

We found monoids to be a natural abstraction to formalize our problem. A monoid structure for a language (i.e. a space defined by a finite set of elements that can be concatenated to one another form sequences) is already common in the literature on combinatorics on words. To give examples from the tokenization literature, the structure of monoids describes the assumptions [14, 15] make about their algorithms' inputs. The primary alternative formalization we are aware of would be to use transducers [8], which we believe would significantly complicate the presentation.

The solution proposed in this paper is the standard incremental decoding algorithm, which is not novel at all.

Algorithm 1 documents a solution already present in deployed tools such as vLLM, OpenLLM, and TGI. We provide it to document an ad hoc solution that arose to this problem in the wild. Our paper formalizes the problem of ill-formed UTF-8 in byte-level tokenizer vocabularies, which will always permit the generation of ill-formed byte sequences.

What are the differences between the algorithm proposed and the incremental decoding algorithm that are implemented in, say, Python?

Our Algorithm 1 is an abstract generalization of the implementations in Python used by the serving engines surveryed.

How does Algorithm 1 react when it never reaches a correct UTF-8 token? How resilient is Algorithm 1 in terms of the extent to which the raw byte string is erroneous?

Algorithm 1 will simply return the empty string indefinitely, never successfully decoding anything. This is in fact the behavior of deployed systems.

The evaluation only compares with SynCode on a set of tasks with singular goals and doesn't compare with other works like GCD, Guidance, automata-based constraint decoding, etc. How are other constraint decoding schemes affected by this problem?

We also examined Synchromesh and observed the same issue. Given the focus of our paper on a theoretical impossibility result, we focused our evaluation on these two popular tools. As we demonstrate, other constraint decoding schemes will also have to cope with incomplete or ill-formed UTF-8 in their inputs.

The problem itself is fairly simple and highly restrictive. The paper formalizes the problem with the Monoid theory but made little to none theoretical contributions regarding nontrivial results or novel techniques.

Our main contribution is a proof that the presence of tokens containing ill-formed UTF-8 in a byte-level tokenizer's vocabulary makes it possible to produce sequences of tokens that are never valid UTF-8. This has real impacts in production, as acknowledged by reviewers uy5s, MUTr, x3kb, and cvmT. Our use of the monoid formalism to prove this property and our ability to predict and detect the presence of software bugs support the value and interest of our contribution.

References

Please see common response for references.

2025-06-04

Our Algorithm 1 is an abstract generalization of the implementations in Python used by the serving engines surveyed.

I'm confused at this since the wordings in the paper (like "resolving leaky abstractions" in the title and "We introduce Algorithm 1" in section 4.1) indicate that the algorithm is one of the novel contributions by this work.

Our main contribution is a proof that the presence of tokens containing ill-formed UTF-8 in a byte-level tokenizer's vocabulary makes it possible to produce sequences of tokens that are never valid UTF-8.

I agree that the rigorous proof in this case would be the biggest contribution of the paper. However, the problem of decoding ill-formed tokens has already been shown in previous works (https://arxiv.org/pdf/2410.23684) and is, from my understanding, not a non-trivial problem to show by using the language of Monoids. Thus the proof for the problem might not be a sufficient main contribution of the work.

This has real impacts in production.

I agree that the problem highlighted in this paper is a real problem in production. However, as reviewer uy5s also pointed out, the paper does not take sufficient effort on demonstrating the empirical evidence of the severity of the problem, which, in this case, constitutes a necessary part of the paper for introducing the problem and motivating the theoretical proof.

2025-06-04

You are correct that the wording in the paper overemphasizes the significance of Algorithm 1. We have already changed the title of the paper (to be reflected in the next version posted to openreview) to “Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8”. We believe this more accurately describes our contribution. We will also reframe our discussion of Algorithm 1 in the next version of the paper in response to your and other reviewers’ apt critique of our introduction of it. See our discussion with reviewer cwmT for improvements and expansions to Algorithm 1.

The paper you cite discusses the production of well-formed UTF-8 from pairs of tokens that are never collocated in the model’s training data. This pushes the model outside of its trained distribution and causes issues similar to those provoked by glitch tokens for similar reasons: models behave badly when forced to continue sequences radically unlike the sequences they were trained on. The paper by Jang et al. that you cite explicitly does not cope with ill-formed UTF-8, and neither do the papers on glitch tokens that reviewer uy5s introduced.

Our proof is that ill-formed UTF-8 sequences are possible wherever the model’s vocabulary contains tokens made up of ill-formed UTF-8, which is novel and distinct from the paper you cite. See our responses to reviewer uy5s for further discussion of this issue.

We maintain that the paper about improbable bigrams you and reviewer uy5s have pointed out, as well as the work on glitch tokens uy5s introduced, are distinct from ours, for these other works deal with well-formed though improbable sequences. Our work deals with ill-formed sequences, which the other works do not deal with.

The primary empirical evidence we introduce is constrained generation systems’ failure to cope with ill-formed UTF-8. Other breakage is also possible as a result of ill-formed UTF-8. The problems serving engines have faced provide one example, which we introduce in the paper.

2025-06-10

Thank you again for the comments. As the conversation period draws to a close, we would like to ask whether your concerns have been resolved; if not, are there any further points we can clarify?

2025-06-11

I thank the authors for the comments. However, I don't think the comments address the concern raised by both reviewer uy5s and me regarding the contribution of the paper. I will keep my original score.

审稿意见

评分: 7置信度: 42025-05-14

The main thesis of this paper is that two tokenized sequences, detokenize to the same output whether combined or detokenized separately. The authors formalise BPE tokenization to prove that most modern byte-level tokenizers break this morphism assumption, and therefore can generate invalid UTF8 sequences during decoding. The authors propose a simple mitigation algorithm (Algorithm 1) to solve token-level decoding to UTF8 by queueing byte level tokens until a valid UTF8 character is formed. They show that this fix, already implemented in most major serving engines, works on a toy problem of emoji decoding when compared to eagerly naively decoding tokens to UTF8. The authors also do some analysis on broader tokenizers to categorise them into byte-level (huggingface and others) or character-level (sentencepiece).

接收理由

The morphism property that the authors outline is a relevant and interesting point to raise when thinking and designing tokenizers, and has downstream consquence in broader LLM inference (for instance when streaming the tokens of a model directly to UTF8).
The authors provide mathematical proofs to formalise the problem of current byte-level tokenization
The authors motivate the implementation of Algorithm 1 which is already present in several LLM inference libraries
The paper is well written and clear

拒绝理由

The main motivation behind this paper is to argue detokenization must be a morphism, because otherwise malformed UTF8 is possible. However, as the authors note, this is important only in token-by-token streaming.
The algorithm 1 that the authors propose has already been implemented by various different open-source implementations and is therefore not entirely novel. Additionally, as the authors themselves also note, this algorithm does not provide guarantees that the decoding process will lead to valid UTF8 and only solves the problem for common issues. Ultimately, this paper provides theoretical and empirical background for an engineering trick that has already been implemented by different sources.

给作者的问题

Apologies for the lack of line numbers, I'm not sure why, but but they were not present in your submission.

The biggest question I have with this is why not try to use constrained decoding to go further than Algorithm 1, and implement a grammar/constraints that guarantee no invalid UTF8 sequences of tokens? One could mask out tokens with potential to create invalid sequences and thus expand Algorithm 1 further to entirely guarantee only valid UTF8 during decoding.

You note that ablations on tokenizers are seldom performed but some previous work in multilingual (Liang et al., 2023) and code generation (Dagan et al., 2024) have done ablations at small scale -- though more looking at tokenizers from the perspective of compression.

Typos/Grammar: "tokeniation" -> "tokenization"; "vocabluary" -> "vocabulary"; "return the same cutting for a given input: pace Gastaldi et al.."

References:

Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer, and Madian Khabsa. 2023. XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13142–13152, Singapore. Association for Computational Linguistics.

Gautier Dagan, Gabriel Synnaeve, and Baptiste Rozière. 2024. Getting the most out of your tokenizer for pre-training and domain adaptation. In Proceedings of the 41st International Conference on Machine Learning (ICML'24), Vol. 235. JMLR.org, Article 387, 9784–9805.

2025-06-03

The main motivation behind this paper is to argue detokenization must be a morphism, because otherwise malformed UTF8 is possible. However, as the authors note, this is important only in token-by-token streaming.

Ill-formed UTF-8 is not only important in token-by-token detokenization. It is possible for the model to take in or put out sequences of tokens that contain ill-formed UTF-8 at any point within the sequence. Attempting to interpret such a sequence of bytes as UTF-8 will lead to errors caused by ill-formed bytes.

Token-by-token streaming sees wide deployment: the serving engines TGI, OpenLLM, and vLLM all use token-by-token detokenization. Huggingface transformer's logits processors also work with model outputs token-by-token.

The algorithm 1 that the authors propose has already been implemented by various different open-source implementations and is therefore not entirely novel.

As stated in the paper, Algorithm 1 is not our primary contribution. We contribute a mathematical formalization of the problem of ill-formed UTF-8 byte sequences, which can always be generated by a byte-level tokenizer's vocabulary. By presenting it to the community, we open the problem up to scientific exploration and debate.

This algorithm does not provide guarantees that the decoding process will lead to valid UTF8 and only solves the problem for common issues.

We prove that there is no such guarantee where the model is allowed to generate tokens that are ill-formed UTF-8. The only way to guarantee that the model only generates well-formed UTF-8 at each step of generation is to make sure that each token in the vocabulary is well-formed UTF-8.

The biggest question I have with this is why not try to use constrained decoding to go further than Algorithm 1, and implement a grammar/constraints that guarantee no invalid UTF8 sequences of tokens?

Thank you for suggesting this approach. Constraining generation to only allow valid UTF-8 would prevent the model from generating any character whose bytes stretched across multiple tokens. When the first token containing the beginning of the character's bytes had been generated, the overall sequence would be ill-formed until the next tokens with the remaining bytes had been generated.

You note that ablations on tokenizers are seldom performed but some previous work in multilingual [6] and code generation [7] have done ablations at small scale -- though more looking at tokenizers from the perspective of compression.

Thank you for pointing out examples of ablations on tokenizers; we will incorporate these examples in the next revision of the paper. We note that none of the models in Table 3, which we believe to be a representative sample of contemporary large foundation models, underwent ablations on their tokenizers.

References

Please see common response for references.

评论- Constrained generation question

2025-06-03

The biggest question I have with this is why not try to use constrained decoding to go further than Algorithm 1, and implement a grammar/constraints that guarantee no invalid UTF8 sequences of tokens?

Constraining generation to only allow valid UTF-8 would prevent the model from generating any character whose bytes stretched across multiple tokens. When the first token containing the beginning of the character's bytes had been generated, the overall sequence would be ill-formed until the next tokens with the remaining bytes had been generated.

I'm confused at your answer to this. Are you saying that constrained decoding would not solve token-by-token detokenization? In which case, I agree. However you could certainly combine constrained decoding with a buffer approach as in Algo. 1 to obtain additional guarantees about eventual UTF-8 validity?

Constrained decoding could condition the next-token prediction based on the current buffer of UTF-8 bytes and mask out any tokens which are not a valid continuation of UTF-8 bytes (partial or complete). Therefore it should be possible to ensure that you can still generate characters whose representation spread across multiple tokens.

2025-06-04

You are correct on all counts. The following example is intended to ensure that we have the same approach in mind.

We will represent bytes by two-character hexadecimal alphanumeric characters and use $\wr$ to separate tokens. Let our vocabulary contain the tokens { $\wr\mathrm{E0\ A4}\wr, \wr\mathrm{B3}\wr$ }. Note that the sequence of bytes $\mathrm{E0\ A4\ B3}$ is well-formed UTF-8, but neither of the tokens in the vocabulary is well-formed on its own.

We would have to permit or prohibit a particular token based on whether appending that token to what has been generated so far would make a sequence that was a prefix of a well-formed UTF-8 sequence, rather than requiring that appending that token to what has been generated so far would make the whole sequence well-formed UTF-8.

For example, suppose the model has generated $\wr\mathrm{E0\ A4}\wr\mathrm{B3}\wr$ . For the next token, the constraint algorithm would have to permit $\wr\mathrm{E0\ A4}\wr$ and prohibit $\wr\mathrm{B3}\wr$ , even though neither $\wr\mathrm{E0\ A4}\wr\mathrm{B3}\wr\mathrm{E0\ A4}\wr$ nor $\wr\mathrm{E0\ A4}\wr\mathrm{B3}\wr\mathrm{B3}\wr$ is well-formed UTF-8. This constraint is correct because $\wr\mathrm{E0\ A4}\wr\mathrm{B3}\wr\mathrm{E0\ A4}\wr$ is a prefix of a well-formed UTF-8 sequence, but $\wr\mathrm{E0\ A4}\wr\mathrm{B3}\wr\mathrm{B3}\wr$ is not.

Such an approach could guarantee that no token combined with previous tokens to make ill-formed UTF-8 but would not guarantee that the overall sequence was always well-formed UTF-8.

As for Algorithm 1, as you point out, the mask would be computed based only on the tokens in the buffer, and we could ignore the tokens that have already been successfully interpreted as characters.

For example, suppose that the model has already generated $\wr\mathrm{E0\ A4}\wr\mathrm{B3}\wr\mathrm{E0\ A4}\wr$ . Applying Algorithm 1, we would have already detokenized and decoded $\wr\mathrm{E0\ A4}\wr\mathrm{B3}\wr$ as the character ळ and have $\wr\mathrm{E0\ A4}\wr$ in the buffer. The mask would be computed based only on the token $\wr\mathrm{E0\ A4}\wr$ in the buffer, and the previous tokens $\wr\mathrm{E0\ A4}\wr\mathrm{B3}\wr$ would be ignored.

Thank you for the suggestion; we will incorporate this as an expansion to Algorithm 1 in the next version of the paper along with benchmarks testing the overhead of such a constraint.

评论- Response to official comment

2025-06-09

You are correct on all counts. The following example is intended to ensure that we have the same approach in mind.

Thank you, this confirms we are on the same page.

Your plan to add expand Algorithm 1 further directly addresses my primary concerns with the novelty of your work. Any improvement to the status quo and the already widespread Algorithm 1 would help your paper stand on its own beyond the Monoid formalism. As such, I have raised my score and hope to see a final revision that improves on Algorithm 1.

2025-06-10

Thank you for the updated score; we will make these changes in the next version of the paper.

2025-06-03

Common response

We thank the reviewers for their time and their feedback. We are happy to see their interest in this topic and the different considerations in their responses. We believe that this discussion about systematic approaches to handling Unicode in general, and UTF-8 in particular, is necessary to the language modeling community. Our paper offers a starting point for this discussion and provides a terrain for further engagement with the issue.

References

[1] https://arxiv.org/pdf/2309.10668 "Language Modeling is Compression" (Delétang et al. 2024)

[2] https://arxiv.org/pdf/2402.14020 "Mesmerizing the Machine" (Geiping et al. 2024)

[3] https://aclanthology.org/2024.emnlp-main.649/ "Fishing for Magikarp" (Land et al. 2024)

[4] https://arxiv.org/pdf/2410.23684 “Improbable Bigrams Expose Vulnerabilities“ (Jang et al. 2024)

[5] https://arxiv.org/abs/2503.02174 "Adversarial Tokenization" (Geh et al. 2025)

[6] https://aclanthology.org/2023.emnlp-main.813/ "XLM-V" (Liang et al. 2023)

[7] https://arxiv.org/pdf/2402.01035 "Getting the most out of your tokenizer" (Dagan et al. 2024)

[8] https://arxiv.org/abs/2410.15696, "Tokenization as Finite-State Transduction", (Cognetta et al. 2024)

[9] https://arxiv.org/abs/2112.10508, "Between Words and Characters" (Mielke et al. 2021)

[10] https://aclanthology.org/2024.findings-emnlp.86.pdf "Tokenization Falling Short" (Chai et al. 2024)

[11] https://arxiv.org/abs/2412.03160 "Byte BPE Tokenization as an Inverse string Homomorphism" (Geng et al. 2024)

[12] https://github.com/github/rust-gems/blob/main/crates/bpe/README.md "bpe README" (Antwerpen et al. 2024)

[13] https://aclanthology.org/2022.acl-short.43.pdf "An Embarrassingly Simple Method" (Hofmann et al. 2022)

[14] https://aclanthology.org/P16-1162/ "Neural Machine Translation" (Sennrich et al. 2016)

[15] https://aclanthology.org/D18-2012/ "SentencePiece" (Kudo et al. 2018)

最终决定Accept

2025-07-08

This paper presents a formal proof that byte-level tokenizers with ill-formed UTF-8 tokens can generate invalid UTF-8 sequences, which is an underexplored issue with practical implications in LLM serving and constrained decoding. The use of monoid theory provides a rigorous foundation that sheds light on a fundamental weakness in widely used tokenization pipelines. While the paper's scope is niche and its empirical evaluation limited, as noted by Reviewers Vin2 and uy5s, its contribution lies in surfacing a subtle yet impactful problem that has already manifested in real-world systems. The authors have addressed key reviewer concerns and clarified that their primary aim is to initiate broader discussion, not to propose a fully novel algorithmic solution. The writing could be improved to enhance clarity, particularly for a broader NLP audience unfamiliar with the formalism. Nonetheless, the work is likely to spark valuable discussions during the poster session and may serve as a stepping stone for more robust tokenizer design in future LLM systems. I lean toward acceptance.