/10

Poster4 位审稿人

最低2最高4标准差0.7

ICML 2025

Interchangeable Token Embeddings for Extendable Vocabulary and Alpha-Equivalence

İlker Işık,Ramazan Gokberk Cinbis,Ebru Aydin Gol

提交: 2025-01-23更新: 2025-07-24

TL;DR

Our novel approach for learning interchangeable tokens generalizes to larger vocabulary sizes as demonstrated on a copying task and two logic tasks.

摘要

关键词

embedding methodsinterchangeable tokensextendable vocabularytransformer modelslinear temporal logicpropositional logicformal reasoningtoken generalizationalpha-equivalenceinductive biaslanguage modelssymbolic representationneural networks

评审与讨论

审稿意见

评分: 22025-03-05

The paper introduces a way to handling interchangeable tokens in neural networks, particularly for tasks involving formal logic and symbolic reasoning. The proposed method employs a token embedding strategy that combines a shared component for semantic consistency and a randomized component for token differentiation. The authors also define a metric called alpha-covariance to measure robustness against alpha-equivalent transformations. The method is evaluated on three tasks: copying with an extendable vocabulary, linear temporal logic (LTL) solving, and propositional logic assignment prediction. Results show improved generalization to unseen tokens and an inductive bias favoring alpha-equivalence.

给作者的问题

Is it necessary to explicitly define the set of interchangeable tokens before training, or can the model learn to infer interchangeable tokens dynamically during the training process?
Would the proposed method still be effective if applied to tasks with human-readable variable names instead of abstract symbolic tokens?

论据与证据

Traditional language models fail to recognize alpha-equivalence.
- Evidence: Experiments on symbolic reasoning tasks demonstrate poor generalization with traditional embedding methods.
The proposed dual-part embedding enhances a model’s ability to handle interchangeable tokens.
- Evidence: Evaluations on copying, LTL solving, and propositional logic tasks show performance improvement

方法与评估标准

The authors adopt a Transformer encoder-decoder architecture with RoPE positional encoding.

Evaluation metrics include accuracy, edit distance, and alpha-covariance.

The proposed model is compared against learned embeddings and an alpha-renaming-based baseline.

理论论述

No theoretical claim was made in this work.

实验设计与分析

Experiments cover three tasks. Tests evaluate both in-distribution and out-of-distribution generalization.
There is an ablation on hyperparameter search to explore the impact of different embedding structures.

补充材料

I did not review the supplementary material.

与现有文献的关系

No direct relation to broader scientific literature.

遗漏的重要参考文献

The paper covers relevant literature on neural-symbolic reasoning and transformer-based formal logic models.

其他优缺点

Strengths

The method is well-motivated with clear formalization.
Comprehensive experimental validation across multiple tasks.
Strong empirical results demonstrating out-of-distribution generalization.

Weakness

The embedding matrix is constructed with a fixed number of interchangeable tokens, which suggests that the set of interchangeable tokens must be predetermined before training.
The structure of symbolic expressions significantly differs from natural language expressions. This potentially limits the model’s applicability to tasks involving human-readable variables.

其他意见或建议

It would be interesting to explore whether this approach generalizes to broader NLP tasks beyond symbolic reasoning.
It would be useful to report computational efficiency metrics (e.g., training time, inference speed).

作者回复

2025-04-01

We kindly thank the reviewer for their constructive criticism.

The embedding matrix is constructed with a fixed number of interchangeable tokens, which suggests that the set of interchangeable tokens must be predetermined before training.

This is the case for the alpha-renaming baseline, but not exactly true for our proposed method. The embeddings we propose are composed of two parts: a learnable common part, and a random part. Thanks to the random part, we can generate as many embeddings for the interchangeable tokens as needed, both in training and inference. The only limitation here is that when a discrete random generation method is used, the sampling set for the random part is naturally limited (Table 1 in the submission). However, since the sampling set grows exponentially as the number of dimensions increases, this is not a problem in practice.

The structure of symbolic expressions significantly differs from natural language expressions. This potentially limits the model’s applicability to tasks involving human-readable variables.

It would be interesting to explore whether this approach generalizes to broader NLP tasks beyond symbolic reasoning.

We agree that applying these ideas to natural language expressions requires vastly different considerations and therefore, left beyond the scope of this paper. Please also refer to our discussion with Reviewer u7Gr for more information.

It would be useful to report computational efficiency metrics (e.g., training time, inference speed).

We have used a variety of hardware for training. But among the models that we trained using NVIDIA H100, the average training durations are approximately as follows:

Baseline: 2 hours 12 minutes
Alpha-renaming Baseline: 2 hours 33 minutes
Proposed method: 2 hours 29 minutes.

We performed a simple benchmark to analyze the runtime performance of our best LTL model using NVIDIA RTX A4000 (randomization method is “Hypercube Vertices”, with uniqueness checking enabled):

Preparing embeddings: 0.0003 seconds
Forward pass: 0.206 seconds
Autoregressive generation: 9.808 seconds

We used a batch size of 768 in both forward pass and autoregressive generation, as noted in Table 4. For autoregressive generation, we used beam search with a beam size of 3. These were the settings we used in all LTL experiments.

Note that as our method does not alter the internal architecture of the model, the only additional requirement is preparing the embeddings.

As the benchmark results indicate, the extra computational cost introduced by our method is negligible compared to running the model. Furthermore, during inference, it is sufficient to generate the embeddings once at the start.

Memory overhead: Our method does not increase the parameter count of the model. On the contrary, model parameters are actually reduced since only one common embedding is learned for all interchangeable tokens. Constructing the embedding matrix may require some extra memory (e.g., for generating the randomized part) but it’s in the same order of magnitude as the embedding matrix, which is quite small compared to the rest of the model parameters.

Is it necessary to explicitly define the set of interchangeable tokens before training, or can the model learn to infer interchangeable tokens dynamically during the training process?

The set of interchangeable tokens must be identified in advance. The model currently cannot learn which tokens are interchangeable.

Would the proposed method still be effective if applied to tasks with human-readable variable names instead of abstract symbolic tokens?

Since enforcing alpha-equivalence using our method would remove the word meanings from interchangeable tokens (only discerning property for interchangeable token embeddings is the randomized part), there is a trade-off between exploiting alpha-equivalence and reasoning from natural language semantics. We have also elaborated on this in our discussion with Reviewer u7Gr.

Overcoming the said trade-off is an interesting research direction as noted in the conclusion section of our submission.

After the rebuttal period, we will update our paper according to this discussion.

审稿意见

评分: 32025-03-11

The paper introduces a novel method for handling interchangeable(semantically equivalent yet distinct) tokens in a language model. The proposed approach uses a dual-part token embedding strategy consisting of a shared semantic component and a randomized component to ensure distinguishability. The paper also introduces a new metric- alpha covariance to evaluate robustness to alpha-equivalent transformations(where renaming bound variables preserves meaning). The proposed approach is evaluated on three tasks and is shown to to generalize better than the baseline methods.

给作者的问题

Can you add details on memory and time overhead?
Can you discuss the scalability tradeoffs as you scale to larger vocabularies beyond a few hundred thousand tokens.
Does this approach work for larger models like GPT-4?

论据与证据

The paper makes the following claims: Claims with sufficient support:

Language models struggle with interchangeable tokens and alpha-equivalence : Experiments shown in Table3 supports the claim
The proposed dual-embedding strategy helps improve generalization to larger vocabularies: Results in Table 2 and Figure 2 support the claim.
The model introduces an inductive bias that makes it robust to data perturbations : Results in Table 2 support this.

Claims that require further justification: 4. The method generalizes better than data augmentation: This claim needs stronger justification as the the alpha-renaming method performs better than the proposed method in Table 2 and a deeper analysis is required to justify this. 5. ** The random embedding approach is critical to success** : Proper ablation studies are required to make this claim stronger 6. The proposed approach is computationally efficient: A runtime analysis is required to compare with standard embeddings

方法与评估标准

Yes, the overall framework makes sense, however there is room for improvement:

The proposed approach is evaluated on three tasks which are relevant to the problem. However, it would be good to add a real world dataset in addition to these synthetic datasets.
The paper proposes a new metric alpha-covariance which is relevant for the problem at hand. However, it should also talk about memory and time overheads of the proposed approach to baselines.

理论论述

实验设计与分析

补充材料

与现有文献的关系

The paper has two key contributions:

Dual-embedding strategy to scale to unseen tokens. This is one of the ways to expand the vocabulary in specific settings such as logic and reasoning with alpha-equivalent representations. This extends vocabulary without retraining, enabling systematic generalization.
Alpha-covariance: A new metric to measure robustness to alpha-equivalence: This provides a new way to measure symbolic consistency.

遗漏的重要参考文献

其他优缺点

Strengths:

The paper is well-written and easy to follow.
There are extensive experiments on 3 datasets to support the claims.

Weaknesses:

There are no results on large-scale language models to see if the proposed approach works for GPT-4 like larger models.
The paper evaluates on synthetic datasets but it raises concerns whether this proposed approach generalizes to real-world applications.
The paper does not discuss about the scalability tradeoffs, specifically as we scale to extremely large vocabularies.

其他意见或建议

Section 3: Line 94: V* is not defined Section 3: V_i is not defined Typo: Section 4.2 Line 159: need make sure -> need to make sure Typo: Abstract Line 37: toward-> towards

作者回复

2025-04-01

We kindly thank the reviewer for their detailed analysis.

The method generalizes better than data augmentation: This claim needs stronger justification as the the alpha-renaming method performs better than the proposed method in Table 2 and a deeper analysis is required to justify this.

We would like to emphasize that the alpha-renaming baseline in Table 2 was trained using only 5 APs instead of 10 (as in Fig. 3) since the purpose of this experiment was to evaluate alpha-covariance rather than generalization. In other words, all models in Table 2 were trained and evaluated on in-distribution data. Therefore, we believe we can conclude that although alpha-renaming achieves better performance on in-distribution data, the proposed method generalizes better to out-of-distribution data, as seen in Fig. 3.

The random embedding approach is critical to success: Proper ablation studies are required to make this claim stronger

Firstly, we note that in Appendix E.3, alongside the hyperparameter search over embedding hyperparameters, we performed ablation studies on the copying task by disabling the normalization features and AdaCos.

Here, we perform a similar experiment on two logic tasks. We use the best hyperparameters that we found and used in Fig. 3. We train each model multiple times and report the test-split results for the model that performs the best in the validation set. The values we report in this section correspond to the overall accuracy values reported in Fig. 3. We ablate one aspect at a time, except for $f_{fn}$ . Because AdaCos requires $f_{fn}$ to function correctly.

LTL task

Proposed Model (Fig. 3a): 92.37%
Without AdaCos: 84.69%
Without $f_{fn}$ & AdaCos: 84.04%
Without $f_{bn}$ : 32.27%

Propositional logic task

Proposed Model (Fig. 3b): 80.72%
Without AdaCos: 81.13%
Without $f_{fn}$ & AdaCos: 80.41%
Without $f_{bn}$ : 15.05%

The results indicate that $f_{bn}$ is absolutely necessary for both tasks. This is intuitive because it keeps the balance between the common and randomized parts of the embedding. $f_{fn}$ is less important but still helpful. The performance of AdaCos varies depending on the task. On LTL, ablating AdaCos decreases the accuracy from 92.37% to 84.69%. However, on propositional logic, AdaCos does not make a clear difference in accuracy.

Can you add details on memory and time overhead?

Please see our discussion with Reviewer xcub.

Can you discuss the scalability tradeoffs as you scale to larger vocabularies beyond a few hundred thousand tokens.

Unlike other approaches, our method only needs to learn a single common part for all interchangeable tokens, even if there are thousands of them. However, the hyperparameter choices regarding the randomized embedding part requires some special considerations. Notable scalability concerns include:

Sampling set size of the random generation method: As outlined in Table 1, using a discrete random generation method leads to a limited sampling set size. However, since it grows exponentially with the number of embedding dimensions, this can be addressed by simply increasing the dimensions of the random embeddings. Nevertheless, larger embedding dimensions may call for an overall increase in model size.
Ensuring the uniqueness of the random embeddings: Increasing the number of embedding dimensions makes uniqueness checks harder (see line 190). But at the same time, due to the increased sampling set size, the probability of generating the same embedding decreases to the point where it’s essentially negligible.

Does this approach work for larger models like GPT-4?

In principle, the proposed approach doesn’t have a limitation that disallows us from using it on larger models. But there are several aspects that need to be considered:

Natural language: As discussed in our response to Reviewer u7Gr, natural languages don’t conform to alpha-equivalence, which leads to a trade-off between enforcing alpha-equivalence and preserving variable name semantics.
Identifying interchangeable tokens: Our method requires interchangeable tokens (variables) to be identified in advance since their embeddings are handled in a special way.
Training from scratch: Since our method changes the architecture of the embeddings, it requires training the whole model from scratch. However, future work may potentially address this.

After the rebuttal period, we will update the paper according to this discussion.

审稿意见

评分: 32025-03-15

The authors propose an interesting token embedding technique to handle alpha-equivalent programs—those that share the same semantic meaning but differ in surface representation. By employing a dual-part token representation, the embedding effectively captures both semantic consistency and token distinguishability. Compared to the baseline, which relies on data augmentation with alpha renaming, this approach demonstrates improved generalizability in reasoning outcomes.

给作者的问题

See Methods And Evaluation Criteria and Claims And Evidence

论据与证据

Claim1. Identify the challenge of generalizing to larger vocabularies in (formal) language modeling tasks and define an experimental protocol to systematically study this problem.

Question: Does the formal language include natural elements, such as comments and textual descriptions? If so, could the embedding approach lead to representational conflicts when switching between these contexts?

Claim2. Propose alpha-covariance, a novel metric for measuring robustness against alpha-conversions.

Question: Alpha-covariance appears to be defined exclusively for LTL formulas. Is it a highly specialized metric that applies only to LTL programs?

Claim 3: Introduce a dual-part embedding method to enhance vocabulary generalization and improve alpha-covariance.

This claim is well supported.

Claim 4: Verify the proposed method thoroughly on three tasks: copying with extendable vocabulary, solving LTL formulae, and predicting assignments for propositional logic.

Question: The applicability of this embedding technique seems limited to synthetic datasets, particularly those derived from PropRandom35 in DeepLTL. To strengthen its impact, it would be valuable to demonstrate its effectiveness on a real-world LTL dataset that includes human annotations and comments.

方法与评估标准

The dual embedding method is conceptually sound, but I believe there are two key aspects that the evaluation has not sufficiently addressed:

Handling domain shift in evaluation: In the current setup, the baseline model is trained on a perturbed dataset but tested on the original test set. Why not apply an actual alpha-conversion step as a preprocessing step during testing to mitigate domain shift and ensure a more direct evaluation of robustness?
Comparison with general-purpose code language models: How do existing large-scale code models, such as Codex, perform on these tasks? Evaluating their performance could provide a stronger baseline and contextualize the effectiveness of the proposed approach.

理论论述

N/A

实验设计与分析

See Methods And Evaluation Criteria and Claims And Evidence

补充材料

I have read into the appendix about more experimental details.

与现有文献的关系

This work is generally related to the field of programming model.

遗漏的重要参考文献

[1] Finnie-Ansley, James, et al. "The robots are coming: Exploring the implications of openai codex on introductory programming." Proceedings of the 24th Australasian computing education conference. 2022.

其他优缺点

The paper is clearly written and easy to follow.

其他意见或建议

N/A

作者回复

2025-04-01

We kindly thank the reviewer for their comments.

Does the formal language include natural elements, such as comments and textual descriptions? If so, could the embedding approach lead to representational conflicts when switching between these contexts?

The applications we considered in this paper do not include natural language comments or descriptions.

Alpha-covariance appears to be defined exclusively for LTL formulas. Is it a highly specialized metric that applies only to LTL programs?

No, alpha-covariance is not specific to the LTL domain. It is applicable in any domain where alpha-equivalence is relevant. Most prominently, this includes formal languages where user-defined variables are present, such as programming languages. For example, we could use alpha-covariance to measure how sensitive a model is to renaming bound variables in a Python programming task.

To strengthen its impact, it would be valuable to demonstrate its effectiveness on a real-world LTL dataset that includes human annotations and comments.

Since our method randomizes the discernable parts of the interchangeable tokens to enforce alpha-equivalence, applying our method as-is in a natural language setting would remove the discerning semantics from the variable names. This is an important trade-off as we elaborated further in our discussion with Reviewer u7Gr. As our focus in this work was formal languages, we left this direction for future work.

Handling domain shift in evaluation: In the current setup, the baseline model is trained on a perturbed dataset but tested on the original test set. Why not apply an actual alpha-conversion step as a preprocessing step during testing to mitigate domain shift and ensure a more direct evaluation of robustness?

Firstly, we would like to emphasize that only in one experiment the baseline model was trained on a perturbed dataset. Since the purpose of this experiment is to measure the generalization capabilities, introducing a perturbation allows us to see how models perform when the training dataset is biased.

In the rest of the experiments, the baselines were not subjected to any perturbations during training.

Applying alpha-conversion as a preprocessing step during testing, as suggested by the reviewer, is performed in alpha-covariance experiments.

Comparison with general-purpose code language models: How do existing large-scale code models, such as Codex, perform on these tasks? Evaluating their performance could provide a stronger baseline and contextualize the effectiveness of the proposed approach.

We thank the reviewer for this suggestion.

We evaluated Llama 3.2 (3B ollama version) on the LTL task. In the prompt, a brief description of the task and several examples were given. The LLM is asked to generate the output trace only. Unlike our models, the input formulas and the output traces are encoded in infix notation instead of prefix (Polish) notation since general-purpose LLMs are more familiar with infix notation due to its popularity in natural language.

Results on LTLRandom35 Test Split (over 10000 samples):

Correct: 25.3%
Alpha Covariance 3, 4, 5 AP: 87.83%, 83.48%, 86.69%. (All permutations of 100 samples for each AP)

These results should be compared to those in Table 2, cf. Proposed - Correct: 95.94%.

As a larger example, we also evaluate Llama 3.2 using the same dataset and sample size as in Fig. 3a from the paper. We cannot include images in this rebuttal, but the overall accuracy is 21.38%, cf. 92.37% (Proposed). Although Llama 3.2 is a small model by LLM standards, its size still dwarfs all dedicated models we evaluated in the manuscript. Despite this, as a general-purpose model, it performs much worse than the dedicated models since LTL is a highly specialized domain. Even though LLM performance could be improved by using larger models and/or various CoT techniques, we hope that these results help put the results from the paper into context.

After the rebuttal period, we will update the paper according to this discussion.

审稿人评论

2025-04-04

I appreciate the authors effort clarifying my questions and I have raised my score correspondingly.

审稿意见

评分: 42025-03-23

The paper proposes a method for handling interchangeable tokens—like bound variables—in language models by splitting token embeddings into a shared semantic part and a unique random part. This improves generalisation to unseen vocabularies and supports alpha-equivalence. Evaluated on copying tasks, LTL solving, and propositional logic, the approach shows improved robustness and consistency, particularly under vocabulary shifts.

给作者的问题

If I understand correctly, the proposed techniques are designed for problems—like LTL and propositional logic—where the grammar is clear or simple enough to reliably identify bound variables. How well do you think these techniques would generalise to settings where that’s not the case—for example, natural language tasks without well-defined grammar, or programming languages where the abstract syntax tree isn’t readily available?

论据与证据

Yes, both full vocabulary and alpha-renaming are natural approaches to address the challenge of extended vocabularies in the context of alpha-equivalence. The fact that the proposed method performs competitively with these strategies—despite being trained on a limited vocabulary—is impressive and clearly highlights its value.

方法与评估标准

Overall, it’s very good. It would be even more interesting to see the proposed method validated on a real-world programming task where alpha-equivalence naturally occurs—but I’m already quite satisfied with the current experimental setup.

理论论述

N/A

实验设计与分析

Looks good to me.

补充材料

The appendix looks good to me, though I haven’t run the code myself.

与现有文献的关系

Bound variables and alpha-equivalence are common in real-world mathematical reasoning and programming tasks. I fully appreciate the value of this paper in addressing challenges related to unseen variables.

遗漏的重要参考文献

No to the best of my knowledge.

其他优缺点

The paper is well written, easy to follow, and with illustrative figure/experiment to validate its hypothesis
I really appreciate the introduction of alpha-covariance, which quantifies the model's ability to tackle alpha-conversions.

其他意见或建议

N/A

作者回复

2025-04-01

We kindly thank the reviewer for their positive feedback.

If I understand correctly, the proposed techniques are designed for problems—like LTL and propositional logic—where the grammar is clear or simple enough to reliably identify bound variables. How well do you think these techniques would generalise to settings where that’s not the case—for example, natural language tasks without well-defined grammar, or programming languages where the abstract syntax tree isn’t readily available?

It is correct that the method we proposed in this paper needs bound variables to be identified. As the reviewer noted, the situations in which this is not possible can be divided into two categories.

1. Natural language

We believe that applying alpha-equivalence in natural language settings requires different considerations. Let us demonstrate this through an example that involves programming with natural language variable names.

Consider the functions get_total_bill(electricity_bill, water_bill) = electricity_bill + water_bill and get_end_time(start_time, duration) = start_time + duration. Although these two functions perform the same operation, the additional meaning conveyed through the variable names may be relevant in a larger context. Therefore, alpha-equivalence does not hold in natural language statements in a strict sense since these variable names are not truly interchangeable. A function like get_end_time(start_time, water_bill) = start_time + water_bill would be nonsensical despite being functionally equivalent to aforementioned examples.

In addition, each word carries connotations in addition to the primary, literal meaning. For instance, many programmers use the variable name i for iterating over indices. Although renaming this variable would result in a functionally equivalent code, it may confuse an LLM.

To summarize, since our method depends on randomizing the discernable parts of the interchangeable tokens, a trade-off arises between enforcing alpha-equivalence and preserving semantics/connotations in natural language contexts. As a result, application to natural language—albeit beyond the scope of this paper—is an interesting direction for future work as we noted in the conclusion section.

2. The abstract syntax tree (AST) isn’t readily available

We thank the reviewer for highlighting this interesting point. Since our focus in this work is formal languages, we believe it’s reasonable to assume that AST can be parsed as needed. In addition, in many formal languages, identifying the bound variables does not require parsing the full AST either. For example, our implementation actually does not depend on AST being available to detect interchangeable tokens because the set of AP (atomic proposition) tokens is predetermined. Similarly, if an alternative preprocessing approach can identify the bound variables in other languages, AST may not be needed.

最终决定Accept (poster)

2025-05-01

The reviewers all agree on the utility of the proposed method for encoding interchangeable tokens within a language model. Identified strengths include:

The paper is well written and easy to follow.
The alpha-covariance metric is appropriate for evaluating robustness to renaming of interchangeable tokens.
The claim that existing language models struggle with interchangeable tokens and alpha-equivalence, is clear and supported by experimental results.
The proposed dual-embedding strategy is shown to improve generalization to larger vocabularies.
Training on full vocabulary and alpha-renaming are natural approaches to address the challenge of extended vocabularies in the context of alpha-equivalence and hence appropriate baselines. The proposed method, given it's limitations, performs well against these baselines in experiments.

The reviewers raise some questions and limitations including:

A number of questions about how this method can be extended to more naturalistic data settings, including those without a well-defined grammar/syntax.
The claim that the proposed method generalizes better than data augmentation: needs stronger justification as the the alpha-renaming method performs better than the proposed method in Table 2 and the work would benefit from a deeper analysis of this.
The claim that the random embedding approach is critical to success would benefit from proper ablation studies
A runtime analysis to compare with standard approaches would strengthen the paper.

There is sufficient approval among reviewers and agreement on the benefits to warrant acceptance.