/10

Poster3 位审稿人

最低2最高4标准差0.9

ICML 2025

Improving LLMs for Recommendation with Out-Of-Vocabulary Tokens

Ting-Ji Huang,Jia-Qi Yang,Chunxu Shen,Kai-Qi Liu,De-Chuan Zhan,Han-Jia Ye

OpenReview PDF

提交: 2025-01-23更新: 2025-08-12

摘要

关键词

multi-task recommender;out-of-vocabulary token;user representation

评审与讨论

审稿意见

评分: 22025-03-10

The paper proposes META ID, an Out-Of-Vocabulary (OOV) tokenization mechanism for improving user/item ID representation in LLM-based recommendation systems. While traditional methods struggle with token diversity and semantic conflicts in token representation, the proposed META ID OOV tokens, generated through clustering meta-path representations of historical user-item interactions, improve both memorization and diversity. It's tested across various recommendation tasks (sequential, direct, rating prediction, explanation generation, and review summarization), outperforming traditional ID construction methods. Experiments also show that LLM-based recommenders incorporating META ID tokens show better performance than popular non-LLM recommenders.

给作者的问题

Besides the questions in previous sections,

How much computational overhead does generating meta-paths and performing clustering introduce? Will it be too computationally intensive for extremely large datasets?

论据与证据

In Table 1 the authors compared their proposed method with other sequential recommenders to demonstrate their superiority. One issue is that the other methods are not LLM-based, which makes it hard to tell where the major gain comes from, i.e. the compute and world knowledge LLM offered, or the proposed META ID. The following data points should also be presented to justify the conclusion:

The training and inference efficiency related numbers of these methods
Comparisons with LLM-based recommenders

方法与评估标准

理论论述

实验设计与分析

See the first tip in "Claims And Evidence"
The effectiveness of certain parts of the model is not fully discussed, such as performance difference w/ and w/o the Linear Transformation layer.
As a tokenization method, more discussions on ID space collisions, the relationship of token size & #of distinct IDs would help validate the proposed method. Figure 4 only presents the performances of different token size on 3 categories of the dataset, which has the following limitations: 3.1 As shown in Table 9, the # of distinct IDs are similar across the 3 categories. How the required token size grows with # of distinct IDs is unclear 3.2 The experiments are done on each category separately. What will the proposed method perform when ID spaces get more complex (e.g. mixed categories)?

补充材料

与现有文献的关系

遗漏的重要参考文献

其他优缺点

其他意见或建议

作者回复

2025-04-01

We thank the reviewer for the thoughtful comments and valuable suggestions. Below, we address each concern in detail. Overall, we emphasize that our method compares fairly with LLM-based recommenders using equivalent architectures, maintains high efficiency, and introduces minimal computational overhead. We have also added further explanation and data to clarify the role of token size, META ID contributions, and preprocessing costs.

Q1：Comparisons with LLM-based recommenders

A1：We would like to clarify that our method does compare against LLM-based recommenders. Specifically, in Line 300-303, we compare with TIGER (based on T5X arch.), as well as RID, SID[2] and CID[3] (all based on T5 arch.). For fairness, META ID also use T5 architecture, with only a 0.7% parameter increase over the vanilla T5 and achieve superior performance over these LLM-based recommenders.

To further evaluate the feasibility and utility of META ID, we also include experiments using mid-sized LLMs (LLaMA2-7B), which strike a balance between performance and training cost.

Additionally, tasks such as explanation generation and review understanding (in Tables 2 and 3) rely heavily on language comprehension, which non-LLM models are typically unable to handle. By leveraging the LLM, META ID effectively bridges structured and unstructured semantics—enabling multi-task capability beyond traditional non-LLM models.

Q2：The training and inference efficiency

A2: We appreciate the reviewer’s suggestion regarding efficiency. While our primary focus is on inference (as it is critical for deployment), we report inference FLOPs for all LLM-based methods under the same input conditions:

RID：7.74 GFLOPs

SID：7.74 GFLOPs

CID：7.77 GFLOPs

TIGER：8.02 GFLOPs

META ID(with T5 backbone): 7.83 GFLOPs

Regarding training efficiency, we note that the computational graph is structurally similar between training and inference. As such, the relative efficiency trends are preserved during training.

Q3: Performance difference w/o the Linear Transformation layer

A3: In Figure 6, removing the Linear Transformation layer (i.e., using random initialization) results in a performance drop. This confirms that the transformation layer contributes to more effective representation learning.

Q4: Token size, distinct ID space and scalability

A4: We agree that token size is influenced by the number of distinct IDs. As the ID space expands, the representational demand increases, typically requiring more clusters. However, this relationship is not strictly linear, as token size also depends on:

The distributional structure of the ID space (e.g., density, sparsity)
The clustering resolution, which controls semantic separation granularity

Table 1: Token size as a function of ID count (on Beauty dataset)

portion	ID count	OOV Token size
30%	10,339	711
50%	17,232	925
80%	27,571	1,156
100%	34,464	1,319

On testing the method under more complex ID spaces (e.g., mixed categories), our current setup isolates categories to study tokenization behavior in a controlled fashion. We agree that mixed-category evaluation is a valuable direction for assessing generalizability. Although such settings are not included due to scope constraints, we plan to incorporate them in future work, and our method is structurally capable of adapting via unsupervised clustering mechanisms.

Q5: Computational overhead of meta-path generation and clustering A5: We appreciate the reviewer’s concern about scalability. The proposed meta-path generation and clustering are offline preprocessing steps, performed once before training or inference, and thus do not affect runtime performance.

On the Beauty dataset (~34K unique IDs):

Meta-path generation (using 32 walks × length 64) takes under 5 minutes
Clustering completes in 8.8 seconds using standard CPU implementations

For larger datasets, our method is scalable via:

Parallel meta-path extraction (user/item paths are independent)
Mini-batch or streaming clustering (e.g., MiniBatchKMeans)

Given the one-time nature and low cost, we believe the computational overhead is both acceptable and controllable, even for large-scale deployment.

[1] Recommender Systems with Generative Retrieval. NeurIPS 2023.

[2] OpenP5: An Open-Source Platform for Developing, Training, and Evaluating LLM-based Recommender Systems. SIGIR 2024.

[3] How to Index Item IDs for Recommendation Foundation Models. SIGIR-AP 2023.

审稿意见

评分: 42025-03-10

This paper focuses on the insufficiency of the LLM-based sequential recommendation tasks and proposes to enhance the tokenizers by introducing OOV tokens. A way of generating new tokens, namely META ID is proposed to characterize the users and items and provide token initializations for later finetuning process. Experimental results illustrate the effectiveness of META ID and introducing OOV tokens.

给作者的问题

Please refer to the weaknesses.

论据与证据

The claims made in the submission are supported by clear and convincing evidence.

方法与评估标准

The chosen evaluation protocol and criteria are reasonable and follow the common practice.

理论论述

NA.

实验设计与分析

The experimental designs and analyses are valid and sound.

补充材料

NA.

与现有文献的关系

This paper proposes a fundamental OOV token technique that could be applied in a wide range of LLM-based recommender systems.

遗漏的重要参考文献

NA.

其他优缺点

Strengths:

A simple yet effective token initialization method that is agnostic to model structures.
Comprehensive experiments support the effectiveness and necessity of introducing specifically-designed OOV tokens.

Weaknesses:

The finetuning strategy of LLM is way too simplified, therefore might not be as effective on larger-scaled LLMs. The author could consider using other instruction tuning techniques enhance the framework.
The META ID method is only tested on old models with less than 7b parameters. The effectiveness of META ID on larger and latest LLMs (e.g. LLaMa3-42b) remain unexplored.

其他意见或建议

NA.

作者回复

2025-04-01

We sincerely thank the reviewer for the constructive and encouraging feedback. We are glad that the reviewer finds our method simple yet effective, the evaluation sound, and the overall contribution valuable. Below, we address the concerns:

Q1: On the finetuning strategy being overly simplified

A1: We appreciate the reviewer’s thoughtful feedback. We would like to clarify that our method does adopt LoRA (Low-Rank Adaptation) for finetuning in Line 312-314, which is a widely used and effective parameter-efficient finetuning strategy, especially designed for large-scale LLMs. LoRA significantly reduces the number of trainable parameters while maintaining strong performance, and has been successfully applied to models with tens or even hundreds of billions of parameters, such as GPT-3, LLaMA3, and Mixtral. Therefore, our finetuning strategy is not limited to small-scale models, and the proposed META ID token design is fully compatible with larger models under the same LoRA-based training paradigm. While our current experiments focus on models up to 7B due to resource constraints, we believe the scalability of both LoRA and META ID ensures that our approach can be extended to larger models. We are actively working on testing our method on models such as LLaMA3-13B and LLaMA3-42B.

Q2: On testing with larger LLMs

A2: As our primary goal was to evaluate the feasibility and utility of introducing META IDs, we chose small- to mid-sized models to ensure manageable training costs and reproducibility. We agree that testing on larger models like LLaMA3-42B or Mixtral-8x7B is important and could better demonstrate scalability. We are currently working on extending our framework to these larger models and will include these results in future versions or follow-up work.

审稿意见

评分: 22025-03-14

This paper introduces META ID, a framework that improves LLM-based recommender systems using out-of-vocabulary tokens. The authors demonstrate that in-vocabulary tokens lack diversity when representing users/items and propose constructing OOV tokens from meta-path features extracted from user-item interaction histories. The approach uses clustering to create hierarchical tokens that capture relationship patterns while maintaining distinctiveness.

给作者的问题

Please see the above comments and suggestions.

论据与证据

Yes.

方法与评估标准

Diversity score as a metric is quite unstable, as each item basically requires more than 10^4 calculations, resulting in a very high computational complexity.

理论论述

Yes.

实验设计与分析

Yes.

补充材料

No submitted supplementary material.

与现有文献的关系

Close to Generative Recommendation.

遗漏的重要参考文献

Essential References Not Discussed

[1] ICML’24. Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations [2] AAAI’24. HLLM: Enhancing Sequential Recommendations via Hierarchical Large Language Models for Item and User Modeling.

其他优缺点

Strengths:

The topic of generative recommendation is interesting, and the optimization from an ID-based perspective is impressive.
The paper is well-written and easy to understand.

Weaknesses:

The motivation lacks sufficient justification.
Some of the latest baselines are missing.
There is no discussion of scaling law, which is very important in GR.

其他意见或建议

The motivation seems unreasonable. First, id-based approaches are not essential in GR. While some GR models like HSTU and SASREC still construct input sequences based on IDs, other methods such as TIGER and EAGER-LLM have proven the feasibility and effectiveness of token-based approaches. This paper only discusses id-based methods without analyzing the advantages and disadvantages compared to token-based approaches, nor explaining in which scenarios id-based approaches are more advantageous than token-based ones. Second, existing LLM vocabularies are already very large, for example, DeepSeek-V3's vocabulary has 129,280 tokens, which is sufficient to represent clustered items. Why add separate OOV tokens? It seems somewhat redundant.
Some important baselines are missing. For example: [1] ICML’24. Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations [2] AAAI’24. HLLM: Enhancing Sequential Recommendations via Hierarchical Large Language Models for Item and User Modeling.
What's the difference between your approach and traditional cold-start item solutions? The meta-path method is similar to traditional solutions - adding neighbors through meta-paths. The novelty is insufficient.
Why choose T5 and LLAMA 7B? The authors need to conduct a rigorous discussion on scaling laws, which is extremely important in GR.
How is the appropriate number of OOV tokens selected for each dataset? From Figure 4, this selection appears quite random.

作者回复

2025-04-01

We thank all reviewers for their constructive feedback. Below, we provide a point-by-point response addressing key concerns.

Q1: Stability of DS.

A1: Although diversity score (DS) involves KL divergence, it is computationally efficient in practice. We avoid full pairwise comparisons (O(N²)) by sampling a fixed number of pairs (e.g., 1,000 or 10,000), reducing the complexity to O(S). Each KL divergence is fast and vectorized (via NumPy/PyTorch). For example, on the Sports dataset, computing 10,000 samples takes only 0.81 seconds on CPU, demonstrating that DS is lightweight and scalable.

Q2: About the motivation

A2: We thank the reviewer for raising this point. We would like to clarify a potential misunderstanding regarding the term "ID-based." Our method does not fully rely on traditional one-hot or raw ID embeddings, but instead tokenizes each ID into a unique, learnable sequence. This makes it fundamentally token-based in form while still preserving the ID-specific semantics—similar to TIGER 's subword compositions.

Our method extends the token-based line by enabling OOV-aware, compositional ID encoding that bridges structured semantics with LLM-compatible inputs. Regarding vocab size concerns (e.g., DeepSeek-V3's 129K tokens), general vocabularies lack domain-specific tokens and might raise hallucination [3,4]. Our added OOV tokens (+0.7% params) cover this gap and yield measurable performance gains (Table 1).

Q3: On missing strong baselines such as EAGER [1]

A3: Thank you for highlighting recent LLM-based methods. We have added direct comparisons with EAGER on Sports and Toys datasets:

Table 1: Comparison with EAGER

Dataset	Method	HR@10	NDCG@10
Sports	EAGER	0.0441	0.0236
	Ours	0.0487	0.0277
Toys	EAGER	0.0714	0.0505
	Ours	0.0761	0.0441

META ID achieves consistent gains in HR@10 on both datasets,and comparable NDCG@10.

Regarding the HSTU[2], its trillion-scale architecture requires substantial resources beyond our scope. Instead of scaling size, our method emphasizes lightweight, structured ID-level representation (+0.7% params), benefiting OOV handling and multitask generalization. We will include these methods in the related work section to acknowledge their contributions and clarify our differences.

Q4: Comparison with traditional cold-start method

A4: We appreciate the reviewer’s comment and agree that cold-start has been a long-standing challenge in recommendation. While our method does leverage graph-based neighbor information via meta-paths, it differs fundamentally from traditional cold-start solutions in several key aspects:

Tokenization instead of embedding fusion: instead of fusing neighbor features or adding graph edges, we tokenize meta-paths into ID-specific sequences and feed them into an LLM, enabling learnable, compositional representations*.
Unified multi-task modeling: OOV tokens allow the model to perform both recommendation and text generation (e.g., explanation) in a unified token space,which traditional methods cannot handle.

Thus, our approach introduces a novel token-level representation strategy that integrates graph semantics into LLM-friendly inputs, a direction not explored in prior cold-start literature.

Q5: Backbones and scaling law

A5: We thank the reviewer and agree that scaling laws are important in GR. We selected T5 (small-size) and LLaMA2-7B (mid-size) as representative backbones for the following reason:

Architectural diversity: T5 is encoder-decoder; LLaMA2 is decoder-only, covering both major LLM paradigms in GR.
Compute trade-off: Both support strong performance under realistic compute budgets.
Baseline alignment: Baselines (e.g., TIGER, RID, SID) are also built on T5, enabling controlled comparison.

While we do not conduct a full scale-sweep, META ID is scaling-agnostic and can plug into larger LLMs (e.g., GPT-3, LLaMA2-13B) without modification.

Q6: Number of OOV tokens

A6: The number of OOV tokens is explicitly controlled by the clustering hyperparameter (K), where each cluster maps to a token. In Figure 4, we vary K to analyze how token granularity affects performance. This mechanism offers a flexible way to adjust vocabulary size based on data complexity or computational budgets. For example, smaller K offers compact vocabularies, while larger K allows fine-grained modeling. [See also Q4 of Reviewer 8KDs]

[1] EAGER-LLM: Efficient Adaptive Generation for Recommendation. AAAI 2024.

[2] Actions Speak Louder than Words: Trillion-Parameter Transducers for GR. ICML 2024.

[3] Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 2023

[3] How to Index Item IDs for Recommendation Foundation Models. SIGIR-AP 2023.

最终决定Accept (poster)

2025-05-01

This paper proposes META ID, a novel framework for improving LLM-based recommendation systems by introducing a principled approach to constructing Out-of-Vocabulary (OOV) tokens for user and item representations. The authors argue that conventional in-vocabulary tokenization schemes are semantically inadequate for capturing recommendation-specific ID semantics, and they offer a clustering-based token generation method leveraging historical user-item interactions.

The reviewers broadly agree that the paper addresses an important and emerging challenge in generative recommendation, particularly in bridging structured user-item semantics with LLM-compatible token inputs. One reviewer (Psxm) gives a clear accept, highlighting the effectiveness and general applicability of META ID across recommendation and generation tasks, supported by comprehensive experiments and a simple yet flexible tokenization mechanism. The other two reviewers (8KDs and j313) raise some concerns, especially regarding the motivation for using OOV tokens, the lack of certain recent baselines, and insufficient exploration of scaling behaviors or efficiency trade-offs.

That said, the authors provide a detailed and convincing rebuttal that directly addresses these points. They clarify misconceptions about the nature of their ID encoding (not raw one-hot), provide new comparative results with LLM-based baselines like EAGER, and demonstrate that their method adds minimal computational overhead. They also give thoughtful explanations on the token count selection, scalability, and preprocessing cost, all of which seem reasonable given the experimental scope.

While some limitations remain—notably, the evaluation is restricted to relatively small- and mid-sized models, and the paper could benefit from a broader comparison set—the proposed method is innovative, practically motivated, and supported by strong empirical results. Given the current level of novelty and the field’s interest in enhancing LLM architectures for recommendation tasks, I lean toward a Weak Accept. This work is likely to spark follow-up research and can be a valuable contribution to the ICML community.