CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing
摘要
评审与讨论
This paper introduces CITER, a collaborative inference framework based on token-level routing to accelerate LLM inference. The framework employs a reinforcement learning-trained router to assess token importance, enabling efficient task allocation between small language models (SLMs) and LLMs.
优点
The study addresses the critical practical issue of reducing LLM inference costs, which is vital for real-world deployment. CITER achieves equivalent performance with 30% fewer LLM calls and improves performance by up to 25% with the same number of LLM calls. Technical innovation is notable, featuring fine-grained control through token-level routing, systematic router training via reinforcement learning, and enhanced training efficiency using shortcut reward estimation. The validation is thorough, with extensive experiments conducted across four benchmarks, a detailed ablation study, and verification across various model sizes.
缺点
The paper lacks sufficient overhead analysis, with no evaluation of the router's computation and memory costs, potential latency from model switching, or end-to-end performance assessment. Especially, the overhead analysis is missing in terms of inference and training as well.
Its generalizability remains uncertain, as evaluations are limited to QA tasks and a single model series (Qwen2), without verification in multilingual or long-form generation contexts.
The theoretical justification is also limited, with insufficient rationale for the router's structural choices, no convergence analysis for iterative training, and inadequate verification of shortcut methods' accuracy.
问题
What is the additional latency introduced by the router? I strongly doubt that single large model call can be more efficient than your approach in case of including training and fine-tuning.
Do similar benefits hold for larger models? How does the framework perform with inputs of varying lengths? How does it perform on tasks with high token dependency?
What are the anticipated challenges in real-world deployment?
The paper introduces Collaborative Inference with Token-level Routing a framework designed to enhance the efficiency of large language model (LLM) inference while maintaining output quality. By implementing a token-level router that predicts the importance of individual tokens, CITER enables smaller language models (SLMs) to handle less critical tokens, reserving LLMs for essential ones. This approach formulates a reinforcement learning (RL) problem to minimize inference costs and introduces a shortcut for reward estimation, significantly accelerating training. Experiments on four benchmark datasets show that CITER can reduce LLM calls by up to 30% while preserving high accuracy or improve accuracy by 25% with the same call ratio. Additionally, ablation studies reveal that token-level routing is more flexible and effective than query-level routing, highlighting the benefits of considering long-term routing impacts.
优点
- The paper is well written and easy to follow.
- The RL-based router training method is novel, and a shortcut to the reward function is proposed to make training easier.
- Experimental results show that the proposed method can achieve better performance under the same call to LLM.
缺点
- While the framework introduces a shortcut for estimating the reward function, the initial training of the token-level router still requires significant computational resources due to the need for reinforcement learning, which can be a barrier for practical implementation.
- The effectiveness of CITER heavily relies on the accuracy of token importance predictions. If the router fails to accurately assess which tokens are critical, it could lead to suboptimal routing decisions, potentially compromising the quality of the generated outputs. More analytical experiments should be conducted to prove this point.
- The main experiment in Figure 2 involves a few baselines. If a comparison with the LLM Inference Acceleration method can be added, the effectiveness of the proposed method will be more prominent.
- Most experimental results show "% Call to LLM". I hope to show more intuitive metrics in the experiment, such as the amount of computation (FLOPs) or inference time/speed, which will help readers to have an intuitive feeling.
问题
Refer to weaknesses.
Large language models (LLMs) perform exceptionally well in natural language processing tasks; however, their computational costs during the inference phase are extremely high, especially in real-time applications. Existing approaches primarily address this issue by routing entire user queries to different models, a method that lacks flexibility and often results in inefficiency. To tackle this problem, the authors propose a novel framework—Collaborative Inference with Token-level Routing (CITER)—which achieves a balance between efficiency and accuracy by predicting token importance and routing tokens to the appropriate model. The authors formalize the training of the router as a reinforcement learning problem and introduce a shortcut for reward function estimation to accelerate the training process.
优点
-
The token-level routing framework for collaborative inference is quite novel. The idea of using small language models to collaboratively generate tokens in order to reduce the inference generation of large language models is very interesting for accelerating model inference speed.
-
The experimental design for evaluating the CITER framework's inference acceleration is comprehensive and rich, with thorough experimental evaluations conducted across multiple benchmark datasets.
缺点
-
The paper only conducts experiments with the Qwen series of models. If the model were switched to the Llama3 series, would the CITER architecture still be able to achieve rapid inference with large models?
-
The generality of the rapid convergence of the iterative training process is not supported by detailed evidence, which undermines the validity of the iterative training approach.
问题
Since I do not have much knowledge about inference acceleration methods, I am curious about how the method proposed in this paper should be compared with existing non-token-level inference acceleration frameworks.
This paper proposes a routing process to combine an LLM and a SLM to improve efficiency while do not sacrifice accuracy.
优点
The proposed method is sound, and positive results are demonstrated across multiple benchmarks.
缺点
-
At least from my option, I don't see significant advantages (differences) over existing collaborative decoding methods. For example, this paper cites the co-LLM, what is the core difference and what is the core difference between this work and the work by UW yejin's team?
-
only one policy is used (QWen)
-
paper is not clear to read, you do not need so many equations for sections like 2.1.2.
I'm open to upgrade my score if the paper is significantly improved.
问题
-
what is the core difference and what is the core difference between this work and the work by UW yejin's team?
-
what the situation for Llama3?
-
what is a more specific reason for iteratively updating the router since your method is a inference method where LLM and SLM are fixed.
-
do you train specific router for each benchmark tested?
The paper proposes CITER, a collaborative inference framework designed to speed up Large Language Model (LLM) inference without sacrificing much accuracy. It does this by routing individual tokens to either a smaller, faster Language Model (SLM) or the full LLM. Experiments show that CITER can reduce LLM inference cost. The token-level routing approach is a fresh take on collaborative inference, offering potentially finer-grained control over efficiency vs. accuracy. The biggest weakness highlighted by multiple reviewers is the lack of a thorough comparison to existing collaborative decoding methods and previous LLM inference acceleration techniques. The reliance on only the Qwen-series of models limits the generalizability of the findings. Reviewers question how CITER would perform on other model architectures, and with a broader evaluation across various NLU tasks or long-form text generation scenarios. The overall efficiency improvement also needs to be assessed in a complete end-to-end analysis.
审稿人讨论附加意见
There is no author response.
Reject