RAEE: A Robust Retrieval-Augmented Early Exit Framework for Efficient Inference
摘要
评审与讨论
The paper proposes a new early exit approach that predicts a probability distribution over the set of layers in the model in which distribution is approximated using similar data. Specifically, it utilizes the embedding of the incoming samples and creates an embedding space where the number of spaces is equal to number of layers in the model. The decision of which space an incoming sample belongs is made based on the possible layers it can make an exit from. During inference, an incoming sample is first checked against the top-k nearest neighbors and the sample is assigned an exiting layer based on the estimated probability distribution using the top-k neighbors.
优点
-
The paper is well written.
-
Paper uses multiple backbone models making the claims strong.
-
The paper claims to reduce the latency overheads which is an important problem for the community.
缺点
-
Novelty: The paper lacks novelty as the main claim is it is learning the distribution of incoming samples using the embeddings of the samples. This is already done in DIMEE [1] work, however the objectives of both the works are different, the method to solve the problem are very similar.
-
Lack of proper baselines: There are multiple existing works [2], [3]. [4], [5] that also learn the probability distribution over the exit points but the paper have neither cited them nor compared against them that reduces the overall impact of the paper. Since the final objective of the paper is to learn a distribution, there should be a comparison with existing distribution predicting methods.
-
Overclaim: The paper has a major claim that it can outperform under zero-shot setting but it requires the labels to create the retrieval database. Also, there will be large impact if the domain of the test dataset changes which is not explored in this paper.
-
Lack of explanation: As per as I believe, the better results of RAEE are due to the fact that it uses learned final layer classifier to map the hidden representations from intermediate layers to class probabilities similar to [5], however, other baselines use different classifiers at each layer that are randomly initialized hence the loss in performance. This is an apple to orange comparison, even the baselines should be tested with learned final layer classifier for fair comparison.
-
Clustering figure: I believe there should be an additional t-SNE plot in the paper showing that how the clusters are formed based on the exit points as shown in the DIMEE paper.
Missing references: There are a lot of missing references
-
JEI-DNN: https://openreview.net/pdf?id=jX2DT7qDam (ICLR 2024)
-
ZTW: https://proceedings.neurips.cc/paper/2021/file/149ef6419512be56a93169cd5e6fa8fd-Paper.pdf (NIPS 2021)
-
PALBERT: https://proceedings.neurips.cc/paper_files/paper/2022/file/5a9c1af5f76da0bd37903b6f23e96c74-Paper-Conference.pdf (NIPS)
-
MSDNet: https://arxiv.org/abs/1703.09844
-
CeeBERT: https://aclanthology.org/2024.findings-acl.101/ (ACL 2024)
-
ETFEE Yixin Ji, Jikai Wang, Juntao Li, Qiang Chen, Wenliang Chen, and Min Zhang. 2023. Early exit with disentangled representation and equiangular tight frame. In Findings of the Association for Computational Linguistics: ACL 2023, pages 14128–14142.
问题
See weaknesses:
Also, Early Exits are not pruning based methods, instead they fall into class of dynamic inference methods as pruning reduces the weights. Here none of the weights are reduced instead the model decides which layer it should not use although it has an option to use them which is not the case with pruning.
W1: Thanks for your comments. Unfortunately, the online date of this paper on ArXiv is October 6, which is later than our submission to ICLR 2025 (October 1, AOE). The techniques proposed in RAEE are novel, while this DIMEE can only be treated as a follow-up work of RAEE.
W2: Thanks for your comments. We carefully reviewed all the listed works and will discuss them in the related works. Specifically, ZTW [3], MSDNet [5] are specifically designed for CNN-based models. JEI-DNN[2], PALBERT [4], CEEBERT[6], and ETFEE [7] all require fine-tuning the whole backbone models. In our paper, we target the scenarios of no update on backbone models’ parameters.
[2] Florence Regol, Joud Chataoui, Mark Coates. Jointly-Learned Exit and Inference for a Dynamic Neural Network : JEI-DNN. CoRR abs/2310.09163 (2023)
[3] Maciej Wolczyk, Bartosz Wójcik, Klaudia Balazy, Igor T. Podolak, Jacek Tabor, Marek Smieja, Tomasz Trzcinski. Zero Time Waste: Recycling Predictions in Early Exit Neural Networks. NeurIPS 2021: 2516-2528
[4] Nikita Balagansky, Daniil Gavrilov. PALBERT: Teaching ALBERT to Ponder. NeurIPS 2022
[5] Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, Kilian Q. Weinberger. Multi-Scale Dense Networks for Resource Efficient Image Classification. ICLR 2018
[6] Divya Jyoti Bajpai, Manjesh K. Hanawal. CeeBERT: Cross-Domain Inference in Early Exit BERT. ACL (Findings) 2024: 1736-1748
[7] Yixin Ji, Jikai Wang, Juntao Li, Qiang Chen, Wenliang Chen, Min Zhang. Early Exit with Disentangled Representation and Equiangular Tight Frame. ACL (Findings) 2023: 14128-14142
W3: Thanks for your comments. Please see the General Response to Out-of-Domain Issues.
W4: Thanks for your comments. It seems there are a lot of misunderstandings. First, RAEE doesn’t have any classifier and doesn’t tune any parameters of backbone models, which is totally different from [5] as you mentioned. The main idea of RAEE is to retrieve the exit information from a pre-built retrieval database and then compute the exit layer according to the top-k nearest neighbors’ exit information. There is no “final layer classifier” in RAEE. In contrast, RAEE doesn’t add any extra components to the backbone model while only passing a parameter of the exit layer to stop the inner loop early. For more novelties claims, please refer to the response to Q1 of the reviewer Jjgv.
The comparisons to baselines are also fair. For RAEE, we only use the training dataset to create the retrieval database. For baselines containing classifiers (HashEE, DeeBERT, AdaInfer), we only tune the classifiers on the training dataset. For CALM requiring training the backbone models, we follow its default settings on the threshold and only perform its inference. For SLEB, we follow its settings to prune the backbone models.
W5: Thanks for your suggestions. However, we believe it is unnecessary to draw such a clustering figure. The reasons lie in the following,
- Data with similar representations would exit at similar layers, but data with different representations may also exit at similar layers.
- In RAEE, there are multiple possible exit layers for each input, which makes it difficult to draw all of them in one figure.
Besides, Figure 2 in the paper can better describe the distribution approximation with top-k nearest neighbors’ exit information.
Thanks for the rebuttal.
Please note that you have already compared against DeeBERT and all the suggested baselines such as CeeBERT, JEI-DNN and ZTW require similar fine-tuning. Also, the ZTW idea can easily be extended to LLMs as well. Regarding the figure, it would have been better if there was a pattern where samples from some embedding space chose a particular subset of layers.
I will keep my score as the rebuttal partially solves my issues.
Thanks.
This paper addresses the challenge of high computational demands in large language model inference. It focuses on reducing the number of inference layers required by exploiting early exits. Instead of early-exiting by training internal classifiers to decide if the model can exit after fewer layers, the authors propose RAEE, a Retrieval-Augmented Early Exit framework. RAEE treats early exit as a distribution prediction problem, where exit decisions are informed by similar data examples stored in a retrieval database (Cache). This approach allows the model to decide on exiting based on prior information from similar data, leading to faster and more efficient inference.
优点
Method
- The method is interesting, and there is a clear performance improvement. Creating a database to decide what layer to use sounds promising. I would push that idea to create a more general dataset rather than tailor it to each specific target dataset.
缺点
Motivation.
-
Although I understand that the inference time could be reduced by early exiting, one of the main reasons for the high computational demand for Transformer LLMs is the KV cache. In this regard, solutions like PagedAttention [1] show a 3.5x and 24x higher throughput using LLaMA. My concern is that this paper claims to accelerate the model inference. However, it doesn't compare with those types of approaches. If the motivation of this work is to deploy LLM on resource-constrained devices, the memory aspect is key. Moreover, Figure 3 shows that the method did not always get better inference latency compared to the base method T5-L, RoBERTa/ElasticBERT or the selected baselines (AdaInfer, SLEB).
-
In my opinion, the work is more interesting regarding performance improvement than latency reduction (Table 2). The results show a clear diversity in the layers. It seems that some layers are better than others for some datasets. Thus, section 4.3 needs to be explored deeply. I expected to see some statistics about the layers used the most for each dataset and what type of tasks/questions/data are better answered for the early or later layer. This raises the question of whether the retriever database could be more general to the input data type.
Experiments.
-
Even though the GLUE benchmark is well-known by the NLP community, it would be good to explain it further, emphasizing the evaluation metrics and clarifying if the tasks are classification or generation.
-
To better contextualize the method RAEE with the LLMs used in the comparison, it would be great to see the performance of HashEE/CALM/SLEB on each LLM. (Figure3)
-
Section 4.3 requires much more analysis to understand the sentence L407-408 better.
- What layers are the most used?
- Is there some correlation between the type of data vs early exit or type of task vs early exit?
Presentation.
-
Some parts of the text are difficult to follow.
-
Figures and plots require better captions. Figure 1 is quite intuitive, but explaining each component in the caption would be ideal. Figure 2 is confusing: the caption doesn't explain the difference between (a) and (b). Also, the explanation between L131 and L142 is confusing. In general, captions for figures and tables are very superficial.
-
Citations. Line 124 could cite Tip-Adapter as a training-free adaptation that uses a cache for image classification. In the same line (124), what are the existing retrieval databases that use clustering and/or product quantization?
[1] Woosuk Kwon et.al, Efficient Memory Management for Large Language Model Serving with PagedAttention, 2023
问题
- If you could assess the concerns above.
- When two or more layers have the same maximal probability RAEE selects the earliest L255. I'm wondering if this is a good option. How many cases exist with the same maximal probability in each dataset? How many times RAEE choose the earliest and make mistakes? If it would use a later, could it get a correct performance?
- How much memory is added if we want to deploy this method on resource-constrained devices?
W1: Thanks for your great suggestions. The main focus of PagedAttention is designing a new attention algorithm that optimizes memory usage and inference efficiency. This type of work is orthogonal to our work. For example, with RAEE’s early exit, PageAttention can dynamically save the computations of exit layers’ KV cache.
For comparable inference latency of T5-L, RoBERTa/ElasticBERT, the reasons lie that extra encoding and retrieving time offsets the benefits of early exit. For AdaInfer, it has a poor exit classifier and always exits very early with wrong predictions. For SLEB, there is a hyperparameter that decides how many layers would be removed from the backbone model. For a fair comparison, we chose similar exit layers for SLEB. Although SLEB can achieve a faster inference, it performs poorly compared to RAEE.
W3: Thanks for your suggestions. We evaluate the proposed RAEE on some representative generation tasks, such as CNN/DailyMail and XSum. Rebuttal-Table 11 shows the results of applying RAEE on two summarization tasks. Experimental results also demonstrate the efficacy of the proposed RAEE.
Rebuttal-Table 11. Performance of the Llama-3-8b and RAEE (Llama) on generation tasks.
| ROUGE-L | Layers | |
|---|---|---|
| CNN/DailyMail Llama-3-8B | 8.95 | 32.00 |
| CNN/DailyMail RAEE (Llama) | 19.75 | 30.43 |
| XSum Llama-3-8B | 5.22 | 32.00 |
| XSum RAEE (Llama) | 7.31 | 30.21 |
W4: Thanks for your comments. Since the early exit frameworks compared in this work are not suitable for all backbone models. Specifically, HashEE and DeeBERT are specifically designed for BERT-liked models; CALM is specifically designed for T5-based models; SLEB is specifically designed for decoder-only models. We have tried our best to implement AdaInfer over all kinds of backbone models whose design is less related to the architecture of backbone models.
W2 & W6: Thanks for your insightful suggestions. We collect the statistics of the most-used exit layers in Rebuttal-Table 12. Table 8 in the appendix also shows the correlation between the type of task and exit layers. These results show the diversity of exit layers across different tasks and different inputs.
Rebuttal-Table 12. Most-used exit layers of the RAEE (Llama) and RAEE (Llama) Corr. over different distance metrics with different backbone models. RAEE (Llama) Corr. means just contains the data that is correctly predicted.
| Layers | SST-2 | SST-5 | MR | CR | MPQA | Subj | TREC | CoLA | Avg |
|---|---|---|---|---|---|---|---|---|---|
| RAEE (Llama) | 5.00 | 3.00 | 5.00 | 5.00 | 15.00 | 11.00 | 27.00 | 23.00 | 11.75 |
W7: Thanks for your suggestions. We have improved all figure captions.
W8: Thanks for your suggestions. Tip-Adapter is a training-free method for learning better representation for CLIP, there is no clear relationship with early exit frameworks, except for the training-free setting.
There are several vector databases that use clustering and product quantization techniques, such as FAISS.
Q1: Thanks for your comments. Theoretically, there would be the case that two or more exit layers have the same probability. However, we collect all exit layers and corresponding probability in all experiments, and no exit layer exhibits the same probability for exiting, which shows that it is a very low probability case.
Q2: Thanks for your comments. It only requires less than 6MB. For more details, please refer to the response to the Q2 of the reviewer Jjgv.
Thank you for the rebuttal and for addressing the concerns raised during the review process. I want to acknowledge and commend the authors' effort and commitment to clarifying the questions and addressing the weaknesses identified in the initial review.
I understand that PagedAttention is orthogonal to your method. However, I am unsure if they are complementary and more experiments in that direction are needed if you want to push the motivation to reduce the inference time.
The results and analysis provided in the rebuttal are appreciated, but they do not substantially change the overall assessment of the submission. While the work presents a compelling and innovative approach with significant potential, the current version still requires substantial revisions to improve its overall quality and impact. Specifically, further attention is needed to refine the motivation, provide additional experimental evidence, and enhance the clarity of the presentation.
Therefore, I maintain my original score. I encourage the authors to continue refining this work, as it has the potential to make a valuable impact with further development, especially in the diversity of predictions.
This paper aims to enhancing the efficiency of large language model inference by adaptively exiting the model at earlier layers. The authors model the early exit problem as a distribution prediction issue, and then use exit information from similar data to approximate the distribution. They outline the methodology for constructing a retrieval database with exit information and propose the RAEE framework, which leverages the pre-built retrieval database to predict the exit layer based on the exit information from the top-k nearest neighbors. Experimental results across eight downstream tasks demonstrate that RAEE improves inference speed while maintaining robust zero-shot performance, outperforming other early exit frameworks.
优点
- The paper provides a novel combined method to decide the exit layer.
- The paper is well written and easy to follow.
- The proposed method is reasonable. Experiments support their claims. The performance can be improved in most scenarios.
缺点
- The paper does not clearly position itself with respect to existing retrieval-augmented methods that used to accelerate the model’s inference. A more thorough literature review is needed to highlight how RAEE differs from and improves upon prior work.
- While the data presented in figure3 is comprehensive, I noticed that the visual presentation, specifically the subscripts, could be enhanced for better readability and aesthetic appeal.
问题
- It has been observed that modeling the early exit problem as a distribution prediction issue is not a novel approach, as similar concepts have been explored in prior works. Could the authors elaborate on the specific novelties of their proposed RAEE framework compared to existing methods?(eg. Predictive Exit: Prediction of Fine-Grained Early Exits for Computation- and Energy-Efficient Inference)
- Table 4 indicates that the performance of RAEE improves with a larger retrieval database. How does the authors plan to balance the trade-off between database size, storage requirements, and inference efficiency, especially for resource-constrained environments?
- In Table 1, why were only 2-3 other methods compared for a specific single backbone(why weren't some methods compared) ?
- The paper uses a threshold of 0.9, but it's unclear how sensitive RAEE is to it. An analysis of how changes in these parameters affect performance would be useful.
W1. Thanks for your comments. We have revised the related work of retrieval-based augmentations.
W2. Thanks for your suggestions. We have revised Figure 3 with new subscripts for better readability and aesthetic appeal.
Q1: Thanks for your comments. There are clearly some differences between the proposed RAEE and existing mainstream early exit frameworks, including the referred Predictive Exit.
- The referred Predictive Exit is now only suitable for CNN-based neural networks, such as VGG-19 and ResNet-34, and applying it to the existing general transformer-based large language models still remains unexplored. The implicit patterns between CNN-based models and transformer-based models are significantly different.
- Although the referred Predictive Exit and the proposed RAEE both model the early exit problem as a distribution prediction problem, it is worth noticing that the proposed RAEE requires no parameters update of the backbone models, while most existing early exit frameworks [1, 2, 3] require jointly fine-tuning the backbone model and the early exit classifiers. Fine-tuning large language models is quite costly, even though using LoRA techniques.
- Although the proposed RAEE and the referred Predictive Exit both introduce hyperparameters, the ablation studies show that RAEE is not sensitive to the hyperparameters, which makes the deployment much easier. However, the Predictive Exit requires setting the starting layer to determine the next exit layer, which significantly impacts the model performance, according to their papers.
- More importantly, the proposed RAEE is more interpretable than those early exit frameworks using learned classifiers. As shown in Figure 2, we can draw the exit distribution with the retrieved examples’ exit information and make the exit predictions.
In summary, the proposed RAEE is suitable for existing state-of-the-art large language models, requires no parameters update of backbone models, is less sensitive to hyperparameters, and is more interpretable.
[1] Xiangjie Li, Chenfei Lou, Yuchi Chen, Zhengping Zhu, Yingtao Shen, Yehan Ma, An Zou, Predictive Exit: Prediction of Fine-Grained Early Exits for Computation- and Energy-Efficient Inference. AAAI 2023: 8657-8665.
[2] Florence Regol, Joud Chataoui, Mark Coates. Jointly-Learned Exit and Inference for a Dynamic Neural Network : JEI-DNN. ICLR 2024.
[3] Divya Jyoti Bajpai, Manjesh K. Hanawal. CeeBERT: Cross-Domain Inference in Early Exit BERT. ACL (Findings) 2024: 1736-1748.
Q2: Thanks for your comments. As shown in Table 5 in the manuscript, the index size and database size are relatively small compared to the backbone model, which is less than 6 MB in total. And Table 4 also shows that increasing the amount of data in the retrieval database would enhance the RAEE’s performance. So, when deploying such LLMs in resource-constrained scenarios, the resource requirements for the retrieval database should be the last consideration. For example, the memory should be preferentially allocated to the model weights and KV cache, then the retrieval database.
Q3: Thanks for your comments. Since the early exit frameworks compared in this work are not suitable for all backbone models. Specifically, HashEE and DeeBERT are specifically designed for BERT-liked models; CALM is specifically designed for T5-based models; SLEB is specifically designed for decoder-only models. We have tried our best to implement AdaInfer over all kinds of backbone models whose design is less related to the architecture of backbone models.
Q4: Thanks for your comments. We have conducted an ablation study on different thresholds in Rebuttal-Table 10. Experimental results show that the performance drop of the proposed RAEE is quite small on average.
Rebuttal-Table 10. Performance of RAEE using different thresholds across eight classification tasks with RoBERTa-Large.
| Metrics | SST-2(acc) | SST-5(acc) | MR(acc) | CR(acc) | MPQA(acc) | Subj(acc) | TREC(acc) | CoLA(mcc) | Avg |
|---|---|---|---|---|---|---|---|---|---|
| RAEE (RB-L) Thres-0.1 | 86.35 | 34.84 | 83.55 | 72.55 | 79.30 | 85.45 | 63.40 | 13.95 | 64.92 |
| RAEE (RB-L) Thres-0.2 | 86.35 | 34.84 | 83.55 | 72.55 | 79.30 | 85.45 | 63.40 | 13.95 | 64.92 |
| RAEE (RB-L) Thres-0.3 | 86.35 | 34.75 | 83.55 | 72.55 | 79.30 | 85.45 | 63.40 | 13.95 | 64.91 |
| RAEE (RB-L) Thres-0.4 | 86.35 | 34.57 | 83.55 | 72.55 | 79.30 | 85.45 | 62.60 | 13.95 | 64.79 |
| RAEE (RB-L) Thres-0.5 | 86.35 | 34.43 | 83.55 | 72.55 | 79.30 | 85.45 | 62.80 | 13.95 | 64.80 |
| RAEE (RB-L) Thres-0.6 | 86.12 | 33.67 | 83.40 | 72.40 | 78.90 | 85.20 | 62.40 | 12.55 | 64.33 |
| RAEE (RB-L) Thres-0.7 | 86.01 | 33.53 | 83.20 | 72.15 | 78.20 | 84.80 | 61.60 | 14.65 | 64.27 |
| RAEE (RB-L) Thres-0.8 | 85.55 | 33.12 | 82.95 | 69.90 | 78.95 | 84.60 | 62.00 | 14.63 | 63.96 |
| RAEE (RB-L) Thres-0.9 | 84.63 | 33.57 | 81.55 | 68.05 | 78.55 | 84.05 | 62.40 | 14.48 | 63.41 |
This work uses retrieval to improve performance on training-free early exit frameworks. The motivation for doing so is observing that similar data should have similar early exit patterns. Experimental results show that this method is significantly better than existing training-free early exit methods.
优点
- Good empirical results on downstream tasks when compared to prior work (table 1)
- Compares inference times as well (figure 3)
- Provides a clear overview and motivation for the problem
缺点
- While the paper mentions that out-of-domain performance is out of scope, I think this is a very important problem because many models today train on non-public data and in the real world we do not always have accompanying train sets to user inference. While it may be out of scope to completely understand out-of-domain performance, I would like to see the authors do some analysis, such as examining performance changes as a function of distance between test example and the nearest neighbors.
- Another interesting experiment for the above is using the LM train set (e.g. C4 for T5) instead of GLUE train sets for the database.
- Give that you have the test labels, can you do additional analysis to compare RAEE with other methods to see how often it exits at the correct layer? This can also clarify questions about inference times
- Figure captions could be improved
问题
See potential experiments or analyses mentioned in weaknesses
W1: Thanks for your insightful suggestions. We rebuild the database with different distance metrics, such as inner product, and retrieve the top-k nearest neighbors based on the corresponding distance metrics. As shown in Rebuttal-Table 7-8, although the RAEE with different backbones based on the inner product achieves a bit poorer performance, the performance difference is quite small (less than one on average), demonstrating the robustness of the proposed RAEE over distance metrics.
Rebuttal-Table 7. Performance of RAEE over different distance metrics.
| Metrics | SST-2(acc) | SST-5(acc) | MR(acc) | CR(acc) | MPQA(acc) | Subj(acc) | TREC(acc) | CoLA(mcc) | Avg |
|---|---|---|---|---|---|---|---|---|---|
| RAEE (RB-L) L2 | 84.63 | 33.57 | 81.55 | 68.05 | 78.55 | 84.05 | 62.40 | 14.48 | 63.41 |
| RAEE (RB-L) IP | 83.37 | 32.76 | 81.90 | 69.00 | 77.65 | 85.05 | 61.80 | 5.91 | 62.18 |
| RAEE (T5-L) L2 | 52.98 | 26.56 | 50.80 | 51.60 | 55.65 | 49.90 | 39.80 | 12.20 | 42.44 |
| RAEE (T5-L) IP | 52.87 | 27.56 | 51.80 | 51.40 | 55.85 | 50.10 | 38.60 | 9.45 | 42.20 |
| RAEE (Llama) L2 | 73.05 | 35.25 | 66.45 | 57.95 | 75.05 | 90.05 | 51.80 | 9.55 | 57.39 |
| RAEE (Llama) IP | 70.99 | 33.94 | 64.60 | 57.75 | 74.05 | 89.05 | 48.20 | 10.61 | 56.15 |
| RAEE (Gemma) L2 | 73.17 | 32.40 | 66.75 | 56.75 | 75.60 | 90.15 | 40.00 | 10.46 | 55.66 |
| RAEE (Gemma) IP | 70.76 | 30.27 | 64.50 | 57.25 | 75.20 | 89.15 | 38.80 | 13.93 | 54.98 |
Rebuttal-Table 8. Exit layers of RAEE over different distance metrics.
| Layers | SST-2 | SST-5 | MR | CR | MPQA | Subj | TREC | CoLA | Avg |
|---|---|---|---|---|---|---|---|---|---|
| RAEE (RB-L) L2 | 18.55 | 13.93 | 18.71 | 15.35 | 17.20 | 13.59 | 12.82 | 12.48 | 15.33 |
| RAEE (RB-L) IP | 18.45 | 14.32 | 18.72 | 15.36 | 17.18 | 13.25 | 13.31 | 13.47 | 15.51 |
| RAEE (T5-L) L2 | 22.27 | 18.74 | 21.88 | 26.84 | 18.05 | 19.06 | 27.29 | 18.55 | 21.59 |
| RAEE (T5-L) IP | 21.42 | 17.88 | 21.95 | 26.46 | 17.41 | 19.23 | 27.29 | 17.91 | 21.19 |
| RAEE (Llama) L2 | 11.77 | 15.70 | 12.43 | 7.04 | 12.83 | 6.58 | 20.06 | 21.04 | 13.43 |
| RAEE (Llama) IP | 11.68 | 14.38 | 12.09 | 6.95 | 13.22 | 6.55 | 20.43 | 21.58 | 13.36 |
| RAEE (Gemma) L2 | 11.00 | 17.62 | 11.70 | 3.29 | 14.72 | 0.51 | 9.50 | 20.06 | 11.05 |
| RAEE (Gemma) IP | 11.81 | 17.91 | 11.92 | 3.38 | 15.21 | 0.52 | 8.76 | 20.53 | 11.25 |
W2: Thanks for your comments. Please refer to the General Response to Out-of-Domain Issues.
W3: Thanks for your suggestions. We collect the statistics of RAEE’s and AdaInfer’s correct exit layers to show that the reduction of inference time benefits from correct exit layers rather than exiting with wrong predictions very early. As shown in Rebuttal-Table 9, RAEE has similar exit layers for all the data and the correctly predicted data. This demonstrates that RAEE does effectively help accelerate the inference without sacrificing performance.
We also collect the correct exit layers of AdaInfer, which show similar conclusions. However, due to the poor exit layer classification and poor model performance, AdaInfer makes many early exits with wrong predictions so that it can achieve quite a small inference latency but a poor performance.
Rebuttal-Table 9. Exit layers of RAEE and AdaInfer on all predictions and correct predictions.
| Layers | SST-2 | SST-5 | MR | CR | MPQA | Subj | TREC | CoLA | Avg |
|---|---|---|---|---|---|---|---|---|---|
| RAEE (RB-L) | 18.55 | 13.93 | 18.71 | 15.35 | 17.20 | 13.59 | 12.82 | 12.48 | 15.33 |
| RAEE (RB-L) Corr. | 18.76 | 14.56 | 18.97 | 16.17 | 17.08 | 13.48 | 12.57 | 13.68 | 15.66 |
| AdaInfer (RB-L) | 1.00 | 0.00 | 1.46 | 1.00 | 18.00 | 1.10 | 0.00 | 4.00 | 3.32 |
| AdaInfer (RB-L) Corr. | 1.00 | 0.00 | 1.45 | 1.00 | 18.00 | 1.11 | 0.00 | 4.00 | 3.32 |
| RAEE (T5-L) | 22.27 | 18.74 | 21.88 | 26.84 | 18.05 | 19.06 | 27.29 | 18.55 | 21.59 |
| RAEE (T5-L) Corr. | 24.96 | 17.82 | 23.87 | 27.59 | 21.23 | 34.53 | 35.66 | 17.88 | 25.44 |
| AdaInfer (T5-L) | 6.34 | 0.00 | 7.72 | 0.00 | 1.00 | 1.00 | 0.00 | 1.00 | 2.13 |
| AdaInfer (T5-L) Corr. | 6.77 | 0.00 | 7.73 | 0.00 | 1.00 | 1.00 | 0.00 | 1.00 | 2.19 |
| RAEE (Llama) | 11.77 | 15.70 | 12.43 | 7.04 | 12.83 | 6.58 | 20.06 | 21.04 | 13.43 |
| RAEE (Llama) Corr. | 11.58 | 15.56 | 12.29 | 7.57 | 12.48 | 6.54 | 18.65 | 21.72 | 13.30 |
| AdaInfer (Llama) | 4.00 | 0.00 | 3.18 | 3.00 | 1.00 | 4.71 | 0.00 | 2.00 | 2.24 |
| AdaInfer (Llama) Corr. | 4.00 | 0.00 | 3.17 | 3.00 | 1.00 | 4.74 | 0.00 | 2.00 | 2.24 |
| RAEE (Gemma) | 11.00 | 17.62 | 11.70 | 3.29 | 14.72 | 0.51 | 9.50 | 20.06 | 11.05 |
| RAEE (Gemma) Corr. | 11.03 | 18.02 | 11.58 | 4.26 | 13.59 | 0.51 | 8.41 | 21.17 | 11.07 |
| AdaInfer (Gemma) | 1.00 | 0.00 | 1.04 | 1.00 | 3.00 | 1.00 | 0.00 | 2.00 | 1.13 |
| AdaInfer (Gemma) Corr. | 1.00 | 0.00 | 1.03 | 1.00 | 3.00 | 1.00 | 0.00 | 2.00 | 1.13 |
W4. Thanks for your comments. We have improved all figure captions with detailed introductions.
Thank you for your response and additional experiments. I have adjusted my score to account for the updates.
However, I still have concerns regarding the general response to out of domain issues. Could the author clarify how they would propose using RAEE in a more realistic setting where academic benchmarks are not the target? Also "data quality" is vague, as some would claim that wikipedia and wikitext are "high quality." 2 defines this relative to whether RAEE can early exit, so maybe it would help to have experiments that use a model and its actual training data which may be able to early exit and then be used in the retrieval portion?
-
The data domain also matters but is not necessary. Since using GLUE tasks’ training dataset to build the retrieval database has demonstrated that their data qualities are better than wikitext, we conducted experiments using different GLUE tasks’ training datasets to build the retrieval database. As shown in Rebuttal-Table 5, RAEE can achieve better performance with out-of-domain databases on SST-2, SST-5, MR, and CR tasks. But for the rest tasks, RAEE can achieve the best performance with only in-domain databases. For exit layers in Rebuttal-Table 6, similar conclusions can also be drawn that in-domain databases are not always the best choice.
Rebuttal-Table 5. Performance of RAEE using different domain retrieval databases across eight classification tasks.
Metrics SST-2(acc) SST-5(acc) MR(acc) CR(acc) MPQA(acc) Subj(acc) TREC(acc) CoLA(mcc) RAEE (RB-L-'SST-2') 84.63 34.39 84.80 80.75 59.60 50.65 28.20 7.86 RAEE (RB-L-'SST-5') 75.34 33.57 73.85 69.20 62.85 50.35 18.60 3.22 RAEE (RB-L-'MR') 90.48 36.11 81.55 81.80 66.05 51.10 24.60 -3.83 RAEE (RB-L-'CR') 74.66 32.67 75.40 68.05 55.60 49.00 16.60 2.58 RAEE (RB-L-'MPQA') 86.12 34.16 83.30 83.50 78.55 50.00 29.40 -1.56 RAEE (RB-L-'Subj') 79.70 32.22 75.90 72.55 54.70 84.05 11.80 8.06 RAEE (RB-L-'TREC') 61.01 26.97 56.55 75.50 56.30 49.65 62.40 -1.39 RAEE (RB-L-'CoLA') 76.72 33.85 73.95 68.05 53.25 57.20 18.00 14.48 Rebuttal-Table 6. Exit layers of RAEE using different domain retrieval databases across eight classification tasks.
Layers SST-2 SST-5 MR CR MPQA Subj TREC CoLA RAEE (RB-L-'SST-2') 18.55 18.57 18.60 19.09 17.90 17.95 19.41 18.56 RAEE (RB-L-'SST-5') 14.23 13.93 13.67 14.09 18.03 12.37 15.34 15.47 RAEE (RB-L-'MR') 18.52 19.06 18.71 19.19 17.39 18.20 18.25 19.39 RAEE (RB-L-'CR') 17.10 17.24 17.59 15.35 13.18 18.36 18.38 16.32 RAEE (RB-L-'MPQA') 21.71 21.64 21.66 21.45 17.20 22.09 21.79 20.34 RAEE (RB-L-'Subj') 18.48 18.02 18.06 16.75 7.67 13.59 6.94 11.09 RAEE (RB-L-'TREC') 12.29 12.39 11.74 16.37 13.60 12.04 12.82 16.12 RAEE (RB-L-'CoLA') 16.36 15.73 16.00 14.47 5.28 14.72 8.35 12.48
In conclusion, although RAEE with the wikitext-based retrieval database cannot achieve as good performance as that with the in-domain retrieval database, the above analysis still demonstrates the efficacy of the proposed RAEE framework.
Thanks for all the reviewers’ insightful suggestions. The main concern of this paper is that RAEE cannot achieve good performance with the out-of-domain retrieval database. To analyze the case of out-of-domain issues, we have conducted experiments using the retrieval database built on wikitext-2-v1. Since there is no gold label on the text dataset, we follow the next-token prediction task setting, where the input sentence's next token is treated as the gold label.
Specifically, we first split the whole text dataset into sentences, avoiding breaking semantics. Then, according to the backbone model’s max input length, for each sentence whose length is smaller than the max input length, we regard the last meaningful token of the sentence as the gold label; for the other sentences, we use a sliding window of the max input length of the backbone model and regard the last meaningful token of the window as the gold label. Finally, we collect the exit information in the way used in this paper. To best demonstrate the efficacy, we choose llama-3-8b as the backbone model, which is pre-trained with the next-token prediction task.
Rebuttal-Table 1. Performance of the Llama-3-8b and RAEE (Llama-wiki) across eight classification tasks.
| Metrics | SST-2(acc) | SST-5(acc) | MR(acc) | CR(acc) | MPQA(acc) | Subj(acc) | TREC(acc) | CoLA(mcc) | Avg |
|---|---|---|---|---|---|---|---|---|---|
| Llama-3-8B | 62.84 | 26.06 | 59.65 | 72.90 | 51.75 | 52.80 | 8.40 | 0.00 | 41.80 |
| RAEE (Llama-wiki) | 55.50 | 21.40 | 54.30 | 61.60 | 57.15 | 51.55 | 13.00 | 0.00 | 39.31 |
Rebuttal-Table 2. Exit layers of the Llama-3-8b and RAEE (Llama-wiki) across eight classification tasks.
| Layers | SST-2 | SST-5 | MR | CR | MPQA | Subj | TREC | CoLA | Avg |
|---|---|---|---|---|---|---|---|---|---|
| Llama-3-8B | 32.00 | 32.00 | 32.00 | 32.00 | 32.00 | 32.00 | 32.00 | 32.00 | 32.00 |
| RAEE (Llama-wiki) | 29.41 | 28.83 | 29.30 | 28.75 | 30.13 | 29.31 | 30.34 | 30.56 | 29.58 |
As shown in Rebuttal-Table 1-2, we evaluate RAEE (Llama) with the wikitext-based retrieval database, termed RAEE (Llama-wiki). Unsurprisingly, RAEE (Llama-wiki) performs poorly, although it can exit earlier. However, this performance drop cannot simply be attributed to the out-of-domain dataset. There are three key points that jointly impact the model performance as well as the model inference efficiency.
-
The task type in the process of building the retrieval database and the inference with early exit should be aligned. To verify this point, we have conducted an experiment on the summarization tasks such as CNN/DailyMail and XSum while using the wikitext-based retrieval database, as shown in Rebuttal-Table 3. Experimental results demonstrate that RAEE can improve performance and accelerate inference even though the domain of the retrieval database is out of the distribution.
Rebuttal-Table 3. Performance of the Llama-3-8b and RAEE (Llama-wiki) on generation tasks.
ROUGE-L Layers CNN/DailyMail Llama-3-8B 8.95 32.00 CNN/DailyMail RAEE (Llama-wiki) 14.01 29.60 XSum Llama-3-8B 5.22 32.00 XSum RAEE (Llama-wiki) 7.15 28.82 -
Data quality determines the generalization of the RAEE’s retrieval database of exit information, impacting the quality of the exit distribution approximation through neighbors’ exit information. We have also conducted experiments to show why RAEE achieves poor performance when using the wikitext-based retrieval database. As shown in the Rebuttal-Table 4, we evaluate the accuracy of the next-token prediction of the backbone model and RAEE on the wikitext-2-v1, where the token with the maximal probability is chosen as the next token (accuracy can better explain the claim than that of perplexity). Experimental results show that the backbone can only achieve an accuracy of 53.90 for predicting the next token. With RAEE which corrects some predictions with early exit, the accuracy still maintains a low level of 57.20. These results demonstrate that there is a considerable volume of sentences in the wikitext dataset that the backbone model cannot make correct predictions even if RAEE corrects the predictions. Those cases also result in no exit of RAEE.
Rebuttal-Table 4. Performance and exit layers of Llama-3-8b and RAEE (Llama-wiki) on wikitext training data.
acc Layers Llama-3-8B 53.90 32.00 RAEE (Llama-wiki) 57.20 30.00
This paper received mixed reviews. The reviewers recognized the new, interesting, and reasonable method, its strong performance, and extensive experiments. At the same time, they raised concerns with unclear motivation and inappropriate positioning of the paper (qFBa), no comparisons with other approaches aiming at inference acceleration, e.g., PagedAttention (qFBa), lack of essential analysis supporting the main arguments (qFBa), experiments limited to in-distribution retrieval (xZ1Y, LKLV), space-time complexity caused by the use of retrieval database (Jjgv, LKLV), lack of proper baselines (LKLV), missing analysis on the sensitivity to the threshold hyperparameter (Jjgv), lack of discussion about existing retrieval-augmented methods (Jjgv), and presentation issues (xZ1Y, Jjgv, qFBa).
The authors' rebuttal and subsequent responses in the discussion period address some of these concerns but failed to fully assuage all of them: after the discussion perior, the reviewers still pointed out the issues on the motivation and positioning of the paper (qFBa), missing comparisons with PagedAttention (qFBa), concerns with the out-of-distribution experiments (xZ1Y), the complexity issue (Jjgv, LKLV), and no comparisons with relevant baselines (LKLV). As a result, two reviewers voted to reject, and a reviewer who leaned borderline toward accept still has concerns about the OOD experiment.
Putting these together, the AC found that the remaining concerns outweigh the positive comments and the rebuttal, and thus regrets to recommend rejection. The authors are encouraged to revise the paper with the comments by the reviewers and the AC, and submit to an upcoming conference.
审稿人讨论附加意见
The rebuttal failed to assuage major concerns of the reviewers, and thus two reviewers voted to reject and even a reviewer who leaned borderline toward accept still raised concerns in his or her final comment. The AC carefully read the confidential message from the authors and disregarded the novelty issue raised by Reviewer LKLV as the authors requested, but there still remain a number of serious concerns that are sufficient reasons to recommend denial. Below I summarize the major concerns of the reviewers and how they are addressed.
- Experiments limited to in-distribution retrieval (xZ1Y, LKLV): The AC agree with the reviewers that OOD robustness has to be guaranteed for the deployment of the proposed method in the wild, i.e., the assumption that test data will be sampled from the training distribution will not hold in many real world applications. However, this concern has not been fully assuaged, according to Reviewer xZ1Y. It is one of the reasons for recommending rejection, it is not the most important factor though.
- Lack of discussion about existing retrieval-augmented methods (Jjgv): Well addressed by the revision. No concern remaining.
- Space-time complexity caused by the use of retrieval database (Jjgv, LKLV): The AC sees that this concern has not been well addressed. The AC agrees with the authors that the database will demand a much smaller amount of memory than LLMs, but as the main target of this paper is inference acceleration, we should also consider time complexity of the retrieval, which could be non-trivial even with latest NN search libraries like FAISS if the database is large. Also, building a retrieval database will require computation resources and time. It is one of the reasons for recommending rejection, it is not the most important factor though.
- Missing analysis on the sensitivity to the threshold hyperparameter (Jjgv): This concern has been well resolved by additional experimental results reported in the rebuttal.
- No comparison with other approaches accelerating inference, e.g., PagedAttention (qFBa): This concern has not been well assuaged due to the absence of experiments for the comparison. Reviewer qFBa believes that the comparison is very important, especially if the authors want to claim the main contribution of this work as accelerating inference, with which the AC agrees. It is one of the main reasons for recommending rejection.
- Unclear motivation and inappropriate positioning of the paper (qFBa): Reviewer qFBa considered that the improvement by the proposed method in inference latency is limited compared with some latest methods, but performance improvement by the method is intriguing, so the motivation of this paper is unclear and the positioning of the paper (i.e., inference acceleration) could be inappropriate. The authors response to this comment sounds reasonable to some extent: the latest methods improved inference speed a lot but substantially degraded performance, while the proposed one allows to achieve decent performance. However, due to the absence of relevant experiments (as mentioned in the above item), the reviewer still ha doubts about the motivation and positioning of the paper. It is one of the main reasons for recommending rejection.
- Limited novelty (LKLV): The AC does not consider this concern at all when making the final decision since the authors' rebuttal on this comment in the confidential message sounds readonable. (The reviewer compares this submission with his or her own work that is not officially published yet, which the AC believes inappropriate.)
- Lack of proper baselines (LKLV): The reviewer was not satisfied by the response as the authors did not conduct experiments the reviewer asked. However, the AC thinks this is not a serious issue as the baselines suggested by the reviewer have inherent limitations--lack of versatility. However, of course, it would be nice if the requested experiments and comparisons have been made.
- Potentially unfair comparisons with prior work (LKLV): It seems caused by misunderstanding of the reviewer; the rebuttal clearly resolved this issue.*
Reject