CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation
摘要
评审与讨论
This paper proposes an MLP-free adaptive sparse activation inference method based on sentence-level prediction.
优点
- The paper is easy to follow, with clear writing and presentation.
- Evaluation results are comprehensive.
缺点
- In lines 93-94, it should be 'nearly lossless generation'
- How would the method scale to larger models such as LLaMA-70B in multi-GPU settings?
- Would this method be compatible with quantization methods?
- The authors should also discuss related works on sparse KV cache [1-4], another important direction to accelerate LLM inference.
[1] Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference, MLSys 2024.
[2] Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache, MLSys 2024.
[3] ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching, ISCA 2024.
[4] InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, OSDI 2024.
问题
Please see the weaknesses above.
Dear Reviewer,
We would like to sincerely thank you for taking the time to review our paper and provide valuable feedback. We are glad that you think our paper is well presented. We are thankful to have the chance to address your questions and concerns.
Q1. In lines 93-94, it should be 'nearly lossless generation'.
A1. We sincerely appreciate you for this thoughtful suggestion. In the updated version, we have revised this term.
Q2. How would the method scale to larger models such as LLaMA-70B in multi-GPU settings?
A2. Thank you for raising this meaningful question. We are very happy to discuss it with you. In fact, we think that CoreInfer can easily adapt to multi-GPU settings. This is because the computation of core neurons is done at the layer level, and different layers do not affect each other. In multi-GPU settings, a large model is typically split by assigning certain layers to individual GPUs, meaning that computations within the same layer still occur on a single device. Therefore, the calculation of core neurons is unaffected by multi-GPU settings, making it easy to scale and deploy.
Q3. Would this method be compatible with quantization methods?
A3. We sincerely appreciate you for this constructive suggestion. We deployed CoreInfer using the Bitsandbytes library under 4-bit and tested its performance on four commonsense reasoning tasks. We found that CoreInfer achieved lossless sparse inference even with applying 4-bit quantization, as the activation of neurons was not affected by quantization. After combining CoreInfer with the FP4 and NF4 quantization formats, the average accuracy improved from 69.98 and 68.83 to 69.99 and 68.89, respectively. Therefore, CoreInfer can be combined with state-of-the-art quantization methods for further acceleration. We have added these experimental results in Appendix 4.3.
| PIQA | Wino | RTE | COPA | Avg. | |
|---|---|---|---|---|---|
| FP4 | 75.79 | 63.54 | 55.59 | 81.00 | 69.98 |
| FP4+CoreInfer | 75.79 | 63.61 | 55.59 | 81.00 | 69.99 |
| NF4 | 76.11 | 64.32 | 54.87 | 78.00 | 68.83 |
| NF4+CoreInfer | 76.11 | 64.25 | 55.23 | 78.00 | 68.89 |
Q4. The authors should also discuss related works on sparse KV cache [1-4], another important direction to accelerate LLM inference.
A4. Thank you for providing this valuable suggestion and these references. Activation sparsity-based inference can indeed be combined with sparse KV cache techniques to further accelerate the model. In the updated version of the paper, we have added a discussion on KV cache methods in the Background and explored the references provided.
We would like to once again express our sincere gratitude for your valuable suggestions. Your feedback has made our work more complete and thorough. If you have any further comments, we would be more than happy to discuss them and address any questions you may have.
Best regards,
Authors
Thank you for the detailed response. Most of my concerns have been addressed. I am increasing my score from 6 to 8.
Dear Reviewer,
Thank you very much for taking the time to read our rebuttal and for providing positive feedback! It is gratifying to know that we addressed most of your concerns. Your insightful feedback is greatly appreciated, and we will ensure all necessary modifications are made accordingly. We are committed to continuously improving our work based on your valuable suggestions.
Thank you once again for your time and effort.
Sincerely,
Authors
This work proposes an activation sparsity method to accelerate LLM inference. Specifically, the method identifies core neurons at the sentence level during the pre-fill stage when the input sentence exhibits stable semantics. Furthermore, by analyzing the relationship between core neurons and semantics, the method can be adapted to tasks with variable sentences by clustering samples into distinct groups. Experimental results demonstrate a significant inference speedup with an acceptable accuracy drop.
优点
-
This work has a clear motivation. In related work, token-wise sparsification and the inclusion of an additional learnable predictor both present challenges, as they lead to extra inference overhead.
-
The proposed concept of sentence-wise core neurons is novel. It's intriguing that core neurons can be directly identified in the pre-fill phase without the need for a additional predictor.
-
In the experiments, the proposed methods shows significant inference speedup with acceptable accuracy drop.
缺点
-
The proposed method involves three manually tuned hyper-parameters (alpha, beta, gamma) to determine final sentence-wise core neurons, that maybe limit its practical usage.
-
The claim for Fig.2(C) needs to be further justified. I agree with the authors that Fig.2(C) shows some of the activations are close to each other for the same sentence. However, Fig.2(C) also exists some of group neurons with same color (e.g., same sentence) are located superlatively. How do the authors claim that "the distribution of core neurons of tokens in the same sentence is always closer".
-
In the experiments, some of the results are unclear. (Please see the detailed comments in the Question below)
Others:
A. It may be inappropriate to refer to the proposed method as "zero-cost sparse inference" in the Abstract, as sparse inference inherently involves some cost.
问题
Overall, I find the proposed method well-motivated and innovative. However, the current version has several areas that require further justification. Specifically, please address the following questions:
-
Weakness.2 as shown above.
-
The speedup gain is consistent across OPT-6.7 and other LLMs. In Table 3, the proposed method achieves a 10.33x speedup over the original transformer on OPT-6.7B; however, the speedup is notably lower on other LLMs, as shown in Fig. 7. For instance, on Llama2-7B with a sequence length of 256, the speedup is around 1.6x.
-
The system memory usage needs further clarification. According to Fig. 7, the proposed method shows no system memory consumption, while the original transformer uses substantial system memory. My understanding is that the proposed method uses activation sparsity to accelerate inference in the original transformer without any system-level modifications. Could you clarify this aspect?
-
The work lacks a comparison of perplexity (PPL) or zero/few-shot accuracy. The authors focus only on hardware performance comparisons with previous works, such as Deja Vu and PowerInfer. How does the accuracy of the proposed method compare to these works? Additionally, what is the sparsity level of these other methods?
伦理问题详情
No Ethics Concerns found.
Q6. Task performance comparison with predictor-based methods
A6. We sincerely appreciate you for this valuable suggestion.
We compared the performance of CoreInfer and PowerInfer on four commonsense reasoning tasks, with the experimental results provided in Appendix A.4.2. The experimental results for PowerInfer are taken from the original paper. We observed that both CoreInfer and PowerInfer achieved near-lossless performance on these tasks. However, PowerInfer requires additional predictors for training and inference, leading to increased overall cost. For example, for OPT-30b, PowerInfer requires approximately 10.45GB for the predictors, which accounts for about one-sixth of the total model size.
Regarding the sparsity of other methods, previous predictor-based approaches train an MLP predictor for each layer of the model to predict which neurons will be activated. The level of sparsity in these methods depends on the model, the number of layers, and the input content. In general, assuming a prediction accuracy close to 100%, the sparsity can be estimated as the ratio of activated neurons to the total number of neurons. In the OPT model series, the proportion of activated neurons is around 10% to 15%. Additionally, the memory size of the MLP predictors introduced by these methods is approximately 10% - 20% of the overall model size.
| Model | Method | PIQA | Wino | RTE | COPA | Avg | Predictor | Predictor Mem |
|---|---|---|---|---|---|---|---|---|
| Opt-6.7b | Original | 76.28 | 65.19 | 55.23 | 81.00 | 69.43 | - | - |
| PowerInfer | 75.67 | 65.51 | 55.96 | 81.00 | 69.53 | ✗ | 3.36 GB | |
| CoreInfer | 76.27 | 65.27 | 55.23 | 81.00 | 69.44 | ✓ | 0 GB | |
| Opt-13b | Original | 76.01 | 64.96 | 58.12 | 85.00 | 71.02 | - | - |
| PowerInfer | 76.28 | 65.98 | 56.32 | 84.00 | 70.65 | ✗ | 4.58 GB | |
| CoreInfer | 76.17 | 65.35 | 57.76 | 85.00 | 71.07 | ✓ | 0 GB | |
| Opt-30b | Original | 77.58 | 68.82 | 58.40 | 82.00 | 71.69 | - | - |
| PowerInfer | 77.48 | 67.56 | 59.93 | 82.00 | 71.53 | ✗ | 10.45 GB | |
| CoreInfer | 77.58 | 68.12 | 58.40 | 82.00 | 71.53 | ✓ | 0 GB |
We once again thank you for taking the time and effort to thoughtfully consider our work! We truly appreciate your recognition of our efforts. If you have any further questions, do not hesitate to let us know. We are more than happy to continue the discussion and address any concerns you may have.
Best regards,
Authors
Dear Reviewer,
We would like to sincerely thank you for taking the time to review our paper and provide valuable feedback. We are glad that you think our method has a clear motivation and novel. We are thankful to have the chance to address your questions and concerns.
Q1. Further justified of Fig.2(C).
A1. We are grateful for the time and effort you have taken to carefully review our work and raise this concern. We sincerely apologize for the misunderstanding caused. In fact, this misunderstanding was caused by the inappropriate use of colors in Fig. 2(C). We initially displayed the core neurons of 50 sentences using a gradient color to represent different sentences, which made it appear as though some sentences had the same color (though they were actually different sentences). We have updated Fig. 2(C), reducing the number of sentences to 20 and using more distinct colors. We hope this new version helps to minimize any confusion.
Q2. Inappropriate term "zero-cost sparse inference" used in the abstract.
A2. We appreciate your thoughtful suggestions. In the new version, we have revised it to "fast sparse inference".
Q3. Explanation of the acceleration in Figure 7.
A3. We appreciate you raising this question, and we are glad to address your question. In Tab. 3, we presented the experimental results on a low-performance GPU (12GB NVIDIA TITAN XP), while in Fig. 7, we showed the results on a high-performance GPU (100GB NVIDIA A100). For the OPT-6.7b model, the low-performance GPU was memory-bound, and our speedup was achieved by reducing the computational requirements during inference and addressing the memory-bound limitation (similar to what we observed with the 70b model on the high-performance GPU). On the high-performance GPU, the OPT-6.7b model was not memory-bound, so the speedup was purely due to the reduction in computational requirements.
Q4. Clarification of system memory usage.
A4. We thank you raising this question and are more than happy to provide clarification. In Fig. 7, the term "system memory" refers to the CPU memory required during model inference. For larger models such as OPT-66b and Llama2-70b, inference can be memory-bound, meaning that some of the model weights need to be offloaded to the CPU. Larger CPU memory implies greater data transfer requirements between the CPU and GPU, leading to increased latency caused by memory bound. We apologize for the confusion caused by our wording. In the revised version, we have updated it to "CPU Memory".
Q5. The use of hyperparameters may limit practical usefulness.
A5. We understand your concern, and we are happy to provide further clarification regarding the hyperparameters. In Fig. 6, we performed an ablation study on the hyperparameters, and it can be seen that CoreInfer is not particularly sensitive to hyperparameter settings. Good performance can be achieved with a range of hyperparameter values, indicating that our method is robust to hyperparameter selection. Additionally, during the experiments, we used the same hyperparameters for tasks of the same type, and the model performed well across all tasks. This suggests that in practical applications, hyperparameters could potentially be set based on the task type. We acknowledge the potential challenges posed by hyperparameters in practical applications, and we plan to further explore and address this issue in future work. Thanks for your concern.
Dear Reviewer 2DA3,
We sincerely appreciate the time and effort you have dedicated to reviewing our paper. In our updated version, we have addressed the areas you highlighted that could cause confusion and have added a comparison with predictor-based methods, as per your suggestion.
We are truly grateful for your valuable suggestions and positive feedback. If you have any further questions, we would be more than happy to provide additional information!
Thank you again for your support.
Sincerely,
The Authors
I appreciate the authors' detailed response. Considering the writing quality, technical contributions, and scope of the problem, I will maintain my current score.
CoreInfer determines the top Beta neurons activated (pooling across tokens in the prefill) and then uses just these for the decoding, if these are found to be stable (if this group stays approximately the same). If these are not found to be the CoreInfer then defers to a "similarity-guided" prediction, that takes core-neurons found to be semantically similar (available in some datasets like ag_news, or found via K-means clustering).
The result is an MLP-free method of actively reducing the effective size of the FFN, only reusing these "core neurons", causing speed-ups observed up to around 10x for decoding from saved calculations and reduced need for memory movement. Similar results and accuracy is observed in several cases.
优点
Strengths
This paper investigates prediction of which neurons will be activated, stemming from an observation that similar semantics (especially per sentence) tend to have a similar set of neurons activated. The amount of savings in inference time is quite large, suggesting a high degree of sparsity in the FFN -- especially when using ReLU activation functions -- which unblock running large language models efficiently on smaller GPUs and edge-devices.
Significance:
The FFN holding the greatest number of weights as the model increases in size, and hardware needing to begin loading early (to prevent bubbles in pipelining), this provides an excellent recipe efficient hardware decoding for LLMs.
The majority of energy being spent on off-chip DRAM f Might be interesting to check what the average performance would be omitting this fall-back in the unstable case, while increasing the size of the accepted core-neuron group. etching (see "Melting Point" and "MobileLLM" papers), this also provides an excellent route for running LLMs on constrained hardware.
Clarity and Originality
The paper is very clear in it's motivations and methods, and how it differs from related works. Prior works tend to use MLPs and learned parameters, which this paper claims add overhead which reduce the inference savings. This work focuses on two techniques for on-the-fly decoding acceleration, without additional learned parameters (maximizing speed ups and energy/memory savings of the technique in edge-inference)
缺点
The semantic stability is noted by the paper to sometimes be not strong enough to warrant restricting to neurons characterized in the prefill stage.
When this happens the authors explored finding semantic similarity to core-neuron sets already discovered, allowing for time studied into inner understanding of the LLM's "core-neuron sets" to make gains in inference speed.
Adding more clear labelling would benefit a few of the figures. Perhaps Asterisks on Table 2 indicating on which benchmarks used the Stable vs Unstable methods. Figure 6 preferably would have titles, to clarify which graph refers to which method, and should either be on the same graph or use the same x and y scales to make it easier to compare.
问题
Sentence length was found to be a distinguishing factor for when the semantics are stable enough to dynamically determine the core-neurons, would there be any data for how this transfers across very different looking tasks (e.g. programming where sentences are not as well defined) and languages with different average sentence lengths and syntax?
How does this correlate with tokenization? Namely, if the language is caught in the byte-fallback of the tokenizer (e.g. Mandarin characters for r50k Tiktoken) would the number of core neurons increase?
Q3. Correlate with tokenization.
A3. We greatly appreciate you for raising this meaningful discussion. As mentioned in A2, we conducted experiments and analyses on Mandarin tasks that might experience "caught in the byte-fallback of the tokenizer." In these scenarios, we observed the following:
- The Number of Core Neurons Required Does Not Increase: In fact, once stability is reached, retaining only 20% of the neurons still allows the model to achieve good generative performance. For instance, in the MultiFieldQA-zh Chinese QA task, CoreInfer retained only 20% of the core neurons, yet the model's F1 score increased from 9.21 to 12.86.
- The Token Length Required for Stability Increases: Due to the byte-fallback phenomenon, the model requires a longer sequence of tokens to understand the semantics. As shown in Fig. 3, the model requires approximately 500 tokens to achieve stability.
Based on these findings, we conclude that as long as the semantics are well defined, the model only requires a small number of core neurons for inference. However, different tokenization strategies may impact the number of tokens needed to establish clear semantics. We have added a discussion on this interesting phenomenon in Appendix A.2.3.
Again, we would like to thank you for your deep thoughts and support for our work! We truly enjoy discussing these constructive and meaningful questions with you, as it makes our work more complete and robust. If you have any further questions, we would be more than happy to discuss and address them!
Best regards,
Authors
Dear Reviewer,
We would like to sincerely thank you for taking the time to review our paper and provide valuable feedback. We are glad that you think our approach is innovative and has great potential for application on edge devices. We are thankful to have the chance to address your questions.
Q1. Corrections to figures and titles.
A1. We sincerely appreciate your thorough reading of our paper and the valuable suggestions provided. We have revised Tab. 2 and Fig. 6 as suggested, and hope that the updated figures provide clearer insights for the readers.
Q2. Analysis of stability on different tasks.
A2. Thank you for providing this very constructive suggestion! Multilingual and different types of input is an important situation that is well worth exploring. For a deeper analysis, we first examined the stability of core neurons as the input length increased for both Mandarin characters and code inputs. Our analysis results are presented in Appendix 2.3 and Fig. 11. Furthermore, to verify the effectiveness of CoreInfer in this context, we conducted analyses and tests on the lcc/RepoBench-P dataset (code prediction tasks where the goal is to predict the next line of code given a longer code snippet) and Chinese dataset including DuReader, MultiFieldQA-zh, VCSUM, LSHT, PassageRetrieval-zh. We report our experimental results in Appendix 4.1. We also provided examples of the responses in Appendix 5.2.
Based on the experimental results, we have the following findings:
- Gradual Stabilization of Core Neurons Across Different Tasks. For different tasks, as the input length increases, the core neurons exhibit a gradual stabilization phenomenon. Even for programming tasks with less clearly defined semantics (such as Python, C#, and Java), core neurons gradually stabilize as the length of the input code increases. We speculate that this is because, through training, the model learns to understand the intrinsic semantics of these languages during inference. For these tasks, using stability-guided prediction, CoreInfer demonstrates excellent generative capabilities. For example, on the Lcc and RepoBench-P datasets, CoreInfer experienced less than 3% performance loss even with a 20% neuron sparsity level.
- Variation in Stabilization Length Across Tasks. Different tasks exhibit varying input lengths required to reach stability. In Appendix Figure 11, we visualize the stabilization process for code and Chinese characters. We found that, compared to typical English text, both code and Chinese characters require a longer token length to achieve stabilization (around 300 tokens for code and 500 for Chinese characters). We speculate that this is due to the unique formatting of code, which often requires more complete code blocks to represent semantics. Meanwhile, Chinese characters, potentially due to byte-fallback effects, require longer token lengths to convey the same meaning as English text.
| Task | Task Type | Eval metric | Avg len | Language | Original | CoreInfer |
|---|---|---|---|---|---|---|
| DuReader | Multi-doc QA | Rouge-L | 15768 | ZH | 6.59 | 6.29 |
| MultiFieldQA-zh | Single-doc QA | F1 | 6701 | ZH | 9.21 | 12.86 |
| VCSUM | Summarization | Rouge-L | 15380 | ZH | 0.16 | 0.17 |
| LSHT | Few shot | Accuracy | 22337 | ZH | 18.25 | 16.00 |
| PassageRetrieval-zh | Synthetic | Accuracy | 6745 | ZH | 10.12 | 9.87 |
| LCC | Code | Edit Sim | 1235 | Python/C#/Java | 58.25 | 56.57 |
| RepoBench-P | Code | Edit Sim | 4206 | Python/Java | 52.20 | 50.19 |
Thanks for the detailed response to each of the questions, have a question about the general overlap of sparsity across single language tasks.
Specifically, what would the typical overlap be with say between the sparsity between subsequent summaries in a single language? Curious if for a specific language or task if there may be sections which are consistently inactive and pruned via this method.
Dear Reviewer,
Thank you for your question. We want to make sure we fully understand your query about the overlap of inactive neurons. Could you please confirm if our interpretation matches your intended meaning?
-
When you refer to "general overlap of sparsity," do you mean the overlap of inactive neurons?
-
Based on this and your mention of "subsequent summaries," are you suggesting we explore the overlap of inactive neurons across different data points within the same English summarization dataset, like XSum?
Once we receive your clarification, we would be happy to provide additional results and discuss further. Thank you again for your valuable feedback. We look forward to hearing from you.
Best regards,
The Authors
Dear Reviewer,
We greatly appreciate your constructive suggestions and insightful feedback on our work. We understand that you are very busy. With the discussion period drawing to a close, we kindly request your permission to allow us to proceed with our investigation based on the interpretation outlined above. Thank you so much for your understanding! We hope that our results and discussions could address your questions.
To explore the overlap of core neurons across different samples within the same language task, we conducted validations on the XSum summarization task. The experiments were performed on the OPT-6.7B model, selecting 100 samples for each dataset. The table below illustrates the proportion of inactive neurons across all samples at various overlap ratios.
| Inputs | 100% | 80% | 60% | 40% | 20% |
|---|---|---|---|---|---|
| Neurons inactivated | 0.56% | 23.56% | 59.81% | 83.36% | 94.96% |
Table Data Explanation: For example, 23.56% indicates that 23.56% of neurons remain inactive in over 80% of the samples.
Based on the experimental results, we have the following observation:
Static Pruning is Challenging. As shown in the table, only 0.56% of neurons remain inactive across 100% of the samples, indicating minimal overlap of inactive neurons. This suggests that the traditional static pruning methods struggle to achieve lossless compression. In contrast, our CoreInfer method utilizes dynamic activation sparsity inference. Unlike static pruning, dynamic activation sparsity inference methods such as PowerInfer [1], LLM in Flash [2], and Deja Vu [3] selectively activate different neurons based on the specific inputs. This new type of method is motivated by the insight that LLMs inherently exhibit high activation sparsity for individual inputs. The effectiveness and hardware deployment friendliness have been validated in previous works like [1,2,3]. Previous dynamic activation sparsity inference methods relied on MLP-based predictors, whereas our CoreInfer is the first method to achieve approximate lossless compression under high sparsity without the need for training predictors.
We are very grateful for this meaningful question, as it has prompted us to reflect more deeply on our paper. We will incorporate all the discussions and your suggestions during the rebuttal into the revised version of the manuscript. Once again, we appreciate your support for this work and your constructive and helpful suggestions. Thank you so much for your time and efforts for reviewing our paper.
Sincerely,
The Authors
Reference:
[1] Song, Yixin, et al. "Powerinfer: Fast large language model serving with a consumer-grade gpu." Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. 2024.
[2] Alizadeh, Keivan, et al. "Llm in a flash: Efficient large language model inference with limited memory." arXiv preprint arXiv:2312.11514 (2023).
[3] Liu, Zichang, et al. "Deja vu: Contextual sparsity for efficient llms at inference time." International Conference on Machine Learning. PMLR, 2023.
Dear Reviewer aw95,
We sincerely appreciate your valuable feedback and insightful suggestions on our work. In our updated version, we have incorporated your suggestions by adding evaluations of our method on tasks with different input formats, and we have further explored its relationship with tokenization.
We are truly grateful for your positive feedback, recognizing our approach as innovative. If you have any further questions, we would be more than happy to provide additional information!
Thank you once again for your valuable perspectives.
Sincerely,
The Authors
This work proposes a new sparse activation inference method based on the semantics of input sentences. It finds that the activated neurons remain relatively stable for semantically consistent sentences and thus proposes to only activate the frequently used neurons based on the input context. Experiments validate the effectiveness of the proposed method in improving throughput compared to previous predictor-based sparse activation methods.
优点
-
The paper is clearly written and easy to follow.
-
The proposed method is motivated by an analysis of core neurons and semantics.
-
Experiments demonstrate that the proposed method achieves improved throughput compared to previous predictor-based methods.
缺点
- The major concern with this paper is the general applicability of the proposed method. For user input contexts and questions, especially for long-context reasoning, it is difficult to ensure and inaccurate to assume that all inputs are semantically stable and full of relevant information. Thus, the assumption underlying the proposed method is too strong and may not be applicable in real-world applications or more challenging questions.
Additionally, for similarity-guided prediction, core neurons are identified by finding the closest semantic groups. If this is the case, the insight of this work is generalized to the following one: similar topics will always activate the same set of core neurons, which is also a too strong assumption. Similarly, the input context may not contain only one topic.
-
As a follow-up to the above concern, the proposed method is evaluated on a limited number of tasks. It would be highly desirable for it to be evaluated on a larger set of commonsense reasoning tasks and long-context tasks, such as LongBench, to validate whether the sparsified neurons can handle more complex input contexts.
-
The proposed method is not benchmarked against predictor-based sparse activation methods in terms of task accuracy.
-
Some design aspects of the proposed method need more clarification: (1) it is unclear how semantic similarity is computed; (2) it also remains unclear how to determine whether the input is stable.
问题
My questions are listed in the weakness section.
Q2. Performance comparison with predictor-based methods.
A2. We sincerely appreciate your suggestion. We compared the performance of CoreInfer and PowerInfer on four commonsense reasoning tasks, and included the experimental results in Appendix A.4.2. The experimental results for PowerInfer are taken from the original paper. We observed that both CoreInfer and PowerInfer achieved near-lossless performance on these four tasks. However, PowerInfer requires additional predictors for training and inference, which increases the overall cost. For example, for OPT-30b, PowerInfer requires approximately 10.45GB for the predictors, which accounts for about one-sixth of the total model size.
| Model | Method | PIQA | Wino | RTE | COPA | Avg | Predictor | Predictor Mem |
|---|---|---|---|---|---|---|---|---|
| Opt-6.7b | Original | 76.28 | 65.19 | 55.23 | 81.00 | 69.43 | - | - |
| PowerInfer | 75.67 | 65.51 | 55.96 | 81.00 | 69.53 | ✗ | 3.36 GB | |
| CoreInfer | 76.27 | 65.27 | 55.23 | 81.00 | 69.44 | ✓ | 0 GB | |
| Opt-13b | Original | 76.01 | 64.96 | 58.12 | 85.00 | 71.02 | - | - |
| PowerInfer | 76.28 | 65.98 | 56.32 | 84.00 | 70.65 | ✗ | 4.58 GB | |
| CoreInfer | 76.17 | 65.35 | 57.76 | 85.00 | 71.07 | ✓ | 0 GB | |
| Opt-30b | Original | 77.58 | 68.82 | 58.40 | 82.00 | 71.69 | - | - |
| PowerInfer | 77.48 | 67.56 | 59.93 | 82.00 | 71.53 | ✗ | 10.45 GB | |
| CoreInfer | 77.58 | 68.12 | 58.40 | 82.00 | 71.53 | ✓ | 0 GB |
Q3. Clarification of experimental details.
A3. Calculation of Semantic Similarity. In Fig. 3, the semantic similarity presented is calculated using Sentence-BERT. Specifically, for two sentences, we encode each sentence using Sentence-BERT to obtain their embeddings, and then calculate the cosine similarity between these embeddings as their semantic similarity.
Determining Input Stability. In our experiments, we determined stability based on the task type: for short prompt zero-shot QA and translation tasks, we used similarity-guided prediction, while for other tasks, we used stability-guided prediction. In practical usage, we determine the input stability based on the similarity of the core neurons activated by the last few tokens of the input. If there is a high overlap, the input is considered stable. As shown in Fig. 3(c), core neurons remain almost unchanged when stability is achieved.
We have added additional clarifications to these two sections in the paper to ensure clearer expression.
Once again, we would like to express our sincere gratitude for your valuable suggestions and insightful comments! It was a pleasure discussing our work with you. If you have any other concerns or questions, do not hesitate to let us know. We look forward to your response and are more than happy to address any additional questions you may have.
Best regards,
Authors
Dear Reviewer,
We would like to sincerely thank you for taking the time to review our paper and provide valuable feedback. We are thankful to have the chance to address your questions and concerns.
Concern 1. The general applicability of the proposed method.
Responce 1. Thank you so much for raising this concern and providing your valuable suggestions. We appreciate this opportunity to provide more explanations and address your concerns:
- We think that the techniques commonly used in real-world LLM applications, such as prefix prompting and input preprocessing, would make user inputs more informative and semantically stable, thereby improving the effectiveness of our method. In practice, large-scale LLM-based systems like ChatGPT often add specific prefix prompts and preprocess user inputs to guide the model in generating more aligned responses
. These prefix prompts and preprocessing steps are usually carefully designed to provide information such as context, persona, and requirements, as well as to correct grammatical errors and eliminate semantic redundancies. These techniques make the input semantics more stable and informative. In such cases, CoreInfer can more effectively predict the activated neurons. Thus, we think that the stability of core neurons identified by CoreInfer is consistent with the effectiveness of these widely used techniques: the gradual stabilization of core neurons explains why adding prefix prompts and preprocessing inputs helps clarify semantics, which in turn allows CoreInfer to work more effectively in real-world scenarios. 2. We further explore CoreInfer in challenging scenarios to verify its applicability in real-world applications or more challenging questions. To further explore and validate CoreInfer's performance in challenging scenarios, we added experimental results on the LongBench dataset, as shown in Appendix A.4.1 and Table 5. We found that CoreInfer performed excellently across key long-text application scenarios, including single-document QA, multi-document QA, summarization, few-shot learning, synthetic tasks, and code completion. This demonstrates that CoreInfer is capable of handling challenging and complex real-world tasks. We provide the experimental results in part 3. 3. In real-world deployment scenarios, we can also implement additional strategies to manage extreme inputs. For instance, we can restart the process or enhance monitoring if significant semantic shifts are detected. To enhance the robustness of our method when applied in real-world settings, we can consider adding extra components or checks to handle extreme input scenarios. For instance, we could adopt a monitor component to track semantic changes and recompute core neurons if there are significant changes in semantics. These approaches can make our method more feasible in engineering applications. In the revised version, we added a discussion on this topic in Appendix A.1 under future work. Thank you for raising this concern. 4. In practical scenarios, certain user habits could also contribute to stabilizing input semantics. For example, users tend to ask questions that are closely related within a single conversation. During the daily use of LLMs, users typically provide context and then ask specific questions based on that context to ensure that LLMs can effectively address their queries. More informative and highly relevant inputs make it more likely for the model to provide satisfactory answers. In practice, this usage habit also leads to more stable input semantics, making CoreInfer more effective.
Reference:
Achiam, Josh, et al. "Gpt-4 technical report." arXiv preprint arXiv:2303.08774 (2023).
[2] Brown, Tom B. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 (2020).
Concern 2. Too strong assumption of similarity-guided prediction.
Response 2. We greatly appreciate you for your valuable suggestions and comments. We appreciate this opportunity to provide more explanations and address your concerns:
- Activating the same set of neurons is not our assumption, but a practical solution. We apologize for the misunderstanding, and we are glad to address your concerns. In fact, we did not assume that similar topics always activate the same set of core neurons. Our finding is that similar topics tend to activate similar sets of core neurons. As shown in Table 1, the correlation coefficient between semantic similarity and core neuron similarity on the Llama 3.1-8b model is above 0.6. Therefore, using the same set of core neurons through clustering is just a pragmatic approximation with good performance. We also verify the effectiveness of this approximation method. For example, in Section 5.2, we demonstrated that similarity-guided prediction improves the model's generative ability in common short-prompt tasks, such as QA and translation. As shown in Table 2, without requiring any MLP predictors, it helps improve the overall generative performance of the model (e.g., increasing the BLEU score on the TruthfulQA task from 7.88 to 9.12 on the Opt6.7b model). Therefore, although this approach is an approximation, we think that this insight is valuable for exploration and utilization, and our experiments have validated its practicality and effectiveness. We are very grateful for your suggestion. Refining this approximation for increased effectiveness is also a very good direction for investigation, and we are committed to continuous improvement in it. Thank you so much!
- Further discussion on the principles behind similarity-guided prediction. In the latest version, we have added a discussion on this question in Appendix A.1. We speculate that this phenomenon may be related to the functional partitioning of neurons, as mentioned in
, where neurons in different regions tend to be responsible for different tasks and functions. We will further explore the underlying principles of this interesting phenomenon in future work and use it to refine our approach. We appreciate your valuable suggestions.
Reference:
Xiao, Chaojun, et al. "Configurable foundation models: Building llms from a modular perspective." arXiv preprint arXiv:2409.02877 (2024).
Zhang, Zhengyan, et al. "Emergent modularity in pre-trained transformers." arXiv preprint arXiv:2305.18390 (2023).
Q1. More experiments on commonsense reasoning tasks and long-context tasks.
A1. We sincerely appreciate you for this valuable suggestion. We conducted a comprehensive evaluation of all tasks on the LongBench dataset using stability-guided prediction with 20% sparsity on the Llama2-7B-chat-4k model (as LongBench consists of complex and lengthy inputs). We were pleasantly surprised to find that CoreInfer achieved near-lossless performance across all tasks. Notably, CoreInfer led to improved performance on QA tasks, increasing the F1 score from 25.38 to 29.36 on MultiFieldQA-en, and from 9.21 to 12.86 on MultiFieldQA-zh. Additionally, CoreInfer demonstrated excellent performance on code prediction and Chinese text-based QA tasks, despite the fact that code semantics are generally more challenging for humans to interpret.
We have included our complete experimental results in Appendix A.4.1 of the updated version. Once again, we are grateful to the reviewer for suggesting this dataset, which has allowed us to better validate our method's performance with complex inputs.
| Task | Task Type | Eval metric | Avg len | Language | Original | CoreInfer |
|---|---|---|---|---|---|---|
| HotpotQA | Multi-doc QA | F1 | 9151 | EN | 24.31 | 23.72 |
| 2WikiMultihopQA | Multi-doc QA | F1 | 4887 | EN | 31.69 | 30.18 |
| MuSiQue | Multi-doc QA | F1 | 11214 | EN | 7.76 | 6.82 |
| DuReader | Multi-doc QA | Rouge-L | 15768 | ZH | 6.59 | 6.29 |
| MultiFieldQA-en | Single-doc QA | F1 | 4559 | EN | 25.38 | 29.36 |
| MultiFieldQA-zh | Single-doc QA | F1 | 6701 | ZH | 9.21 | 12.86 |
| NarrativeQA | Single-doc QA | F1 | 18409 | EN | 17.78 | 15.71 |
| Qasper | Single-doc QA | F1 | 3619 | EN | 17.75 | 19.87 |
| GovReport | Summarization | Rouge-L | 8734 | EN | 26.95 | 25.06 |
| QMSum | Summarization | Rouge-L | 10614 | EN | 20.88 | 19.57 |
| MultiNews | Summarization | Rouge-L | 2113 | EN | 26.22 | 26.01 |
| VCSUM | Summarization | Rouge-L | 15380 | ZH | 0.16 | 0.17 |
| TriviaQA | Few shot | F1 | 8209 | EN | 83.01 | 78.08 |
| SAMSum | Few shot | Rouge-L | 6258 | EN | 41.24 | 41.53 |
| TREC | Few shot | Accuracy | 5177 | EN | 64.50 | 63.00 |
| LSHT | Few shot | Accuracy | 22337 | ZH | 18.25 | 16.00 |
| PassageRetrieval-en | Synthetic | Accuracy | 9289 | EN | 8.00 | 7.70 |
| PassageCount | Synthetic | Accuracy | 11141 | EN | 2.85 | 2.49 |
| PassageRetrieval-zh | Synthetic | Accuracy | 6745 | ZH | 10.12 | 9.87 |
| LCC | Code | Edit Sim | 1235 | Python/C#/Java | 58.25 | 56.57 |
| RepoBench-P | Code | Edit Sim | 4206 | Python/Java | 52.20 | 50.19 |
Dear Reviewer 55rr,
We hope this message finds you well. We sincerely appreciate your valuable feedback and insightful suggestions on our work! Given the limited time available for author-reviewer discussions, we would be very grateful if you could let us know whether our responses have sufficiently addressed the issues identified in your second review.
You raised four major questions: (1) the general applicability of the proposed method, (2) the assumption of similarity-guided prediction being too strong, (3) the need for more experiments on long-context tasks, and (4) the need to compare with predictor-based methods.
In our response, we have provided detailed explanations and clarifications regarding your first two concerns. Additionally, we have verified the performance of our method on 21 tasks from the LongBench dataset you suggested, and we have compared our method's performance with other predictor-based methods.
We deeply appreciate the significant time and effort you have devoted to reviewing our work and are grateful for your additional insights. Your comments have been immensely helpful in refining our paper.
Thank you once again for your valuable perspectives. We look forward to any further guidance you may have.
Sincerely,
The Authors
Thank you to the authors for preparing the experiments and explanations. Although some of my questions have been addressed, my major concern remains: given the complex nature of real-world languages, it is difficult to assume that the input context will have a stable semantic meaning. Consequently, it is also challenging to accurately assign each input context to a single pre-clustered topic. This issue becomes even more complex in the case of multi-round conversations, making this assumption and the proposed method harder to generalize.
The authors justified that prefix prompts could enhance semantic stability; however, I did not understand the rationale behind this or how these prefix prompts can make an arbitrary user-provided context semantically more stable. It would be helpful if the authors could elaborate on this point.
Additionally, regarding the results on LongBench in Q1, 20% sparsity appears somewhat insufficient, as static structured pruning for LLMs (e.g., FLAP/Wanda) with 20% sparsity can also maintain comparable performance. Results under higher sparsity levels, along with comparisons to other methods (e.g., static or dynamic pruning methods) under the same budget, are necessary to justify the effectiveness of the proposed method.
Furthermore, as mentioned in Weakness-2, results on more commonsense reasoning tasks (e.g., ARCE, ARCC, Winogrande, and Hellaswag) and MMLU are required to demonstrate the usefulness of the method for sentences with diverse semantics. In particular, results on MMLU are highly desirable, as it includes contexts with diverse factual knowledge.
Dear Reviewer,
We appreciate the time and effort you have taken to review our work. We are glad that our previous response addressed some of your concerns, and we hope to resolve the remaining issues through the following responses.
1. Concern about stability.
A1. Based on your feedback, we recognize that your primary concern is regarding stability. As you express the worry that in real-world multi-turn settings, both semantic similarity and core neurons might become unstable, thereby affecting the effectiveness of CoreInfer, we want to address this concern from two perspectives. Firstly, we validate that both semantics and core neurons exhibit stable trends even in real multi-turn conversations. Secondly, we confirmed that CoreInfer can be more efficient when employing commonly used prefix prompts in real-world scenarios.
- Our method is effective in real multi-turn conversations. To verify this, we first randomly selected 5k real multi-turn conversations from the ShareGPT-Chinese-English-90k [1] dataset—a high-quality Chinese-English parallel bilingual human-machine multi-turn QA dataset covering user questions in real and complex scenarios. We validated that both semantic and core neuron exhibit a clear trend of increasing stability as the number of turns increases. For semantic stability, we calculated the cosine distance between the semantic embeddings of the current turn and those of the previous turns. For core neurons stability, we calculated the overlap ratio between the core neurons of the current turn and those of the previous turns. As shown in the following table, we found that both semantic stability and core neurons stability progressively increase with each additional turn. For example, at the third turn, both semantic information and core neurons are almost stable. This indicates that CoreInfer is capable of making effective predictions on core neurons in real multi-turn conversation scenarios.
| Conversation turn | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|
| Semantic | 0.753 | 0.952 | 0.999 | 0.999 | 0.999 |
| Core Neuron | 76.9% | 88.5% | 98.3% | 100% | 100% |
Furthermore, we conducted experiments on multi-turn conversations dataset MT-eval [2] across four tasks. The table shows CoreInfer's scores across different tasks, highlighting our CoreInfer still effective in multi-turn setting.
| Model | Recollection | Expansion | Refinement | Follow-up | Avg. |
|---|---|---|---|---|---|
| original | 3.86 | 5.89 | 6.20 | 8.53 | 6.11 |
| CoreInfer | 3.83 | 5.40 | 6.00 | 8.41 | 5.91 |
Therefore, we believe that for normal inputs and tasks encountered in daily life, even for multi-turn setting, CoreInfer consistently performs well. For example, in Table 2 of our paper, we demonstrate the effectiveness of our method in common tasks such as question answering, translation, and summarization.
-
In commonly used LLMs like ChatGPT, prefix prompts usually contain additional information or valid instructions , which can make the semantic of inputs clearer and are beneficial to most user-provided input. For instance, the following is an example of a prefix prompt [3]: "You are an AI assistant. The user is currently having a conversation with you. Please do your best to answer the user's questions and provide helpful and accurate information.". It clearly incorporates the necessary roles and responsibilities information into the input.
To further verify this, we tested and compared the semantic stability and core neurons stability when the inputs with and without a prefix prompt. For semantic stability, we calculated the cosine distance between the semantic embeddings of the output and the input. For core neurons stability, we measured the overlap ratio between the core neurons of the output and the input. For each dataset, we calculated the average values. As shown, the addition of the prefix prompt significantly increased both semantic similarity and core neurons similarity, indicating that the inputs became much more stable when added with prefix prompts.
| Truthful_qa | Trivia_qa | wmt16-de-en | wmt16-ro-en | |
|---|---|---|---|---|
| Semantic (w/o prompt) | 0.887 | 0.854 | 0.772 | 0.759 |
| Semantic (w/ prompt) | 0.998 | 0.996 | 0.999 | 0.997 |
| Truthful_qa | Trivia_qa | wmt16-de-en | wmt16-ro-en | |
|---|---|---|---|---|
| Core Neuron (w/o prompt) | 63.7% | 51.2% | 62.4% | 52.1% |
| Core Neuron (w/ prompt) | 94.5% | 97.8% | 95.1% | 91.5% |
We hope that the above responses address your concerns regarding stability.
2. Clarification for 20% sparse.
A2. We are afraid that there might be some misunderstanding. The 20% sparsity mentioned in our previous response means that only 20% of the neurons are retained; that is, 80% of the MLP weights are pruned, resulting in a pruning ratio of 80%.
CoreInfer is a method for dynamic activation sparsity inference. Unlike static pruning, dynamic activation sparsity inference such as PowerInfer [4], LLM in Flash [5], and Deja Vu [6] selectively activates different neurons based on different inputs. Since LLMs inherently exhibit high activation sparsity for individual inputs, it is possible to achieve good performance under high sparsity conditions.
Furthermore, the effectiveness and hardware deployment friendliness of dynamic activation sparsity inference methods have been thoroughly validated in previous works such as PowerInfer [4], LLM in Flash [5], and Deja Vu [6]. Thanks to the flexibility of pruning, dynamic activation sparsity inference methods typically offer significant performance improvements over static pruning and have recently garnered increasing attention. It is also noteworthy that previous dynamic activation sparsity inference methods relied on MLP-based predictors, whereas CoreInfer is the first method to achieve approximate lossless compression under high sparsity without the need for training predictors.
Reference:
[1] https://huggingface.co/datasets/shareAI/ShareGPT-Chinese-English-90k
[2] Kwan, Wai-Chung, et al. "MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models." arXiv preprint arXiv:2401.16745 (2024)
[3] Liu, Haotian, et al. "Improved baselines with visual instruction tuning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
[4] Song, Yixin, et al. "Powerinfer: Fast large language model serving with a consumer-grade gpu." Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. 2024.
[5] Alizadeh, Keivan, et al. "Llm in a flash: Efficient large language model inference with limited memory." arXiv preprint arXiv:2312.11514 (2023).
[6] Liu, Zichang, et al. "Deja vu: Contextual sparsity for efficient llms at inference time." International Conference on Machine Learning. PMLR, 2023.
3. More experimental results on the commonsense reasoning dataset and MMLU
A3. We appreciate your suggestion to include additional commonsense reasoning datasets. To validate our method's performance on these datasets, we evaluated CoreInfer under the condition where only 20% of the neurons were activated (pruning ratio = 80%). Specifically, we tested our approach across various domains within the MMLU datasets. The results, presented in the table below, demonstrate that CoreInfer achieved nearly lossless performance across different datasets.
| Model | PIQA | HellaSwag | Wino | Arc-e | Arc-c |
|---|---|---|---|---|---|
| Original | 78.7 | 76.2 | 70.1 | 75.3 | 41.9 |
| CoreInfer | 76.0 | 75.3 | 70.1 | 65.7 | 36.3 |
| Task Name | Original | CoreInfer |
|---|---|---|
| mmlu_average | 0.3017 | 0.2885 |
| mmlu_humanities | 0.2971 | 0.2555 |
| mmlu_formal_logic | 0.2937 | 0.2857 |
| mmlu_high_school_european_history | 0.3394 | 0.2964 |
| mmlu_high_school_us_history | 0.3529 | 0.3196 |
| mmlu_high_school_world_history | 0.2996 | 0.2700 |
| mmlu_international_law | 0.4545 | 0.3862 |
| mmlu_jurisprudence | 0.2963 | 0.3241 |
| mmlu_logical_fallacies | 0.3252 | 0.2954 |
| mmlu_moral_disputes | 0.3353 | 0.3128 |
| mmlu_moral_scenarios | 0.2324 | 0.2380 |
| mmlu_philosophy | 0.3794 | 0.3186 |
| mmlu_prehistory | 0.3642 | 0.3932 |
| mmlu_professional_law | 0.2588 | 0.2458 |
| mmlu_world_religions | 0.3801 | 0.3801 |
| mmlu_other | 0.3334 | 0.3136 |
| mmlu_business_ethics | 0.3600 | 0.3200 |
| mmlu_clinical_knowledge | 0.3585 | 0.3226 |
| mmlu_college_medicine | 0.2890 | 0.2486 |
| mmlu_global_facts | 0.3100 | 0.2600 |
| mmlu_human_aging | 0.3722 | 0.3460 |
| mmlu_management | 0.2718 | 0.2542 |
| mmlu_marketing | 0.3547 | 0.3778 |
| mmlu_medical_genetics | 0.4200 | 0.4400 |
| mmlu_miscellaneous | 0.3934 | 0.3950 |
| mmlu_nutrition | 0.2974 | 0.2255 |
| mmlu_professional_accounting | 0.2589 | 0.3315 |
| mmlu_professional_medicine | 0.2169 | 0.2096 |
| mmlu_virology | 0.3434 | 0.3349 |
| mmlu_social_sciences | 0.2941 | 0.2863 |
| mmlu_econometrics | 0.1930 | 0.2018 |
| mmlu_high_school_geography | 0.2475 | 0.2172 |
| mmlu_high_school_government_and_politics | 0.3005 | 0.2639 |
| mmlu_high_school_macroeconomics | 0.2821 | 0.2359 |
| mmlu_high_school_microeconomics | 0.2563 | 0.2469 |
| mmlu_high_school_psychology | 0.2899 | 0.2583 |
| mmlu_human_sexuality | 0.2901 | 0.3130 |
| mmlu_professional_psychology | 0.3186 | 0.2712 |
| mmlu_public_relations | 0.3818 | 0.3364 |
| mmlu_security_studies | 0.2367 | 0.2367 |
| mmlu_sociology | 0.3731 | 0.3687 |
| mmlu_us_foreign_policy | 0.3900 | 0.3300 |
| mmlu_stem | 0.2651 | 0.2528 |
| mmlu_abstract_algebra | 0.3000 | 0.3000 |
| mmlu_anatomy | 0.3704 | 0.3370 |
| mmlu_astronomy | 0.3158 | 0.2303 |
| mmlu_college_biology | 0.3125 | 0.3194 |
| mmlu_college_chemistry | 0.2300 | 0.2700 |
| mmlu_college_computer_science | 0.2100 | 0.2600 |
| mmlu_college_mathematics | 0.2800 | 0.2500 |
| mmlu_college_physics | 0.1569 | 0.2843 |
| mmlu_computer_security | 0.3100 | 0.3100 |
| mmlu_conceptual_physics | 0.2851 | 0.2553 |
| mmlu_electrical_engineering | 0.2690 | 0.2966 |
| mmlu_elementary_mathematics | 0.2513 | 0.2275 |
| mmlu_high_school_biology | 0.2742 | 0.2935 |
| mmlu_high_school_chemistry | 0.2660 | 0.2266 |
| mmlu_high_school_computer_science | 0.2600 | 0.2800 |
| mmlu_high_school_mathematics | 0.2185 | 0.2407 |
| mmlu_high_school_physics | 0.2649 | 0.2053 |
| mmlu_high_school_statistics | 0.1991 | 0.1898 |
| mmlu_machine_learning | 0.3214 | 0.3036 |
We would like to express our sincere gratitude for the time and effort you have dedicated to reviewing our work. Your invaluable guidance and suggestions have enabled us to explore our method more deeply, thereby enriching and enhancing this study overall. All of your comments have made our work more comprehensive, and we are truly appreciative. We will incorporate all the additional experiments and discussions into our revision. Thank you once again for your thorough review and feedback.
Sincerely,
The Authors
Dear Reviewers,
We sincerely appreciate all of you for your valuable comments and suggestions. We are grateful for the opportunity to address some concerns and improve the manuscript, with changes highlighted in blue (text) and orange (figures and tables). Specifically, we made the following major changes:
1. Testing Performance on Additional Long-Context Datasets. Based on the reviewers' suggestions, we evaluated our method on more challenging tasks from the LongBench dataset using the Llama2-7B-chat-4k model. We tested 21 tasks covering key long-text applications such as single-document QA, multi-document QA, summarization, few-shot learning, synthetic tasks, and code completion. We found that CoreInfer achieved near-lossless performance across different tasks while activating only 20% of neurons. Notably, CoreInfer led to improved performance on QA tasks, increasing the F1 score from 25.38 to 29.36 on MultiFieldQA-en, and from 9.21 to 12.86 on MultiFieldQA-zh. Detailed results can be found in Appendix A.4.1.
2. Verifying Applicability to Diverse Inputs (Including Chinese and Code Programming). Per the reviewer’s suggestions, we explored the behavior of core neurons when the input was in different languages. We found that core neurons tended to stabilize as the input length increased, regardless of the language. However, the token length required to reach stability varied depending on the language, as different languages require different numbers of tokens to express semantics. We verified that CoreInfer performs well on Chinese text QA and code prediction (Java/C#/Python) tasks. For example, on the Lcc and RepoBench-P datasets, CoreInfer experienced less than 3% performance loss even with a 20% neuron sparsity level. Detailed exploration and experimental results can be found in Appendices A.2.3 and A.4.1.
3. Adding Comparisons with Predictor-Based Methods. Based on the reviewers' suggestions, we added a comparison of CoreInfer's performance with the state-of-the-art predictor-based method PowerInfer on three different models and four classic commonsense reasoning tasks. We found that both CoreInfer and PowerInfer achieved near-lossless performance across all models and tasks. However, the size of the predictor required by PowerInfer increased with the model size, leading to higher training and inference costs. For example, for OPT-30b, PowerInfer requires approximately 10.45GB for the predictors, which accounts for about one-sixth of the total model size. Detailed results are presented in Appendix A.4.2.
4. Verifying Adaptability to Quantization. As suggested by the reviewers, we combined CoreInfer with 4-bit quantization (including NF4 and FP4). We found that CoreInfer achieved lossless sparse inference even with applying 4-bit quantization, as the activation of neurons was not affected by quantization. After combining CoreInfer with the FP4 and NF4 quantization formats, the average accuracy improved from 69.98 and 68.83 to 69.99 and 68.89, respectively. Therefore, CoreInfer can be combined with state-of-the-art quantization methods for further acceleration. Detailed results are presented in Appendix A.4.3.
We sincerely appreciate all of your suggestions, which have led us to a deeper thinking. We welcome any further feedback from you and are more than willing to address any questions or concerns you may have.
Best regards,
Authors
Summary: The paper introduces CoreInfer, a method for accelerating LLM inference through semantics-inspired adaptive sparse activation. CoreInfer predicts core neurons critical for sentence semantics during the pre-filling stage and removes other neurons during subsequent inference, thus improving efficiency without relying on on-the-fly predictors. Experimental results across various tasks validate the effectiveness of the strategy.
Strength:
-
Intriguing observations regarding sentence semantics: The observations regarding the stability and similarity of core neurons with respect to sentence semantics are insightful.
-
Boosted efficiency: CoreInfer can achieve non-trivial inference speedups on real devices while maintaining accuracy on most tasks.
Weakness:
-
Too strong assumptions: The method assumes stable semantics and neuron activation patterns across the entire input context, which may not generalize to complex inputs.
-
Insufficient experiments: Limited benchmarks against state-of-the-art static and dynamic pruning methods (e.g., Deja Vu) reduce confidence in its comparative advantages.
-
Accuracy drops on key tasks: CoreInfer showed notable accuracy declines on tasks like ARC-E and MMLU, raising concerns about its robustness under high sparsity.
Reasons for the decision:
While CoreInfer presents a novel approach to sparse activation and demonstrates efficiency gains, its assumptions about semantic stability and the lack of sufficient comparisons with key baselines limit its generalizability. Additionally, significant accuracy drops on critical tasks under high sparsity weaken its overall contribution.
Although only Reviewer 55rr provided a negative rating initially, during the reviewer discussion stage, they further emphasized the above concerns, which were agreed upon by Reviewer 7ifw. Both reviewers leaned towards rejecting the submission, and the other two reviewers did not strongly advocate for the work. Therefore, my overall recommendation is to reject.
审稿人讨论附加意见
During the rebuttal period and the reviewer discussion stage, the most critical concerns were primarily raised by Reviewer 55rr, focusing on the rationale of the proposed method, evaluation completeness, and generalizability. These concerns convinced Reviewer 7ifw during the reviewer discussion stage, leading both reviewers to lean toward rejecting the paper. Below, I summarize these critical concerns, which significantly influenced the decision on this submission, along with the authors' responses:
1. Too strong assumptions of semantic stability and core neurons activation
Reviewer Concern: Reviewer 55rr questioned the assumption that each input corresponds to a single stable semantic topic and that the same topic always activates the same set of core neurons. This was considered unrealistic for real-world applications. The reviewer also argued that the proposed method could degenerate into static pruning rather than truly dynamic sparsity.
Author Response: The authors justified their approach by arguing that prepending a consistent system prompt helps stabilize semantics. They provided results on semantic and neuron stability during multi-turn conversations, claiming their method works in these scenarios.
2. Insufficient experimental validation
Reviewer Concern: Despite adding more commonsense reasoning tasks and MMLU benchmarks, the authors' method showed significant accuracy drops, e.g., -9.6% on ARC-E, -5.6% on ARC-C, and -2% on MMLU (only +3% above random guessing).
Author Response: The authors emphasized that their method maintains reasonable performance under high sparsity but only provided the results of their method without providing comparative results against other baselines.
3. Lack of comparisons with key baselines
Reviewer Concern: The method was not benchmarked against critical baselines such as Deja Vu, which use more reliable input-adaptive sparsity strategies. The absence of these comparisons limits the paper's claims of superiority.
Author Response: The authors added results on additional tasks but did not include direct comparisons with the mentioned baselines. They argued that their approach avoids training additional predictors, a potential advantage over baselines like Deja Vu.
The discussion highlighted unresolved concerns about the method's assumptions, lack of baseline comparisons, and insufficient evaluation. While the authors added additional results and clarifications, I agree with Reviewers 55rr and 7ifw that these were insufficient to address the aforementioned core weaknesses. As such, these concerns weighed heavily in the decision to recommend rejection.
Reject