Mixture of In-Context Experts Enhance LLMs' Long Context Awareness
We propose MoICE, which enhances LLMs' context awareness. MoICE introduces a router in each attention head within LLMs, which dynamically directs the head's attention to contextual positions crucial for completing the head's function well.
摘要
评审与讨论
Large language models (LLMs) have shown promise in various NLP tasks but often fall short in tasks requiring deep contextual understanding, such as coherent long-text generation and Retrieval-Augmented Generation (RAG). Challenges like the "lost-in-middle" phenomenon, where LLMs struggle with middle context information, and limitations from the widely-used Rotational Position Encoder (RoPE) significantly impact performance. This work introduces the Mixture of In-Context Experts (MoICE) that dynamically selects optimal RoPE angles within each attention head to direct the attention of a head to specific contextual positions. Experiments are conducted on open-source models such as Mistral by freezing LLM parameters and exclusively updating routers for only a few steps.
优点
The paper is very well written and easy to follow. The claims are mostly well-substantiated with extensive experimentation supporting them. Background information is provided as needed without overwhelming the reader. The paper provides details on hyperparameters and compute to ensure reproducibility. The ablations, especially the one on visualization of dynamic routing states is very interesting.
缺点
-
It seems that the main weakness of the paper is in the evaluation section. Firstly in Table 1, the gains in performance by using MoICE are minimal. For instance, the gains on majority on the datasets are not before than 1%. It raises the question of the actual significance and practical implications of this approach. It would be great if authors could report mean and standard deviation of their results.
-
MoICE seems promising for endowing LLMs with the ability to improve context awareness even at pretraining. While all experiments are currently conducted using pretrained LLMs, it will be interesting to see if one could pretrain LLMs with MoICE (maybe 2B size) on datasets such as C4 etc and then test on standard benchmarks.
问题
No specific questions. I would appreciate a response with respect to the weakness stated above.
局限性
N/A
We sincerely appreciate your valuable feedback and suggestions! We hope our response could address your concerns.
1. Mean and standard deviation of Table 1
Thanks for your valuable comment. We reported the t-test results of MoICE in Table 1: the p-values are both less than 0.02, which illustrates the significant improvement of our method. In addition, we also have set different random seeds and repeated L-eval experiments 5 times, the mean and standard deviation of MoICE are reported below.
| Coursera | QuALITY | TOEFL | SFiction | Average | wins | ties | win-rate% | |
|---|---|---|---|---|---|---|---|---|
| Llama-2-Chat | 36.77 0.00 | 38.12 0.00 | 55.02 0.00 | 60.16 0.00 | 47.52 0.00 | 68.00 0.00 | 117.00 0.00 | 34.94 0.00 |
| + MoICE | 39.65 0.32 | 41.88 0.27 | 56.28 0.21 | 64.84 0.00 | 50.66 0.05 | 89.00 1.00 | 117.20 1.48 | 40.77 0.20 |
| Coursera | QuALITY | TOEFL | SFiction | Average | wins | ties | win-rate% | |
|---|---|---|---|---|---|---|---|---|
| Mistral-7B-Ins. | 45.20 0.00 | 44.06 0.00 | 62.08 0.00 | 61.72 0.00 | 53.27 0.00 | 71.00 0.00 | 105.00 0.00 | 34.11 0.00 |
| + MoICE | 48.08 0.24 | 46.73 0.27 | 65.35 0.81 | 62.18 1.19 | 55.59 0.16 | 85.00 1.10 | 115.20 2.05 | 39.39 0.21 |
All methods in our paper use greedy decoding which is determinate, the randomness of MoICE results from the initialization of MoICE router when training.
2. Applying MoICE to the pre-training stage (Weakness 2)
Thanks for your valuable suggestions. Due to limited time and computing resources, it is not feasible to train such a large model. Therefore, we pre-train a small model from scratch and observe the effectiveness of MoICE. This demonstrates the potential of scaling up our method.
Specifically, we train a language model with a Llama architecture of 49M parameters, with and without MoICE respectively. The model has 4 layers, 6 heads per layer, and a hidden layer dimension of 512. We train the model with the OpenWebText dataset [2].
We use 4 GTX A800-80Gs for training for 600k steps, with a context window of 512, which takes 96 hours. (Given the limited time, this is the most extensive scenario we are able to test. We appreciate your understanding regarding these limitations.)
We measure the model's context awareness on the Key-Value Retrieval task [3]. The prompt for key-value retrieval is shown below:
"eb098018-bdb5": "970cbed8-3665",
"0a9d957f-2256": "be09fd63-4dfa",
"e2b49af9-d0e3": "c5ed6251-085d",
"8ece1451-05e1": "2d5932f7-acd8",
"eb2f4a8d-e0b7": "e0acbc2c-d478",
"0c8c0695-dd3c": "086d71cb-35c0",
"79a1c002-4ba6": "e69f5f62-250e",
"b0c1c9df-c13f": "3ce6b12e-6223",
"ee17cc77-6342": "41c410e1-776c",
"483f6a4d-9aa4": "3711356c-6df1",
"ee17cc77-6342": "41c
We use 10 key-value pairs as examples in prompt, which includes a query key. We insert the query key-value pair in different positions of examples (In the prompt example above, the query key is inserted in the 9th position). The model's task is to find the value corresponding to the query key and output it, which evaluates its context awareness.
The performance of a pre-trained Llama model and a pre-trained Llama model with MoICE are shown below, respectively:
| 1 | 3 | 5 | 7 | 9 | |
|---|---|---|---|---|---|
| Baseline | 0.476 | 0.324 | 0.328 | 0.344 | 0.502 |
| + MoICE | 0.652 | 0.762 | 0.634 | 0.622 | 0.814 |
From the results, we can see that our model can significantly increase the contextual capabilities of the pre-trained language model.
Once again, we appreciate your thoughtful review and feedback on our paper. Please let us know if you have any additional questions or suggestions.
References
[1] Peterson J, Meylan S, Bourgin D. Open clone of openai’s unreleased webtext dataset scraper[J]. 2019.
[2] Liu N F, Lin K, Hewitt J, et al. Lost in the middle: How language models use long contexts[J]. Transactions of the Association for Computational Linguistics, 2024, 12: 157-173.
I appreciate authors detailed rebuttal that addresses many of the weaknesses identified and questions raised. I emphasize that all additional experiments and clarifications made during this rebuttal should be made in any revised manuscript to improve clarity of the work. Given my already positive review, I maintain my score.
We thank the reviewer for your recognition and active engagement. We will definitely include the additional experimental results in a future revision as you suggested.
This paper presents an approach, Mixture of In-Context Experts (MoICE) for enhancing the long-context awareness of LLMs with RoPE. Specifically, the authors use a router to dynamically select multiple RoPE angles for each attention head and token. They also use a lightweight router-only training strategy and freeze LLM parameters to only update the routers. Empirical evaluation shows that MoICE outperforms existing methods on long context understanding and generation tasks while maintaining efficiency.
优点
- The proposed MoICE approach deals with the challenge of limited context awareness in LLMs. The idea of dynamically selecting RoPE angles is novel and effectively addresses limitations of the original RoPE technique.
- The authors conduct extensive experiments across multiple tasks and datasets with LLaMA2-7B and Mistral-7B, demonstrating comparable performance of MoICE with competitive baselines while maintaining inference efficiency.
- The paper also includes detailed ablation studies and analyses on different hyperparameters: expert total number N, selected expert number K, as well as different training data, showing that the method is robust.
缺点
- There is a lack of open-ended tasks in the experiments. The authors use a very small open-ended task which contains only 181 questions from 29 long documents. This is far from enough to show that the method could work well on general open-ended tasks. They should do more experiments on open-ended tasks, such as TriviaQA.
- The proposed approach slightly modifies the language model architecture by adding a router layer and train it for long-context awareness. In fact, it would be more natural to apply this technique to pre-training stage to enhance the model's original ability to understand long contexts. The authors should discuss more about this, and if possible, show whether their method can be generalized to pre-training (even on smaller models, such as GPT-2).
问题
Please refer to the section above.
局限性
Yes, the authors have discussed limitations.
We sincerely appreciate your valuable feedback and suggestions! We hope our response could address your concerns.
1. Performance on general open-ended tasks (Weakness 1)
Thanks for your valuable suggestions. We have added an additional benchmark Longbench [1], which is a bilingual multitask, and comprehensive assessment of long context understanding capabilities of large language models. We evaluate 16 tasks in 5 scenarios, and we report the average value for each scenario. All experiments are conducted on one A800-80G GPU. TriviaQA is included in few-shot learning, and we report the results below.
| Method | Single-Doc QA | Multi-Doc QA | Summarization | Few-shot Learning | Synthetic Tasks | Average |
|---|---|---|---|---|---|---|
| Llama2-7B-chat | 25.54 | 18.47 | 23.37 | 51.78 | 3.94 | 29.85 |
| + PI | 23.42 | 23.73 | 25.34 | 51.63 | 7.63 | 31.30 |
| + NTK | 24.73 | 23.67 | 25.41 | 51.97 | 8.33 | 31.58 |
| + Ms-PoE | 23.68 | 24.59 | 25.33 | 51.66 | 8.04 | 31.75 |
| + AB | 27.06 | 22.94 | 25.52 | 52.84 | 8.62 | 32.21 |
| + MoICE | 26.31 | 23.70 | 25.60 | 52.34 | 9.71 | 32.25 |
| Method | Single-Doc QA | Multi-Doc QA | Summarization | Few-shot Learning | Synthetic Tasks | Average |
|---|---|---|---|---|---|---|
| Mistral-7B-Instruct-8k | 27.20 | 19.89 | 24.22 | 52.41 | 5.06 | 25.76 |
| + PI | 30.94 | 24.94 | 26.24 | 49.34 | 9.35 | 28.16 |
| + NTK | 30.46 | 21.21 | 23.89 | 52.41 | 8.44 | 27.28 |
| + Ms-PoE | 27.90 | 17.89 | 20.28 | 48.59 | 8.95 | 24.72 |
| + AB | 29.81 | 21.95 | 25.58 | 54.42 | 7.89 | 27.93 |
| + MoICE | 31.09 | 22.98 | 26.69 | 55.76 | 8.02 | 28.91 |
| Method | Single-Doc QA | Multi-Doc QA | Summarization | Few-shot Learning | Synthetic Tasks | Average |
|---|---|---|---|---|---|---|
| Qwen2-7B-Instruct-32k | 34.66 | 35.91 | 25.77 | 56.89 | 33.83 | 37.41 |
| + PI | 28.28 | 17.08 | 24.60 | 57.51 | 32.67 | 32.03 |
| + NTK | 31.35 | 23.98 | 24.95 | 56.64 | 32.50 | 33.88 |
| + Ms-PoE | OOM | OOM | OOM | OOM | OOM | N/A |
| + AB | OOM | OOM | OOM | OOM | OOM | N/A |
| + MoICE | 39.37 | 37.35 | 25.81 | 57.29 | 34.83 | 38.93 |
| TriviaQA | Llama2-7B-chat | Mistral-7B-Instruct-8k | Qwen2-7B-Instruct-32k |
|---|---|---|---|
| Origin | 84.44 | 85.61 | 86.26 |
| + PI | 84.75 | 84.39 | 85.26 |
| + NTK | 86.16 | 85.61 | 86.56 |
| + Ms-PoE | 85.65 | 84.39 | OOM |
| + AB | 85.82 | 85.9 | OOM |
| + MoICE | 86.01 | 86.44 | 87.14 |
"OOM" indicates that due to the extra memory cost required by Ms-PoE and AB, the inference on the long context failed due to out of memory.
On LLMs with the 4k, 8k, and 32k context windows, MoICE consistently improves the performances on various language tasks including general open-ended tasks. We will add the results in the revision.
2. Applying MoICE to the pre-training stage (Weakness 2)
Thanks for your valuable suggestions. Due to limited time and computing resources, it is not feasible to train such a large model. Therefore, we pre-train a small model from scratch and observe the effectiveness of MoICE. This demonstrates the potential of scaling up our method.
Specifically, we train a language model with a Llama architecture of 49M parameters, with and without MoICE respectively. The model has 4 layers, 6 heads per layer, and a hidden layer dimension of 512. We train the model with the OpenWebText dataset [2].
We use 4 GTX A800-80Gs for training for 600k steps, with a context window of 512, which takes 96 hours. (Given the limited time, this is the most extensive scenario we are able to test. We appreciate your understanding regarding these limitations.)
We measure the model's context awareness on the Key-Value Retrieval task [3]. The prompt for key-value retrieval is shown below:
"eb098018-bdb5": "970cbed8-3665",
"0a9d957f-2256": "be09fd63-4dfa",
"e2b49af9-d0e3": "c5ed6251-085d",
"8ece1451-05e1": "2d5932f7-acd8",
"eb2f4a8d-e0b7": "e0acbc2c-d478",
"0c8c0695-dd3c": "086d71cb-35c0",
"79a1c002-4ba6": "e69f5f62-250e",
"b0c1c9df-c13f": "3ce6b12e-6223",
"ee17cc77-6342": "41c410e1-776c",
"483f6a4d-9aa4": "3711356c-6df1",
"ee17cc77-6342": "41c
We use 10 key-value pairs as examples in prompt, which includes a query key. We insert the query key-value pair in different positions of examples (In the prompt example above, the query key is inserted in the 9th position). The model's task is to find the value corresponding to the query key and output it, which evaluates its ability of context awareness.
The performance of a pre-trained Llama model and a pre-trained Llama model with MoICE are shown below, respectively:
| 1 | 3 | 5 | 7 | 9 | |
|---|---|---|---|---|---|
| Baseline | 0.476 | 0.324 | 0.328 | 0.344 | 0.502 |
| + MoICE | 0.652 | 0.762 | 0.634 | 0.622 | 0.814 |
From the results, we can see that our model can significantly increase the contextual capabilities of the pre-trained language model.
Once again, we appreciate your thoughtful review and feedback on our paper. Please let us know if you have any additional questions or suggestions.
References
[1] Bai Y, Lv X, Zhang J, et al. Longbench: A bilingual, multitask benchmark for long context understanding[J]. arXiv preprint arXiv:2308.14508, 2023.
[2] Peterson J, Meylan S, Bourgin D. Open clone of openai’s unreleased webtext dataset scraper[J]. 2019.
[3] Liu N F, Lin K, Hewitt J, et al. Lost in the middle: How language models use long contexts[J]. Transactions of the Association for Computational Linguistics, 2024, 12: 157-173.
The author's rebuttal has addressed my concerns and I raised my score accordingly.
We sincerely appreciate your positive feedback! We will surely add the additional experiments to a future revision.
The paper introduces the "Mixture of In-Context Experts" (MoICE) method to address uneven context awareness in large language models (LLMs) using Rotary Position Embedding (RoPE). The central element of MoICE is a router that selects different RoPE angles. The authors propose a loss function that learns to select RoPE angles for each head based on context information and encourages diverse RoPE angles among attention heads. MoICE is evaluated on two representative models—one with full attention and the other with sliding window attention—to demonstrate its effectiveness in both open-ended and close-ended long context evaluation tasks.
优点
- The concept of mixing multiple RoPE angles within each head is innovative.
- MoICE achieves state-of-the-art results on multiple benchmarks.
- The paper includes sanity checks and analyses to elucidate the MoICE mechanism.
缺点
- The auxiliary loss definition (Equations 8–10) appears to be ad hoc.
- The method cannot be adapted to non-RoPE models, such as those using Alibi.
问题
- How does training data impact MoICE's performance? It appears that MoICE uses additional data to learn its parameters. Even if the base model is frozen, this extra training data could positively affect benchmark performance.
- Why does Table 3 show that MoICE performs better with Llama2 than with Mistral?
- From Tables 4 and 5, should we always choose larger values for N and K? What are the cost implications of using larger N and K?
局限性
I have not observed any red flag in terms of potential negative societal impact.
We sincerely appreciate your valuable feedback and suggestions! We hope our response could address your concerns.
1. The auxiliary loss appears to be ad hoc (Weakness 1)
Thanks for your valuable feedback. The auxiliary loss (Eq.8-10) is a widely adopted practice in MoE systems to address imbalanced routing strategies, which means the router shows too much preference for specific experts [1,2,3,4,5].
In our method, this loss plays a crucial role in alleviating such issues. Without it, the model's performance will decrease because the router might consistently choose specific RoPE bases without considering alternatives. To demonstrate the impact of this loss term, we have conducted an ablation study on Llama2-7b-Chat and Mistral-7B-Instruct-8k by removing it from Eq. 10. The results, shown below, indicate a decline in performance:
| Llama2-7b-Chat | Coursera | QuALITY | TOEFL | SFiction | Average |
|---|---|---|---|---|---|
| w/o aux loss | 39.83 | 41.58 | 56.13 | 62.5 | 50.01 |
| w/ aux loss | 39.83 | 42.08 | 56.13 | 64.84 | 50.72 |
| Mistral-7B-Instruct-8k | Coursera | QuALITY | TOEFL | SFiction | Average |
|---|---|---|---|---|---|
| w/o aux loss | 47.67 | 46.04 | 64.68 | 58.59 | 54.25 |
| w/ aux loss | 47.82 | 46.53 | 64.68 | 62.50 | 55.38 |
We will include this discussion and the results in the revision. Thank you for your valuable feedback!
2. The adaption to non-RoPE models (Weakness 2)
Thank you for pointing this out. We discussed this in detail in Appendix B. Specifically, our method mainly focuses on resolving the issues brought by the wave pattern inherent in RoPE. Compared with non-RoPE, RoPE is more prevalently used in modern LLM, such as Llama, Mistral, Qwen, etc [6]. We believe the study of RoPE shortcomings will also help push the development of advanced LLMs.
3. The impact of training data on MoICE's performance (Question 1)
The MoICE router assigns dynamic routing weights for each (predefined and non-trainable) RoPE angle, which are used to calculate a weighted sum of attention scores. As a result, no extra knowledge or ability in additional data is introduced into the base model.
In Table 6, We observed very similar performance improvement when given varied datasets to train MoICE. This ablation study demonstrates that the improvements come from the routing strategies the router learned, not due to any supplementary knowledge derived from the extra training data.
4. Why does MoICE perform better with Llama2 than with Mistral? (Question 2)
We implemented MDQA tasks using a 30-doc context for Mistral according to its longer context window. MDQA tasks with a 30-doc context are more challenging than those with a 10-doc context, resulting in Mistral's weaker performance than Llama2.
5. The effect of N and K Values (Question 3)
We should not always choose larger values for N and K.
For N, too large values are unnecessary. As shown in Table 4, performance improves with increasing N up to N=7, after which it plateaus at N=9. Despite testing with larger N values, we observed no significant improvement in performance, while incurring additional costs in both training and inference.
Regarding K, Table 5 demonstrates that the best performance is achieved when K equals N. Although increasing K raises inference costs, our efficiency remains to be the most practical compared to other baselines, even when the cost is at its maximum given the fixed N (K=N=7, as seen in Table 2).
Once again, we appreciate your thoughtful review and feedback on our paper. Please let us know if you have any additional questions or suggestions.
References
[1] Fedus W, Zoph B, Shazeer N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity[J]. Journal of Machine Learning Research, 2022, 23(120): 1-39.
[2] Zeng Z, Miao Y, Gao H, et al. AdaMoE: Token-Adaptive Routing with Null Experts for Mixture-of-Experts Language Models[J]. arXiv preprint arXiv:2406.13233, 2024.
[3] Xue F, Zheng Z, Fu Y, et al. Openmoe: An early effort on open mixture-of-experts language models[J]. arXiv preprint arXiv:2402.01739, 2024.
[4] Dai D, Deng C, Zhao C, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models[J]. arXiv preprint arXiv:2401.06066, 2024.
[5] Zoph B, Bello I, Kumar S, et al. St-moe: Designing stable and transferable sparse expert models. arXiv 2022[J]. arXiv preprint arXiv:2202.08906.
[6] Zhang Z, Chen R, Liu S, et al. Found in the middle: How language models use long contexts better via plug-and-play positional encoding[J]. arXiv preprint arXiv:2403.04797, 2024
After rebuttal: raised score by 1 point after discussion.
The paper proposes a new strategy Mixture of In-Context Experts (MoICE) to increase the input context length of LLMs while allowing the model to function effectively on longer context inputs. Their key idea is to introduce a routing mechanism at each attention head of the transformer that allows selection of multiple positions (RoPE angles) dynamically to effectively process tokens at different parts of the input context. They implement the proposed MoICE strategy on Llama-2-7B-chat and Mistral-7B-instruct-8k, and evaluate it on tasks in the L-Eval benchmark which consists of 4 close-ended tasks (Multiple choice questions, classification etc.) and ~181 questions on open-ended generation tasks.
优点
- The paper proposes an interesting idea and explains it reasonably well.
- The implementation on 2 open source LLMs Llama-2-7B-chat and Mistral-7B-instruct-8k and analysis are valid.
缺点
-
A major weakness is the training of the router for context lengths of just 8k. While increasing from 4k to 8k is valuable. Having an experiment or model with larger input context lengths (perhaps atleast 16k) will be of great value.
-
Evaluations of long context abilities on other benchmarks. While evaluations on L-Eval are reasonable, it would have been valuable to report on atleast one other popular benchmark such as ZeroScrolls [1]
[1] Shaham, Uri, Maor Ivgi, Avia Efrat, Jonathan Berant, and Omer Levy. "Zeroscrolls: A zero-shot benchmark for long text understanding." arXiv preprint arXiv:2305.14196 (2023) -- EMNLP 2023.
问题
Similar to weaknesses.
- Have you tried tuning and evaluating on context length greater than 8k?
- Have you considered evaluations on other benchmark tasks for long context?
局限性
No limitations have been listed.
Clear limitations in terms of any memory usage or implementation details and challenges in training and datasets used for training would be of value to the community.
We sincerely appreciate your valuable feedback and suggestions! We hope our response could address your concerns.
1. To test MoICE with LLMs whose context length is greater than 8k. (Weakness 1 & Question 1)
Thanks for your valuable suggestions. We have implemented MoICE on Qwen1.5-7B-Chat, whose pre-training context length is 32k . The results on LongBench [1] and L-eval are reported below, respectively. All experiments are conducted on one A800-80G GPU:
| Method | Single-Doc QA | Multi-Doc QA | Summarization | Few-shot Learning | Synthetic Tasks | Average |
|---|---|---|---|---|---|---|
| Qwen2-7B-Instruct-32k | 34.66 | 35.91 | 25.77 | 56.89 | 33.83 | 37.41 |
| + PI | 28.28 | 17.08 | 24.60 | 57.51 | 32.67 | 32.03 |
| + NTK | 31.35 | 23.98 | 24.95 | 56.64 | 32.50 | 33.88 |
| + Ms-PoE | OOM | OOM | OOM | OOM | OOM | N/A |
| + AB | OOM | OOM | OOM | OOM | OOM | N/A |
| + MoICE | 39.37 | 37.35 | 25.81 | 57.29 | 34.83 | 38.93 |
| Method | Coursera | QuALITY | TOEFL | SFiction | Average | wins | ties | win-rate% |
|---|---|---|---|---|---|---|---|---|
| Qwen2-7B-Instruct-32k | 78.44 | 61.88 | 61.19 | 69.53 | 67.76 | 83 | 119 | 40.83 |
| +PI | 76.58 | 61.88 | 60.32 | 70.31 | 67.27 | 83 | 107 | 39.11 |
| +NTK | 78.07 | 62.38 | 60.32 | 70.31 | 67.77 | 84 | 111 | 40.20 |
| +Ms-PoE | 75.47 | 60.89 | 60.47 | 71.88 | 67.18 | OOM | OOM | OOM |
| +AB | 78.44 | OOM | OOM | OOM | N/A | OOM | OOM | OOM |
| +MoICE | 78.44 | 62.87 | 61.77 | 71.09 | 68.54 | 91 | 105 | 41.59 |
"OOM" indicates that due to the extra memory cost required by Ms-PoE and AB, the inference on the long context failed due to out of memory.
On LLMs with the 32k context window, MoICE still brings improvements in terms of context awareness. We will add these results in the revision.
2. Evaluations on benchmark beyond L-eval (Weakness 2 & Question 2)
Thanks for your valuable suggestions on the generalization of MoICE. We have added an additional benchmark Longbench [1], which is a bilingual multitask, and comprehensive assessment of long context understanding capabilities of large language models. We evaluate 16 tasks in 5 scenarios, and we report the average value for each scenario. All experiments are conducted on one A800-80G GPU.
| Method | Single-Doc QA | Multi-Doc QA | Summarization | Few-shot Learning | Synthetic Tasks | Average |
|---|---|---|---|---|---|---|
| Llama2-7B-chat | 25.54 | 18.47 | 23.37 | 51.78 | 3.94 | 29.85 |
| + PI | 23.42 | 23.73 | 25.34 | 51.63 | 7.63 | 31.30 |
| + NTK | 24.73 | 23.67 | 25.41 | 51.97 | 8.33 | 31.58 |
| + Ms-PoE | 23.68 | 24.59 | 25.33 | 51.66 | 8.04 | 31.75 |
| + AB | 27.06 | 22.94 | 25.52 | 52.84 | 8.62 | 32.21 |
| + MoICE | 26.31 | 23.70 | 25.60 | 52.34 | 9.71 | 32.25 |
| Method | Single-Doc QA | Multi-Doc QA | Summarization | Few-shot Learning | Synthetic Tasks | Average |
|---|---|---|---|---|---|---|
| Mistral-7B-Instruct-8k | 27.20 | 19.89 | 24.22 | 52.41 | 5.06 | 25.76 |
| + PI | 30.94 | 24.94 | 26.24 | 49.34 | 9.35 | 28.16 |
| + NTK | 30.46 | 21.21 | 23.89 | 52.41 | 8.44 | 27.28 |
| + Ms-PoE | 27.90 | 17.89 | 20.28 | 48.59 | 8.95 | 24.72 |
| + AB | 29.81 | 21.95 | 25.58 | 54.42 | 7.89 | 27.93 |
| + MoICE | 31.09 | 22.98 | 26.69 | 55.76 | 8.02 | 28.91 |
On LLMs with the 4k, 8k, and 32k context windows, MoICE consistently brings improvements in terms of context awareness. We will add these results in the revision.
Once again, we appreciate your thoughtful review and feedback on our paper. Please let us know if you have any additional questions or suggestions.
References
[1] Bai Y, Lv X, Zhang J, et al. Longbench: A bilingual, multitask benchmark for long context understanding[J]. arXiv preprint arXiv:2308.14508, 2023.
Thanks for adding these additional evaluations. I will let the ACs decide if additional experiments are acceptable at this time. The additional experiments do address the weaknesses I had noted in the submitted paper. Please make sure to include these in the main paper in future versions.
If the new experiments are acceptable then I can increase my score by a point. I'll keep my score as is until we get a clarification.
We sincerely appreciate your valuable suggestions and are glad to know that our rebuttal and new experiments have addressed all of your concerns. We will definitely include the additional experimental results in a future revision as you suggested.
This paper proposes a new effective method for long context length for LLMs. All reviewers agree the contributions of this paper with all positive scores.
AC also agree with the reviewers, so recommends accepting this paper.