Learn To be Efficient: Build Structured Sparsity in Large Language Models
We propose a novel training algorithm to train efficiency-aware LLMs that have more structured contextual sparsity for fast inference.
摘要
评审与讨论
This paper proposes a method to increase the structured sparsity of models through training, called Learning-to-be-efficient (LTE). It introduces a new training loss that guides the model to activate fewer neurons while maintaining original performance. Simultaneously, LTE employs a threshold-based sigmoid routing strategy, allowing flexible expert selection instead of a predefined fixed number. To achieve acceleration, the authors further implement an efficient version based on Triton. Compared to previous sparse acceleration methods, LTE can be applied to activation functions other than ReLU, offering better generality.
优点
- The method demonstrates good generality and can be applied to various activation functions.
- The designed separability loss is intuitive and effective.
- Experiments on diverse types of tasks prove the effectiveness of the method.
- Implementation of a Triton kernel achieves computational time acceleration.
缺点
- Due to the mainstream dense models having an FFN to attention ratio of approximately 2:1, although good sparse acceleration is achieved in FFN, the overall speedup is not high. If experiments were conducted on models with a higher proportion of FFN, the acceleration effect would likely be better, potentially enhancing the impact of the work.
- Some experimental and methodological details lack sufficient discussion. See the questions section for more details.
问题
- Regarding the experimental analysis in Line 135: Since MoEfication approximates dense model computation, expert scores are only used for expert selection and do not scale the expert outputs. This ensures that when all experts are selected, the computation is entirely equivalent to the original dense model. Therefore, it should not be possible to have a situation where two experts are selected but only one contributes. Could the authors further elaborate on this phenomenon?
- Concerning the expert grouping method, is there room for further improvement in clustering W_1? LLaMA uses GLU, where the intermediate representation is obtained by element-wise multiplication of two vectors, which is different from the vanilla FFN studied in the original MoEfication.
- What principles guide the selection of thresholds, and is there any transferability to other LLMs?
局限性
Yes, the authors have adequately addressed the limitations.
We are glad that the reviewer found that our work has extensive evaluation and shows strong empirical performance. We thank the reviewer for the constructive feedback and appreciate the opportunity to address the points you have raised.
Q1: Due to the mainstream dense models having an FFN to attention ratio of approximately 2:1, although good sparse acceleration is achieved in FFN, the overall speedup is not high. If experiments were conducted on models with a higher proportion of FFN, the acceleration effect would likely be better, potentially enhancing the impact of the work.
A1: We agree with the reviewer that only improving sparsity for FFN layers does have a speed-up limit. Applying LTE in a more FFN-intense model can gain more speed-up. Another potential solution is to apply MoEfication in attention layers. We will leave this for future exploration.
Q2: Regarding the experimental analysis in Line 135: Since MoEfication approximates dense model computation, expert scores are only used for expert selection and do not scale the expert outputs. This ensures that when all experts are selected, the computation is entirely equivalent to the original dense model. Therefore, it should not be possible to have a situation where two experts are selected but only one contributes. Could the authors further elaborate on this phenomenon?
A2: The reason multiple experts are selected but only one contributes (Line 135) is due to the adoption of the noisy top-K softmax routing strategy in Section 3.2. With this routing strategy, the expert outputs are multiplied by their corresponding expert scores. Given the large number of experts in MoEfication and the fact that the sum of the softmax expert scores is 1, many of the selected expert scores are nearly zero. When those expert outputs are multiplied by the nearly zero scores, the outputs are scaled to zero, resulting in no contribution to the inference. We hope that this explanation can address your concerns, we will polish the writing of Section 3.2 to make this point more clear.
Q3: Concerning the expert grouping method, is there room for further improvement in clustering W_1? LLaMA uses GLU, where the intermediate representation is obtained by element-wise multiplication of two vectors, which is different from the vanilla FFN studied in the original MoEfication.
A3: For LLaMA, our current design uses the parameter clustering on the gate matrix (whose output feeds into the activation function) to group the neurons. In our early evaluation, we also tried to use another matrix (up-matrix) to group the neurons. However, we observed no significant difference in performance between the two approaches. As we discussed in Figure 9 in the paper, another strategy (co-activation) also shows similar performance. Our hypothesis is that, since LTE updates the model weights and routers, the model can adjust the grouping to some extent, even though the initial grouping is not optimal. However, if the initial grouping is too bad (like random), the model may not be able to make such a large adjustment.
Q4: What principles guide the selection of thresholds, and is there any transferability to other LLMs?
A4: In our evaluation, we set the threshold to 0.5 for all models and tasks, as 0.5 represents the midpoint for the sigmoid output. As shown in Figure 3 in the paper, we observed that the separability loss creates a significant gap between the pre-set thresholds. we believe that a threshold around 0.5 can also work well.
Thanks for the attentive reading of the manuscript and constructive feedback. We will incorporate these changes into our final version. We hope our response addresses all the concerns and that the reviewer will consider raising the rating accordingly. We are more than glad to answer any further questions.
Thank you for your detailed reply. I'll keep my original score.
We thank the reviewer for the response. We hope these explanations and discussions have answered your questions.
Thanks again for your insightful and constructive comments, which indeed help improve our paper. We are happy to answer further questions if you have any in the future.
The paper presents a new approach (LTE) aimed at improving the inference efficiency of large language models by developing structured activation sparsity. The method trains LLMs to activate fewer neurons in FFN layers while attempting to maintain task performance. The approach works by grouping neurons into experts and using a routing strategy (based on Sigmoid instead of Softmax) to select experts adaptively. The authors evaluate the method on RoBERTa, GPT2, and LLaMA models across various NLP tasks. They report that LTE outperforms existing baselines and provides FLOPs and a latency reduction by utilizing sparsity through a custom CUDA implementation.
优点
- The paper presents a new method for inducing sparsity in LLMs. It is based on several MoE concepts but uses Sigmoid-based routing instead of the traditional Softmax.
- The authors tested their method on multiple models, datasets, and task types.
- The paper includes a custom CUDA kernel implementation to strengthen the applicability of the method.
缺点
- The two-step training increases the complexity of applying the approach
- Dependency on multiple hyperparameters/thresholds.
- The paper keeps claiming that existing methods focus on existing sparsity in pre-trained models, but inducing sparsity in training (even in LLMs) is not new and has been around for a while.
问题
- Since you are introducing a different training approach, how does the total training time compare to the baseline?
- Can you provide more insights into the limitations of Softmax routing?
局限性
- Complexity overhead of the training.
- Many of the presented insights are already in the MoE literature, except for the Sigmoid routing.
- Limited number of baselines (using only two baselines, there is a lot of work about sparsity in LLMs). Imo, the paper would be much stronger if it compares against more (sparsity) baselines.
We are glad that the reviewer found that our work has extensive evaluation and our custom kernel increases the applicability. We thank the reviewer for the constructive feedback and appreciate the opportunity to address the points you have raised.
Q1: The two-step training increases the complexity of applying the approach.
Q4: Complexity overhead of the training. Since you are introducing a different training approach, how does the total training time compare to the baseline?
A1&A4: Compared to the two post-training baselines, even though LTE introduces additional training overhead, we argue that LTE training is a one-time effort. Since LTE significantly saves the inference overhead, This approach is beneficial in the long run for serving LLM more efficiently.
Moreover, as we discussed in Q4 of the General Response. Even if we further fine-tune the entire model MoEfied with Deja Vu for the same training time as LTE models, the Deja Vu models fail to achieve comparable performance to LTE models
Q2: Dependency on multiple hyperparameters/thresholds.
A2: We agree with the reviewer that LTE training introduces additional thresholds and hyperparameters. However, we argue that those additional thresholds and hyperparameters do not increase the training complexity of LTE.
1) To avoid manually selecting thresholds, we design the separability loss (Eq. 5 in the paper) to make models threshold-aware. This separability loss encourages router outputs to diverge from a predefined threshold (Figure 3 in the paper), which enables us to use the same threshold for all evaluations. In our paper, we set the threshold to 0.5 for all models and tasks, as 0.5 represents the midpoint for the sigmoid output.
2) For the hyperparameter ( in Eq. 6) for the separability loss, our ablation study (Figure 11 in the paper) shows that this parameter is not sensitive to performance. Once it is increased to a certain value. It won’t introduce performance gains. We set this hyperparameter to 0.5 for all models and tasks.
3) The only sensitive hyperparameter is in Eq.6, which is necessary to control the sparsity level of the model (like the number of selected expert in traditional MoE models).
Q3: The paper keeps claiming that existing methods focus on existing sparsity in pre-trained models, but inducing sparsity in training (even in LLMs) is not new and has been around for a while.
A3: We thank the reviewer for pointing out the confusion in our claim. The existing methods that we discussed in the paper stand for MoEfication methods transforming a pretrained dense model into MoE models rather than traditional MoE training.
As we discussed in the General Response Q1, MoEfication presents unique challenges, and our evaluations show that the MoE training baselines (such as noisy-top K softmax, sigmoid-MoE, and fine-tuning Deja Vu models) all underperform LTE. As far as we know, our paper is the first work to explore the sparse training upon a pretrained model.
Q5: Can you provide more insights into the limitations of Softmax routing?
A5: The limitations of Softmax routing in MoEfication come from the fact that the sum of the softmax output is one. In traditional MoE models, only a few experts are selected (typically fewer than four), resulting in expert scores that are not too small. However, in MoEfication, a much larger number of experts can be chosen. When the sum of the expert scores is distributed among all experts, each expert may receive a very small score (almost zero in many cases). Since the expert outputs are multiplied by these scores, extremely small scores will scale the expert outputs to very small values, affecting inference performance.
Q6: Many of the presented insights are already in the MoE literature, except for the Sigmoid routing.
A6: Although some aspects of LTE design have been discussed in previous work, MoEfication introduces unique challenges (as detailed in General Response Q1). Directly applying existing MoE techniques to the MoEfication scenario is non-trivial.
Our evaluation shows that directly applying MoE training methods, such as noisy-top K softmax, sigmoid-MoE, and fine-tuning Deja Vu models, does not outperform the proposed LTE methods. This result indicates that LTE is more effective for the MoEfication scenario and provides a strong baseline for this area.
Q7: Limited number of baselines (using only two baselines, there is a lot of work about sparsity in LLMs). Imo, the paper would be much stronger if it compares against more (sparsity) baselines.
A7: In addition to the Deja Vu and MoEfication baselines reported in Section 5, we also compare the performance of the noisy top-k softmax routing, which is the most common MoE training strategy, in Section 3.2. Due to the training collapse issue of the softmax router, we did not conduct large-scale evaluations on other tasks.
Moreover, as suggested by Reviewer #1 (zH4X), we evaluated and compared three additional baselines: Sigmoid-MoE and Deja Vu with fine-tuning (Figure 1 in the rebuttal PDF); Model Pruning Wanda (Table 1 in the rebuttal PDF). The evaluation results show that LTE still outperforms these baselines. A potential reason for LTE's better performance is that MoEfication presents unique challenges compared to traditional MoE training, which are better addressed by LTE. (we kindly refer to the General Response Q1 for more details).
We hope those new baselines can address your concerns.
Thanks for the attentive reading of the manuscript and constructive feedback. We will incorporate these changes into our final version.
We hope our response addresses all the concerns and that the reviewer will consider raising the rating accordingly. We are more than glad to answer any further questions.
Thanks to the author for providing further insights and clarifications (through this comment or the other comments, especially with the added baseline). I will consider them when adjusting the final score.
We thank the reviewer for the response. We are happy to see the reviewer’s acknowledgment of our efforts on further insights, clarifications, and new baseline evaluations in our response. We hope these discussions and evaluations have effectively addressed your concerns.
Please let us know if you have any further questions, concerns, or points that require clarification. We are more than happy to answer or discuss them. We look forward to your final score.
This article introduces a novel training algorithm, LTE, designed to train large language models (LLMs) to achieve more structured activation sparsity during inference. Thus, it enhances their efficiency without compromising performance until a very high sparsity.
优点
-
LTE performs excellently across all datasets: In multiple natural language understanding (NLU) tasks, LTE shows no significant performance degradation at high sparsity levels (80-95%) and maintains good performance even at sparsity levels exceeding 90% of FFN.
-
This article develops a CUDA kernel to speed up inference by reducing memory and computational overheads.
缺点
-
Stage 1 of LTE will train all the model's parameters. How do you implement Dejavu? As far as I know, Dejavu is a post-training method that freezes the model's parameters. I'm not sure whether this is a fair comparison.
-
The LTE algorithm introduces a two-stage training process, which might be complex and computationally intensive to implement. This can be a barrier to practical adoption in resource-constrained environments.
问题
- Have you tried LTE also on the attention layers?
局限性
The article includes a limitation section.
We are glad that the reviewer found that our work has excellent performance. We thank the reviewer for the constructive feedback and appreciate the opportunity to address the points you have raised.
Q1: Stage 1 of LTE will train all the model's parameters. How do you implement Dejavu? As far as I know, Dejavu is a post-training method that freezes the model's parameters. I'm not sure whether this is a fair comparison.
A1: Deja Vu is proposed as a post-training method in their paper. Following this, we implemented Deja Vu in a post-training manner in our paper: we first fine-tuned the model with the specific datasets and then applied the Deja Vu to the fine-tuned model to MoEfy models.
However, we understand the reviewer's concern that fine-tuning those MoEfied models (all parameters) can further improve performance. We conduct an evaluation on fine-tuning Deja Vu models. We first fine-tuned the model with Wikitext-103 and applied Deja Vu. Subsequently, we further fine-tuned the Deja Vu model with Wikitext-103 (for the same training time as LTE training). Note that the first fine-tuning is necessary for Deja Vu to collect data to train predictors.
The evaluation results are reported in Figure 1 of the rebuttal PDF. We find that while the additional fine-tuning does improve the performance of Deja Vu, LTE still outperforms Deja Vu with additional fine-tuning.
Q2: The LTE algorithm introduces a two-stage training process, which might be complex and computationally intensive to implement. This can be a barrier to practical adoption in resource-constrained environments.
A2: Compared to the two post-training baselines, even though LTE introduces additional training overhead, we argue that LTE training is a one-time effort. Since LTE significantly saves the inference overhead, this approach is beneficial in the long run for serving LLM more efficiently.
Moreover, as we discussed in Q1. Even if we further fine-tune the entire model MoEfied with Deja Vu for the same training time as LTE models, the Deja Vu models fail to achieve comparable performance to LTE models, which indicates that, besides training, other LTE components also contribute to the performance.
Q3: Have you tried LTE also on the attention layers?
A3: We focus on the MoEfication for FFN layers in this work, but we do think applying MoEfication on attention layers can be an interesting and promising exploration direction. A potential solution is to set each attention head as an expert and set a sigmoid routing to decide if a head in the attention layer should be used. We believe MoEfication on attention layers can further increase the contextual sparsity of the paper, and we plan to leave this for our future study.
Thanks for the attentive reading of the manuscript and constructive feedback. We will incorporate these changes into our final version.
We hope our response addresses all the concerns and that the reviewer will consider raising the rating accordingly. We are more than glad to answer any further questions.
Thanks for the response of the authors. Most of my questions have been addressed. I will raise my score.
We thank the reviewer for the response and for raising the score. We are glad to see that our responses have addressed your concerns!
Thanks again for your insightful and constructive comments, which indeed help improve our paper. We are happy to answer further questions if you have any in the future.
This work aims to introduce structured sparsity to large language models (LLMs) to improve their execution efficiency. To achieve this, it enhances previous MoEfication methods by employing a sigmoid-based non-competitive routing function and a threshold-based expert selection, allowing for adaptive expert numbers. Experiments across various models and different language understanding and generation tasks validate the effectiveness of the proposed method.
优点
-
The paper is well-motivated and easy to follow.
-
The proposed method exhibits good soundness and can achieve real-device speed-up with the developed CUDA kernel.
-
The proposed method has been evaluated across both encoder and decoder language models, achieving a consistently improved accuracy-efficiency trade-off.
缺点
- The major concern is that the technical contribution and novelty of this work are somewhat limited. The use of MoE with sigmoid functions to avoid competition among experts has been adopted in previous works, such as [1][2], and an extension to MoEfication will intuitively work. The authors are expected to analyze the key differences that make the proposed method particularly suitable for MoEfication.
[1] "Approximating Two-Layer Feedforward Networks for Efficient Transformers," R. Csordás et al., EMNLP'23.
[2] "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention," R. Csordás et al., arXiv'23.
- One missing baseline is structured weight pruning [3][4]. Considering this work mainly targets a per-task fine-tuning setting, structured weight pruning can already achieve decent sparsity with non-trivial generation speed-up without performance dropping [3][4]. This comparison can inform the community whether weight sparsity or context sparsity is more cost-effective or if they can be applied together.
[3] "A Simple and Effective Pruning Approach for Large Language Models," M. Sun et al., ICLR'24.
[4] "Fluctuation-based Adaptive Structured Pruning for Large Language Models," Y. An et al., AAAI'24.
-
Task-specific fine-tuning for each task is too costly. The authors are highly encouraged to perform continued pretraining and evaluate across different tasks to validate the generalization capability of the proposed method.
-
I wonder how the task-specific fine-tuning is performed for the baseline methods. Specifically, is Dejavu/MoEfication performed on top of the fine-tuned model, or is the model fine-tuned after applying these techniques, or are there any smarter strategies? This question arises because the proposed method simultaneously updates both model weights and expert selection strategies, which may be a key reason why it outperforms the baselines.
问题
My questions have been included in the weakness section. I'm willing to adjust my scores if my concerns are properly addressed.
局限性
This work does not suffer from notable negative societal impacts as it aims to improve the efficiency of LLMs.
We are glad that the reviewer found that our work is well-motivated and sound. We thank the reviewer for the constructive feedback and appreciate the opportunity to address the points you have raised.
Q1: Novelty concerns: A sigmoid router was proposed in previous MoE works like [1][2]. The authors are expected to analyze the key differences that make the proposed method particularly suitable for MoEfication.
A1: We thank the reviewer for pointing out the missing references, and we are happy to clarify the key differences between LTE and MoE with a sigmoid router, and the novelty of our paper.
First, we would like to clarify that, compared to traditional MoE, the MoEfication problem has three unique challenges: 1) Router designing for a larger number of selected experts; 2) Router training on pretrained models; 3) Adaptive sparsity in different layers of the pretrained model. (We kindly refer to the General Response Q1 for more details.)
Even though the Sigmoid-MoE[1] can handle the first challenge, it ignores the second and third challenges, which can cause an inferior performance. Instead, LTE also addresses two other challenges: For the second challenge, LTE adopts an indicator function (Eq. 2) to avoid the expert output scale and a two-stage training algorithm to solve the non-differentiability issue of the indicator function. For the third challenge, LTE employs an efficiency loss (Eq. 4) to introduce competition across different layers, leading to more adaptive sparsity in different layers (Figure. 15 in the paper). Those novel designs make LTE better address the challenges in MoEfication. (We kindly refer to the General Response Q2 for more details.)
Sigmoid-MoE Evaluation. To better understand how Sigmoid-MoE works on MoEfication tasks, we implement the Sigmoid-MoE[1] on the GPT2-M and fine-tune it with the WikiText103. We train Sigmoid-MoE models with the same training time as we train the LTE models. The comparison results are in Figure 1 in the rebuttal PDF: Sigmoid-MoE overcomes the collapse issue of the softmax router, but still underperforms LTE.
Q2: One missing baseline is structured weight pruning [3][4]…
A2: We thank the reviewer for bringing the model pruning for discussion. As discussed in lines 94-99 of our paper, while both LTE and structured weight pruning provide structured sparsity for inference acceleration, they provide two different types of sparsity. The contextual sparsity offered by LTE is more flexible and adaptive compared to the static sparsity offered by model pruning.
To provide a clearer comparison between these methods, we apply Wanda to a Wikitext-103 fine-tuned LLaMA-7B model and report the results in Table 1 in the rebuttal PDF. The evaluation results show that LTE achieves better performance than Wanda given the same level of sparsity.
Q3: Task-specific fine-tuning for each task is too costly. The authors are highly encouraged to perform continued pretraining and evaluate across different tasks to validate the generalization capability of the proposed method.
A3: We agree with the reviewer that validating the generalization capability can better demonstrate the effectiveness of the proposed method. In our submission, we already included this evaluation in Figure 7 (Section 5.2). We use the Tulu dataset to supervise fine-tune LTE models and evaluate the few-shot performance on the MMLU dataset (which is a large comprehensive dataset consisting of 15,908 questions from 57 distinct tasks.) The evaluation results show that LTE still outperforms other baseline methods in this supervised fine-tuning setting, which demonstrates the generalization capability of LTE.
Q4: I wonder how the task-specific fine-tuning is performed for the baseline methods. Specifically, is Dejavu/MoEfication performed on top of the fine-tuned model, or is the model fine-tuned after applying these techniques, or are there any smarter strategies?
A4: Deja Vu and MoEfication are proposed as post-training methods in their papers. Following this, we implemented Deja Vu or MoEfication in a post-training manner in our paper: we first fine-tuned the model with the specific datasets and then applied Deja Vu or MoEfication to the fine-tuned model to MoEfy models. (This is also the implementation suggested in the MoEfication paper).
However, we understand the reviewer's concern that further fine-tuning those MoEfied models (all parameters) can further improve performance. We conduct an evaluation on fine-tuning Deja Vu models. We first fine-tuned the model with Wikitext-103 and applied Deja Vu. Subsequently, we further fine-tuned the MoEfied Deja Vu model with Wikitext-103 (for the same training time as LTE training). Note that the first fine-tuning is necessary for Deja Vu to collect data to train predictors.
The evaluation results are reported in Figure 1 of the rebuttal PDF. We find that while the additional fine-tuning does improve the performance of Deja Vu, LTE still outperforms Deja Vu with additional fine-tuning.
Thanks for the attentive reading of the manuscript and constructive feedback. We will incorporate these changes into our final version.
We hope our response addresses all the concerns and that the reviewer will consider raising the rating accordingly. We are more than glad to answer any further questions.
[1] "Approximating Two-Layer Feedforward Networks for Efficient Transformers," R. Csordás et al., EMNLP'23.
[2] "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention," R. Csordás et al., arXiv'23.
[3] "A Simple and Effective Pruning Approach for Large Language Models," M. Sun et al., ICLR'24.
[4] "Fluctuation-based Adaptive Structured Pruning for Large Language Models," Y. An et al., AAAI'24.
Thank you to the authors for providing the response and addressing most of my concerns. I will raise my score and further listen to other reviewers' opinions.
We thank the reviewer for the response and for raising the score. We are indeed glad that our responses address your concerns!
Thanks again for your insightful and constructive comments, which indeed help improve our paper. We are happy to answer further questions if you have any in the future.
General Response
Dear reviewers,
We thank all the reviewers for their constructive reviews towards improving our work.
We are pleased that reviewers found our paper’s advantages: “LTE constantly achieves better performance-sparsity trade-off across multiple models, datasets, and task types.” (All reviewers); “The customized kernel improves the soundness of the work.” (All reviewers).
For the comments and concerns discussed in the reviews, we write this general response to address some common concerns and separate responses to each individual review.
Q1: The challenges of MoEfication compared to traditional MoE training.
A1: MoEfication presents three unique challenges:
1) Router designing for a larger number of selected experts: Unlike traditional MoE designs, which typically select a small number of experts, MoEfication can choose a much larger number of experts, which leads to softmax routing collapse (as discussed in Section 3.2), which lead to commonly used Softmax routing collapse in our empirical evaluation.
2) Router training on pretrained models: MoEfication is based on a pretrained model, unlike traditional MoE models that are trained from scratch. Before MoEfication, the outputs of the grouped experts are not scaled in the pretrained model, but typical MoE designs need to scale outputs with expert scores to make the router differentiable. The scaling of expert outputs in a pretrained model can hurt performance, even with fine-tuning.
3) Adaptive sparsity in different layers of the pretrained model. Recent work [2] shows that different layers in a pretrained model have different levels of sparsity. Traditional MoE designs use a predefined k to set the sparsity for each layer, which prevents adaptive sparsity across different layers.
Those challenges make it non-trivial to directly apply MoE training for the MoEfication problem.
Q2: The novelty and contribution of our paper.
A2: While some MoE-related concepts have been discussed in the MoE literature, as far as we know, LTE is the first method to address all three aforementioned challenges in MoEfication.
1) Efficiency-aware Sigmoid router: We adopt a sigmoid router and use an efficiency-aware loss to introduce competition among experts, thereby avoiding the collapse issue associated with softmax routing, which address the first challenge.
2) Indicator function with two-stage training: We use an indicator function to select experts without scaling expert outputs. To address the non-differentiability issue of the indicator function, we proposed a two-stage training algorithm to jointly train the model and routers. This address the second challenge.
3) Threshold-based router with efficiency loss: We utilize a threshold-based router, which allows for a more adaptive selection of experts in each layer. The efficiency loss introduces competition across different layers, leading to adaptive sparsity in different layers (see the ablation study in Figure 15 in the Appendix), which addresses the third challenge.
Besides handling those three challenges, another novelty of LTE is the introduction of inference efficiency as an optimization goal. This approach encourages models to only use necessary parameters for inference by training, which is different from the common practice of predefining a sparsity level in typical MoE.
Q3: Additional baseline comparison: Sigmoid-MoE[1]
A3: As suggested by Reviewer #zH4X, we implemented the Sigmoid-MoE to test its performance on MoEfication tasks. We applied Sigmoid-MoE to the GPT-2 Medium model and fine-tuned it with the WikiText-103 dataset. The Sigmoid-MoE model was trained for the same training time as the LTE models. The comparison results are presented in Figure 1 in the rebuttal PDF. While Sigmoid-MoE overcomes the collapse issue of the softmax router, it still underperforms LTE.
Q4: Additional baseline comparison: Deja Vu fine-tuning
A4: Deja Vu and MoEfication are proposed as post-training methods in their papers. Following this, we implemented Deja Vu or MoEfication in a post-training manner in our paper: we first fine-tuned models with the specific datasets and then applied Deja Vu or MoEfication to the fine-tuned modesl to evaluate performance.
However, we understand the reviewers’ concern that fine-tuning those MoEfied models (all parameters) could further improve the performance of the baselines. To better study this problem, we conduct an evaluation on fine-tuning the Deja Vu model (all parameters) to compare the performance. We first fine-tuned the model with Wikitext-103 and applied Deja Vu. Subsequently, we further fine-tuned the Deja Vu model with Wikitext-103 for the same training time as LTE training. Note that the first fine-tuning is necessary for Deja Vu to collect data to train predictors.
The evaluation results are reported in Figure 1 of the rebuttal PDF. We find that while the additional fine-tuning does improve the performance of Deja Vu, LTE still outperforms Deja Vu with fine-tuning.
To sum up, MoEfication presents unique challenges, and it is non-trivial to directly apply MoE techniques to MoEfication. Considering the novel design and strong empirical performance of LTE, we believe that LTE presents a strong baseline for future study.
We agree with the reviewers that the paper writing can be further improved, and we believe that those issues can be well addressed in our final version.
[1] "Approximating Two-Layer Feedforward Networks for Efficient Transformers," R. Csordás et al., EMNLP'23.
[2] "The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers." Li, Zonglin, et al., ICLR’23.
Reviewers agree that the paper delivers a significant real-world improvement in LLM inference efficiency by training a structured sparsity which means that even relatively low sparsity (say 50%) is beneficial and leads to wall-clock improvements in latency and runtime.
The approach combines existing ideas (sigmoid MoE, MoEfication, sparsity-encouraging losses) but in a novel way, and has the potential to open new areas of related work.
Reviewers are concerned with complexity of the training process, but the rebuttal argues effectively that this complexity is justified by the improvement in inference efficiency.
Reviewers also note that applying to the FFN layers only does limit the opportunity for end-to-end improvement, but as the rebuttal notes, this is potential future work. Publishing the paper now will allow other researchers to benefit from the ideas.