暂无评分数据
ICLR 2025
StructMoE: Augmenting MoEs with Hierarchically Routed Low Rank Experts
TL;DR
A method to scale MoE models using structured modules
摘要
The traditional approach to scaling Mixture of Experts for transformer models has been to increase the total number of experts. While performance improves with more experts, the gains are diminshing whereas
memory scales linearly with the number of experts. We introduce $StructMoE$, a scaling approach for Mixture of Experts which augments experts with additional dynamic capacity using routed structured matrices which we refer
to as $L$ow $R$ank $E$xprts ($$LoRE$$). At a high-level, we introduce hierarchical MoEs where the first level of routing decides which expert each token
should be routed to and the second level of routing decides which $LoRE$ should each token be routed through. The outputs of the expert and the $LoRE$ are then entangled together to provide
the final output. This introduces more dynamism into the model which has empirically been demonstrated to improve model performance. We find this scaling approach to outperform a standard MoE baseline in terms of loss on a held out validation. Thus, we propose this to be an effective scaling technique for MoEs compared to the standard approach of adding more
experts to the model.
关键词
moemixture of expertsLLMtransformer
评审与讨论
作者撤稿通知
I submitted the wrong manuscript for the final version.