StructMoE: Augmenting MoEs with Hierarchically Routed Low Rank Experts

Zain Sarwar,Ashwinee Panda,Benjamin Thérien,Stephen Rawls,Sambit Sahu,Supriyo Chakraborty

提交: 2024-09-25更新: 2024-10-16

TL;DR

A method to scale MoE models using structured modules

摘要

The traditional approach to scaling Mixture of Experts for transformer models has been to increase the total number of experts. While performance improves with more experts, the gains are diminshing whereas memory scales linearly with the number of experts. We introduce $StructMoE$, a scaling approach for Mixture of Experts which augments experts with additional dynamic capacity using routed structured matrices which we refer to as $L$ow $R$ank $E$xprts ($$LoRE$$). At a high-level, we introduce hierarchical MoEs where the first level of routing decides which expert each token should be routed to and the second level of routing decides which $LoRE$ should each token be routed through. The outputs of the expert and the $LoRE$ are then entangled together to provide the final output. This introduces more dynamism into the model which has empirically been demonstrated to improve model performance. We find this scaling approach to outperform a standard MoE baseline in terms of loss on a held out validation. Thus, we propose this to be an effective scaling technique for MoEs compared to the standard approach of adding more experts to the model.

关键词

moemixture of expertsLLMtransformer

评审与讨论

撤稿通知

2024-10-16

I submitted the wrong manuscript for the final version.