PaperHub
5.5
/10
Poster4 位审稿人
最低3最高3标准差0.0
3
3
3
3
ICML 2025

Autonomy-of-Experts Models

OpenReviewPDF
提交: 2025-01-18更新: 2025-07-24

摘要

关键词
Mixture-of-ExpertsAutonomy-of-ExpertsLanguage models

评审与讨论

审稿意见
3

This work proposes Autonomy of Experts (AoE) to replace the traditional MoE model. Instead of employing a router to select the experts, AoE computes the internal activations of all experts and select the best one to proceed. The authors conduct the pre-training experiments to investigate various properties of AoE and demonstrate promising results compared to the conventional MoE.

给作者的问题

  • Please clarify the key contribution in the expert selection strategy in AoE compared to previous studies such as CompeteSMoE.
  • Please clarify the advantages of using the intermediate norm for expert selection and the training dynamic of AoE, as elaborated in the Other Strength and Weaknesses section.

论据与证据

The claims are generally supported.

方法与评估标准

Yes

理论论述

  • The paper does not contain any theoretical claims

实验设计与分析

The experimental designs are sound. The authors conducted the pre-training experiments to train AoE and SMoE from scratch, and evaluated them on several downstream tasks. This is the standard experimental setting.

补充材料

Yes, I read the code in the supplementary materials and the appendices.

与现有文献的关系

The scope of this work is quite limited as it only focuses on the design of the MoE expert selection mechanism. The paper did not discuss how this idea can impact related research disciplines.

遗漏的重要参考文献

While the AoE architecture design is relatively new, its expert selection mechanism has been observed in a previous study of CompeteSMoE [1], which also proposed to selected the best experts via its activation norm.

[1] Pham, Quang, et al. "CompeteSMoE--Effective Training of Sparse Mixture of Experts via Competition." arXiv preprint arXiv:2402.02526 (2024).

其他优缺点

I have two major concerns regarding AoE

  • First, while using the expert's activation norm for selection has been explored, AoE instead uses the intermediate's activation norm. Could the authors provide more justification into why this design could achieve a better result? Specifically, it is not guaranteed that expert with the highest intermediate norm will also have the highest output norm.
  • Second, the training pipeline for AoE is not clear. Since AoE requires decomposing the expert weight into an up and down projections, how are the parameters updated throughout training, and does AoE decompose the expert weight after each update?

其他意见或建议

N/A

作者回复

We sincerely thank you for your valuable comments! We hope our rebuttal helps address your concerns. If so, we would be grateful if you could consider increasing the overall recommendation.


 

Contribution and Novelty

Thank you for listing [Pham et al.]. Our paper fundamentally differs from it. We believe our AoE presents a solid novelty and technical contribution. Here are the differences.

1. Motivation, Architecture, and Method

  • AoE identifies and addresses a widely overlooked issue, i.e., the separation between the router's decision-making and the execution of experts. AoE is a novel MoE paradigm where the router is removed, allowing the experts to autonomously select themselves.

  • In contrast, Pham et al. aim to enhance the router. They take the experts' output as additional supervision signals to the router logits, which requires a complex training objective. Additionally, because they sometimes compute every expert's final outputs, this results in dense activation, making the model not technically even an SMoE.

2. Use of Activation Norms

  • [1] proposed that activation norms can be used to measure the knowledge of modules. Our approach is inspired by [1] and we cite them in Line 147. Pham et al. do not cite this paper or other related interpretability works. This perspective is not Pham et al.'s innovation, nor do we claim it as ours.

  • One of our contributions lies in a novel MoE paradigm that selects experts without routers, using only intermediate activations, along with architecture designs for enhanced efficiency. This is not simply a matter of selecting which nodes to use for norm calculations or improving results. AoE avoids the need to compute every expert's final output, preserving the sparse activation characteristic of SMoE and making AoE significantly more efficient and practical.

3. Efficiency

  • AoE achieves similar pre-training throughput of an SMoE.

  • Pham et al. did not report efficiency results. Since their model is sometimes densely activated, assuming the common 8-select-2 setting, AoE can be up to four times more efficient during pretraining.

4. Effectiveness

  • Pham et al. developed a complex training pipeline and only tested models up to 36M parameters, with evaluation limited to BPC and PPL metrics.

  • AoE supports simple end-to-end training, validated at scale with models up to 4B parameters and evaluated across various downstream tasks.

Additionally, Pham et al. publically acknowledged that their complex router-enhancement can not be reproduced, making it impossible to use it as a baseline for comparison. This is stated in the first line of their README: https://github.com/giangdip2410/CompeteSMoE

We would be truly grateful if you could kindly re-assess our novelty and contribution.

References

[1] Transformer Feed-Forward Layers Are Key-Value Memories, EMNLP 2021

 


 

Response to Concern 1

Most of this concern has been addressed in our initial response. Here, we address your question that "the expert with the highest intermediate norm is not guaranteed to have the highest output norm."

A neural network works according to how it is trained. Since AoE is not trained like a traditional MoE, this guarantee is unnecessary. As mentioned in the last paragraph of Sec.3.1, we train AoE to represent its awareness of its capabilities through the norm of the intermediate node. To support this claim, we present a new experiment: during inference, when using Final Output Norms (FON) to select experts in a pre-trained AoE (Config.10), rather than the intermediate norms used during training, there is a performance drop across all tasks:

ARC-EPIQASIQAWINOHELLAMNLIQNLISST2AVG.
AoE41.1658.3236.8053.0428.3732.7850.6154.5944.46
AoE (FON)37.3356.7535.4151.5427.7732.0050.1654.4743.18

This is a valuable question, as many readers who are accustomed to the traditional MoE models may also have similar concerns. We will include this response in the paper. Thank you very much.

 


 

Response to Concern 2

The word "decompose" in Line 202 might be ambiguous. We mean that we modify the model architecture, replacing the WgW_g in traditional MoE with two low-rank matrices. The architecture is illustrated in Fig.1(b), which does not require dynamic adjustment during runtime. We will improve the clarity in Line 202. Thank you for your feedback.

 


 

Response to "Relation To Broader Literature"

MoE is the foundation of advanced LLMs, such as DeepSeek-R1, Qwen-2.5-Max, Grok, etc. A deeper understanding and innovation in MoE could help advance future LLMs. We will highlight the necessity and potential impact of studying MoE in the NLP field.

 


If you have further questions, please leave more comments. We are committed to addressing them. Your assessment is very important to us! Thank you very much!

审稿意见
3

The authors introduce a new Mixture of Experts (MoE) paradigm called Autonomy-of-Experts (AoE), where experts independently decide whether to process inputs. The foundation of AoE lies in the understanding that an expert can gauge its ability to effectively handle a token by observing its internal activations. In AoE, routers are eliminated; experts pre-calculate internal activations for inputs and are ranked according to their activation norms. Only the highest-ranking experts continue with the forward pass, while the others are terminated. The overhead associated with pre-computing activations is mitigated through low-rank weight factorization. This method of self-assessment followed by comparison with peers leads to enhanced expert selection and effective learning.

给作者的问题

See weaknesses above.

论据与证据

Yes, the claims appear to be substantiated by the evidence.

方法与评估标准

Yes

理论论述

No theoretical analysis

实验设计与分析

Experimental design and analysis appears to be valid.

补充材料

Yes, skimmed through the supplementary material and code.

与现有文献的关系

The contributions are contextualized properly in the context of broader scientific literature. The idea of using internal activations of experts in an efficient manner for pre-training is novel in the context of Sparse MoEs.

遗漏的重要参考文献

Related works that are essential to understanding the (context for) key contributions of the paper are discussed.

其他优缺点

Strengths:

  • The paper is written clearly
  • The idea of using internal activations of experts in an efficient manner is novel
  • The pre-training experiments and ablation studies are thorough.

Weaknesses:

  • Given there is a 3% difference in throughput between MoE and AoE, it would be interesting to see if the baselines should be allowed to pre-train for additional steps (for a fair comparison in wall-clock time) and whether some of the gain still holds.
  • Typically, multiplicative noise is added to the input before computing gate scores e.g., in the Switch Transformer paper, which can improve performance. Can the authors compare against that version of the gate?

其他意见或建议

None

作者回复

We sincerely thank you for your constructive suggestions and valuable comments! We hope our rebuttal helps address your concerns. If so, we would be grateful if you could consider increasing the overall recommendation.  


 

If the baselines should be allowed to pre-train for additional steps

We would like to clarify that in our paper, we ensured fairness by using the same model size and training on the same number of tokens. The difference in throughput (tokens processed per second) results in slightly longer training times of AoE models but does not indicate that AoE has learned additional knowledge. If baselines are trained for additional steps, they would acquire knowledge that AoE does not. If you have any further experimental suggestions, we would be happy to discuss or explore them!

 


 

To train Switch Transformer with multiplicative noise

Thank you for your valuable advice. We trained a traditional MoE model with multiplicative jitter noise (0.01), applied to the layer inputs using 100B tokens with LauxL_{aux}. As shown by the Switch Transformer, jitter noise encourages expert exploration, which is beneficial for MoE models, but it still cannot outperform AoE models with LauxL_{aux}.

ARC-EPIQASIQAWINOHELLAMNLIQNLISST2AVG.
MoE (Config.2)40.7458.4936.1351.3028.1132.6750.2351.8343.68
MoE (Config.2 + noise)40.9558.4336.7551.0728.3432.9650.0151.7243.78
AoE (Config.10)41.1658.3236.8053.0428.3732.7850.6154.5944.46

We will include these experiments in our paper. Thank you again for your helpful suggestions!

审稿意见
3

This paper introduces Autonomy-of-Experts (AoE), a novel approach to Mixture-of-Experts (MoE) models that addresses a critical issue in traditional MoE architectures: the separation between routing decisions and expert execution. In traditional MoE, a router decides which experts process which inputs, creating potential misalignment between routing decisions and expert capabilities.

The core innovation of AoE is to eliminate the router entirely and allow experts to autonomously determine whether they should process inputs. I quite like the author's initial findings in existing findings that justify the development of AoE and also the toy illustration of AoE behaviors in the appendix which is very cute.

Overall, the authors justify the claims by extensive pre-training and ablation experiments.

给作者的问题

See above.

论据与证据

The authors claims are quite clear and supported by convincing evidence.

方法与评估标准

The methods proposed in the paper are appropriate for addressing the identified problem:

  1. The authors first validate their core insight through preliminary experiments on pre-trained MoE models before developing their full AoE approach.
  2. The low-rank factorization to reduce computational overhead is well-motivated and effectively implemented.
  3. The evaluation criteria are appropriate.

理论论述

The paper does not present formal theoretical proofs but provides conceptual explanations for why AoE is effective.

实验设计与分析

See above.

补充材料

The supplementary material includes additional experimental details on alternative expert selection metrics, pre-training with alternative expert selection strategies, and a toy example to provide additional interpretation of AoE's advantages. This material provides useful context and further validates the authors' claims.

与现有文献的关系

The paper appropriately situates AoE within the broader MoE literature.

遗漏的重要参考文献

N/A

其他优缺点

Overall, I find this paper presents a solid technical improvement, though I have a few remaining questions:

  1. The computational overhead analysis focuses on throughput but could discuss training time more explicitly.

  2. While AoE outperforms traditional MoE, the performance gains on the 732M parameter models are relatively modest on some tasks. Can you explain it?

  3. The paper could benefit from more discussion on whether the insights from AoE could be incorporated into traditional MoE architectures without completely replacing the router.

  4. The paper mentions that a smaller d_low may be a lossy approximation when below the true rank of Wg. d_low seems to have no effects to the final performance in 732M model.

  5. What about the uncertainty of the accuracies in the tables?

其他意见或建议

N/A

作者回复

We sincerely thank you for your constructive suggestions and valuable comments! We hope our rebuttal helps address your concerns. If so, we would be grateful if you could consider increasing the overall recommendation.  


 

To discuss training time more explicitly

Here are the total machine hours (1 machine = 8 GPUs):

Machine Hours
MoE72.73
AoE (dlowd_{low}=64)76.15
AoE (dlowd_{low}=128)76.58
AoE (dlowd_{low}=256)77.56
AoE (dlowd_{low}=512)81.34

We will include these results. Thank you for the feedback.

 


 

Small Model's Performance gains are relatively modest on some tasks

There are two key factors to consider:

  1. Larger models are stronger. With the same number of tokens, a large AoE shows more noticeable gains over MoE.

  2. The "spiral rise" in performance during pre-training. Both AoE and MoE show steady overall improvements, but task-specific performance fluctuates across checkpoints. We show results at 50B, 80B, and 100B tokens (AoE: Config.10, MoE: Config.2) to illustrate this.

    For example, at 100B tokens, MoE slightly outperforms AoE on PIQA due to a larger gain from 80B to 100B. A similar trend occurs with HELLA from 50B to 80B, though AoE regains the lead at 100B. Despite fluctuations, AoE consistently achieves higher AVG. scores across checkpoints.

Task Performance over tokens

TokensModelsARC-EPIQASIQAWINOHELLAMNLIQNLISST2AVG.
50BMoE38.7656.2635.6750.1227.3433.0550.2350.8042.78
AoE39.2757.1335.9851.7027.4033.3250.8750.3443.25
80BMoE40.4557.4536.3950.2027.9333.3250.2850.5743.32
AoE41.7957.6736.4451.3027.8334.5150.3851.4943.93
100BMoE40.7458.4936.1351.3028.1132.6750.2351.8343.68
AoE41.1658.3236.8053.0428.3732.7850.6154.5944.46

We'll include detailed task accuracy development plots to highlight the consistent advantages of 732M AoEs.

 


 

Can AoE insights benefit traditional MoEs?

While AoE shows that routers can be removed, we try to use expert norms as labels for router training. The router receives gradient-detached inputs to avoid interfering with AoE learning. During inference, expert selection is performed solely by the router to reduce memory usage. This setup led to a performance drop (50B tokens, dlowd_{low}=256, with LauxL_{aux}):

Task Performance

50B tokensARC-EPIQASIQAWINOHELLAMNLIQNLISST2AVG.
MoE (Config.2)38.7656.2635.6750.1227.3433.0550.2350.8042.78
AoE w. router38.2456.3735.8851.3027.3632.3250.2750.1142.73
AoE (Config.10)39.2757.1335.9851.7027.4033.3250.8750.3443.25

These results suggest that—even with supervision—routers struggle to learn effective expert selection, highlighting the limitation of separating routers and experts. We will discuss this in the paper and hope it inspires further research. Thank you for the thoughtful question.

 


 

Effects of dlowd_{low}

When dlowd_{low} = 64 or 128, AoE with LauxL_{aux} results in lower performance compared to dlowd_{low} = 256 (Config.10). Additionally, as shown in Figure 2, dlowd_{low} = 64 results in slower convergence, similar to traditional MoEs.

We also tested an extreme case where dlowd_{low} = 8. In this case, the approximation of WgW_g is too lossy, leading to poor performance and high loss. Due to limited time, we trained with only 50B tokens. We will include this experiment and make our statement more accurate. Thank you for your valuable feedback!

Task Performance

50B tokensARC-EPIQASIQAWINOHELLAMNLIQNLISST2AVG.
MoE (Config.2)38.7656.2635.6750.1227.3433.0550.2350.8042.78
AoE (dlowd_{low}=8)37.6756.3134.8050.3627.2132.2750.0850.3442.38
AoE (Config.10)39.2757.1335.9851.7027.4033.3250.8750.3443.25

Loss over Tokens

MoEAoE (dlowd_{low}=8)AoE (Config.10)
10B3.593.703.59
20B3.093.173.08
30B2.923.022.91
40B2.842.912.82
50B2.772.862.75

 

The uncertainty of the accuracies

First, we note that AoE outperforms MoE across various setups presented in our paper, with the testing process using greedy decoding to eliminate generation randomness.

For each configuration, we aggregated the outputs from all tasks into a sequence of 1s (correct) and 0s (incorrect), paired with the outputs of traditional MoE (Config. 2), and performed a McNemar test. The improvement of AoE over MoE is significant (p < 0.05) across all configurations in Tables 2, 3, and 4. We will highlight the significance of this improvement.

审稿意见
3

This paper proposes a new scheme to select expert sublayer in the mixture-of-experts (MoE) language model. Rather than empoying a router layer to choose the expert to process incoming embedding, the method utilizes a factorized subcomponent of feed-forward layer to calculate the importance score ("norm" in the paper) that are fed to a softmax and top-k to choose the expert to be activated. Experiments show that the proposed method outperforms the traditional MoE with router sublayer with a better outcomes with a load balancing loss.

update after rebuttal:

Thank you for your comments on my review! The additional results would be useful to discuss about this study more in-detail. However, I don't have strong reason to revise my overall judgment, so I will leave the overall score as it is.

给作者的问题

(1) please add experiments to compare other expert selection mechanisms. (2) The proposed method introduced factorization of the gate layer in the expert. Could you show some results when other subpart is employed for factorization (especially WpW_p) or other trivial tweaks (e.g., splitting first n-th element of intermediate layer to calculate norms)?

论据与证据

The claim of the paper is based on assumption that the model separation between each expert and router should be hermful. The paper said this characteristics is "overlooked", but the paper contains basically only empirical results on the proposed model that compares downstream scores, meaning that it is not sure to me to say that the features the paper focused on is necessary or not.

方法与评估标准

The proposed method is very intuitive and it should work if the large amount of training iteration have processed (like other MoE models). Factorizing gate layers in the expert layer significantly lose information from the expert layer, but is an acceptable solution to yield low-dimensional vector for score calculation if we relied on references cited on top-right of p.4. Table 2 also shows its impact is negligible.

理论论述

The proposed method is just an empirical method based on hypothesis that the expert sublayer itself is more aware of selecting tokens to process.

实验设计与分析

Evaluations are employed only for comparing the proposed method with traditional MoE. Though the results show better outcomes on the proposed method, it is still unclear whether the proposed method is still better against other strategies to select experts (e.g., those cited in the last paragraph in the section 2.

补充材料

Briefly checked whether there are some comparison with other methods in appendices, but no ones have been found.

与现有文献的关系

Not sure

遗漏的重要参考文献

Not sure

其他优缺点

NA

其他意见或建议

NA

作者回复

We sincerely thank you for your constructive suggestions and valuable comments! We hope our rebuttal helps address your concerns. If so, we would be grateful if you could consider increasing the overall recommendation.  


 

Comparison with more MoE works

We trained the Hash-Layer model (Roller et al.) on 100B tokens, using the same hyperparameters and model setup as our other models. Unlike other methods cited in the last paragraph of Section 2, which require domain-labeled data, Hash-Layer does not—hence our decision to include only this method for comparison. Here are the results:

Task Performance

100B tokensARC-EPIQASIQAWINOHELLAMNLIQNLISST2AVG.
MoE (Config.2)40.7458.4936.1351.3028.1132.6750.2351.8343.68
Hash-Layer41.1257.6236.1850.9128.8232.5450.3051.0343.57
AoE (Config.10)41.1658.3236.8053.0428.3732.7850.6154.5944.46

Our results align with Roller et al. (p.7), who note that "Switch Transformer and Hash perform similarly in multi-layer routing." Hash-Layer's key advantage lies in its balanced load distribution.

Our AoE model not only outperforms both Hash-Layer and traditional MoE in downstream tasks, but also demonstrates improved load balancing compared to traditional MoE.

 


 

Please compare other expert selection mechanisms

Thank you for the suggestion. We have added a new baseline—the Hash-Layer—as noted above. In Table 2, we report results using the top-K token choice, while Table 3 includes experiments with top-P token choice and top-K expert choice selection mechanisms. We hope these results address your concern.

If you have further suggestions for additional baselines, we’d be glad to explore them or discuss more!

 


 

Try factorizing WpW_p or other trivial tweaks

Thank you for your insightful question. We trained new models by factorizing WpW_p instead of WgW_g (dlowd_{low} = 256, using 100B tokens with LauxL_{aux}), but observed a performance drop.

Task Performance

100B tokensARC-EPIQASIQAWINOHELLAMNLIQNLISST2AVG.
MoE (Config.2)40.7458.4936.1351.3028.1132.6750.2351.8343.68
AoE (Config.10, Factorized WpW_p )40.9558.2236.5450.6728.0832.9750.2153.6743.91
AoE (Config.10)41.1658.3236.8053.0428.3732.7850.6154.5944.46

We have a few hypotheses regarding the performance drop:

  1. Factorizing WpW_p might create a bottleneck in this branch, while the activation function in the WgW_g branch may already act as a bottleneck. Since both pathways in the FFN are constrained in this case, it could lead to the performance drop. If this is the case, we would suggest always factorizing WgW_g in AoE or encourage future work to further develop expert architectures.

  2. Alternatively, the optimal value for dlowd_{low} might change when factorizing WpW_p.

Due to time constraints, we cannot provide a definitive explanation at this moment. However, these aspects of model architecture are valuable areas for future research, and we will discuss this point in the paper.

We also experimented with computing norms using only the first n elements of the intermediate layer (n=256n=256), training with 50B tokens due to time constraints. Compared to standard AoE (dlow=256d_{low}=256) and MoE, this method underperformed because of insufficient activation information for expert selection:

Task Performance

50B tokensARC-EPIQASIQAWINOHELLAMNLIQNLISST2AVG.
MoE (Config.2)38.7656.2635.6750.1227.3433.0550.2350.8042.78
AoE (Split, n=256)39.1455.7136.0350.7527.5533.0550.2749.3142.72
AoE (Config.10)39.2757.1335.9851.7027.4033.3250.8750.3443.25

We will include these points in the paper, as they offer insights that may be useful for future research. Thank you again for your insightful question!

最终决定

Paper #2647 This paper proposes a new scheme to select expert sublayer in the mixture-of-experts (MoE) language model. Rather than empoying a router layer to choose the expert to process incoming embedding, the method utilizes a factorized subcomponent of feed-forward layer to calculate the importance score ("norm" in the paper) that are fed to a softmax and top-k to choose the expert to be activated. [Reviewer yvWh] The core innovation of AoE is to eliminate the router entirely and allow experts to autonomously determine whether they should process inputs. I quite like the author's initial findings in existing findings that justify the development of AoE and also the toy illustration of AoE behaviors in the appendix [Reviewer 44vv] In AoE, routers are eliminated; experts pre-calculate internal activations for inputs and are ranked according to their activation norms. Only the highest-ranking experts continue with the forward pass, while the others are terminated [Reviewer BPoe] The authors conduct the pre-training experiments to investigate various properties of AoE and demonstrate promising results compared to the conventional MoE. [Reviewer JD26]

Though the results show better outcomes on the proposed method, it is still unclear whether the proposed method is still better against other strategies to select experts (e.g., those cited in the last paragraph in the section 2. [Reviewer yvWh] The paper does not present formal theoretical proofs but provides conceptual explanations for why AoE is effective. [Reviewer 44vv]

All reviewers acknowledge the authors rebuttals.