PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
5
4
4
5
4.5
置信度
创新性3.3
质量3.3
清晰度3.3
重要性3.0
NeurIPS 2025

Dense Backpropagation Improves Training for Sparse Mixture-of-Experts

OpenReviewPDF
提交: 2025-05-08更新: 2025-10-29

摘要

关键词
moemixture of expertsrouter

评审与讨论

审稿意见
5

This paper presents an improvement for training MoE routers by including the moving average output of non-selected outputs in the router gradient calculation. This better approximates the dense gradients, particularly at lower layers, and leads to improved convergence and/or test accuracy. Validation is performed on MoE models around 2B params (500M active params) using FineWeb and a suite of language model benchmark tasks, finding good improvements in accuracy, perplexity by training length.

优缺点分析

This is a simple idea that explained well and demonstrated effectiveness. It's only compared to Top-K in an apples-to-apples comparison, but quite effective in boosting its performance there (they mention unrelated globally reduced load balancing loss as why comparison systems underperform, but I appreciate their candidness of the explanation here). Similarity plots in the ablations between expert averages and the approximated vs true dense gradient show some of the behavior the method nicely.

There are still lots of questions I have around the method's behavior and potential for interactions with other selection or balancing methods, when used in combination. For example, what happens if additional load balancing constraints are imposed while this gradient approximation is used? Do they work together, or does one lessen the need for the other? And since gradient error is introduced as a problem, is it possible to know more about its impact, and the impact of non-zero-mean vs zero-mean error in particular, other than just better/worse performance? I've listed these and a few more in the questions section below. While some performance and behavior were demonstrated well, I think many others remain unexplored.

问题

  • While distances between the means is shown in the paper, I'm curious about the distribution of distances of each actual output to each mean (including the distance to the expert that produced that output) --- how compact or diffuse are the clusters? What are the distances of each expert's outputs to each avg?

  • This zero-centers the error of router gradients for non-selected experts. But what are the impacts of non-zero-centered error, other than just worse performance? In a case where all expert outputs are near their means, as a possibly illustrative case, what does the error from zeroing experts do to the update (not just the grad), and what behaviors can this contribute to in terms of future expert selection, selection collapse, etc?

  • How does this interact, if at all, with other expert selection tricks like imposed balancing constraints or stochastic selection/gradient estimators?

  • The average is computed only over x for which the expert is selected. But then it's being used in the grad approximation for inputs in which it is not selected --- and its behavior could be very different for these x. I'm not sure how much this matters, other than maybe the possibility to tend to flatten the router distribution, since the inputs that are most affected are those closest to being selected by the expert anyway, and thus closest to its average expectation's distribution. It's a bit of a fine point, though, and I wonder what you think about this seeming inconsistency?

  • Relatedly, for x that are close to the boundary of enabling an expert from its mean comparison, if the new expert is enabled, how close are the actual outputs to the mean? In particular, if we calculate the distance between the output x would have had for expert K+1 (sorted by softmax score), how does it compare to distance for experts 1..K, and in particular expert K, which K+1 will replace if selected? Are these close, so that there could be oscillation at the edges? And what are the effects of this --- it could smooth expert selection, but at some cost to specialization, or it could induce more clustering around the means to improve specialization, but right now the precise behavior is unclear.

  • As mentioned above, to me it seems this has the potential to flatten the router distribution. Though this doesn't affect top-k, it might impact other selection methods (though temperature could be adjusted). Have you seen anything like this?

  • l.187 "we can actually automatically account for the impact of sparsity and 188 granularity on the default vector update by weighting the updates to the default vector by the router 189 logits." --- This seems important and interesting enough to go in the main text in method description. I can imagine some sort of heuristic that uses a different parameter indicating the weight of the ema update per overall training token, then calculates beta depending on how often the expert is selected. I actually haven't found a description of this yet looking over the appendix, either -- how does this work?

Smaller questions/comments:

  • Paragraph at l.240 about globally reduced load balancing loss: This paragraph didn't make any sense to me, but seems important. I think it could be more clearly written, with more concrete descriptions of what distributions or behaviors change and in which directions, with a small summary of the prior work and its effects.

  • Fig 2d: 1/64 sparsity --- this jumped out as strange, but explained in the caption that it's for a different smaller model. If it isn't a comparible point, I think it would make sense to either leave out or at least separate and better indicate in the figure it's for a different model

  • Table 2: in row for PubMedQA, 54.8_2.2 appears exactly in 3 of the 4 score cols --- is this correct or is there an error here building the table?

局限性

最终评判理由

This is a simple idea with demonstrated effectiveness. Most of the questions in my initial review were answered well.

格式问题

作者回复

Thank you for your detailed review and insightful questions. We address each point below:

How compact or diffuse are the clusters? What are the distances of each expert's outputs to each avg?

Since the router's conditional entropy decreases over time, we know that expert clusters become more compact (since each token's distance to other experts decreases). We will measure the exact distances over the course of training and include a plot of this trend in the camera ready revision.

But what are the impacts of non-zero-centered error, other than just worse performance?

When the gradient error has nonzero expected error, the router only receives feedback from the expert(s) it selected, regardless of whether it was the optimal decision or other expert(s) were better choices. As an example, assume the router selects expert 22 and receives a positive gradient signal. This will lead the router to increase expert 22's weight as to reinforce that selection. Assume also that had the router selected expert 11, it would receive a positive signal gradient signal that is twice as strong. In fact, expert 11 would be the optimal expert for the token and if the router received this additional information it would adjust its decision accordingly.

In this example, the absence of gradient signal from expert 11 is a component of the error. The TopK router would promote only expert 22 because it received a positive gradient signal. But this error prevents the router from seeing that expert 11 is in fact a better option. When we adjust the gradient to be unbiased, we remove this pitfall of the standard TopK router.

How does this interact, if at all, with other expert selection tricks like imposed balancing constraints or stochastic selection/gradient estimators?

Our implementation is designed to be flexible and add on to any existing MoE routing method. For example, we experimented with adding Default MoE interacting with SparseMixer. We were able to run this experiment successfully, although we did not evaluate this or any other combination of methods on a large-scale training run.

The average is computed only over x for which the expert is selected. But then it's being used in the grad approximation for inputs in which it is not selected. I wonder what you think about this seeming inconsistency?

This is an interesting point and we agree with your intuition regarding the closeness of inputs to an expert. The gradient signal supplied by the default vector has the strongest effect for inputs that have a high probability for that given expert. In other words, the effect of the default vector is proportional to how "close" that input is from the expert's domain. While there is a slight inconsistency in also applying this default vector to "farther" inputs this effect is much less pronounced.

As for flattening the router distribution, we don't believe this is a strong cause for concern because we observe empirically that conditional entropy gradually decreases (i.e. this distribution sharpens) over the course of training.

If the new expert is enabled, how close are the actual outputs to the mean? Could be oscillation at the edges?

Related to your first comment, the closeness of outputs to the mean depends on the time step. Expert clusters become compact over time as experts specialize, so the hypothetical output if a token were routed to an alternate expert grows father from the true mean expert output. But this is not a issue in practice because the default vector will have less of an effect on the input's routing decision as it grows farther. In the camera ready, we will include a figure depicting the measured distances between the expert output of an input on the boundary and the true mean expert output.

We do not observe significant oscillation effects in routing; see our response to Reviewer 3 regarding router fluctuation.

We also evaluated expert specialization (in the Appendix, Figs 14-25) between TopK and Default MoE and found no significant difference. So, Default MoE does not harm expert specialization.

As mentioned above, to me it seems this has the potential to flatten the router distribution. Have you seen anything like this?

We tracked conditional router entropy, which is positively associated with a flat router distribution. In all of our experiments we observed a monotonically decreasing router entropy, and did not see any signs of a flattened router distribution.

"we can actually automatically account for the impact of sparsity and granularity on the default vector update by weighting the updates to the default vector by the router logits." I actually haven't found a description of this yet looking over the appendix, either -- how does this work?

The following equation expresses how we weight the default vector update using the router logits. In doing so, we reduce the method's sensitivity to the β\beta hyperparameter, as shown in Figure 8 in the paper. We recognize that this is an important detail regarding the practicality of our method, and we will add an additional explanation in Section 3 to clarify this.

E^i(t)=βE^i(t1)+(1β)1KjRi(t)SijEi(xj)\hat E_i^{(t)} = \beta \hat E_i^{(t-1)} + (1-\beta)\frac{1}{K}\sum_{j\in R_i^{(t)}} S_{ij} E_i(x_j)

where R(t)NB×KR^{(t)}\in\mathbb{N}^{B\times K} denotes the indices of the KK routed experts for each of the BB batch tokens at training step tt, and SRB×NS\in\mathbb{R}^{B\times N} denotes the router weights corresponding to all NN experts for each token.

Paragraph at l.240 about globally reduced load balancing loss: This paragraph didn't make any sense to me, but seems important.

Traditionally, the load balancing loss is computed on each individual GPU's microbatch, using the microbatch's routing counts. The gradients from the load balancing loss are then all-reduced at the optimizer step. Alternatively, the globally reduced load balancing loss first all-reduces the routing counts, then applies the same "global" gradient to each microbatch. To see why this is important we present a motivating example:

Assume we are training with 32 GPUs and 32 experts. Also assume on each GPU ii, expert ii receives 131131 tokens and all other experts receive 9999 tokens. For example, the microbatch on GPU 0 has local routing counts [131,99,99,,99][131, 99, 99, \dots, 99].

The global routing counts are [3200,3200,3200][3200, 3200, \dots 3200] which means the distribution is perfectly uniform. So, the global load balancing loss should be zero. But, if we calculate the load balancing loss on each GPU individually the total loss will be nonzero because each individual microbatch is imbalanced.

In practice, this is problematic because individual microbatches's routing counts are noisy due to small sample size. At a large enough scale, we end up overestimating the penalty to apply for load balancing. Globally reduced load balancing loss avoids this by first reducing the routing counts, and then applying the loss. See Section 3.3 of GRIN MoE (arXiv:2409.12136) for more details. We will include a similar explanation in the camera ready revision.

Fig 2d: 1/64 sparsity --- I think it would make sense to either leave out or at least separate and better indicate in the figure it's for a different model

We agree, and thank you for bringing up this point of confusion. To clarify, this data point is for a larger (7 billion parameter) model compared to the other points corresponding to 2 billion parameter models. We will separate this individual result from the rest of the figure for a clearer comparison.

Table 2: in row for PubMedQA, 54.8_2.2 appears exactly in 3 of the 4 score cols --- is this correct or is there an error here building the table?

We checked the OpenReview submission; actually, in the first column, the PubMedQA score is 54.8_2.2, but in the third and fourth columns, it is 58.4_2.2. In this case, the DefaultMoE and TopK MoE got the same score for the 32c4 MoE configuration. (We went back and checked the LM-Eval-Harness outputs, and the stderrs are actually different, 2.2227 for the first column and 2.2064 for the third and fourth column) Please let us know if this is what you see when you look at the PDF on OpenReview.

Thank you very much for your thoughtful read of our paper. Your questions will make great additions to the paper, and we look forward to reading your response.

评论

Thanks for the responses. They address most of my questions well.

The globally reduced load balancing loss explanation in your comment here is much clearer. Also for Table 2, thanks for checking and including the additional error numbers. For means, 54.8, 58.4, 58.4 is also what I see, I was mistaken about the first number in my original comment.

审稿意见
4

This work investigate the problem of ineffective learning of non-selected experts in Sparse Mixture of Expert (SMoE) training. The authors noticed that in the traditional SMoE, gradient of non-selected experts are zero, and proposed to replace it with the running average of the experts' activation, which is a non-bias estimator. This result in DefaultMoE, and has shown to achieve better performance than the traditional TopK routing on the pre-training task with two architectural choices.

优缺点分析

Strengths

  • Overall this paper is easy to follow, the method is well-motivated and justified. The empirical study is also comprehensive and convincing.

Weaknesses

  • Cross-validating β\beta could be costly.
  • Although it is nice to see the authors compared 2 architectural variants, I would prefer to see a few more routing strategy baselines such as XMoE [A], MoEUT [B] , or AoE [C]

[A] Chi, Zewen, et al. "On the representation collapse of sparse mixture of experts." Advances in Neural Information Processing Systems 35 (2022): 34600-34613.

[B] Csordás, Róbert, et al. "Moeut: Mixture-of-experts universal transformers." NeurIPS 2024

[C] Lv, Ang, et al. "Autonomy-of-Experts Models." ICML 2025

问题

Suggestion: Together with Figure 11, it would be nice to plot the average zero-shot evaluation scores throughout training.

局限性

yes

最终评判理由

Overall a solid work, and the author addressed my concerns in the rebuttal. Therefore, I recommend acceptance.

格式问题

N/A

作者回复

Thank you for providing these comments and suggestions. We are glad that you found our paper convincing and easy to follow.

Tuning \beta\ could be costly

As we note in Line 184, we don't actually need to tune β\beta because we find that when the EMA update includes the router weights. The following equation expresses how we weight the default vector update using the router logits. In doing so, we reduce the method's sensitivity to the β\beta hyperparameter, as shown in Figure 8 in the paper. We recognize that this is an important detail regarding the practicality of our method, and we will add an additional explanation in Section 3 to clarify this.

E^i(t)=βE^i(t1)+(1β)1KjRi(t)SijEi(xj)\hat E_i^{(t)} = \beta \hat E_i^{(t-1)} + (1-\beta)\frac{1}{K}\sum_{j\in R_i^{(t)}} S_{ij} E_i(x_j)

where R(t)NB×KR^{(t)}\in\mathbb{N}^{B\times K} denotes the indices of the KK routed experts for each of the BB batch tokens at training step tt, and SRB×NS\in\mathbb{R}^{B\times N} denotes the router weights corresponding to all NN experts for each token.

Compare to XMoE, MoEUT, and AoE

Thank you for the useful related works; we have started the process of comparing against these methods, and will include the results in the camera ready, along with updating our open-source codebase to include the code of our reimplementations.

Plot eval scores throughout training

We agree with your suggestion to plot the evaluation scores throughout training. We have all the training checkpoints available, and will add the eval benchmarks plotted over time in the camera ready revision. (Please note that we can’t add figures in the rebuttal)

评论

Thank you for the clarification. I have no further questions.

审稿意见
4

Default-MoE addresses lack of gradient flow for MoE routers' unactivated paths by backpropagating a proxy output of such paths/experts. That default expert-output used for backpropagation, as the actual output is unavailable due to tokens not being routed to said expert, is a weighted average of historical expert outputs.

优缺点分析

Strengths:

  • Clearly written paper with an easy-to-follow flow.
  • Experiments are done in a more realistic scale with FineWeb.
  • In-depth ablation studies and method specifications.

Weaknesses:

  • It's perhaps clearer to pose the motivation to be the approximating truncation error that Sparse MoE incur with respect to Dense MoE (Eq. 7) in order to achieve Dense MoE's level of performance.
    • Personally, I find the current argument to imply that SMoE is a simple practical truncation of MoE born from an engineering standpoint, while in reality all of the theory (e.g. equations in Sec. 3.1) can treat SMoE as a formal router option.
    • If the goal is to estimate the gradient of the full Dense MoE (L266), a comparison with the Dense MoE variant on perplexity/computational overhead would be relevant.
  • While experiment size is relatively large, diversity is rather lacking.
    • Default-MoE is only compared with Sparse MoE formally.
    • For SparseMixer, only the training perplexity is reported, and only for the first 10B tokens. While it does seem like the test set perplexity would follow the trend of the training curve, actual numbers for a converged training would greatly improve persuasiveness.
    • The same applies with Loss-Free Balancing, but with no plot this time and only a brief mention in L239. While I understand that the wasted resources in reproducing ReMoE would have prevented the replications of other baselines, the paper still needs to be a complete evaluation instead of a list of individual experiments.
  • As router entropy is considered, it would be great if there is also additional analysis on router fluctuation.
  • "Default MoE" naming is easy to be confused with Dense MoE.

Overall I think the paper would be great if all of these concerns are addressed.

问题

See Weaknesses.

局限性

Efficiency concerns are listed in Sec 4.6. No societal impact was addressed.

最终评判理由

While the paper shows promise and the rebuttal does address some of the concerns, they do not provide any substantial improvement to the original version of the paper. The arguments are very handwavy, with only a few numbers included. For that I believe my evaluation of the paper remains unchanged.

格式问题

The structure of the paper is rather unconventional, which lack of emphasis on motivation and a conclusion. Regardless, it is not a dealbreaker.

作者回复

Thank you for your review and constructive feedback. We appreciate you pointing out the scale of our experiments and depth of our ablation studies.

Default MoE should be compared to Dense MoE

We agree with your suggestion to compare Default MoE against the Dense MoE as a baseline. After 10 billion tokens of training, a TopK 8c1 MoE achieves perplexity of 23.78623.786. A Dense 8c8 MoE achieves perplexity of 21.49921.499. Our Default MoE (8c1) achieves perplexity 23.19223.192. So, after 10 billion tokens, we recover 123.19221.49923.78621.49926%1 - \frac{23.192 - 21.499}{23.786 - 21.499}\approx 26\% of the performance gap between standard TopK and Dense MoE. We again emphasize that the computational cost imposed by this improvement is negligible. We will include this result and a loss curve comparison in the camera ready revision. We note that while 26% of the performance gap may not seem like a lot, approximating Dense MoE's gradient 100% faithfully will not reproduce the same performance as the Dense 8c8 MoE unless all 8 experts are activated, and this would incur significant computational overhead.

Experiment diversity

We appreciate your suggestion to include more extensive results with SparseMixer and Loss-Free Balancing. The SparseMixer authors note that their method requires at least 500 billion tokens of training to surpass the TopK baseline (arXiv:2310.00811). While this scale is outside the scope of our compute budget, we can conclude that prior to 500 billion tokens Default MoE will outperform SparseMixer since it already outperforms TopK. We will provide extended training results with Loss-Free Balancing to compare its performance over many more tokens. These results will be included in the camera ready version of our paper. Additionally, we will include a comparison to the related works suggested by Reviewer 2 to further expand the breadth of our evaluation.

Router fluctuation

To address your idea on router fluctuation, we empirically measured the similar metric of router saturation method proposed in OLMoE (arXiv:2409.02060). Router saturation involves the router decisions for identical token sequences at various training checkpoints, and measures the rate at which these decisions match those of the final trained model. For example, a router saturation of 0.80.8 at a checkpoint signals that the router's decisions will remain the same for roughly 80%80\% of the tokens throughout training, and will change/fluctuate for the other 20%20\%. When observing router saturation over time, we observed no significant difference between TopK and Default MoE after the earliest steps of training. In other words, the router in TopK and Default MoE "fluctuate" at similar rates.

Thank you for your detailed review. We look forward to reading your response.

评论

Thank you for the clarification. I have no further questions.

审稿意见
5

One of the problems of standard Sparse Mixture-of-Experts (MoE) models is that they are discontinuous functions and the gradients of the router are very sparse, which difficults its training. The paper suggests to learn a "default output" for every expert (i.e. the average output among the tokens that selected it) in a Sparse MoE model, and use this as the output of an expert when it is not selected. This way, the gradient of the router parameters corresponding to the inactive experts still get non-zero gradients, due to the (non-zero) default output. The proposed method is simple to implement, and does not add any significant overhead (neither time or memory) to the standard MoE model.

The paper experiments with several models (with 8 and 32 experts, and a different number of experts/token), and presents results on a varietay of well-established benchmarks (e.g. MMLU, OpenBooksQA, Lambada, ...) showing that it typically (but not always) outperforms the scores of standard TopK routing.

优缺点分析

Strengths

  • The paper is very well motivated, structured, and written. It is (generally speaking) very easy to read and understand.
  • The proposed method seems very simple to implement (and the authors have released the source code), which allows other researchers to quickly adopt the idea, re-evaluate it, and build on top to improve it.
  • The method improves the standard TopK routing, which is still essentially the most used routing algorithm among state-of-the-art MoE models, over a wide range of datasets and models (with some caveats).
  • The proposed method seems very robust to different hyperameters (learning rate, choice of EMA's β\beta), which also increases the chances of successfully adopting it for other models & datasets.
  • The paper includes lots of ablation and analysis experiments, measuring the importance of different design choices and empirical obsergations. I want to highlight, in particular, the experiment in section 4.5, showing that the proposed method provides gradients that are more similar to those of a dense MoE, than the ones obtained when using standard TopK (see figure 4, left). This shows that the proposed approach is indeed a good method for approximating the dense gradients of a MoE.

Weaknesses

  • My main concern relates to the observation detailed in lines 196-199 and figure 2a. It seems that the proposed method achieves better results, compared to standard TopK, when the sparsity is not too high. However, given that when KNK \rightarrow N, both methods are equivalent (because no default vectors are used), one would expect the proposed method to be more beneficial with sparser models, given that TopK does a very bad job approximating the gradients of the full (dense) MoE, as shown in Figure 4a. This raises concerns about the ability of the proposed method to generalize to larger and sparser models, and begs the question: how do the authors explain this?
  • Some details of the proposed method are not clear. For instance, it was observed that after "weighting the updates to the default vetor by the router logits", the value of β\beta is not as important (see lines 187-189, and Figure 8). However, I couldn't find any equation or detailed explanation of the actual "weighting" being used.
  • The paper uses EMA to compute a running average of the output of every expert based on the tokens that activated it. In Lines 130-132, the paper states: "our method ensures that computed expert outputs at the current step are factored into the EMA update before applying the EMA as a substitute for other tokens that did not activate the expert". This is problematic with auto-regressive models, since the natural dependence of tokens may be broken. The proposed method still shows superior performance than TopK after pretraining, so it seems that it didn't learn to exploit this. However, one wonders if models with bigger capacity could eventually learn to exploit this, and why not simply preventing it by using the previous EMA state as the experts' output of a given step, before updating it.
  • The statement in lines 262-264 is not obvious to this reviewer, one might think that the opposite is actually true. Let's suppose that the (conditional) entropy of the router is 0, then every expert will process a very distinct subset of tokens, so the different default vectors will be different. If the entropy of the router is high, very tiny differences in a given input token may cause the router to choose a different expert, which implies that all the experts may be processing relatively similar tokens, and then all default vectors would be similar too. However, it is also true that if the conditional entropy is close to 0, the importance/contribution of the default vectors to the output of the MoE is also 0 (because only the output of the top1 expert matters).
  • The "improvement (%)" columns in Table 2 are not clear at all. What is the baseline for this "improvement"? 8c1 results on Lambada show % of almost 4000, is this correct?
  • Figure 2d is a bit confusing without reading the caption. Perhaps labeling each point with the number of activated & total parameters of the corresponding model would be useful.

问题

  • You experimented a lot with EMA to learn the default expert outputs. Did you ever consider using the same optimizer as the rest of the parameters by adding an auxiliary term to the NLL loss? For instance, something like this: L=LNLL+βi(E^istop-gradient(Ex[Ei(x)]))2\mathcal{L} = \mathcal{L}_{\text{NLL}} + \beta \sum_i (\hat{E}_i - \text{stop-gradient}(\mathbb{E}_x[E_i(x)]) )^2
  • Typo in line 240? "dMoEs" -> "MoEs"?
  • Typo in Figure 4's caption: "dense router gradient (K=8)" should be "(N=8)", since it's the total number of experts, I assume. No?

局限性

Yes.

最终评判理由

The authors have successfully addressed my concerns and questions. I don't think the method or the results achieved can be qualified as a "groundbreaking impact", but it is a very solid paper. Thus, I definitely recommend the AC accepting the paper.

格式问题

No concerns.

作者回复

Thank you for your detailed review and valuable feedback. We are glad you found our paper easy to read, and appreciate you recognizing the simplicity and effectiveness of our method. We address each of your questions and concerns below.

It seems that the proposed method achieves better results, compared to standard TopK, when the sparsity is not too high. How do the authors explain this?

Our method outperforms standard TopK for the sparsest and largest model (the 1/64 sparsity 7B MoE).

You are correct in suggesting that higher sparsity presents more potential for improvement with Default MoE. In section 4.1 (line 151) we point out that in our 2B parameter MoE, roughly 1.6 billion of these parameters are allocated to the expert MLPs. In an MoE with N=32N=32 experts and K=1K=1, only 50 million of these MLP parameters are active (in contrast to 200M active expert parameters for N=8N=8). We believe it is difficult for our method to offer much of an improvement at such a limited scale.

To further demonstrate this point, we note that our 7 billion parameter Default MoE with a sparsity of 1/641/64 significantly improves over TopK, with a much wider performance gap than the 2 billion parameter 32c1 MoE (Figure 2d). In other words, unless the model size is prohbitively small, our method does outperform TopK at higher sparsity levels.

I couldn't find any equation or detailed explanation of the actual "weighting" being used.

You’re right, and we will update Section 3.2 to explain this weighting method more clearly. The following equation expresses how we weight the default vector update using the router logits. In doing so, we reduce the method's sensitivity to the β\beta hyperparameter.

E^i(t)=βE^i(t1)+(1β)1KjRi(t)SijEi(xj)\hat E_i^{(t)} = \beta \hat E_i^{(t-1)} + (1-\beta)\frac{1}{K}\sum_{j\in R_i^{(t)}} S_{ij} E_i(x_j)

where R(t)NB×KR^{(t)}\in\mathbb{N}^{B\times K} denotes the indices of the KK routed experts for each of the BB batch tokens at training step tt, and SRB×NS\in\mathbb{R}^{B\times N} denotes the router weights corresponding to all NN experts for each token.

Why apply the EMA update before expert forward pass if the natural dependence of tokens may be broken? Why not update after?

You are right in bringing up this potential issue. We update the EMA before the forward pass in the submitted paper, but we also experimented with applying it after the forward pass after the submission and found no difference in performance across the model scales we evaluate. We have updated the method to apply the EMA after the forward pass, and will edit lines 130-132 to reflect this in the camera ready. Thank you for this great suggestion!

If the entropy of the router is high, very tiny differences in a given input token may cause the router to choose a different expert, which implies that all the experts may be processing relatively similar tokens, and then all default vectors would be similar too. However, it is also true that if the conditional entropy is close to 0, the importance/contribution of the default vectors to the output of the MoE is also 0

These are interesting points, and we would like to clarify a few details:

  • When the router entropy conditioned on an input is high, it is true that slight changes in the input can lead to a different routing decision. But if the inputs are similar, this does not necessarily mean the expert outputs will also be similar, as the expert MLP is nonlinear.
  • In the high-entropy setting, the router may make choose an expert based on a marginal difference in the router weights. This may not be the optimal expert choice, yet the TopK router will not receive feedback from other experts. We claim that the default vectors are more useful when entropy is high because they provide information from all experts when the router's uncertainty is high.
  • We agree with your intuition on what would happen when router entropy is zero.

8c1 results on Lambada show % of almost 4000, is this correct?

We measure the normalized improvement using the expected random score as a baseline. For example our MMLU score is 32.5 compared to 31.8 for TopK MoE. This seems like a 2% improvement. But, MMLU is multiple choice so just a random guess would get 25%. So we are 32.52525=30%\frac{32.5-25}{25}=30\% better than the random baseline on this metric, compared to the 27% normalized improvement of TopK. The normalized LAMBADA scores are unexpectedly high because the random baseline is 1%, much lower than that of the other benchmarks. Compared to 1%, our score of 41.0 is a 40x improvement. We realize this seems out-of-place so we have a footnote in the paper addressing it; if you can point us to a better random baseline for LAMBADA we are happy to use it.

Figure 2d is a bit confusing without reading the caption.

We agree, and thank you for pointing this out. All but the rightmost data point in Fig. 2d correspond to models with 2 billion active parameters. We will separate the rightmost data point (from a 7B model) into its own result in the camera ready revision.

Did you ever consider using the same optimizer as the rest of the parameters by adding an auxiliary term to the NLL loss?

Yes, specifically we experimented with training the default vector E^i\hat E_i to be a learnable parameter. In an ablation experiment we found that this underperforms our proposed method. The core idea of our method is to approximate the expected expert output, but directly optimizing the default vector does not account for this heuristic. We'll add more detail on what we tried here in the camera ready.

Typo in line 240? "dMoEs" -> "MoEs"?

Yes, thanks for catching this!

Typo in Figure 4's caption: "dense router gradient (K=8)" should be "(N=8)", since it's the total number of experts, I assume. No?

Yes, N=8N=8 is the total number of experts. Note that we are specifically comparing to a dense model as an upper bound. We vary KK between 11 and 88 to evaluate all possible sparsity levels. In a dense model all of the experts are active so K=N=8K=N=8 and there is no sparsity. We will edit the figure caption to clarify this further.

Thank you for your detailed review; we genuinely enjoyed thinking about the questions you asked and look forward to reading your response.

最终决定

This paper presents a method that improves sparse MoE training by substituting missing expert activations with exponential moving averages of previously seen expert outputs, enabling dense gradient updates to routers while maintaining sparse computation. The approach consistently outperforms standard TopK routing across multiple model scales and benchmarks with negligible computational overhead.

While reviewers appreciated the method's simplicity, clear motivation, and comprehensive ablation studies, they noted limitations including incomplete comparisons with other routing methods and evaluation primarily at smaller scales (up to 7B parameters). On the other side, during the rebuttal, authors successfully addressed most technical concerns by clarifying implementation details, providing additional experimental results, and promising expanded baseline comparisons for the camera-ready version. Despite these limitations, the paper received unanimous support for acceptance due to its solid technical contribution and practical applicability.