PaperHub
6.6
/10
Poster4 位审稿人
最低3最高4标准差0.5
3
4
3
4
ICML 2025

Joint MoE Scaling Laws: Mixture of Experts Can Be Memory Efficient

OpenReviewPDF
提交: 2025-01-21更新: 2025-07-24
TL;DR

We derive scaling laws for memory- and compute-constrained Mixture of Experts language models.

摘要

关键词
mixture of expertsscaling lawsllm

评审与讨论

审稿意见
3

This paper introduces joint scaling laws for Mixture of Experts (MoE) and dense models, incorporating factors such as the number of active parameters, dataset size, and number of experts. The proposed scaling law captures interactions between these variables, enabling principled optimization of MoE configurations under compute and memory constraints.

给作者的问题

  • How to make sure the comparison between dense and MoE is fair.
  • Would the scaling laws hold for MoE variants with different design choice, e.g., routing policy.
  • How does varying expert size (corase expert or fine expert) affect the proposed scaling law?

论据与证据

Yes

方法与评估标准

Yes

理论论述

Yes

实验设计与分析

Yes

补充材料

NA

与现有文献的关系

Related to LLM training methods.

遗漏的重要参考文献

Yes

其他优缺点

Strengths:

  • Provides actionable insights for MoE with different budget.
  • Extensive empirical validation (280 models, up to 5B parameters) strengthens confidence in the scaling law.

Weaknesses:

  • The analysis assumes dataset size can scale freely, which may not hold in real-world scenarios with limited data.
  • Limited discussion of challenges in training MoE models (e.g., expert imbalance, routing instability) that could affect scalability.

其他意见或建议

It will be better to more clearly demonstrate is the comparison on MoE's full parameters or the activated parameters.

作者回复

We thank the reviewer for their helpful feedback and comments. We also appreciate the recognition of the extensive empirical validation and actionable insights from our work. Below, we specifically address questions and weaknesses mentioned in the review. If our answers address the reviewer's concerns, we would like to kindly ask for the reconsideration of the rating.

Expert Imbalance To address the reviewer’s concern, we have performed an analysis of our training logs in the context of expert imbalance/token dropping. In general, we observe that the load balancing loss quickly induces balance between experts. Overall, the percentage of dropped tokens is low and doesn’t exceed 10%, therefore doesn’t significantly affect the training efficiency. Here (https://anonymous.4open.science/api/repo/3412679821-1326/file/dropped_toks.png?v=09990c3d) we present a plot of per-layer average amount of dropped tokens (excluding the first 10% of training), for 2 selected active parameter counts and number of experts varying from 2 to 32.

We will include an extended discussion of the token dropping in the final version of the manuscript. We thank the reviewer for this suggestion and believe this will be a valuable addition to our paper.

Fair Comparison Between Dense and MoE Models This is indeed a critical question. We took several steps when designing the experiments to ensure a fair comparison:

  • During training, we use Switch MoE models with capacity factor C=1, i.e., tokens exceeding the capacity due to imbalance between experts are dropped. This is the most conservative setup for MoE; many papers [1, 2] and implementations [3, 4] actually use higher capacity factors or even dropless MoE. If we used such variants, we expect the benefits of MoE would be larger.
  • We carefully adjust the learning rate depending on the number of experts and additionally tune the batch size.
  • Finally, our work is the first to scale MoE models considering the total memory usage of the model, both during training and inference. This important comparison point was previously missing in the literature, which usually didn't consider the additional memory needed by MoE.

We hope that this explanation is convincing. Should the reviewer have further concerns or suggestions regarding this topic, we would be happy to address them.

Discussion on Scenarios With Limited Data Thank you for this important comment. Please note that since our analysis already considers different dataset sizes, our scaling law naturally applies to cases where the dataset size is fixed (by plugging the value of constrained dataset size DD in Eq. 6). Please refer to our response to reviewer njYw for a longer comment on this topic. We will also include a further discussion in the final version of the paper.

Routing Instability With the correctly tuned learning rate and batch size, as described in Sec. 5.1., we didn’t observe any instabilities caused by routing.

MoE Variants With Different Design Choices Thank you for the thoughtful question. Available literature shows robustness of scaling laws to changes in architecture/setup. This is documented for routing algorithms [6], training datasets [7] or depth-to-width ratio [5]. Therefore, while we agree that variations on the MoE design - such as the routing policy - form an important axis of model architecture, we expect our conclusions to be robust to these changes.

Fine-Grained/Coarse-Grained Experts We didn’t explicitly model expert granularity, since we would need vastly more resources to consider another variable in our experiment grid. However, based on the available literature [9, 10], we can expect that using fine-grained experts would further improve the efficiency gains we observe when using MoE. We leave further quantification of these gains for future work. We also hope to make it easy for the community to extend our research, since we will release the model checkpoints and code upon the end of the review period.

Regarding Other Comments In the paper, we always use the notation N_act to refer to active parameters, and N_total to refer to total parameters.

References:

[1] Muennighoff et al., OLMoE: Open Mixture-of-Experts Language Models

[2] Vavre et al., Llama 3 Meets MoE: Efficient Upcycling

[3] Gale et al., MegaBlocks: Efficient Sparse Training with Mixture-of-Experts

[4] Tan et al., Scattered Mixture-of-Experts Implementation

[5] Kaplan et al., Scaling Laws for Neural Language Models

[6] Clark et al., Unified Scaling Laws for Routed Language Models

[7] Hoffmann et al., Training Compute-Optimal Large Language Models

[8] Frantar et al. Scaling Laws for Sparsely-Connected Foundation Models

[9] Ludziejewski et al., Scaling Laws for Fine-Grained MoE

[10] Dai et al., DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

审稿意见
4

The authors, motivated by the Chinchilla scaling laws for large transformers and the popularity of Mixture of Experts (MoE) architectures, investigate a joint scaling law that can be applied to MoE models and dense models (when number of experts = 1). The loss of the model is related to the number of parameters, number of training tokens, and number of experts. The optimization involved minimizing this loss given a fixed compute budget F (assuming limiting factor was accelerator memory)

This work considers a standard MoE variant: a switch MoE. Tokens are routed to experts and a load-balancing loss is used. The Chinchilla-based optimality analyses assumes that computational efficiency can be measured by FLOPS: approximated by 2ND + 4ND for the forwards and backwards pass respectively, where N is the number of activated parameters and D is the number of training tokens

Running over 280 experiments, the authors fit the scaling law using least squares and share their four findings and rule of thumb.

Edit: Given the authors' responses I have increased my score to an accept.

给作者的问题

  • Have the authors analysed if the load-balancing loss fully mitigates the expert imbalance at scale? Does it help avoid the problem where many tokens in a batch are routed to the same expert causing poor hardware utilisation that the theoretical analysis misses? Asking if there is a chance that inference bottlenecks in the real-world render the theoretical analysis less useful?
  • Do the authors believe these results will hold for distributed training?
  • Have authors verified for a few experiments if estimated FLOPs match actual FLOPs?
  • Would optimisations like FlashAttention and MLA influence the observed scaling laws?
  • Do the authors believe any modifications should be made to the Chinchilla scaling laws that are MoE specific? I think the paper suggests no?

论据与证据

Yes, the authors do a good job at gathering convincing evidence from comprehensive experiments

方法与评估标准

Yes

理论论述

N/A

实验设计与分析

Checked the experiment settings in the appendix; generally looks good, i.e. optimised params

补充材料

Appendix.

与现有文献的关系

This paper covers the broader scientific literature of MoE architecture (efficient/conditional compute) along with scaling laws for transformers (Hoffman et al., 2022). The key contribution is the combination of successfully extending the Chinchilla scaling law to cover MoEs.

遗漏的重要参考文献

N/A

其他优缺点

Strengths:

  • The authors do a great job at visualising their results through very well-done figures in a way that effectively communicates the results
  • The summaries of the findings and rule-of-thumb is also very well communicated and makes the work easy to digest
  • Focusing on inference memory, not just training FLOPs is important
  • Range of experiments is vast and rigorous
  • Limitations section is well done

Weaknesses:

  • The assumption that dataset-size is a variable I find unrealistic; empirically most will take the largest data that is available to them and treat it as a constant, varying model-size and training-steps
  • Missing how this works across different datasets, or is scaling law dataset independent and if so can can we reach this conclusion from some results
  • I feel that there is an over reliance on the Chinchilla-derived scaling law without modification (although I guess the fit has a low MAE)

其他意见或建议

  • Sec 4.1 is missing an equation number for the optimization equation
  • A footnote could be added to explain the 6=4+2 weights for FLOPS for readers not very familiar with the literature (I had to check this myself in Hoffmann 2022)
作者回复

We thank the reviewer for the insightful feedback and comments. We appreciate the recognition of the vast scale of our experiments, clear visualizations, and good communication of findings. Below we address the reviewer’s questions in details. If our answers address the reviewer's concerns, we would like to kindly ask for the reconsideration of the rating.

Variable Dataset Size Thank you for this valuable insight. Please note that in the scaling laws literature, it is common to treat the dataset size as a variable [1,2,3,4]. These results can be crucial in the long run, as shown by [2], which shifted the community’s focus to collecting larger datasets rather than just scaling model sizes. Furthermore, typically a one-epoch setting is assumed, i.e. the models are trained on the whole dataset. However, we agree that the data-constrained scenario is important from the practical perspective, and including such discussion can provide a broader context to our paper. Since our analysis already considers different dataset sizes, our scaling law naturally applies to cases where the dataset size is fixed (e.g., in Fig. 2 (a) in the paper each vertical line represents a constant dataset size). We will add a more detailed section concerning the dataset constraints in the camera-ready version of the paper.

Different Datasets [2] note that their scaling laws are robust to changes in the dataset (they consider 3 different datasets). Similarly, we expect that our qualitative conclusions will hold for different datasets, even though the numerical coefficients will likely be different.

Reliance on Chinchilla / Changes of Chinchilla for MoE As Chinchilla is an established result and is proven to reliably model scaling in various scenarios [3,4], our scaling law needed to be reducible to Chinchilla in the case of E=1E=1 (dense). Our experiments have shown that our scaling law works well for different EE’s via low MAE on a held-out extrapolation validation set, both for dense and MoE models. Note that we do modify Chinchilla scaling laws - a scaling law that generalizes [2] for a given E is a major contribution of our work. As a further piece of evidence for the reliability of our approach we will include a bootstrapped version of our results in the final version of the paper.

Expert Imbalance Thank you for raising this important question. To avoid exceeding the response length we refer to the answer given to reviewer V5XV.

Influence of Architecture/Training Optimizations Introducing changes (such as MLA) to the models’ architecture would change the scaling laws coefficients, however, we expect the functional form to remain same (similarly to [5], whose laws’ form did not change when they modified their routing mechanism). As changes to the attention mechanism would impact both Dense and MoE models, we would not expect them to favor any of the architectures, and we would expect the general conclusions to stay the same. Simultaneously, Switch MoE layer could be improved e.g. via the use of fine-grained experts, yielding better scaling behavior in MoE models.

FLOPs and Efficiency To compare our FLOPs estimates with actual numbers, we perform an experiment for 3 model sizes and EE’s using torch.utils.flop_counter.FlopCounterMode, measuring MFLOPs per token in forward pass. The discrepancy between our estimate and actual numbers stems mostly from the implementation details of the embedding layer and becomes relatively smaller with larger models. It is also the same size in MoE and dense models with the same number of active parameters. Note as well that our FLOPs estimation method is standard in literature [2,6].

E-1832
estimation method/ model sizeour estimatetorchtorchtorch
370M642.1572.7573.0573.7
890M1781.01702.21702.71704.4
1.6B3261.43186.53187.33190.0

In a specific setup, MoE efficiency depends on the implementation, with many efficient ones [7,8] speeding up training and inference.

Distributed Training Since our method is implementation agnostic, distributed setting should not have any impact on our conclusions.

Other Comments Thank you for your comments regarding equation number and clarification on the FLOPs counting. We will fix them in the final version of the paper.

References:

[1] Kaplan et al., Scaling Laws for Neural Language Models

[2] Hoffmann et al., Training Compute-Optimal Large Language Models

[3] Ludziejewski et al., Scaling Laws for Fine-Grained Mixture of Experts

[4] Kumar et al., Scaling Laws for Precision

[5] Clark et al., Unified Scaling Laws for routed language models

[6] Gadre et al. Language models scale reliably with over-training and on downstream tasks

[7] Tan et al., Scattered Mixture-of-Experts Implementation

[8] Zhao et al., DeepEP: an efficient expert-parallel communication library

审稿意见
3

This work balances computational and memory constraints by deriving joint scaling laws for both Mixture-of-Experts (MoE) and dense models. The analysis shows that the optimal number of experts is closely tied to the available memory and compute budgets. Furthermore, experimental results suggest that MoE models can often outperform dense models. By transferring scaling laws from dense to MoE models, this study provides valuable insights for designing and deploying MoE architectures in large-scale training scenarios.

给作者的问题

The paper employs a standard Switch MoE layer that routes each token to a single expert, which seems inconsistent with mainstream MoE architectures like Deepseek that activate multiple experts per token. Can the methods presented in the paper be adapted to scenarios where K>1K>1 experts are activated per token?

论据与证据

The claims are well-supported.

方法与评估标准

The proposed methods and valuations criteria almost make sense. However, here are some questions:

  1. I did not find a dedicated section in the manuscript that explicitly details the dataset selection process. Specifically, I wonder if the authors utilized established benchmarks such as the HumanEval dataset [1], which assesses the capability to generate Python functions of varying complexity. Could you please provide specific details about your dataset choices and experimental setups?

  2. I would like to discuss a point regarding performance metrics. While traditional scaling laws typically use final loss as the primary performance indicator, I am interested in your perspective on how loss values correlate with other metrics such as accuracy. Specifically, I would appreciate your thoughts on why lower loss values provide stronger evidence for conclusions such as "MoE can often be the preferred alternative to dense models" compared to alternative evaluation metrics.

[1]. Evaluating large language models trained on code. 2021.

理论论述

The work does not include any theoretical proofs. I believe the article would be significantly strengthened if the authors provided derivations of the optimal NN and DD (as shown in equation (7)) by formulating and solving the joint MoE scaling law objective function. This theoretical foundation would complement the empirical results and make the content more complete.

实验设计与分析

The experimental designs are effectively validate the performance of the proposed principled framework for selecting the optimal MoE configuration and some interesting findings for comparsion between dense and MoE models.

补充材料

Yes, I reviewed the supplementary materials, where the authors provided further explanations on experimental settings and implementation details.

与现有文献的关系

Recent works have shown that:

  1. For a fixed dataset size, as model size increases, the benefit of using an MoE diminishes.

  2. For a fixed model size, as the number of training tokens increases, the benefit of an MoE grows.

This paper further discuss trade-offs between the computational and memory costs of MoE models through the novel methods using joint MoE scaling laws, offering valuable insights for their design and deployment in large-scale training settings.

遗漏的重要参考文献

No

其他优缺点

Strengths:

  1. The paper propose a novel joint scaling law for MoE models to find optimal experts configuration, providing an interesting perspective for us.

  2. The topic for discussing the precise trade-offs between compute and memory efficiency is important in real-world and this work offer actionable insights for deploying MoE models in practical.

Weaknesses:

The innovation of the proposed Joint MoE Scaling Law appears to be incremental. It builds upon previous work that established power-law relationships between final loss, model size, and dataset size [1], as well as recent MoE-related studies that treated the number of experts EE and model size NN as variables in formulas [2]. The main contribution seems to be incorporating dataset size DD as an additional variable to establish the equation between LL, EE, NN, and DD.

Could you clarify the core innovations of your proposed method, as well as the differences and challenges in advancing beyond the aforementioned research? (If the authors address this concern, I would consider increasing my score.)

[1]. Training compute-optimal large language models, 2022.

[2]. Unified scaling laws for routed language models, 2022.

其他意见或建议

In the Related Work section, equation (2) appears to lack the constraint C6NDC \approx 6ND. where CC denotes the floating-point operation count (FLOPs).

作者回复

We thank the reviewer for the detailed comments and suggestions. We also appreciate the recognition of the practical importance of our findings, the actionable insights they provide, and the confirmation that our claims are well-supported. We hope that the answers below adequately answered the reviewer's questions and concerns. If that is the case, we kindly ask for a reconsideration of the paper score.

Regarding the innovations, differences and challenges compared to previous work We believe that our paper delivers significant innovations not present in existing works and provides important value to the scientific community:

  • Crucially, we derive and fit a new scaling law, which allows us to develop novel, practical insights for MoE training. In particular, we are the first to consider scaling of MoE models in a memory-constrained regime. We deliver a new, unexpected result that MoE models can be optimal in such scenario. We are the first to obtain the optimal token-to-param ratio based on the number of experts.
  • The inclusion of D allows us to reach qualitatively different conclusions than [5]. Furthermore, we consider the compatibility of our formula with [6] a strong point of our paper.
  • We have performed a careful and meticulous work of collecting empirical evidence (>280 training runs, spanning model sizes with up to 5B parameters, multiple distinct training durations and MoE expert counts). These results set a substantial basis for deriving conclusions. Moreover, we open the results to the scientific community for further analysis, by releasing model checkpoints and training code upon the end of the review period.
  • Furthermore, we put a lot of attention towards details to ensure our results are robust. We utilize a dynamic batch size (Sec. 5.1.1.) to ensure the reliability of our training runs regardless of their token count. Additionally, we derive a joint scaling law for the learning rate (Sec. 5.1.2, App. C), which is a novel contribution of our work. Based on the fitted coefficients, it justifies that a larger EE necessitates lower learning rate - a result not present in the literature to this point.

Evaluation As we are interested in comparing general trends, we focus on modeling perplexity. This metric has been shown to predict downstream performance well, even if the architecture details (e.g. the model size) differ [1]. Existing literature [2], suggests that when perplexity is fixed, MoE outperforms dense models at world-knowledge tasks while matching their performance in reasoning. Notwithstanding, we agree that an analysis of the downstream performance would be a valuable addition. We have performed such experiments (Results: https://anonymous.4open.science/api/repo/3412679821-1326/file/all_benchmarks_grid.png?v=70a33096) using dropless MoE during evaluation. We find that the perplexity strongly dictates the overall downstream performance, however there seems to be a slight advantage of either dense or MoE models in selected benchmarks (LAMBADA, OpenBookQA). We will add the analysis of downstream performance in the camera-ready version of the paper.

Derivation of the Optimal N and D Here (https://anonymous.4open.science/api/repo/3412679821-1326/file/optimal_n_d.png?v=331aaf30) we present a sketch of the derivation of the optimal N_act (with optimal D being analoguous). We will provide full details in the final version of the manuscript. We thank the reviewer for this suggestion and believe it will contribute to the completeness of the analysis.

Dataset Selection We train our models using FineWeb-Edu, a 1.3T subset of the FineWeb dataset - a large, openly available, high-quality LLM pretraining dataset. The data curation process of the FineWeb was guided using popular benchmarks (CommonSenseQA, HellaSwag, etc.). FineWeb-Edu is selected using a filter for highly educational content. It “outperforms all openly accessible web-datasets on a number of educational benchmarks” [3]. We will clarify our choice of the dataset in the camera-ready version.

Regarding “Questions for the Authors” Although we focus on the standard MoE variant, we believe that our main conclusions will hold for other MoE versions. We can form this assumption based on related work, where scaling laws are shown to be consistent across routing algorithms [5] or datasets [6]. Based on the literature [4], we can expect changes like fine-grained experts to further improve efficiency gains from using MoE.

References:

[1] Du et al., Understanding Emergent Abilities of Language Models from the Loss Perspective

[2] Jelassi et al., Mixture of Parrots: Experts improve memorization more than reasoning

[3] Penedo et al., The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

[4] Ludziejewski et al., Scaling Laws for Fine-Grained Mixture of Experts

[5] Clark et al., Unified Scaling Laws for routed language models

[6] Hoffmann et al., Training Compute-Optimal Large Language Models

审稿人评论

Thank you for your answer, it has resolved most of my confusion. I have already improved my score.

审稿意见
4

The paper proposes a scaling law for mixture-of-experts and dense models, similar to the one used by Chinchilla (Hoffmann et al., 2022), but that incorporates the number of experts in the equations.

The proposed equation describing the scaling laws, is essentially the combination of Chinchilla with Clark et al. (2022), which explores the scaling laws of MoE models, but that didn't take into account the total number of training tokens DD, only the number of active parameters NN. The paper also presents different variations of the scaling laws that take into account not only total training cost, but also (expected) total inference cost, and memory constraints.

The authors fit the model running several expermients, with a wide range of training FLOPs (different number of tokens, different number of experts, and different model backbone sizes), and analyze the results. There are different key observations from the experimental results which are highlighted through the paper. Perhaps, the most interesting one is that MoE models can also be parameter efficient compared to dense models, which is commonly assumed to be false in the community.

Update after the rebuttal

I thank the authors for clarifying my questions and addressing the typos that I highlighted during my review. I will keep my score and recommend that the acceptance of the paper.

给作者的问题

None.

论据与证据

The authors propose a unifying scaling law for dense and MoE models, that accurately predicts the training loss of a model given the total training budget, the total number of activated parameters, and the number of experts.

The experiments indeed show that the optimal number of experts for a given training budget depends on the memory constraints. When the training budget is high but the memory is highly constrained, it is often better to use a dense model than a MoE one. But if the training budget is low, a MoE model may be optimal, even with the same memory constraints than a dense model (see Figure 3, 4 and Table 2).

方法与评估标准

The paper follows standard practices when studying the optimal scaling of Transformer models for language modeling. The evaluation criteria (fitting the scaling formula and comparing the interpolation and extrapolation behaviour) is also standard.

理论论述

There are no theoretical claims.

实验设计与分析

The experiments are sound: the authors train a wide range of models on a large collection of data (FineWeb-edu), perform a reasonable tuning of hyperparameters (learning rate and batch size). They use standard practices to fitting the scaling laws (LBFGS, as Hoffmann et al., 2022).

补充材料

I've read all the supplementary material, which mainly contains details about the fitting of the scaling laws and their resulting coefficients, and the list of models explored during experimentation.

与现有文献的关系

The key contributions of the paper are of high importance to the community of language modeling, deep learning architectures, and mixture-of-experts. The paper includes a quite comprehensive literature review, citing relevant papers in the context of MoEs and scaling laws for language modeling.

遗漏的重要参考文献

The literature review focuses only on MoEs and scaling laws for language modeling, while the same techniques can be applied to other areas, such as computer vision. For instance, "Scaling Vision with Sparse Mixture of Experts" by Riquelme et al. (2021) presents MoEs for Vision Transformers (ViTs), and "Scaling Vision Transformers" by Zhai et al. (2021) presents scaling laws for ViTs.

其他优缺点

I really appreciate that the authors highlight all the take-aways from the different sections, and give a general rule of thumb at the end of the paper. The paper is of great quality in my opinion. Congratulations to the authors!

其他意见或建议

  • In Table 2, I would suggest labeling the rows as "Training FLOPs", and the columns as "Maximum Memory". This can be inferred from reading the table's caption, but it will be clearer to see if it was directly on the table, and it seems to be enough space to add these in it.

  • I think there's a typo in the caption of Figure 5. L346 reads "(D/N<1D/N < 1 ---- more tokens than parameters)". It is the other way around: NN is the number of parameters, and DD of tokens!

  • Could you add the legend in Figure 5b). I'm not sure what the brown curve represents (is it 16B training tokens?).

作者回复

Thank you for your thoughtful and encouraging review. We are especially grateful for the reviewer’s recognition that "the key contributions of the paper are of high importance to the community of language modeling" and that the work is of "great quality". We are also thankful for the detailed comments and suggestions, which we will apply in the revised manuscript version.

Scaling Laws for Other Modalities Thank you for pointing this out. We agree that the techniques presented in our work could be extended to other domains such as computer vision, and we appreciate the references to Riquelme et al. (2021) and Zhai et al. (2021). We will incorporate these works in the revised version and expand the discussion to highlight the broader applicability of our scaling laws beyond language modeling.

Regarding "Other Comments" If we are able to fit the labels within Table 2, we will do so in the camera-ready version.

Regarding the typo in Fig. 5 - you are absolutely right, the caption should indicate that D/N < 1 stands for fewer tokens than parameters and we will fix the caption.

Thank you for pointing out the missing label in Fig. 5b. The brown curve indeed corresponds to 16B training tokens - we will fix the plot accordingly.

最终决定

In the paper, the authors propose unified scaling laws for both dense and Mixture-of-Experts (MoE) models, taking into account important variables such as the number of active parameters, training dataset size, and the count of experts involved.

All the reviewers are positive about the contributions of the papers, including: (1) the novelty of the scaling laws framework that predicts a model’s training loss based on three core elements: the overall training budget, the number of active parameters during training, and the number of experts in the model; (2) the experiments are sufficiently extensive and appropriate to evaluate several aspects of the proposed scaling laws; (3) the writing and presentation of the paper are good. After the rebuttal, most of the concerns of the reviewers were addressed, and all the reviewers are happy with the current stage of the paper.

In my opinion, the contributions and originality of the proposed scaling laws are sufficient for acceptance at ICML. Therefore, I recommend accepting it in its current form. However, I encourage the authors to address the reviewers’ suggestions and integrate their feedback into the camera-ready version of their paper.