Mixture of Hidden-Dimensions: Not All Hidden-States’ Dimensions are Needed in Transformer
摘要
评审与讨论
This paper presents a novel Transformer architecture to address the challenges associated with scaling hidden dimensions. The proposed MoHD (Mixture of Hidden Dimensions), leverages the observation of high hidden dimension sparsity to enhance computational efficiency and model performance. In experiments, MoHD outperforms vanilla Transformers and MoE models on efficiency and performance. Several techniques such as shared sub-dimension are introduced to dynamically activate sub-dimensions through a routing mechanism, ensuring that both common and specialized token are effectively modeled.
update after rebuttal
The rebuttal addresses my questions well. I am pleased to keep my positive scores. I also found that all the reviewers have the positive rating, and therefore I believe we have reached a consensus on this submission.
给作者的问题
See Weaknesses.
论据与证据
MoHD reportedly achieves a 1.7% performance gain with 50% fewer active parameters and a 3.7% improvement at 3× parameter scaling. These claims are empirically supported through rigorous testing. Results are convincing.
方法与评估标准
The MoHD architecture integrates shared/specialized sub-dimensions and dynamic routing, a well-grounded approach for intermediate dimensions. The proposed method aligns well with issues identified in observational studies. Evaluations in NLP tasks are convincing Parameter efficiency and task performance as metrics are practical choices.
理论论述
The method is built upon the observations. Experimental results demonstrate the effectiveness of the proposed method.
实验设计与分析
Experiments are comprehensive across many baselines to validate MoHD.
补充材料
Terms, extended literature review and more observations are provided, which is help to understand the study -- but I did not check them carefully.
与现有文献的关系
The work is related to the design of efficient LLM architectures:
- Sparsity among dimension, heads, activation, etc.
- Conditional computation such as MoE.
遗漏的重要参考文献
While key sparse and adaptive architecture studies are cited, recent MoE advancements requires further discussion (1).
- R. Cai et al. Flextron: Many-in-One Flexible Large Language Model.
其他优缺点
Strengths:
-
The paper is mostly well-written and easy to follow.
-
The proposed method is technically sound with well-visualized observations.
-
Experiments are conducted with numerous model size. The ablation studies effectively validating components of MoHD.
Weaknesses:
- The experiments were conducted on models with up to 1B activated parameters. It could be helpful to show the sparsity patterns in larger models, which may be different.
- Exploring the combination of MoE and MoHD could help unlock sparsity benefits across multiple dimensions.
- Further ablation studies to isolate the effects of shared vs. specialized sub-dimensions and routing mechanisms would clarify why finer-grained routing harms performance while higher expert specialization improves outcomes.
- How sensitive is MoHD to routing variations (e.g., K-value selection, thresholds)? Was stability tested under different routing configurations?
- Can MoHD adapt to non-NLP tasks without architectural modifications? Or are design adjustments necessary?
其他意见或建议
The paper can be improved by:
- Providing theoretical grounding for MoHD’s sparsity and routing mechanisms.
- Extending evaluations to other domains/larger models and real-world applications.
- Clarifying limitations/failure modes when deploying MoHD across tasks/architectures.
Thank you very much for your high evaluation of our work!
Q1: Providing theoretical grounding for MoHD’s sparsity and routing mechanisms.
R1: We refer to our response to Reviewer fdJw Q4, where we discuss how sparse mixed activation expands effective width, reduces complexity, and improves loss. Due to space constraints, we will include the full theoretical derivation in the final version.
Q2: Extending evaluations to other domains/larger models and real-world applications.
R2: As noted in h4Rw Q3, MoHD can scale to larger models, particularly those based on Transformer architectures. For cold-started models, MoHD is expected to apply similarly. We plan to extend evaluation to more domains in future work.
Q3: Why finer-grained routing harms performance while higher expert specialization improves outcomes.
R3: Our experiments show that performance peaks when hidden representations are divided into 16 sub-dimensions (each of size 256). Finer routing granularity leads to degradation due to:
-
Redundancy: Over-partitioning causes sub-dimensions to capture overlapping or low-importance features, reducing capacity efficiency.
-
Routing instability: Finer granularity makes routing decisions more sensitive and harder to stabilize during training.
-
Fusion mismatch: Our Group Fusion Layer re-integrates sparsely activated sub-dimensions. A moderate number of active groups is easier to optimize, while excessive fragmentation hinders training.
As for expert specialization, we interpret it as each sub-dimension focusing on a narrower subspace or data distribution. This expands representational diversity and enables the model to better capture fine-grained patterns, improving generalization across tasks.
Q4: How sensitive is MoHD to routing variations and stability tested under different routing configurations?
R4: Thank you for the insightful question. We analyzed MoHD’s sensitivity and stability under varied routing configurations, with key findings as follows:
-
Routing stability: As shown in Figure 7, sub-dimension selection probabilities mostly stay within 0.2–0.3, indicating active and balanced usage. Our Sub-Dimension Load Balance Loss further improves distribution and efficiency.
-
Shared subspaces enhance stability: Best performance occurs when 75% of sub-dimensions are shared across tokens. This mitigates instability in private subspaces. Beyond a certain partitioning level, additional sub-dimensions yield diminishing returns or harm performance due to routing instability.
-
Component sensitivity: MoHD is more sensitive in Attention layers than FFN layers. Sparsification in Attention causes larger performance drops, suggesting that routing in this component requires finer, task-specific design—an area we aim to explore further.
We agree that routing stability is key to scaling MoHD and appreciate the reviewer’s focus on this aspect.
Q5: Can MoHD adapt to non-NLP tasks without architectural modifications?
R5: Thank you for the thoughtful question. While MoHD is developed for NLP, its core ideas may extend to other domains—particularly Vision-Language Models (VLMs), which also use Transformer architectures. Our perspective:
-
Shared structure: Vision Transformers (ViTs) treat image patches as tokens, analogous to text tokens. These patches share global patterns while retaining local uniqueness, similar to linguistic semantics.
-
Core applicability: If ViT hidden dimensions show both shared and token-specific activations, MoHD’s principle of selective sub-dimension activation may enhance efficiency and expressiveness.
-
Architectural adaptation: Despite structural parallels, visual representations differ from language. Routing and partitioning strategies would need tuning to align with visual characteristics.
-
Multimodal fusion: In VLMs, MoHD-induced sparsity could affect alignment across modalities. Incorporating sparsity-aware mechanisms without harming cross-modal interaction remains an open direction.
In short, while direct transfer isn’t trivial, MoHD’s principles are generalizable and worth exploring in vision and multimodal settings.
Q6: Extend Reference
R6: We sincerely thank the reviewer for the additional references. We will include a detailed discussion of them in the next revision.
- The proposed Mixed Hidden Dimensions (MOHD) aims to address the inefficiency of hidden dimension scaling.
- The core insight lies in the observation that only subsets of dimensions are activated across tokens, with some dimensions shared globally and others allocated as "private" dimensions.
- The MOHD model reportedly achieves comparable or superior performance to standard Transformers while reducing activation parameters by up to 50%.
给作者的问题
- How does MOHD compare to other sparse models (e.g., structured/activation pruning) in training efficiency and convergence?
- What is the theoretical basis for activation scaling? Why does it preserve activation flow despite reduced compute?
- How does MoHD improve performance while reducing activations?
论据与证据
- Experiments supports the claims for smaller models (e.g., 355M/495M parameters).
- Results are largely convincing, though extrapolation to larger models (e.g., 1.13B) raises questions.
方法与评估标准
- MOHD is evaluated across diverse tasks.
理论论述
The theoretical foundation requires further elaboration.
- While the shared/specialized sub-dimension concept is appealing, rigorous analysis of their interactions is lacking.
- The activation scaling mechanism, though practical, lacks theoretical justification for mitigating information loss. Deeper mathematical insights would ground the method more firmly.
实验设计与分析
- Ablations offer useful insights but could better explain design choices (e.g., group fusion, balance loss) and their specific contributions.
- Claims of MOHD superiority over MoE feel underdeveloped. A nuanced comparison of trade-offs (e.g., model scale, task complexity) would provide a more balanced perspective.
补充材料
Yes
与现有文献的关系
- MOHD contributes to LLM Architecture, and Efficient-LLM
遗漏的重要参考文献
Existing works also observe the sparsity of Transformers [1,2], and this paper is also related to the pruning methods of LLMs [3,4], which are recommended for citation and discussion.
[1] Li et al. The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers. ICLR 2023.
[2] Wang et al. Q-Sparse: All Large Language Models can be Fully Sparsely-Activated. CoRR abs/2407.10969 (2024).
[3] Ma et al. LLM-Pruner: On the Structural Pruning of Large Language Models. NeurIPS 2023.
[4] Dong et al. Pruner-Zero: Evolving Symbolic Pruning Metric From Scratch for Large Language Models. ICML 2024.
其他优缺点
Pros:
- The writing and organization are relatively clear.
- A unique solution to inefficient hidden dimension scaling in Transformers.
- Convincing results for smaller models.
- The shared/specialized sub-dimension concept is novel and promising.
Cons: - Insufficient theoretical grounding.
- Limited comparisons with other sparse/MoE architectures.
- Computational costs for scaling MOHD need clearer articulation.
其他意见或建议
This paper would benefit from deeper analysis of limitations.
We are greatly encouraged by the reviewer’s positive feedback. Below, we address each of your comments in detail.
Q1: Theoretical Grounding
R1: Please see our response to Reviewer fdJw (Q4) for a more detailed explanation. In summary, we theoretically ground MoHD by showing that:
- Sparse activation leads to effective width expansion;
- This expanded width results in lower empirical loss;
- A hybrid scheme of private and shared sub-dimensions yields better performance than either alone.
Q2: Computational Costs Compared to MoE and Other Methods
R2: We summarize MoHD’s computational characteristics relative to other approaches:
-
Compared to MoE: MoHD sparsifies the hidden dimension across all matrices, while MoE targets only the FFN. Thus, MoHD achieves higher sparsity coverage. It does, however, introduce modest overhead from additional WTE parameters and routing operations, particularly at inference.
-
Compared to pruning and quantization: MoHD avoids iterative pruning or post-training fine-tuning. It learns sparse activations during training, reducing overall training costs while maintaining performance.
-
Compared to activation sparsification: Unlike post-hoc sparsification, which may degrade performance due to training-inference mismatch, MoHD maintains consistency between phases. This leads to better generalization and stability at a similar inference cost.
Q3: Scaling Costs
R3: Please refer to Reviewer h4Rw, Q4 for detailed cost analysis across model scales. In brief:
- Routing and fusion layers add minimal parameter overhead.
- The primary computational increase comes from the WTE layer, which scales with model width.
- Notably, the performance benefits of MoHD cannot be solely attributed to increased WTE size, as demonstrated in our ablation and FLOPS analysis.
Q4: How MoHD Improves Performance While Reducing Activations
R4:
-
Sparsity and Redundancy in Hidden Dimensions
Empirical studies reveal that only a small subset of hidden dimensions are meaningfully activated. MoHD removes redundant activations, concentrating computation on informative subspaces and improving efficiency. -
Component-wise Sensitivity to Sparsification
Consistent with prior work, our experiments show that the FFN layers tolerate, and sometimes benefit from, sparsification due to regularization effects. In contrast, MHA layers are more sensitive. MoHD accommodates this by applying structured sparsity selectively, preserving performance. -
Grouped Fusion Mechanism
MoHD achieves efficiency via:
- Activation scaling: Dynamically adjusts magnitudes of active units for representational stability.
- Grouped fusion: Aggregates sparse activations to preserve expressive capacity.
These mechanisms enable MoHD to reduce activations without compromising—and often improving—performance. Our findings suggest that well-structured sparsity improves generalization by filtering noise and reinforcing salient patterns.
Q5: Limitations of MoHD
R5: We appreciate the chance to elaborate on current limitations:
-
Hyperparameter Sensitivity
MoHD relies on key hyperparameters: sparsity ratio , shared dimension proportion , and total number of sub-dimensions . Tuning the balance between shared and specialized sub-dimensions is especially important for optimizing generalization vs. specialization. -
Routing Optimization Challenges
Unlike MoE’s routing, MoHD activates sub-dimensions within a single matrix, making routing more complex. While effective, our current routing loss becomes less stable as the number of sub-dimensions increases, limiting scalability. -
WTE Layer Growth
As MoHD expands the effective width, the WTE layer scales proportionally, contributing to a non-negligible share of total parameters. This can offset efficiency gains at very large scales. -
Information Degradation at Scale
Despite using activation scaling and group fusion, large-scale downsampling and softmax weighting may still cause skewed distributions, suppressing useful but low-weighted sub-dimensions. This can reduce representational fidelity, especially under extreme sparsity.
Q6: Extend Reference
R6: We sincerely thank the reviewer for the additional references. We will include a detailed discussion of them in the next revision.
This paper proposes a LLM sparsification method, namely Mixture of Hidden Dimension (MOHD), to improve the efficiency of Transformer-based LLMs. Based on the observation that only a small subset of hidden dimensions is shared and activated across tokens in given texts, MOHD selectively discern shared and specific sub-dimensions. In this way, not all hidden dimensions are utilized, and thereby improves the parameter efficiency, and utmost retain competitive performance with original LLMs. Experimental results demonstrate the effectiveness of the proposed method.
给作者的问题
What are the computational costs of training and inference with MOHD compared to MoE and other sparsification methods?
论据与证据
The experiments are basically sufficient. The model is established based on the observation, which provides empirical evidence for the motivation. To achieve the sparsification, the paper further proposes a routing mechanism. Experimental results indicate the effectiveness of the proposed method.
方法与评估标准
The paper conduct experiments on benchmark datasets and tasks on the widely-used LLaMA architecture. The method evaluation is convincible to me.
理论论述
The model is established based on the observation that not all hidden dimensions are utilized in LLMs for tasks. Though the underlying designations of the specific modules can be further illustrated, most parts are reasonable. Experimental results evaluate the method designs.
实验设计与分析
The paper conducts comparison experiments, ablation studies, parameter analysis. The appendix also provides details of experimental design and analysis.
补充材料
Yes.
与现有文献的关系
The proposed method is related to general efficiency studies on LLMs. The research topic is important, and this paper may draw some interests in the community.
遗漏的重要参考文献
Most are cited. More discussion and connection to existing MoE studies may be helpful in understanding the further contribution between them.
其他优缺点
Strengths:
- The motivation is mostly clear. The preliminary experiments and observations are inspirable.
- The proposed method is somewhat interesting and mostly reasonable.
- The experiments demonstrate the effectiveness of the proposed method.
Weaknesses:
- Some details of the specific module are not clear. The routing mechanism and activation flow can be further illustrated. The specific motivation of modules can be enhanced.
- The computational costs at different scales can be further clarified.
- It is not clear whether the proposed method can be applicable to larger LLMs.
- The writing can be further improved. The clarification of symbols can be detailed for better understanding.
其他意见或建议
The paper is well-written, but technical explanations of the routing mechanism would benefit from further clarification. A more explicit comparison with MoE and other sparsification methods in terms of scalability would also enhance the work.
伦理审查问题
No.
Thank you for the valuable comments and suggestions.
Q1: Comparison with MoE and other sparsification methods
R1: We respectfully refer to our responses to Reviewer fdJw (Q2, Q3) for a detailed comparison between MoHD and MoE, especially regarding routing design and efficiency. Comparisons with other sparsity-based methods are addressed in Reviewer 3ShX, Q1.
Q2: The illustration and motivation of routing mechanism and activation flow
R2: Thank you and we will further improve the writing. A detailed explanation is provided below:
-
The motivation for introducing Sub-dimension Scaling comes from our analysis of activation patterns in Transformers (Section 2.2, Figure 10). We observed that after sparse activation, activation magnitudes tend to decrease, potentially causing information loss. To address this, we introduced a scaling factor , which adjusts the outputs of activated sub-dimensions to restore the original magnitude. This factor is dynamically computed based on the number of activated sub-dimensions, helping to preserve consistent activation flow.
-
Our Dynamic Routing mechanism is motivated by the observation that many hidden dimensions have consistently low activation values, indicating redundancy. To reduce this, we designed a routing method that dynamically selects sub-dimensions based on input token characteristics, allowing the model to adaptively adjust its activations and enhance representational capacity. We also observed shared activation patterns across tokens and introduced shared sub-dimensions to capture these common features, reducing the complexity of routing and improving training efficiency. Experiments confirm that this approach yields notable improvements.
Q3: MoHD's applicability to larger LLMs
R3: Thank you for the suggestion. We believe applying MoHD to larger-scale language models offers several potential benefits:
-
Improved parameter efficiency: Larger models typically suffer from higher parameter redundancy. This makes MoHD’s sparse activation mechanism more effective, as it helps reduce inefficiencies without sacrificing expressive power.
-
Accelerated performance gains with scale: As shown in Table 1, under the same proportion of activated parameters, the performance improvement of MoHD over the baseline tends to grow as the total parameter count increases. This suggests that the advantages of MoHD may scale positively with model size.
-
Enhanced representation and generalization capacity: By combining shared and specialized sub-dimensions, MoHD captures both general features and fine-grained, token-specific patterns. Scaling up the model increases the size of each sub-dimension, potentially improving its ability to model complex language phenomena and enhancing generalization across tasks.
Due to the significant time and computational costs associated with pretraining larger LLMs, we were unable to include full-scale experiments in the current version. However, we plan to include these extended results in a future revision to address your suggestion more thoroughly.
Q4: The computational costs at different scales can be further clarified
R4: Thank you for this valuable suggestion. To provide a clearer view of the computational overhead, the following table presents the theoretical forward-pass FLOPS for models equipped with Word Token Embeddings (WTE), routing, and group fusion layers under different configurations:
| Model Size | MoHD 50% | MoHD 75% | Baseline 100% | 2× Width | 3× Width | 4× Width |
|---|---|---|---|---|---|---|
| 355M | 2.70E+12 | 3.63E+12 | 4.56E+12 | 5.40E+12 | 6.24E+12 | 7.07E+12 |
| 495M | 4.19E+12 | 5.59E+12 | 7.00E+12 | 8.40E+12 | 9.73E+12 | 1.11E+13 |
| 1.13B | 6.93E+12 | 9.80E+12 | 1.26E+13 | 1.40E+13 | 1.55E+13 | 1.69E+13 |
As shown above, although increasing model width leads to a proportional increase in WTE-related FLOPS, the relative impact on total FLOPS remains modest. This behavior is partly attributed to the architecture design: deeper models with a higher depth-to-width ratio benefit more from MoHD, as the ratio of activated parameters to total parameters becomes more efficient during scaling.
It’s also important to note that the performance gains of MoHD are not merely a result of increased WTE size or parameter count. For instance:
-
The FLOPS of the x4-width model is significantly higher than that of x3, yet the performance improvement is marginal.
-
The 1.13B model benefits more from MoHD in terms of performance improvement compared to the 495M model, even though its FLOPS-to-baseline ratio is actually lower.
These observations reinforce our argument that MoHD achieves efficiency primarily through structural sparsity and expert specialization, not brute-force scaling.
Q5: The clarification of symbols.
R5: We apologize for the confusion and will provide clearer and more readable content in the next revision.
This paper proposes MoHD (Mixture of Hidden Dimensions), an architecture that optimizes hidden dimension usage via dynamic routing between shared and token-specific dimensions. Experiments demonstrate its superior parameter efficiency and task performance over existing models.
给作者的问题
- Why does the 495M MoHD underperform on WinoGrande (WG)?
- Is it possible to apply the existing MoE model and Dense model to the MoHD architecture?
- Could you describe in more detail the differences between the MoHD and MoE approaches in terms of routing optimization, performance efficiency, etc.? Can these two approaches be combined?
- Can you provide some theoretical basis for MoHD?
论据与证据
The paper provides empirical evidence to support claims. The authors empirically show sparse activation patterns in hidden dimensions across tokens: some shared, and others token-specific. Extensive experiments validate the performance gains with low computational overhead. Ablation studies confirm component efficacy.
方法与评估标准
I think the method design is mostly reasonable. Though hidden dimension sparsity is known, cross-token activation modeling is novel and justifies shared sub-dimensions. Evaluations use standard NLP datasets and metrics, aligning with community norms.
理论论述
I think the theoretical contributions are mostly based on the observations of activation patterns and routing strategies. While the motivations are intuitive, further analyses can be enhanced.
实验设计与分析
I think the experiments are mostly convincing.
补充材料
NA
与现有文献的关系
NA
遗漏的重要参考文献
NA. Most related works have been cited.
其他优缺点
I think this paper has strengths as:
- Novel Approach.
- Clear writing.
- Robust generalization across NLP tasks
- Persuasive ablation studies.
For the shortcomings of this paper, please refer to Questions.
其他意见或建议
There are some terminology and symbols requiring more clarification:
- Terminology clarification: Does dimension refer to embedding size, specific dimensions, or all dimensions?
- Symbols clarification, e.g., in Equation 3.
Thank you for the valuable comments and suggestions.
Q1: On MoHD-495M’s performance on WinoGrande (WG)
R1: We respectfully clarify that MoHD-495M does not consistently underperform on WG. In fact, our MoHD 50%-495M model achieves 52.7%, outperforming the LLaMA2-495M baseline (51.3%). At larger scales (e.g., 1.13B), MoHD shows clear gains under multiple configurations (75% x2, x3, x4), indicating strong adaptability.
That said, performance dips in certain settings may stem from:
- WG-specific reasoning demands: WG requires fine-grained commonsense inference, which may benefit from further adaptation of MoHD’s routing and activation strategies.
- Routing sensitivity: Some configurations may have suboptimal routing for WG due to overfitting to other tasks. Ensuring robustness across specialized reasoning datasets like WG is a valuable direction for future work.
Q2: On integrating MoE into MoHD
R2: We appreciate this suggestion. Yes, integration is theoretically feasible. MoHD introduces sparsity along the hidden dimension (width), while MoE sparsifies the intermediate dimension (length). These operate orthogonally and, in principle, could be combined for multi-dimensional sparsity, potentially improving parameter efficiency.
However, such integration would introduce significant optimization and engineering complexity. Further empirical studies are needed to assess whether their combination yields synergistic or conflicting effects.
Q3: MoHD vs. MoE
R3: Key differences:
- Routing granularity: MoHD routes at the hidden dimension level, tailoring token-specific subspace activations. MoE routes at the expert (subnetwork) level in the FFN.
- Component scope: MoHD applies to both Attention and FFN, while MoE is typically confined to FFN projections.
- Capacity and interpretability: MoHD expands model width, increasing per-token representation capacity. MoE expands the FFN depth, aiding memory but not width.
- Efficiency: MoHD reduces redundant activations in both Attention and FFN, offering better scaling under fixed activation budgets (see Table 2).
- Challenges: MoHD’s routing across hidden dimensions is less explored and demands novel optimization and implementation.
Despite these challenges, MoHD shows notable improvements over MoE when trained from scratch, highlighting its promise for efficient scaling.
Q4: Theoretical Proof
R4: We prove that mixed sparse activation achieves strictly better risk bounds.
Lemma 1 (Unbiased Sparse Forward Pass).
Let be a hidden layer. Apply a mask with , and define the sparsely activated output as . Then:
Proof: Linearity of expectation and variance decomposition.
Corollary 1.1 (Effective Width).
For , training with sparse activation is equivalent (in expectation) to training a full network with width . Thus, sparse activation expands effective width.
Lemma 2 (Approximation Error Decay with Width [Barron, 1993]).
Let , where is a Barron space. For a network with width , the approximation error satisfies:
where is the Barron norm of .
Corollary 2.1 (Mixed Activation Lowers Error).
Define encodes global context, decodes shared features, models per-token contributions; denotes all parameters.
If with non-constant , then:
- Token-only networks () have .
- Mixed networks () have .
Choosing yields for .
Lemma 3 (Rademacher Complexity of Shared Dimensions).
Let (mixed) and (token-only) have equal total parameters. Then:
where for samples.
Proof: Shared Dimensions reducing the VC dimension [Bartlett, 1998]. Apply Dudley entropy integral.
Theorem (Risk Bound).
For risk :
Sparse activation (i) expands effective width, which (ii) lowers approximation error when shared dimensions are included, while (iii) shared dimensions reduce Rademacher complexity - collectively proving .
This paper introduces MoHD (Mixture of Hidden Dimensions), a novel Transformer architecture that improves parameter efficiency by leveraging hidden dimension sparsity. The key insight is that only a subset of hidden dimensions are highly activated, with some dimensions commonly activated across tokens while others are token-specific. MoHD combines shared sub-dimensions for common features with dynamically routed specialized sub-dimensions per token. The authors demonstrate strong empirical results across 10 NLP tasks, achieving 1.7% higher performance with 50% fewer activated parameters and 3.7% improvement with 3× parameter expansion at constant activation cost.
The reviewers found the paper's motivation clear and the methodology sound, with comprehensive experiments validating the approach. They particularly appreciated the novel perspective on scaling model architectures through hidden dimension sparsity. The theoretical foundations were strengthened through the authors' rebuttal, which provided detailed proofs of how mixed sparse activation achieves better risk bounds.
Several constructive suggestions were raised regarding clarification of the routing mechanism, computational costs at different scales, and comparisons with other sparsification methods. The authors provided thorough responses and committed to incorporating these improvements. The paper would benefit from expanded discussion of limitations, scalability to larger models, and potential applications beyond NLP.
The revised version is expected to fully address these points, particularly by including clearer explanations of the routing dynamics, comprehensive cost analyses across model scales, and extended comparisons with related approaches. With these additions, this work will make a valuable contribution to efficient scaling of language models.