W1. SoftMoE Formulation: The SoftMoE formulation presented in Section 2 of this paper differs from that of the original SoftMoE paper [1]. In the original formulation, each expert processes slots, making and . However, in this paper, the parameter is not included. Notably, setting in the original formulation results in an incomparability with other papers using SoftMoE, where is recognized as a significant hyperparameter. Furthermore, in line 243, the authors cite [2], claiming that this work demonstrates that even a single-expert SoftMoE can outperform traditional architectures. This comparison is inaccurate, as is not set to 1. Specifically, after reviewing the code, I observed that ; see lines 55-64 in this GitHub file: Link.
W2. Theoretical Contribution: I don't find the results presented in Theorem 1 insightful.
W2.1. Trivial Result: Ignoring (the number of slots) in the SoftMoE formulation leads to a trivial outcome for . Here, matrix becomes an all-one vector of size instead of a matrix of size , resulting in identical output tokens and thereby ignoring any temporal information in the tokens, effectively replacing them with a fixed token. This trivializes the ineffectiveness of SoftMoE with , which is not the case when .
W2.2. Notion of Approximation: The notion of approximation presented by the authors is non-standard. I encourage the authors to provide references or pointers to similar approaches in the literature to clarify this aspect.
W2.3. Proof Technique: After reviewing Appendix A, I noticed that the proof relies on a special case where a contradiction arises as matrix norms approach infinity. This is acknowledged by the authors in Section 3, where they mention that normalizing the input makes the results from Theorem 1 inapplicable.
W3. Lack of comparison for Algorithm 1: The paper lacks a comparison with the sparse routing mechanism considered in the original SoftMoE paper [1] (cf. Section 3.2). (Correct me if I am wrong)

References:
[1] Puigcerver, Joan, et al. "From sparse to soft mixtures of experts." arXiv preprint arXiv:2308.00951 (2023).
[2] Obando-Ceron, Johan, et al. "Mixtures of experts unlock parameter scaling for deep RL." arXiv preprint arXiv:2402.08609 (2024).