PaperHub
4.9
/10
Poster4 位审稿人
最低2最高4标准差0.8
2
4
3
2
ICML 2025

Mastering Massive Multi-Task Reinforcement Learning via Mixture-of-Expert Decision Transformer

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24

摘要

关键词
multi-task RLoffline RLDecision TransformerMixture-of-Experts

评审与讨论

审稿意见
2

This paper studies the Massive Multi-Task Reinforcement Learning problem and proposes M3DT. When the number of tasks is large, it shows great performance improvement compared with former methods.

给作者的问题

Please check weaknesses.

论据与证据

In section 3.1, the authors propose an insight "Reducing the learning task number, particularly to a sufficiently small scale can significantly enhance model performance" according to Figure 2. However, I do not feel the logic is smooth here. In my opinion, Figure 2 only shows that the more tasks there are, the more conflict and worse performance they have. This paper studies massive MTRL where the number of tasks is larger, say 160. Then Figure 2 does not tell us reducing 160 tasks to a small number of groups, say 20, can enhance the performance. First, the performance may highly depend on the grouping strategies. Second, there may still exist conflict within task groups. Thus, in terms of the presentation, this claim is not appropriate.

方法与评估标准

Yes, the evaluation and methods make sense to me.

理论论述

There is no theoretical analysis in this paper.

实验设计与分析

This paper studies the massive multi-task RL setting where the number of tasks could be 160, which is sound to me.

补充材料

There is no attached supplementary material. However, I have checked the appendix for implementation.

与现有文献的关系

This paper is highly related to current literature in this area.

遗漏的重要参考文献

I am not sure whether all essential references are included.

其他优缺点

  1. In Figure 4, the model architecture is drawn from bottom to top. However, the small figures for those tasks are drawn above the output layer with a "down arrow". This may raise some confusion.
  2. The description of the method/algorithm is not clear, especially for the grouping part. I have the following questions but failed to find out the answer in the paper: a) How to determine the number of groups? b) What is the grouping frequency, such as grouping every epoch? c) For M3DT-G, it needs to calculate all gradients. Then, the computation cost is pretty high. d) Is there any special treatment for tasks within the same group? For example, do authors treat them equally important and use linear scalarization perhaps?
  3. In Figure 5 on the left, why is there a performance drop from 98M to 123M?
  4. I appreciate the ablation study in section 5.2 to support the method design. However, I do not understand why to separate the expert training and router training because they are trained simultaneously in the original MoE with load-balancing loss. Why do not authors simply group tasks and train MoE layers (router+experts)?
  5. Lastly, this paper studies the MTRL problem. Nevertheless, I feel the methods are general and can be applied in supervised MTL as well. What specific challenges in MTRL do authors find and solve? The novelty in the current version looks like using grouping and MoE in MTL.

I am happy to change my mind if these questions are answered or if I misunderstand the paper.

其他意见或建议

Please check weaknesses.

作者回复

We sincerely thank the reviewer for the thoughtful feedback and questions that will surely turn our paper into a better shape.

Q1. The logic in insight "Reducing ..." is not smooth. Fig.2 does not tell us reducing 160 tasks to a small number of groups can enhance the performance. First, the performance may highly depend on the grouping strategies. Second, there may still exist conflict within task groups.

A1. (1) As stated in Section 3.1 (bolded "Performance Variance"), we test on different task sets each run. The STD reflects the model’s robustness to variations in task combinations. When the task count is reduced to 10, the model exhibits minimal STD, demonstrating the good performance is robust to the grouping strategies.

Note: The specific criteria for constructing task subsets can be found in Reviewer is3P top

(2) The gradient similarity on 10 tasks is 0.29 in Fig.2, indicating gradient conflicts still exist within task groups. However, compared to massive tasks, the gradient conflict is significantly reduced when training on 10 tasks, and model achieves better scores.

Therefore, by interpreting Fig.2 from the reverse perspective (from right to left), we derive our robust insight. Importantly, this insight does not advocate focusing solely on small-scale tasks. Given the predetermined task count the model must solve, how to reduce task numbers to be learned is precisely the problem we raised in Section 3.2 and addressed in our method.

Q2. In Figure 4, the model is drawn bottom-top, but the task sets appear above output layer with downward arrows, which may cause confusion.

A2. Thanks for identifying this. We upload the revised figure at link(https://anonymous.4open.science/r/ICML_rebuttal_11746/revised_fig4.JPG) and will replace in the revised manuscript.

Q3. How to determine the number of groups?

A3. We align the number of task groups with experts, with each expert dedicated to a distinct task subset. Increasing the number of groups = more experts = more model parameters. Through experimental analysis in Fig.5 and Fig.7, we identify the optimal group (expert) counts for different task scales, which reflects the trade-off between task subset learning difficulty and gating difficulty. We will add these details to the revised manuscript.

Q4. What is the grouping frequency?

A4. Task grouping is performed only once after completing the first stage. The overall algorithm workflow is backbone training in first stage, task grouping, each expert individually training on its corresponding fixed task group in second stage, and router training in third stage. The algorithm concludes after the third stage. Each stage executes once in the entire training process, with detailed training iterations per stage specified in Appendix A.5. We will supplement this in pseudocode in revised manuscript.

Q5. M3DT-G needs to calculate all gradients. Computation cost is high.

A5. Since task grouping is performed only once, M3DT-G requires merely a single additional computation of all tasks' gradients followed by K-means clustering, which is negligible compared to the overall training cost.

Q6. Is there any special treatment for tasks within the same group?

A6. We do not apply any special processing or treatment to any of the tasks.

Q7. In Figure 5 left, why is performance drop from 98M to 123M?

A7. As analyzed in Section 5.2 (line 402-415) and Fig.7, the overall performance is a trade-off between expert performance and gating difficulty. When the expert count exceeds a threshold, further increasing the expert count no longer improves expert performance, while the complexity of router assigning weights to experts continues to improve. Thus, the performance plateaus or even degrades.

Q8. Why do not authors simply group tasks and train MoE?

A8. Through experiments, we evaluate the ablation: group tasks and train MoE (simultaneously training router and task-ID-matched expert while freezing other experts), which achieved 71.21, compared to 77.89 in Table 2. The performance drop primarily stems from the instability caused by alternating optimization of different experts. It also fails to leverage the the second stage, where experts can be trained independently and in parallel, resulting in significantly prolonged training duration.

Q9. I feel the methods are general and can be applied in supervised MTL. What specific challenges in MTRL do authors find and solve?

A9. In RL, task distinctions are more pronounced and decision-making process is more sensitive compared to regression and classification. While scaling up model parameters has proven effective for handling numerous tasks in other fields, it falls short in RL. This is the core challenge that our paper addresses. Our method can be readily adapted to other supervised MTL domains by simply replacing the backbone with other task-specific models. We hope this work can inspire new research in MTL and advance the field.

审稿意见
4

Transformer-based models have recently shown success in offline reinforcement learning by framing the problem as a sequence modeling problem. Moreover, offline multi-task reinforcement learning (MTRL) has benefited from the high capacity of these models for solving complex and diverse tasks. Nevertheless, as the number of tasks massively increases, the overall performance starts to drop tremendously. The naive scaling of the transformer-based models doesn't counteract the drop in performance caused by scaling the number of tasks. The performance of the model saturates eventually as the model capacity increases. Accordingly, we are not gaining from the scalability of the transformer-based models in MTRL as done in the supervised learning setting. In this work, the authors propose a different way to scale the model by learning different experts for different groups of tasks. With that, scaling the model is in the form of increasing the number of experts, hence increasing the number of tasks to be learned. Moreover, this reduces the conflicts between tasks since the number of tasks per expert can be lower. The proposed method, named M3DT, consists of three stages of learning: a backbone transformer model, experts, and a router model. The approach has been evaluated on 160 tasks, a mix of tasks from metaworld, dm-control, and continuous control problems from Mujoco, against relevant and recent offline MTRL baselines with a transformer-based architecture.

给作者的问题

  • How long is the training for M3DT compared to the other baselines?
  • The expert performance seems better than the performance with the router; why do we need a router then? Am I missing something?
  • Why do you freeze the predict head while the corresponding input has been changed because of the training of the router in the third phase?

I would be happy to raise my score if my concerns are addressed.

论据与证据

  • In my opinion, the claims presented in this work are quite clear and well-motivated.
  • The claims are supported by two studies (in Figures 2 and 3), which makes them very convincing.

方法与评估标准

  • I find the proposed approach, named M3DT, sound for tackling the problem of multi-task reinformcenet learning in the offline setting.
  • The massive number of tasks used for evaluation is interesting and quite motivating for the effectiveness of the approach.

理论论述

  • There are no theoretical claims presented in this work.

实验设计与分析

  • I highly appreciate the amount and the diversity of the experiments and ablations done in this work.
  • As mentioned before, I like the two studies presented in Figures 2 and 3.

补充材料

  • I checked the supplementary materials.

与现有文献的关系

  • I believe this work is important to close the gap between supervised learning and reinforcement learning in terms of the scaling of the transformer-based models.
  • I agree with the authors that offline MTRL has been limited to dozens of tasks without looking into more realistic scenarios that we expect from offline RL.
  • Hopefully, this will affect the online MTRL setting to handle such massive number.

遗漏的重要参考文献

  • I have no recommendations for papers to be cited.

其他优缺点

  • I found some weaknesses in this work,
    • The limitations of this approach are not well-discussed.
    • One important limitation, in my opinion, is the training time since M3DT requires 3 stages of training. I would advise the authors to indicate the training time for the proposed approach, compared to the baselines.
    • I believe the future works were never clearly stated in this work. It is important to understand how to fix some limitations in this work or how to extend this work to future directions.
    • In Figure 7, it seems that the performance of the proposed approach eventually saturates even when the number of experts increases. I would encourage the authors to comment on that.

其他意见或建议

  • In section 2 (Preliminaries), subsection 2 (Multi-Task RL), second line, there is a small typo in funct(u)ions.
作者回复

We sincerely appreciate your recognition of our work. We offer our responses to address your concerns as follows and will supplement these details and correct the typo in the revised manuscript.

Q1. The limitations are not well-discussed. It is important to understand how to fix some limitations or extend this work to future directions.

A1. We primarily focus on the challenges in MTRL, analyzing the scalability of task numbers and parameter size, and proposing a paradigm that increases model parameters to achieve task scalability. However:

(1) we did not focus on fine-grained network architecture design, such as optimizing the expert and router architecture, which could potentially further improve performance and reduce model size.

(2) While we leave held-out task generalization and continual learning unexplored, our method’s modular parameterization and group-wise learning naturally support solutions like held-out task adaptation via expert learning, dynamic skill composition for held-out tasks, and forgetting mitigation for the router for past tasks. Addressing these areas presents promising future work.

(3) Scaling experts with task count raises inference costs. While we prioritize performance, we experimented with activating Top-k experts with fixed k to control inference overhead, but it degrades performance. A more tailored Top-k gating mechanism, better aligned with our three-stage training paradigm, could enhance our practical applicability.

Q2. The training time for the proposed approach compared to the baselines.

A2. Our experiments were conducted on RTX 4090. We train PromptDT-5M for 4e5 steps in first stage, taking around 5.2h. In second stage, all experts are trained independently on their dedicated task group in parallel, taking around 1.8 hours (the time for training a single expert). In third stage, we train the router for 4e5 steps, taking 17.2h. The reason for the long training time in third stage is our code's lack of parallel computation for experts. Code optimization may reduce this training time. The total training time of M3DT-174M is around 24.2 hours. For comparison, training PromptDT-173M and MTDT-173M each takes around 21.3 hours, while training HarmoDT-173M requires 95.6 hours.

Q3. In Fig.7, the performance eventually saturates even when the number of experts increases.

A3. As stated in Section 5.2 (line 402-415), the performance plateaus for two main reasons:

(1) The router's weight allocation becomes increasingly difficult with more experts;

(2) Each expert's task load becomes small enough to achieve optimal performance, preventing further improvement in individual expert performance (dashed line).

Thus, the overall performance eventually saturates.

Q4. The expert performance seems better than the performance with the router; why do we need a router then?

A4. Sorry for your confusion. The "Expert Performance" is obtained by manually selecting each test task's corresponding expert based on Task ID and individually evaluating their performance. However, this approach is impractical for large-scale tasks in our overall method evaluation and real-world applications: (1) the specific task IDs are usually unavailable, (2) manual expert selection is prohibitively expensive or impossible. Therefore, a unified policy is essential. The router automatically assigns weights to each expert based on the backbone's hidden states, integrating all sub-policies into a unified policy without needing task ID. This necessitates the router. We consider it reasonable the framework's overall performance may be lower than the ideal case of perfect expert switch, as increasing the number of experts raises the difficulty of optimal weight allocation. We will rename "Expert erformance" to "Oracle Expert Selection Performance" in the revised manuscript and add the above explanation.

Q5. Why freeze the predict head while the corresponding input has been changed because of the training of the router?

A5. Through experiments, we evaluate the performance of M3DT-G with simultaneous training of both router and predict head, achieving a score of 76.86. This result is slightly lower than our original algorithm's 77.89, but outperforms other ablation variants. The rationale for freezing the predict head is:

(1) The predict head trained in the first stage already contains knowledge for all tasks, and continued training of the backbone parameters across all tasks would lead to overfitting due to severe gradient conflicts (as shown in Fig.6);

(2) In the second-stage expert training, we also freeze the predict head, so all learned experts' outputs remain compatible with the predict head's input. In the third stage, the router performs weighted sum of all expert outputs (with weights normalized by Softmax to sum to 1), which means the entire MoE's final output can be viewed equivalent to a single expert's output, thereby preserving compatibility with the predict head's expected input.

审稿人评论

I would like to thank the authors for addressing my concerns and answering my questions. Accordingly, I will increase my score.

作者评论

Thanks for your reply. We appreciate your recognition of our responses. Your comments have been very helpful in refining our revised version, and we will incorporate these discussions accordingly.

审稿意见
3

This paper introduces M3DT, a novel mixture‐of-experts (MoE) extension of the Decision Transformer designed to tackle the scalability challenges in massive multi-task reinforcement learning. The method leverages task grouping, dedicated expert modules, and a three-stage training mechanism to reduce gradient conflicts and enable effective parameter scaling, demonstrating improved performance across up to 160 tasks on challenging benchmarks.

Update after rebuttal--I appreciate the authors’ thorough revision and the additional explanation addressing my concerns. I have decided to keep my original score.

给作者的问题

  • Could you elaborate more on how you measured “gradient conflict”? Did you observe the frequent occurence of gradient conflicts in cases with more tasks?
  • Can you elaborate on how the three-stage training mechanism specifically prevents overfitting to dominant tasks?
  • Could the approach be extended to online multi-task settings, and if so, what modifications would be necessary?
  • When should we use M3DT-Random? and M3DT-Gradient? What’s the criteria suggested by the authors?
  • Regarding the router component, what specific design choices (e.g., architecture of the MLP) led to its effectiveness in dynamically allocating expert weights?
  • The normalized score is a key evaluation metric across diverse tasks. Could you elaborate on the normalization process used for each benchmark?

论据与证据

The authors claim that naive parameter scaling fails as task numbers increase and that their proposed MoE approach, with its task-specific grouping and staged training, significantly mitigates performance degradation. These claims are supported by extensive experiments and ablation studies comparing M3DT with existing baselines, although additional discussion on the statistical significance of improvements would be beneficial.

方法与评估标准

The method is well-motivated and appropriate for addressing multi-task RL scalability. The use of standard benchmarks (Meta-World, DMControl, Mujoco, though mixed!) and normalized performance metrics strengthens the experimental evaluation. But my main concern is how “massive” these benchmarks are.

理论论述

While the paper offers empirical insights into gradient conflicts and parameter scaling, it lacks rigorous formal proofs. Strengthening the theoretical analysis, particularly regarding the benefits of the three-stage training mechanism, would add depth to the work.

实验设计与分析

The experimental design is comprehensive, including comparisons with state-of-the-art DT-based baselines and detailed ablation studies. Also, I’m curious how online multi-task methods would perform in these “massive” tasks.

补充材料

The supplementary material appears extensive and includes additional ablation studies, detailed experimental setups, and analyses of task grouping strategies, which complement the main text effectively.

与现有文献的关系

M3DT builds clearly on recent advances in DT and MoE architectures for multi-task RL. It addresses a significant gap in scaling multi-task RL, and the authors relate their work to relevant literature, although a discussion of some recent MoE approaches in reinforcement learning could provide additional context.

遗漏的重要参考文献

It would be great if the authors could discuss more about the closely related MoE-based multi-task RL approaches (i.e., Hendawy et al. (2023) and Huang et al. (2024)).

其他优缺点

Strengths include a novel approach to reducing gradient conflicts, a well-structured three-stage training process, and robust experimental validations. On the downside, parts of the presentation are dense and could benefit from clearer exposition, and further theoretical justification of the methods would strengthen the paper.

其他意见或建议

  • Font size in the figures is too small.
  • Inconsistency: Mixture-of-Expert v.s. Mixture-of-Experts
  • In table 1 (and check others as well), please double check to bold the values that are in a statistically significant range.
作者回复

We are greatly appreciative of your recognition of our work. We offer our responses to address your concerns as follows and will address the formatting issues in Other Comments in the revised manuscript.

Q1. how “massive” these benchmarks are.

A1. Current standard MTRL algorithms typically handle a limited number of tasks: offline methods are generally confined to 50 Meta-World tasks, while some online methods are restricted to MT10 tasks. In contrast, our approach integrates 160 tasks from Meta-World, DMC, and MuJoCo Locomotion.

Q2. Strengthening theoretical analysis would add depth to the work.

A2. Thanks for your suggestions. But our method is proposed based on observed experimental phenomena, making theoretical analysis difficult. Neither DT nor MoE we used has been theoretically analyzed. While theoretical methods are rare in MTRL compared to MTL, our future work may focus on developing a theoretically grounded MTRL algorithm.

Q3. How online MT methods would perform in massive tasks.

A3. Recent studies have shown offline MTRL methods outperform online methods on Meta-World MT50 benchmark, as seen in HarmoDT and MTDiff. This performance gap may widen significantly when scaling to larger task sets, where online methods often degrades. As task number increases, the required wall-time overhead for environment interaction grows substantially, leading to a significant expansion in training time.

Q4. Discuss more about MoE-based MTRL approaches.

A4. Thanks for your suggestion. We will supplement and discuss [1,2] in the revised manuscript.

[1] MoE-Loco: Mixture of Experts for Multitask Locomotion

[2] Multi-Task Reinforcement Learning With Attention-Based Mixture of Experts

Q5. How you measured “gradient conflict”? Did you observe the frequent gradient conflicts with more tasks?

A5. We compute the gradients for each task and calculate the mean gradient across all tasks. The gradient similarity is defined as the average cosine similarity between this mean gradient and all tasks' gradients. During training, we record this metric every 1e4 steps to obtain a similarity curve. After smoothing to the stabilized phase of training, we use the plateau value as our final gradient similarity. Gradient conflict is calculated as (1 - gradient similarity). Yes, our results confirm that the gradient conflicts intensifies with more tasks.

Q6. How the three-stage training specifically prevents overfitting to dominant tasks?

A6. First stage, we terminate full-task training when performance plateaus and gradient conflicts peak (Fig.6(left)) to avoid overfitting to dominant tasks. Second stage, we use task grouping to reduce task numbers each expert learns, mitigating gradient conflicts and preventing overfitting. Third stage, we use well-trained experts to balance router learning on each task.

Q7. Could the approach be extended to online MT settings, what modifications would be necessary?

A7. M3DT can be readily extended to online MTRL by simply replacing the backbone DT's training objective and optimization with those of Online DT and iterating through all tasks for parameter optimization. However, as task numbers increase, the time cost of interacting with all tasks becomes prohibitive, hence we still recommend offline as the primary paradigm.

Q8. When to use M3DT-Random or M3DT-Gradient? What’s the criteria?

A8. M3DT-G consistently outperforms M3DT-R. The latter is designed as a baseline to validate that even random task grouping improves performance by reducing the learning task load per parameter subset. M3DT-G requires only two additional steps, computing gradients across all tasks and performing K-means clustering, which account for a little increase in training time. We recommend use M3DT-G as default approach.

Q9. Regarding the router, what specific design choices (architecture of the MLP) led to its effectiveness?

A9. Thanks to our analysis of task quantity and the proposed algorithm, we achieve strong results using a basic 5-layer MLP with ReLU activation for the router, without any specialized design. We believe optimizing the router structure or routing mechanism could further enhance performance or extend our method to generalization and continual learning scenarios, which we leave as future work.

Q10. Normalized score is a key evaluation metric across diverse tasks. The normalization process used for each benchmark?

A10. As stated in Appendix A.1, we perform score normalization separately for each domain's tasks. For MW, we use the success metric. For DMC, we map the original range [0, 1000] to [0, 100]. For Cheetah-vel, we map [-100, -30] to [0, 100]. For Ant-dir, we map [0, 500] to [0, 100]. For the latter two domains, scores outside the original range are clipped to 0 or 100. For overall average normalized score across all tasks, we assign equal weight to each task, computing the algorithm's final score as the uniform average over all 160 tasks.

审稿意见
2

The authors study the problem of distilling a large multi-task offline RL dataset into a single policy via Prompt Decision Transformer (DT). The authors first study the scalability of Prompt DT with respect to both tasks and model size. These experiments provide a (very clear) demonstration of the theory that multi-task RL datasets diminish performance on each individual task due to conflicting gradients and show that larger models do not necessarily fix the problem. A common solution is to split parameters between tasks. This is usually done with separate output heads, but the authors take a different approach and use a Mixture of Experts model to split learning. This M3DT architecture is trained in a three-stage process (base model, experts, then router). Experiments find M3DT improves multi-task performance relative to several other DT methods and ablations.

Update After Rebuttal

The authors' explanation of some missing details and additional results breakdowns do partially address my concerns. I think it is a mistake to setup an empirical multi-task experiment like this and make half the task set solvable by meta-RL methods that do not need to address multi-task RL challenges. Results emphasize an under-discussed method for task subset selection and score normalization.

The main conclusion from the results is that splitting the architecture of a multi-task policy into independent components improves performance, which is a well established pattern. Mixture of Experts is a different and simple way to go about this (Fig 7), but the simplicity is somewhat undercut by a complex three-stage training process. New figures in the rebuttal confirm MoE performs worse than the joint single-task baseline. I understand and appreciate the desire to avoid task IDs, but disagree that the original task set makes this an unachievable oracle. The lack of task IDs would be a more valid limitation if randomized variants of Meta-World + DMC taks were also being used, or if the domains were more difficult to distinguish from a single observation. The rebuttal task transfer results show limited zero-shot benefit from implicit task identification in MW+DMC, which is expected, but further support the idea that these tasks are not similar enough to demand ground-truth task IDs.

I am very torn between a score of 2&3 but will maintain my original score of 2 partially because my concerns are quite different from other reviews.

给作者的问题

Please clarify whether the 80 Cheetah-Vel/Ant-Dir tasks receive equal weight in the normalized score as the 80 distinct Meta-World+DMC tasks. How would the main results change if Cheetah-Vel and Ant-Dir were completely ignored?

Could you clarify Figure 7 and the aggregate score of the 160 single-task Prompt DTs? See the last paragraph of “Other Strengths and Weaknesses” for context.

论据与证据

The main claims of the paper are supported by many experiments. However, I have several questions and concerns about the conclusions from the experiments, discussed below.

方法与评估标准

To aggregate results across domains with different reward functions, the authors report every result as a normalized score. However, they are missing breakdowns of the normalized score by domain. To reach 160 tasks, the authors use tasks from three domains. 50 tasks are from Meta-World, and 30 are from the DM Control Suite (DMC). These choices make sense because previous work has demonstrated the key gradient conflict issue here (especially in Meta-World). In the main text, the authors state the third domain is “Mujoco Locomotion,” and I think a reasonable guess from that description would be the standard gym tasks that are similar to the DMC. Appendix A.1.3 clarifies: 80 of the 160 total tasks used are minor variations of two locomotion problems.

I am very skeptical that “40 tasks” from Cheetah-Vel and “40 tasks” from Ant-Dir should count as 80 tasks in the context of this paper. A meta-learning paper (including the Prompt DT paper the dataset is taken from) would call this “40 tasks”, but this is a formality of the meta-RL problem statement. Meta-RL is interested in identifying an unknown MDP at test-time, while many of the baselines here are provided with the task ID upfront. The bar for a unique "task" is much lower in meta-RL. In any case, these are very saturated meta-RL benchmarks: it is easy to solve all 40 of these tasks online with a small model and without task IDs. Gradient conflicts are a non-issue. When the standard of a new task is that it creates a distinct optimization objective, as in this paper, I think we should treat these "80 tasks" as two tasks. This makes the 160 task count a bit misleading. My main concern is that the normalized score is actually 50% (80 tasks / 160) reliant on the performance in these two problems. The writing does not make this totally clear, but reporting the score for 120 and 160 tasks as separate metrics suggests this is the case. If so, it significantly changes many of the takeaways upon a re-read. For example:

Conclusion: “In MTRL, increasing the model size of DT swiftly hits the performance ceiling”. Reinterpreted: yes, maybe because 50% of the eval metric relies on 2 tasks where Prompt-DT has nearly perfect performance at its original model size. This could explain why the margins are thin across the paper.

Conclusion: “In the clear trend where the performance degrades with increasing task numbers, the decline is pronounced when task number is relatively low (below 40 tasks), while it becomes much more gradual once the tasks reach a sufficiently large number (above 80 tasks).” Reinterpreted: the diminishing rate of decrease is caused by repeatedly sampling the same task. This is complicated by Appendix A.3, where the authors clarify that task sampling is not uniform. Instead, the subsets are designed such that the average performance of single-task DTs is consistent with the average performance of single-task DT on all 160 tasks. If that metric is 50% Cheetah/Ant, and PromptDT performance is nearly 100% successful on those tasks, perhaps the samples are biased by Cheetah/Ant, but too few details are given to analyze this. The subsets used to measure performance vs. task count might be skewed in some way.

Conclusion: Random task groupings work almost as well as the more sophisticated gradient-similarity task groupings (Tables 1 & 2). Reinterpreted: randomly spreading the Cheetah/Ant tasks among the experts is fine because the conflict between them are not significant enough to stop Transformers of this size from learning with standard online RL, much less supervised DT.

Conclusion: “M3DT… shows diminishing performance gains after reaching 40 experts on 160 tasks.” Reinterpreted: By the time we reach 40 independently trained models on 82 effective tasks, gradient conflicts are no longer an issue. We are approaching the baseline of learning a switch function between single-task expert models. Ideally, MTRL might be better than that method by positive transfer, but offline methods often treat it as an oracle. From that perspective, 48 experts work the best (Fig 5) because it is closest to 1 expert per task. The authors set out to distill a multi-task dataset into a single policy, but the solution is to set up a MoE architecture and scale it up 40x until it resembles training a separate policy for every task.

It would be very useful to see the results with Ant-Dir and Cheetah-Vel removed.

My apologies for the long discussion, but this is a rare case where I felt an appendix detail may substantially change my opinion on a paper's results, and that not all reviews may comment on this.

理论论述

N/A

实验设计与分析

Content that might normally go here was discussed in a section above.

补充材料

I read the Appendix.

与现有文献的关系

This paper relates to important topics in multi-task optimization in RL and the use of large-scale sequence models for policy learning.

遗漏的重要参考文献

Nothing critical.

其他优缺点

The first few figures in this paper clearly demonstrate the core MTRL optimization challenge and would support many existing works in this area. The method is a good example of applying ideas from large-scale LLMs to RL, an emerging trend and an important area for study.

This line of work involves training single-task agents from scratch on every task and then distilling their behavior into a single policy. The main benefit of this might be generalization to unseen tasks or at least a significant reduction in parameters to handle the training set (vs. switching between the single-task experts). The paper does not discuss generalization or fine-tuning to new tasks, and the architecture almost directly scales with task count.

Figure 7 is a confusing result. The MoE experts outperform the overall model (including the router), which should be bad (why use the router then?), but the writing acts like this is falling short of an oracle upper bound. The standard upper bound is usually the aggregate results of (smaller) single-task specialist agents (e.g., Multi-Game DT, Gato). Because the experts are mostly redundant architectures to a base model that trains on its own, it is not 100% clear to me whether Figure 7 is reporting something different than this. Regardless, Appendix A.3 makes it clear that, at some point, the authors trained a single-task Prompt DT model on all 160 tasks and used its score for balancing the dataset. Can we compare the performance of M3DT against those scores?

其他意见或建议

The paper would benefit from evaluating on held-out tasks, which should be a key advantage of using Prompt DT as the base model rather than one-hot identifying every task.

Minor: The Cheetah-Vel reference is MAML, and the Ant-Dir reference is ProMP.

作者回复

We sincerely appreciate the reviewer's thorough reading of our manuscript and your valuable comments regarding out task setup and method. These comments have helped us realize that we inadvertently omitted several critical details in our original manuscript. We believe that supplementing and clarifying this information will significantly enhance the quality of our paper and we hope these can address your concerns.

Regarding the use of Ant-dir and Cheetah-vel as 80 tasks, our intention is not to propose a new benchmark, but to ensure sufficient task quantity to investigate how task scale affects MTRL scalability. To achieve methodological rigor in task selection, we follow three key criteria:

(1) Performance Balance: Based on single-task score, the selected task subsets maintain average scores comparable to the full 160 set, eliminating the influence of task difficulty

(2) Domain Ratio Preservation: The subset composition at different task scales approximately maintains the original domain ratio (MW:DMC:Ant:Cheetah=5:3:4:4), mitigating domain-specific biases

(3) Subset Diversity: We use distinct task subsets for each run with the same task scale to maximize task coverage, eliminating correlations between specific tasks.

These eliminate potential bias caused by the relative ease of Ant and Cheetah tasks. We have separately reported the scores for MW+DMC and Ant+Cheetah tasks,which are consistent with the average scores across160 tasks in manuscript, validating the rationality of task selection and reliability of our findings. Let me know if link not work https://anonymous.4open.science/r/ICML_rebuttal_11746/revised_fig2.png

Q1. Calculation for Average Normalized Score

A1. We assign equal weight to each of the 160 tasks and compute the average normalized scores. Ant and Cheetah collectively account for 50% of the average score.

Q2. "In MTRL, increasing ..." maybe because PromptDT has nearly perfect performance at original size. "In the clear ..." is caused by repeatedly sampling the same task.

A2. In seperated results, both task scalability and parameter scalability exhibit the same trends as demonstrated in the original paper. Detailed analyse please see https://anonymous.4open.science/r/ICML_rebuttal_11746/revised_fig2.png.

Q3. "Random ..." maybe because the conflict between them are not significant enough.

A3. We conduct experiments by separately training PromptDT-1.47M on 80 MW+DMC and 80 Ant+Cheetah tasks to show gradient conflicts and normalized scores,see https://anonymous.4open.science/r/ICML_rebuttal_11746/Q4.png. The results show that training on either standalone domain exhibits significantly weaker gradient conflicts.

Q4. "M3DT..." M3DT appears to learn several single-task experts and a switch function.

A4. M3DT fundamentally differs from training single-task policies and a switch function: (1) our router dynamically assigns weights to experts at token-level, rather than switching to the single best expert at task-level (dashed line in Fig.7), and Fig.7 demonstrates our router outperforms oracle switching when expert count is small. (2) reducing tasks per expert doesn't guarantee better performance, as evidenced by the drop when scaling from 48 to 56 experts (175M to 200M in Fig.7 dashed line), which shows contribution that identifying the optimal task load for given parameter subsets and strike a trade-off between expert performance and gating difficulty.

Q5. How would the main results change if Cheetah-Vel and Ant-Dir were completely ignored?

A5. M3DT shows more superior performance on complex tasks(MW+DMC). See https://anonymous.4open.science/r/ICML_rebuttal_11746/domain_main_result.png

Q6. Evaluating on held-out tasks, a key advantage of Prompt DT rather than one-hot identifying task.

A6. We first emphasize that both M3DT and PromptDT operate without task ID and only use trajectory prompt during held-in task evaluation (see Rebuttal A8), whereas MTDT and HarmoDT require task IDs to function. For held-out evaluation, see https://anonymous.4open.science/r/ICML_rebuttal_11746/generalization.png

Q7. At least a significant reduction in parameters(vs. switching single-task experts)

A7. We think the core challenge in MTRL lies in the limitation that simply increasing parameters fails to address large-scale tasks. Only after achieving competent performance at large-scale tasks does parameter reduction become meaningful. But parameter efficiency also remains a valuable future work, particularly through refined architectural designs for experts and router. While single-task agents totally require 235.2M parameters with an average score of 84.35, our method achieves a 26% reduction in parameters.

Q8. Figure 7 is a confusing result. Why use the router? Compare the average single-task performance?

A8. Please see Reviewer aD2s Q4A4. We add the averaged single-task performance in the revised Fig.7 at https://anonymous.4open.science/r/ICML_rebuttal_11746/revised_fig7.jpg

最终决定

This paper tackles the challenge of scaling offline multi-task reinforcement learning (MTRL) to a massive number of tasks, proposing M3DT, a Mixture-of-Experts (MoE) framework built upon the Decision Transformer (DT). The authors first demonstrate empirically that naively increasing model size offers diminishing returns as task numbers grow due to issues like gradient conflicts. M3DT addresses this by incorporating MoE layers into the DT architecture and employing a three-stage training process (backbone, experts, router) to manage task complexity and improve scalability, showing results on up to 160 tasks.  

Reviewers acknowledged the importance of the problem and appreciated the empirical studies highlighting the limitations of standard scaling. The MoE approach was seen as a relevant direction. However, significant concerns were raised, particularly regarding the experimental setup and interpretation.  

Reviewer is3p strongly criticized the task composition, noting that 80 out of the 160 tasks were minor variations of two locomotion problems (Ant-Dir, Cheetah-Vel). This reviewer argued that these should effectively count as only two tasks in the context of MTRL (as opposed to meta-RL), and their inclusion significantly skews the results and conclusions about scalability, performance saturation, and the effectiveness of grouping strategies. While the authors provided breakdowns by domain in the rebuttal, Reviewer is3p remained unconvinced, arguing the setup obscures rather than clarifies the challenges of massive MTRL and that the core finding (parameter splitting helps) is already well-established. This reviewer also found the MoE approach somewhat undermined by the complex three-stage training and maintained a weak reject score.  

Reviewer NqBs also initially recommended weak reject, questioning the logical connection between the initial experiments and the proposed solution, finding the method description (especially task grouping and training stages) unclear, and asking about the specific MTRL challenges addressed beyond general multi-task learning. The authors provided clarifications on the logic, grouping mechanism (performed once after stage 1), training rationale, and MTRL challenges. Reviewer NqBs acknowledged the rebuttal but did not provide further feedback or update their score.  

Reviewers BLnM and aD2s were more positive. Reviewer BLnM (Weak Accept) found the method well-motivated but sought clarification on gradient conflict measurement, the 3-stage training benefits, online applicability, router design, and score normalization. Reviewer aD2s praised the motivation and experiments but noted the lack of discussion on limitations and training time, also questioning the performance saturation and the expert vs. router performance discrepancy in Figure 7. The authors addressed these points, providing details on gradient similarity calculation, the 3-stage training rationale, training times, performance saturation explanations, and clarifying Figure 7 by distinguishing oracle expert selection from the router's role. These responses satisfied Reviewer BLnM (who maintained Weak Accept) and Reviewer aD2s (who raised their score to Accept).

While the paper addresses an important problem in scaling MTRL and proposes a relevant MoE-based approach, the significant concerns raised by Reviewer is3p regarding the experimental design's validity cast doubt on the central claims about mastering "massive" MTRL. The heavy reliance on numerous variations of just two locomotion tasks potentially increases the task count and might overstate the method's scalability and effectiveness in diverse, large-scale multi-task settings. Although the authors provided additional results and clarifications that satisfied two reviewers (leading to one score increase), the fundamental critique about the task composition from Reviewer is3p was not fully resolved.