MENTOR: Mixture-of-Experts Network with Task-Oriented Perturbation for Visual Reinforcement Learning
MENTOR is a highly efficient visual RL algorithm that excels in both simulation and real-world complex robotic learning tasks.
摘要
评审与讨论
This paper addresses the challenge of reinforcement learning with visual observations, where learning an efficient policy from high-dimensional image data is difficult. The authors propose a novel approach by incorporating a mixture-of-experts (MoE) architecture in the policy and applying task-oriented perturbation to optimize learning efficiency. The method, called MENTOR, is tested on several reinforcement learning benchmarks, including DeepMind Control Suite, Meta-World, and Adroit, as well as real-world experiments. MENTOR demonstrates superior performance compared to prior state-of-the-art (SOTA) methods.
优点
- The paper is clearly written and easy to follow.
- The proposed approach—integrating a mixture-of-experts in the policy architecture and applying task-oriented perturbation—is well-motivated and empirically supported, as demonstrated in Figures 3 and 4.
- MENTOR shows significant empirical improvements over baseline methods in both simulated environments and real-world experiments.
缺点
- The paper lacks a discussion of its limitations and possible future directions for addressing them.
- Several clarifications could improve the writing and presentation of the work.
- A more detailed analysis of hyperparameter sensitivity would be beneficial. It would be helpful to understand how MENTOR's performance is affected by hyperparameters such as the number of experts, the number of top-k experts, the perturbation rate, and the size of the set S_{top}.
问题
- Ablation study: If the method only used the MoE component and random perturbation (similar to DrM), what would the performance be? It would be valuable to analyze whether the mixture-of-experts or task-oriented perturbation contributes more to the success of MENTOR.
- Task-oriented perturbation and self-imitation learning: The task-oriented perturbation shares similar intuition with self-imitation learning (https://arxiv.org/abs/1806.05635), where agents benefit from their own past high-rewarding network weight or trajectories. Citing relevant work on self-imitation learning would strengthen the paper. Additionally, a discussion comparing the advantages and disadvantages of task-oriented perturbation versus self-imitation learning would enhance the contribution.
- Expert output architecture (Line 199): The paper mentions that expert i produces output a_i, but it is unclear how this output is derived from the latent vector z. Could you provide more details about the architecture of the feedforward network FFN_i and its role in generating the expert output?
- Clarification on MW (Line 215): The paper refers to the "Assembly task from MW," but MW is not defined in the text. Does MW refer to Meta-World? A clear definition would improve readability.
Task-oriented perturbation and self-imitation learning: The task-oriented perturbation shares similar intuition with self-imitation learning (https://arxiv.org/abs/1806.05635), where agents benefit from their own past high-rewarding network weight or trajectories. Citing relevant work on self-imitation learning would strengthen the paper. Additionally, a discussion comparing the advantages and disadvantages of task-oriented perturbation versus self-imitation learning would enhance the contribution.
Thank you for pointing out the relevance of the Self-Imitation Learning (SIL)[13] work to our task-oriented perturbation approach. We agree that both SIL and our method share a similar intuition: leveraging one’s high-performing history to enhance performance. We have cited and discussed the SIL work in our revised paper to better contextualize our contributions.
While task-oriented perturbation and SIL share similar intuition, we believe they address distinct problems and thus are not interchangeable. Specifically, SIL proposes an effective framework leveraging past good trajectories to enhance the learning process through policy gradient optimization to update agent weights. However, our method does not reuse past high-reward trajectories but rather directly exploits the past high-performing agent and uses parameter perturbation to update the agent weights without calculating the gradient, making our method distinct from SIL.
Expert output architecture (Line 199): The paper mentions that expert i produces output a_i, but it is unclear how this output is derived from the latent vector z. Could you provide more details about the architecture of the feedforward network FFN_i and its role in generating the expert output?
Quick Answer:
- The architecture of feedforward network: it is a two-layer MLP. Linear (256 -> 256) + ReLU + Linear (256 -> 256). In this paper, we define the z dimension as 256.
- is exactly the -th expert.
A detailed explanation as follows:
- Input: latent vector
- Architecture:
- : Router. An MLP to the logit of n_expert
- : Expert. For each expert, it is two-layer MLP.
- Linear (256 -> 256) + ReLU + Linear (256 -> 256). In this paper, we define the z dimension as 256.
- Derive output:
- Step 1. Get distribution of experts:
- Step 2. Get output of each expert:
- Step 3. Combine them:
Clarification on MW (Line 215): The paper refers to the "Assembly task from MW," but MW is not defined in the text. Does MW refer to Meta-World? A clear definition would improve readability.
Yes, MW refers to Meta-World. Thank you for pointing this out. We have updated the expression to improve readability.
[12] Xu, Guowei, et al. "Drm: Mastering visual reinforcement learning through dormant ratio minimization." arXiv preprint arXiv:2310.19668 (2023).
[13] Oh, Junhyuk, et al. "Self-imitation learning." International conference on machine learning. PMLR, 2018.
Thank you for your helpful comments. We respond to your comments below as well as adding more experiments. The additional results are posted on rebuttal website here.
The paper lacks a discussion of its limitations and possible future directions for addressing them.
Thank you for pointing this out! We will include a discussion of the limitations of the proposed approach in the final version of the paper.
While MENTOR has demonstrated outstanding performance in both simulation and real-world experiments, most of the environments studied involve a single task with a single robot embodiment. Scaling up the agent parameters to enhance its learning capacity and enable effective performance in more complex scenarios—such as learning a single policy that generalizes across hundreds of tasks or even across different robot embodiments—remains an exciting direction for future research.
A more detailed analysis of hyperparameter sensitivity would be beneficial. It would be helpful to understand how MENTOR's performance is affected by hyperparameters such as the number of experts, the number of top-k experts, the perturbation rate, and the size of the set .
Thanks for your suggestion! Due to time limitations, we only report the ablation study for the number of experts and top_k in the Hammer (Sparse) task, as shown in Section 7 on the rebuttal website. We will include a more detailed discussion in the final version. The results indicate that the optimal choice for the number of experts is 8 and for top_k is 4. When top_k is 4, there are no significant performance differences when the number of experts is set to 4, 8, or 32, which suggests that 4 experts are enough to learn the skill in this task. The ablation on top_k further validates our hypothesis, as reducing top_k (to 2) results in a worse learning curve. If the number of experts is set to 1 and top_k is also 1, the MoE will downgrade to a standard MLP, resulting in the worst performance among all configurations.
Although we have not conducted an ablation study for the size of the set due to time constraints, we can briefly discuss its influence based on extreme cases. Suppose the size equals 1; in this case, the agent would perturb using only the best-performing agent in its history, which is likely to cause the agent's weights to converge to a local minimum. On the other hand, if the size is infinite, the distribution formed by would represent the average policy distribution across the training history, causing the perturbation to act more like random noise and fail to guide the weights toward an optimal direction. Thus, the optimal size for should not be too small or too large.
As for the perturbation rate, in our paper, we use the exact numerical values provided in [12]. Due to time constraints, we did not explore alternative settings for these values. We plan to conduct further experiments on this aspect in the future.
Ablation study: If the method only used the MoE component and random perturbation (similar to DrM), what would the performance be? It would be valuable to analyze whether the mixture of experts or task-oriented perturbation contributes more to the success of MENTOR.
Thank you for your helpful suggestion! To address your concern, we have conducted additional ablation studies on five diverse tasks. The results demonstrate that both architectural and optimization improvements play essential roles in the overall algorithm’s performance improvements. The results are provided in the general response as well as in the rebuttal website Section 1. Please feel free to refer to them.
Dear Reviewer TuwR,
We thank you again for your valuable comments and suggestions.
In our earlier response, we provided detailed clarifications addressing your questions about our paper and included additional analysis about related works and limitations in the revised paper and more experimental results based on your excellent suggestions on the rebuttal website here.
As the author-reviewer discussion stage is nearing its conclusion, we kindly request you to review our revised paper and response, and reconsider your scores if our response has adequately addressed your concerns.
If you have any additional questions, we would be happy to provide further clarifications. We sincerely look forward to your feedback.
Best regards,
The authors
Thank you for your response to address my concerns. I will keep my score of leaning towards acceptance.
Dear reviewer TuwR,
We really appreciate the time you spent on understanding our work. If you don't feel comfortable enough raising the scores, please consider raise your confidence for our clarifications.
Best regards,
The authors
This paper proposes a sample-efficient visual reinforcement learning approach called MENTOR, which utilizes a mixture-of-experts network instead of the traditional MLP network to mitigate gradient conflicts, along with a task-oriented perturbation method to enhance exploration. Evaluation results in multiple simulation environments show that MENTOR is sample efficient. Further, MENTOR can be successfully used for real-world reinforcement learning, which facilitates the application of reinforcement learning to real-world scenarios.
优点
-
Attempts to alleviate the burden of shared parameters by introducing MoE architectures into reinforcement learning
-
A simple and effective perturbation method is proposed that can better guide the policy learning
-
The proposed method achieves an improvement in sample efficiency compared to DrM
-
Validates the effectiveness of the method on real-world robotics tasks, providing a valuable reference for the community
缺点
-
Lack of persuasion and ablation in the use of MoE. MoE has been widely used in the field of multi-task learning, and it can effectively alleviate the conflict problem due to multi-objective optimization. However, policy optimization in a single robot manipulation task often has only one optimization objective, which does not fit the context of multi-task learning. Although it is claimed in the paper that the architecture advantage can be propagated to a single task to alleviate the burden of shared parameters, there is no further analysis and ablation experiments on this.
-
Lack of correlation between the two main improvements. MENTOR makes improvements in both architecture and optimization, yet there seems to be no necessary connection between the two. This makes the improvements in the paper appear as if they are just a combination of two tricks. In fact, optimization is often related to architecture, and it remains uncertain whether the use of MoE will introduce new challenges for policy optimization.
-
Lack of ablation of the two improvements. The paper only provides performance curves for MENTOR in simulation tasks, lacking ablation studies on architecture and optimization, which makes the reasons for the final performance improvement unclear. Although incremental comparisons are made in real-robot experiments, comparisons in simulation tasks will be more convincing and fairer.
问题
-
Whether it can be shown that the multi-stage property of the task in single-task learning leads to the gradient conflict problem or the existence of a shared parameter burden in policy optimization?
-
Is MoE more prone to dormancy than MLP or can it mitigate dormancy to some extent?
-
In Figure 6, why MENTOR performs worse on hammer than on hammer (sparse)?
-
Does the performance improvement in Fig. 6 arise mainly from the task-oriented perturbations?
伦理问题详情
N/A
Thank you for your helpful comments. We respond to your comments below as well as adding more experiments. The additional results are posted on the rebuttal website here.
Lack of persuasion and ablation in the use of MoE. ... Although it is claimed in the paper that the architecture advantage can be propagated to a single task to alleviate the burden of shared parameters, there is no further analysis and ablation experiments on this. Whether it can be shown that the multi-stage property of the task in single-task learning leads to the gradient conflict problem or the existence of a shared parameter burden in policy optimization?
Thank you for your insightful comments! We believe that even in a single-task setting, policy optimization often involves multiple objectives. For instance, as described in Meta-World[8], manipulation tasks are associated with compound reward functions that typically include components such as reaching, grasping, and placing. Conflicts between these objectives can arise, creating a burden for shared parameters.
To validate this, we analyze the gradient cosine similarities for the Assembly task, as detailed in the rebuttal website Section 5. The task can naturally be divided into four stages: Grasp, Move, Assemble, and Release.
Our findings show that the MLP agent experiences gradient conflicts between grasping and the other stages. This can occur because the procedure of reaching to grasp objects could increase the distance between the robot and the target pillar, leading to competing optimization signals. In contrast, the MoE agent mitigates these conflicts, achieving consistently positive gradient cosine similarities across all stage pairs. This validates the ability of the MoE architecture to alleviate the burden of shared parameters and facilitate more efficient optimization, even in single-task scenarios.
Lack of ablation of the two improvements. ... Although incremental comparisons are made in real-robot experiments, comparisons in simulation tasks will be more convincing and fairer. Does the performance improvement in Fig. 6 arise mainly from the task-oriented perturbations?
Thank you for pointing it out! To address your concern, we have conducted additional ablation studies on five diverse tasks. The results are posted in the general response as well as in the rebuttal website Section 1. Both MoE structure and Task-oriented Perturbation improve the agent’s learning sample efficiency. Please feel free to refer to them.
Is MoE more prone to dormancy than MLP or can it mitigate dormancy to some extent?
Empirically we find that MoE agents tend to have lower dormancy compared with MLP agents, as shown in rebuttal website Section 6. We demonstrated that both in the simulation and real-world environments, the change of agent structure from MLP to MoE will lead to a consistently lower and smoother dormant ratio (and also better performance as illustrated in the original paper). The explanation is as follows:
According to [9,10,11], the neural network’s dormant ratio is an effective index reflecting the agent’s skill acquisition capabilities: a lower dormant ratio indicates better learning ability. As illustrated in Section 3.1 in the original paper and Section 5 on the rebuttal website, the using of MoE structure indeed can enhance the agent’s learning capabilities through the alleviation of sharing parameters. Thus, it is reasonable that MoE agents have lower dormancy than MLP agents.
Lack of correlation between the two main improvements. MENTOR makes improvements in both architecture and optimization, yet there seems to be no necessary connection between the two. This makes the improvements in the paper appear as if they are just a combination of two tricks.
Thanks for your comments. We believe the effects of architecture (MoE) and optimization (Task-oriented Perturbation) are intrinsically correlated. The effectiveness of task-oriented perturbation relies on the foundation of sufficient learning capability brought by the MoE architecture. Without such a foundation, the perturbation process may damage the performance of the policy and lead to suboptimal performance. This may explain why in the ablation study (rebuttal website Section 1) in Soccer environment (the most challenging task and the only environment MENTOR does not achieve 100% success rate), the implementation of Task-oriented Perturbation alone leads to worse performance than random perturbation but the combination of both MoE (a stronger agent structure) and Task-oriented Perturbation lead to the best performance.
In fact, optimization is often related to architecture, and it remains uncertain whether the use of MoE will introduce new challenges for policy optimization.
As shown in the ablation study in simulation (rebuttal website Section 1) and real-world (original paper Table 1), the implementation of MoE did not cause additional optimization burden but can actually improve the overall performance with the same gradient optimizer. However, the problem may appear as we scale up the model and train it with more challenging tasks. We will explore this direction in the future.
In Figure 6, why MENTOR performs worse on hammer than on hammer (sparse)?
Thank you for pointing this out, and we apologize for the confusion. As described in Section 4.1, Figure 6 presents experimental results from three different simulation benchmarks. The "Hammer" task in the second row is from the Adroit environment, while "Hammer (Sparse)" in the third row is from Meta-World. These two tasks have significantly different setups: the Adroit environment requires the use of a dexterous hand, while the Meta-World task involves a simpler 2-jaw gripper. Due to these differences in task settings and complexity, their results are not directly comparable.
We appreciate your observation and will revise the writing to clarify this distinction and avoid potential confusion.
[8] Yu, Tianhe, et al. "Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning." Conference on Robot Learning. PMLR, 2020.
[9] Sokar, Ghada, et al. "The dormant neuron phenomenon in deep reinforcement learning." International Conference on Machine Learning. PMLR, 2023.
[10] Xu, Guowei, et al. "Drm: Mastering visual reinforcement learning through dormant ratio minimization." arXiv preprint arXiv:2310.19668 (2023).
[11] Ji, Tianying, et al. "ACE: Off-Policy Actor-Critic with Causality-Aware Entropy Regularization." arXiv preprint arXiv:2402.14528 (2024).
Dear Reviewer 1toB,
We thank you again for your valuable comments and suggestions.
In our earlier response, we provided detailed clarifications addressing your questions about our paper and included additional analysis about our method in the revised paper and more experimental results based on your excellent suggestions on the rebuttal website here.
As the author-reviewer discussion stage is nearing its conclusion, we kindly request you to review our revised paper and response, and reconsider your scores if our response has adequately addressed your concerns.
If you have any additional questions, we would be happy to provide further clarifications. We sincerely look forward to your feedback.
Best regards,
The authors
Dear Reviewer 1toB,
We hope this message finds you well. As the author-reviewer discussion stage is nearing its final conclusion, we kindly remind you to review our revised paper and responses, including the additional analysis and experimental results provided on the rebuttal website here.
If our updates address your concerns, we would greatly appreciate it if you could reconsider your scores. Please let us know if you have any remaining questions—we would be happy to provide further clarifications.
Thank you for your time and consideration.
Best regards,
The authors
The paper presents MENTOR, an innovative approach to enhance sample efficiency in visual deep reinforcement learning (RL) for robotics. By replacing the standard multi-layer perceptron with a mixture-of-experts (MoE) architecture and introducing a task-oriented perturbation mechanism, MENTOR improves the agent's performance in complex tasks and facilitates more effective optimization. The method demonstrates superior results across three simulation domains and achieves an impressive 83% success rate on challenging real-world robotic tasks, significantly outperforming the current best model-free visual RL algorithm, which only achieves 32%.
优点
- MENTOR introduces a mixture-of-experts (MoE) architecture that enhances learning efficiency by dynamically allocating gradients to modular experts, effectively mitigating gradient conflicts in complex scenarios.
- The evaluation extends beyond simulations to real-world robotic manipulation tasks, demonstrating MENTOR’s practical value and sample efficiency, which are crucial for advancing reinforcement learning applications in robotics.
缺点
While MENTOR demonstrates impressive performance in both simulation and real-world tasks, the paper could benefit from a more detailed analysis of the limitations of the proposed approach, particularly in terms of scalability and generalization across diverse robotic platforms and environments. This would provide a clearer understanding of the framework's applicability in broader contexts.
问题
- Are the experimental results in the real-world obtained through sim2real transfer of models trained in simulation, or are they trained from scratch entirely in a real environment?
- Why are external disturbance experiments not conducted in the simulation environment?
Thank you for your helpful comments. We respond to your comments below as well as adding more experiments. The additional results are posted on the rebuttal website here.
The paper could benefit from a more detailed analysis of the limitations of the proposed approach, particularly in terms of scalability and generalization across diverse robotic platforms and environments. This would provide a clearer understanding of the framework's applicability in broader contexts.
Thanks for your suggestion! We will add a discussion of the limitations of the proposed approach in the final version.
As for the scalability, we conduct the self-ablation in Hammer task as shown in the rebuttal website Section 7, which demonstrated the performance comparisons when changing the number of experts and top_k of MoE. The results indicate that for a single task, the over-expansion of agent parameters (i.e., increasing the number of experts) cannot efficiently increase the agent’s overall performance.
As for the generalizations, we conduct random disturbances in both real-world experiments (in the original paper Section 4.2) and simulator (in rebuttal website Section 4), which demonstrate that the learned policies by MENTOR have strong robustness against relevant disturbances.
While the current policy demonstrates strong performance within individual tasks, the potential of scaling its parameters to enable effective performance in more complex scenarios—such as learning a single policy that generalizes across hundreds of tasks or even across different embodiments—remains an exciting avenue for future exploration.
Are the experimental results in the real-world obtained through sim2real transfer of models trained in simulation, or are they trained from scratch entirely in a real environment?
The experimental results in the real world are obtained by training entirely from scratch in the real environment. This approach was chosen due to the absence of suitable simulation environments and significant sim-to-real gap present in our tasks.
For example, in the Peg Insertion task, successful completion requires contact-rich interactions to accurately insert the peg into the hole. Accurately modeling such detailed contact dynamics in a simulator is challenging. Similarly, tasks like Cable Routing and Tabletop Golf involve interactions with soft objects (the cable and the grass surface, respectively), which are widely recognized as difficult for simulators to model accurately.
Moreover, our ability to train efficiently and successfully in real environments serves as strong evidence of the sample efficiency of our proposed method. This highlights its effectiveness as a model-free visual RL algorithm, outperforming the leading baseline.
Why are external disturbance experiments not conducted in the simulation
Thank you for highlighting this point! The original simulation platforms do not natively support random disturbances like those applied in our real-world experiments. However, we agree that it is beneficial to examine the effects of disturbances in a simulated environment as well.
To address this, we modified the "Assembly" task from the Meta-World environment and have included both success and failure cases in the rebuttal website Section 4. The training phase remains unchanged, but during evaluation, we introduce a random disturbance: after the robot grasps the ring and moves toward the fitting area, the fitting pillar randomly changes its location (Disturbance). This forces the robot agent to adjust its trajectory to the new target position.
We conducted 10 rollouts in both the original environment and the modified environment (with disturbances during evaluation). The results show a 100% success rate in the original environment and a 90% success rate in the modified environment. These results demonstrate that the policy learned through MENTOR exhibits strong robustness to random disturbances.
Dear Reviewer 8sj6,
We thank you again for your valuable comments and suggestions.
In our earlier response, we provided detailed clarifications addressing your questions about our paper and included more detailed analysis of limitation in the revised paper and additional experimental results based on your excellent suggestions on the rebuttal website here.
As the author-reviewer discussion stage is nearing its conclusion, we kindly request you to review our revised paper and response, and reconsider your confidence score if our response has adequately addressed your concerns.
If you have any additional questions, we would be happy to provide further clarifications. We sincerely look forward to your feedback.
Best regards,
The authors
Thank you for the reviewer’s response. Overall, my concerns have been addressed, and I will increase my confidence score.
We are delighted to hear that our response and the revisions to the manuscript have addressed your concerns, and we greatly appreciate your decision to raise your confidence score. Thank you once again for your time and valuable feedback!
The authors introduce MENTOR, a visual deep RL algorithm designed to improve sample efficiency in robotic tasks. MENTOR enhances RL agents by replacing traditional MLPs with a MoE architecture. Additionally, the authors introduce a task-oriented perturbation mechanism that heuristically samples task-relevant perturbations. Their experiments show MENTOR can get good performance over the diverse tasks.
优点
- The paper is well-structured and easy to understand, with a clear presentation of the proposed method.
- The authors conduct extensive experiments in both simulated and real environments, effectively demonstrating the method’s efficacy.
缺点
- The proposed MoE architecture is not evaluated over multi-task environments, especially ones that need different strategies for the different tasks in the environments.
- The benefit of the MoE and the task-oriented exploration strategies are coupled. The authors need to decouple this two components and show the effectiveness of the MoE.
- The authors need to compare with other techniques that can handle the multi-modality like transformers, diffusion-based policy.
问题
The authors need to address my concerns in the weakness section.
The authors need to compare with other techniques that can handle the multi-modality like transformers, diffusion-based policy.
Thanks for your comments! Multi-modality frameworks like Transformer and Diffusion have been widely used in language-conditioned generation or guidance, but not widely implemented in the deep reinforcement community[1,2,3]. Although multi-modality models are not well-aligned with the scope of the paper, we found a way to adapt the Transformer model into our research question.
To explore the potential of transformer-based models, we implemented a vision transformer (ViT) encoder following the setup from previous work[4]. Specifically, we replaced the CNN visual encoder in the DrM baseline with a ViT encoder. This ViT processes 84×84 images with 12×12 patches. The patch embeddings have a dimension of 128, with 4 transformer layers, each having 4 attention heads. To avoid running out of GPU memory, we set the batch size to 32. Our findings are as follows:
- Throughput: Due to the substantial number of parameters in ViT, this replacement significantly reduced the training speed. On an RTX 3090 GPU, the training throughput is reduced from 5000 to 500 (throughput = batch_size * steps per second).
- Performance: Due to the time constraint, we did not finish the whole training process in the Hammer task. However, as shown in the rebuttal website Section 3, this change did not lead to significant performance improvements compared to the baseline method.
As for the diffusion-based policy, a concurrent work named DPPO [5] has been published recently, which fine-tunes pre-trained diffusion-based policies through policy gradient methods. Before this, diffusion-based policies and RL had not been closely integrated, as policy gradient methods were generally considered inefficient for training diffusion policies from scratch in continuous control tasks [6, 7]. We believe the use of pre-trained diffusion models and expert demonstrations is out of the scope of this paper, and due to time constraints, we are unable to include a fair comparison. We plan to conduct further experiments on this aspect in the future.
[1] Yarats, Denis, et al. "Mastering visual continuous control: Improved data-augmented reinforcement learning." arXiv preprint arXiv:2107.09645 (2021).
[2] Laskin, Michael, et al. "Curl: Contrastive unsupervised representations for reinforcement learning." International conference on machine learning. PMLR, 2020.
[3] Laskin, Misha, et al. "Reinforcement learning with augmented data." Advances in neural information processing systems 33 (2020): 19884-19895.
[4] Tao, T., et al. (2022). Evaluating vision transformer methods for deep reinforcement learning from pixels. arXiv preprint arXiv:2204.04905.
[5] Ren, A. Z., et al. (2024). Diffusion policy policy optimization. arXiv preprint arXiv:2409.00588.
[6] M. Psenka, et al. Learning a diffusion model policy from rewards via q-score matching. arXiv preprint arXiv:2312.11752, 2023.
[7] L. Yang, et al. Policy representation via diffusion probability model for reinforcement learning. arXiv preprint arXiv:2305.13122, 2023.
Thank you for your helpful comments. We respond to your comments as follows with more experimental results. The additional results are posted on the rebuttal website here.
The proposed MoE architecture needs to be evaluated over multi-task environments, especially ones that need different strategies for the different tasks in the environments. Thank you for pointing this out. We have indeed evaluated the proposed MoE architecture in multi-task environments, both in simulation and in real-world experiments.
Simulation Results:
In simulation, we evaluate the MoE architecture on the MT5 task from the Meta-World environment. MT5 comprises five distinct tasks: Door-Open, Drawer-Open, Window-Open, Drawer-Close, and Window-Close. To assess the benefits of the MoE architecture, we trained two policies differing only in their backbone structure (MoE vs. MLP). The results, illustrated in rebuttal website Section 2, show that the MoE agent achieves nearly a 100% success rate across all five tasks, whereas the MLP agent achieves an overall success rate of 90%. As illustrated in the original paper’s Section 3.1, the MLP agent exhibits a significant performance disparity between "Open" and "Close" tasks, achieving 100% success in "Close" tasks but only 82% in "Open" tasks. To investigate the role of the agent architecture, we calculated gradient similarities and visualized them in the original paper’s Figure 3. The MLP agent exhibits negative gradient similarities in most inverse-task pairs, whereas the MoE agent does not, highlighting its ability to mitigate gradient conflicts.
Real-World Results:
In real-world experiments, we directly train an RL policy using MENTOR in the Peg Insertion task. The task involves using one policy for inserting different pegs (Star, Triangle, and Arrow) into targets with significantly different poses, requiring the agent to learn distinct policies for each peg. As shown in the original paper’s Table 1, MENTOR with the MoE backbone achieves significantly better and more balanced performance across all pegs compared to MENTOR without MoE. To further analyze the contribution of the MoE structure, we recorded the Expert Usage Heatmap (in the original paper’s Figure 11). While Expert 6 is universally activated across all pegs, other experts exhibit clear preferences for specific peg types. This demonstrates that the MoE structure enables the policy to assign different experts to specialize in different tasks, improving overall performance.
The benefit of the MoE and the task-oriented exploration strategies are coupled. The authors need to decouple these two components and show the effectiveness of the MoE.
Thank you for your suggestion! To address your concern, we have conducted additional ablation studies to decouple the two components on five diverse tasks. The results are posted in the general response as well as in the rebuttal website Section 1. Please feel free to refer to them.
Dear Reviewer 784n,
We thank you again for your valuable comments and suggestions.
In our earlier response, we provided detailed clarifications addressing your questions about our paper and included additional experimental results based on your excellent suggestions on the rebuttal website here. As the author-reviewer discussion stage is nearing its conclusion, we kindly request you to review our revised paper and response, and reconsider your scores if our response has adequately addressed your concerns.
If you have any additional questions, we would be happy to provide further clarifications. We sincerely look forward to your feedback.
Best regards,
The authors
I will maintain my score and lean towards rejecting this work, as the rebuttal does not fully address my concerns. First, from Section 1 of the rebuttal, it is evident that the performance improvements from both MENTOR w/o MoE and MENTOR w/o TP to MENTOR (Ours) are marginal, suggesting that MENTOR’s performance might benefit more from better hyperparameter tuning rather than these components. Additionally, in Section 3 of the rebuttal, it is mentioned that the ViT is trained with a batch size of 32, which is too small for continuous control tasks, making this part of the experiment unconvincing.
Dear Reviewer 784n,
Thanks for your reply.
From Section 1 of the rebuttal, it is evident that the performance improvements from both MENTOR w/o MoE and MENTOR w/o TP to MENTOR (Ours) are marginal, suggesting that MENTOR’s performance might benefit more from better hyperparameter tuning rather than these components.
We believe the improvements of MENTOR (Ours) over the two ablations, MENTOR_w/o_MoE and MENTOR_w/o_TP, are significant. To establish a baseline for comparison, we define the standard training time as follows:
Standard Training Time: Let T_MENTOR, T_MENTOR_w/o_MoE, and T_MENTOR_w/o_TP denote the time required for the three different methods to reach the same performance (the final performance of the worst method). The standard training time T_standard is defined as the training time for the worst method to achieve this performance:
We define normalized sample efficiency as (lower is better).
| Sample Efficiency | Hopper Hop | Disassemble | Coffee Push | Soccer | Hammer |
|---|---|---|---|---|---|
| MENTOR (ours) | 0.6167 | 0.7056 | 0.8066 | 0.6237 | 0.7167 |
| MENTOR_w/o_TP | 1 | 0.8505 | 0.9481 | 0.7312 | 0.875 |
| MENTOR_w/o_MoE | 0.85 | 1 | 1 | 1 | 1 |
MENTOR (Ours) achieves an average of 28.5% and 21.2% less training time over the 5 tasks compared with MENTOR_w/o_MoE and MENTOR_w/o_TP as well as achieves significantly higher episode reward and success rate in Hopper Hop and Soccer tasks.
As for the hyperparameters, we use the same set of hyperparameters in MENTOR and all the ablation studies as in DrM’s original code (https://github.com/XuGW-Kevin/DrM) without tuning, which suggests that the performance improvements are largely due to the proposed technical contributions.
In Section 3 of the rebuttal, it is mentioned that the ViT is trained with a batch size of 32, which is too small for continuous control tasks
We have switched to a GPU with larger storage to run the ViT-based method with a larger batch size (256). We estimate this ViT experiment will take approximately 3 days. We will keep you updated with the results as training progresses.
Dear Reviewer 784n,
We have conducted experiments using the ViT-based method with the same batch size as MENTOR (bs=256) on the Hammer (Sparse) task. Please refer to the results in Section 3 of the rebuttal website here. The performance of ViT-bs256 surpasses both DrM and ViT-bs32 but remains significantly less efficient than MENTOR and even MENTOR_w/o_MoE and MENTOR_w/o_TP, as demonstrated in Section 1 of the rebuttal website. This underscores the effectiveness of our proposed techniques.
Additionally, we would like to highlight that the implementation of a Transformer-based encoder is orthogonal to the scope of our work. The same visual encoder could be directly applied to MENTOR by replacing the CNN encoder with the ViT version. We hope this response addresses your concerns. Should you have any further questions, we would be happy to provide additional clarifications. We sincerely look forward to your feedback.
Best regards,
The authors
We thank the reviewers for their insightful comments and helpful suggestions. We are pleased that the reviewers find the proposed method well-motivated and effective (Reviewers 8sj6, 1toB, TuwR), and supported by extensive experimental results validating its efficacy in both simulation and real-world RL environments (Reviewers 784n, 8sj6, 1toB, TuwR).
We provide additional clarifications, explanations and discussion in the per-reviewer responses as well as our rebuttal website here.
The main concerns from the reviewers focus on the lack of ablation studies to separately demonstrate the effectiveness of both the architectural (MoE) and optimization (Task-oriented Perturbation) improvements of MENTOR. To address reviewers' concern, we have conducted additional ablation studies on five diverse tasks: Hopper Hop, Disassemble, Coffee-Push (Sparse), Soccer (Sparse), and Hammer (Sparse). These studies aim to decouple the effects of the MoE architecture and the Task-oriented Perturbation (TP) mechanism proposed in our paper.
For the experiments, we evaluate four ablated versions of MENTOR using the same four random seeds as in the original experiments, as shown in rebuttal website Section 1:
- MENTOR: Full model with both MoE and Task-oriented Perturbation.
- MENTOR_w/o_TP: Task-oriented Perturbation is replaced with random perturbation.
- MENTOR_w/o_MoE: The policy backbone uses an MLP architecture instead of MoE.
- MENTOR_w/o_TP_MoE: Neither MoE nor Task-oriented Perturbation is used.
The results, summarized below, demonstrate the individual contributions of each component:
- MENTOR_w/o_MoE consistently outperforms MENTOR_w/o_TP_MoE and MENTOR_w/o_TP outperforms MENTOR_w/o_TP_MoE in 4 out of 5 tasks, indicating that both the MoE architecture and Task-oriented Perturbation independently contribute to improved policy learning.
- However, the overall sample efficiency and performance of MENTOR_w/o_TP and MENTOR_w/o_MoE remain lower than the full MENTOR model. This underscores the complementary nature of these two components in enhancing the overall learning efficiency and robustness of MENTOR.
This paper introduces MENTOR, a method aimed at addressing sample efficiency and gradient conflict in Visual RL. It achieves this by employing a mixture-of-experts (MoE) network and a task-oriented perturbation mechanism. The paper demonstrates the effectiveness of this approach through experiments conducted in both simulated and real-world robotic manipulation tasks.
Strengths:
- The use of MoE to address gradient conflicts and alleviate the burden of shared parameters is well-motivated and innovative in the context of RL. (8sj6, 1toB, TuwR)
- The paper is well-written, clear, and easy to follow. (784n, TuwR)
Weaknesses:
- The contributions of individual components—MoE and task-oriented perturbation—are not well-isolated, leading to ambiguity in attributing performance gains. (784n, 1toB, TuwR)
- The hyperparameters are not sufficiently explored, and their impact on the experimental findings is unclear. (784n, TuwR)
While MENTOR introduces a compelling and innovative approach, the paper’s weaknesses in experimental rigor are notable. The lack of comprehensive ablation studies limits the clarity of the contributions. Given these concerns, I lean toward rejection. However, the foundational ideas hold significant promise, and with the outlined improvements, this work could make a substantial contribution in the future.
审稿人讨论附加意见
The reviewers raised critical points regarding the lack of ablation studies and the incomplete analysis of parameter choices. Specifically:
-
Ablation Studies (Reviewers 784n, 1toB, TuwR): Several reviewers emphasized the need for detailed experiments to isolate the contributions of MoE and task-oriented perturbation. While the authors clarified some aspects, the reviewers felt their concerns were not fully addressed.
-
Hyperparameter Choices (Reviewers 784n, TuwR): Reviewers expressed concerns about various hyperparameter choices in different experiments and their potential effects on the findings. Although the authors provided some clarifications, these explanations did not fully alleviate the reviewers' concerns.
Overall, while the rebuttal addressed some ambiguities, the above concerns weighed heavily in the decision to recommend rejection.
Reject