PaperHub
6.3
/10
Poster3 位审稿人
最低3最高4标准差0.5
3
4
3
ICML 2025

ARS: Adaptive Reward Scaling for Multi-Task Reinforcement Learning

OpenReviewPDF
提交: 2025-01-23更新: 2025-08-14
TL;DR

We propose Adaptive Reward Scaling (ARS) for multi-task RL, which balances rewards across tasks and includes periodic network resets to enhance stability and efficiency, outperforming baselines on Meta-World and excelling on complex tasks.

摘要

关键词
reinforcement learningmulti-task reinforcement learningreward scaling

评审与讨论

审稿意见
3

Multi-task reinforcement Learning (MTRL) algorithms face challenges when tackling tasks with varying complexities and reward distribution. In this work, the authors propose a method for tackling the varying reward magnitude across tasks by adaptively scaling the reward of each task using a history-based reward scaling strategy. Furthermore, to prevent early overfitting to a few tasks, the authors adopt a resetting mechanism from the single-task deep RL literature to mitigate that. The proposed approach, named ARS, is benchmarked on Metaworld compared to related baselines while showing promising results in handling large scale MTRL setting with varying reward magnitude.

给作者的问题

  • For the ablation study in Figure 4, does the ARS w/o reset baseline consider the same frequency (nresetn_{reset}) in updating the reward scaling factors? the same frequency when performing the factors update + resetting.

论据与证据

In this work, two main claims were presented,

  1. The introduction of reward scale variation of a challenge in multi-task RL.
  2. The importance of adaptively scaling the reward of different tasks because of the varying reward magnitude among them.
  3. The role of resetting is integrated to alleviate overfitting to early learned tasks, in addition to stabilizing the critic training since the proposed adaptive scaling can destabilize the critic training since the Q target changes frequently.

In my opinion,

  1. The problem of the varying reward distribution or reward scale variation is a known issue that has been discussed in the literature [1]. I don't think that introducing this issue should be a contribution to this work. Nevertheless, the fixed reward scaling is an interesting view for this problem, in addition to the connection to the reward scaling in single-task deep RL. In addition, the presented example in Figure 2 shows another dimension of the problem, hence motivating adaptive scaling.

  2. The method introduced in this work to adaptively scale the reward magnitude is novel.

  3. The resetting mechanism, in general, can help the MTRL training since it has been adopted in prior works [2,3]. In this work, resetting the networks is motivated by more than one reason, as stated above. Since it is an essential component in the algorithm and crucial to enhance the performance, it is important to strongly support the claim, behind adding this component to ARS, by ablation studies. For example, showing how the critic is suffering without resetting the network, given the change in the reward magnitude caused by the adaptive scaling. Otherwise, the resetting mechanism is not really a contribution to this work; it is just an adoption of an existing tool in the literature [3].

[1] Hessel, Matteo, et al. "Multi-task deep reinforcement learning with popart." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. No. 01. 2019. [2] Sun, Lingfeng, et al. "Paco: Parameter-compositional multi-task reinforcement learning." Advances in Neural Information Processing Systems 35 (2022): 21495-21507. [3] Cho, Myungsik, et al. "Hard tasks first: multi-task reinforcement learning through task scheduling." The 41st International Conference on Machine Learning. 2024.

方法与评估标准

  • I believe the authors proposed a new method for adapting the scale of the reward magnitude between tasks, which is effective when looking at the empirical results.
  • For the evaluation criteria, I believe Metaworld is a good benchmark to study the effectiveness of MTRL algorithms, especially on the MT50 scenario, which is large scale.

理论论述

  • No theoretical claims were provided or discussed in this work.

实验设计与分析

  • In general, all experiments highly suit this work in showing the effectiveness of the proposed method in the MTRL setting.
  • I appreciate the teaser experiment added in Figure 2.
  • I have concerns regarding the baselines.
    1. As stated before, the problem of varying reward distribution is known in the MTRL literature. One important baseline is PopArt [1], which is, as far as I know, the first work to discuss this issue in the MTRL setting. This approach is similar to the normalization baseline in the ablation in Table 5, yet not exactly. Popart hasn't been used as a baseline nor cited in this work.
    2. In addition, MOORE [2] is a recent MTRL approach that reported SOTA results on Metaworld, in particular, on MT50. Moreover, MOORE has never been benchmarked against nor mentioned in the related work section.
  • For the Metaworld MT10 and MT50 scenarios, it is not clear to me if this experiment considers random goal positions (MT10-rand and MT50-rad) [3] or fixed goal positions.
  • The performance of PACO is lower in this work than in the original paper. It could be because of the different network architecture and hyperparameter, but why not follow the original hyperparameter of each method?
  • I have a concern regarding the experiment done in Figure 4. I don't understand how some baselines can have higher ESTR value as the threshold δ\delta increases. In particular, PACO has a higher value at δ=0.7\delta = 0.7 than 0.50.5.

[1] Hessel, Matteo, et al. "Multi-task deep reinforcement learning with popart." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. No. 01. 2019. [2] Hendawy, Ahmed, Jan Peters, and Carlo D'Eramo. "Multi-Task Reinforcement Learning with Mixture of Orthogonal Experts." The Twelfth International Conference on Learning Representations.

补充材料

  • I checked the whole supplementary material. Notably, I appreciate the illustrative diagram in Figure 14.

与现有文献的关系

  • I believe this work highlights the varying reward distribution issue, which has been studied previously in the literature [1]. This indicates the importance of looking into the reward magnitude as a cause for the instability of the MTRL training.
  • In addition, the reset strategy proposed shows effectiveness empirically, hence supporting previous claims discussed in the literature [2,3].

[1] Hessel, Matteo, et al. "Multi-task deep reinforcement learning with popart." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. No. 01. 2019. [2] Sun, Lingfeng, et al. "Paco: Parameter-compositional multi-task reinforcement learning." Advances in Neural Information Processing Systems 35 (2022): 21495-21507. [3] Cho, Myungsik, et al. "Hard tasks first: multi-task reinforcement learning through task scheduling." The 41st International Conference on Machine Learning. 2024.

遗漏的重要参考文献

  • I believe this work should cite the following papers for the aforementioned reasons: [1] Hessel, Matteo, et al. "Multi-task deep reinforcement learning with popart." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. No. 01. 2019. [2] Hendawy, Ahmed, Jan Peters, and Carlo D'Eramo. "Multi-Task Reinforcement Learning with Mixture of Orthogonal Experts." The Twelfth International Conference on Learning Representations.

其他优缺点

  • All points have been mentioned in Claims And Evidence & Experimental Designs Or Analyses.

其他意见或建议

  • In the Preliminaries Section, the Multi-Task Reinforcement Learning subsection, HH was not defined. I believe it is the horizon.
作者回复

We thank Reviewer Cp7C for their thoughtful feedback and valuable suggestions, which have significantly improved our paper. We have carefully addressed each comment, strengthened our experimental results, and clarified our key contributions accordingly. Below, we respond in detail to each point raised by the reviewer.

Other Multi-Task RL Baseline (PopArt and MOORE):

We appreciate the reviewer’s suggestion regarding the inclusion of an additional multi-task RL baseline, specifically PopArt[1] and MOORE [1]. In response, we conducted comparative experiments using MOORE on the MT10 and MT50 benchmarks.

Since the Meta-World v2 environments default to a horizon of 500, we used this setting across all our experiments. However, the original MOORE paper used a horizon of 150, making a direct comparison with our ARS results challenging. To address this, we attempted to run MOORE with a horizon of 500, but the official code required roughly 200 hours of computation on the MT50 benchmark alone. Consequently, we evaluated ARS using a horizon length of 150 for fairness. The ARS results are shown in Table 1 at the following anonymous link:

https://sites.google.com/view/icml25ars

Notably, the MOORE setup on MT50 (n_expert=6) includes significantly more parameters than our default ARS (400×4). To ensure a fairer comparison, we tested MOORE against an enlarged ARS variant (800×4), which has a comparable parameter count. Despite the comparable model size, ARS consistently outperformed MOORE on both benchmarks requiring significantly less computational time. For instance, on MT50, MOORE requires about 200 hours to train, whereas ARS (800×4) surpasses its performance in just 22 hours—an order-of-magnitude faster. These results demonstrate that our ARS framework achieves significant performance improvements without incurring substantial computational overhead. We will include MOORE results in the revised version.

We also tested PopArt on MT10. Because the official implementation is unavailable, we reimplemented PopArt’s scale-invariant updates within SAC-MT, using SAC target values for critic learning. We varied the update frequency over {1, 10, 100, 500}. The outcomes are shown in Table 2 (available anonymously at https://sites.google.com/view/icml25ars). Across all frequencies, our ARS method outperformed PopArt on MT10, and we plan to include the complete results for both MT10 and MT50 in our revised paper.

https://sites.google.com/view/icml25ars

Our ARS method consistently outperformed all PopArt variants on the MT10 benchmark. We will include comprehensive PopArt variant results for both the MT10 and MT50 benchmarks in the revised paper.

[1] Hessel, Matteo, et al. "Multi-task deep reinforcement learning with popart." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. No. 01. 2019.

[2] Hendawy, Ahmed, Jan Peters, and Carlo D'Eramo. "Multi-Task Reinforcement Learning with Mixture of Orthogonal Experts." The Twelfth International Conference on Learning Representations.

Effectiveness of Reset Mechansim

We thank the reviewer for suggesting a more thorough analysis of the Reset Mechanism.Table 4 in our paper already highlights its importance in stabilizing performance. To further investigate, we analyzed Q-values per task during training on MT10, comparing ARS with and without the reset mechanism. Results are shown in Figure 1 at the following anonymous link:

https://sites.google.com/view/icml25ars

As shown in Figure 1, training without the reset mechanism leads to significantly lower Q-values across tasks, with negative values appearing in tasks like 'push' and 'pick-place'—despite rewards being always positive. Q-values also show greater variance without reset, emphasizing the reset mechanism’s role in ensuring training stability.

Random Goal Positions

Sorry for the confusion. Random goal positions are used in our setup.

Lower Performance of PaCo in MT10

We use the same hyperparameters for PaCo as in the original paper. The performance gap likely stems from the difference in horizon lengths: the original PaCo uses 150, while Meta-World v2 defaults to 500, which we follow. To isolate this effect, we also run ARS with a horizon of 150. Results are shown in Table 1 at the anonymous link below:

https://sites.google.com/view/icml25ars

ESTR Results

We sincerely apologize for the confusion regarding the ESTR results. The reported ESTR value of PaCo at δ=0.7\delta = 0.7 was incorrect and should be revised from 0.6 to 0.5.

Definition of H

Yes, H is the horizon length.

Frequency (n_reset) for ARS w/o reset

We investigated the performance of the ARS w/o reset baseline with various values of n_reset . We then selected the best-performing setting for the paper. The value of n_reset used in the paper for the ARS w/o reset baseline is 40.

审稿人评论

I would like to thank the authors for addressing most of my concerns. Also, I appreciate adding PopArt and MOORE as baselines, and from this experiment and given it is a random goal setting, ARS is indeed performing very well.

I still have concerns regarding the novelty of the resetting mechanism. I believe the concept of resetting is not novel in MTRL [1,2]. More importantly, I believe the same exact way was introduced in SMT [2]. The answer of the authors regarding this point was not convincing. I am not asking if the resetting is important; I am asking how novel this mechanism is compared to [2]. In other words, what differentiates this resetting mechanism from the one introduced in SMT [2]?

I am stressing this point because I can clearly see from the results and, given the author's response, that resetting plays an important role.

[1] Sun, Lingfeng, et al. "Paco: Parameter-compositional multi-task reinforcement learning." Advances in Neural Information Processing Systems 35 (2022): 21495-21507.

[2] Cho, Myungsik, et al. "Hard tasks first: multi-task reinforcement learning through task scheduling." The 41st International Conference on Machine Learning. 2024.

作者评论

Thanks for your further comment.

The novelty of this paper lies in the use of 'reward scaling' combined with resetting.

As the reviewer mentioned, resetting is not new in MTRL, e.g. [2]. But, in [2], the authors adopted 'scheduling' together with resetting to facilitate learning of hard tasks by assigning more resources and early times to hard tasks, because hard tasks mainly degrade overall performance and easy tasks are learned quickly. But, [2] still shows limitation. It could not solve all MT10 tasks although the performance was much improved.

In our work, we used reward scaling to boost the performance of hard tasks and adopted the equalizer rule, i.e., our reward scaling tries to make the rewards of all MT tasks have the same magnitude so that any one task is not favored or disfavored.

The authors of [2] did not catch that the strong bias of MT policy towards easy tasks can simply be corrected by reward scaling. Note that the strong bias of MT policy towards easy tasks is basically due to easy tasks' large early rewards. (Recall that the policy gradient is score * Q, where Q is just an expected weighted sum of discounted rewards. So, larger rewards of a subtask mean larger policy gradient towards that subtask.)

Our reward scaling simplifies the overall procedure significantly and achieves significant performance gain. This approach solved all MT10 tasks for the first time to the best of our knowledge, which is a milestone in MTRL. In fact, during the rebuttal period, we ran more experiments. We adopted layer normalization which is commonly used in deep learning for stable training. In this new experiment, we achieved the average success rate as follows:

MT10(hidden unit 400): 97.3% (w/o layer norm) \to 98.16% (w/ layer norm)

MT50(hidden unit 400): 68.5% (w/o layer norm) \to 78.3% (w/ layer norm)

MT50(hidden unit 1024): 78.7% (w/o layer norm) \to 88.85% (w/ layer norm)

Please note that such high performance has not been reported before.

We believe that our contribution revealing that such simple reward scaling combined with resetting can solve MTRL effectively is not trivial in the area of MTRL and believe that our work is worth being shared to the MTRL community via publication.

审稿意见
4

The paper introduces Adaptive Reward Scaling (ARS), a novel framework designed to tackle the difficulties caused by varying reward distributions in multi-task reinforcement learning.

ARS employs a history-based reward scaling strategy that dynamically adjusts reward magnitudes to ensure balanced training focus across diverse tasks. Additionally, ARS incorporates a reset mechanism that mitigates biases introduced by early learned tasks, enhancing adaptability and convergence. The framework integrates seamlessly into existing off-policy algorithms and has demonstrated state-of-the-art performance on Meta-World benchmark.

update after rebuttal

I don't have major concerns regarding this paper. I still recommend acceptance.

给作者的问题

  • I assume the rewards used in this paper are dense. Would varying reward scales still pose an issue if binary sparse rewards were used instead?
  • The reward scales change dynamically during training. Could this variation in reward scales affect training stability? If not, what mechanisms ensure stability?

论据与证据

I think most of the claims in this submission are supported by evidence.

方法与评估标准

Yes, they make sense to me.

理论论述

This submission does not include proofs.

实验设计与分析

I went through all the experiments, and most of them make sense to me.

补充材料

Yes, section C and D.

与现有文献的关系

Reward scaling has been shown to be effective in prior studies [1, 2]; however, most of these works focus on single-task settings. In contrast, this paper addresses the multi-task setting, where reward scaling poses greater challenges due to varying reward magnitudes across tasks.

[1] Wu, Yueh-Hua, et al. "ANS: adaptive network scaling for deep rectifier reinforcement learning models." arXiv preprint arXiv:1809.02112 (2018).

[2] Henderson, Peter, et al. "Deep reinforcement learning that matters." Proceedings of the AAAI conference on artificial intelligence. Vol. 32. No. 1. 2018.

遗漏的重要参考文献

N/A

其他优缺点

Strengths

  • This paper directly addresses the challenge of varying reward distributions across tasks in multi-task RL. Improper reward scales can result in biased training and suboptimal performance. As an RL practiioner, I would say this is a critical yet frequently overlooked issue in the field of RL.

  • The ARS framework proposed in this paper includes a history-based reward scaling strategy and a reset mechanism. It's simple but very intuitive.

  • The ARS framework demonstrates strong empirical results on the Meta-World benchmark, solving the MT10 benchmark from scratch.

  • The proposed ARS framework seems to be applicable to any off-policy multi-task RL method. The authors demonstrate its applicability by integrating it into various off-policy multi-agent approaches.

Weaknesses

  • The proposed framework is evaluated solely on Meta-World, a relatively simple benchmark. The conclusions would be more compelling if more challenging tasks were included, such as those in [1, 2, 3].

[1] Zhu, Yuke, et al. "robosuite: A modular simulation framework and benchmark for robot learning." arXiv preprint arXiv:2009.12293 (2020).

[2] Mu, Tongzhou, et al. "Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations." arXiv preprint arXiv:2107.14483 (2021).

[3] Chernyadev, Nikita, et al. "Bigym: A demo-driven mobile bi-manual manipulation benchmark." arXiv preprint arXiv:2407.07788 (2024).

其他意见或建议

N/A

作者回复

We thank Reviewer mXD6 for their thoughtful feedback and valuable suggestions, which have significantly improved our paper. We have carefully addressed each comment, strengthened our experimental results, and clarified our key contributions accordingly. Below, we respond in detail to each point raised by the reviewer

Limited Environments

We appreciate the reviewer’s suggestion of three benchmarks for multi-task RL. We believe that the MT10 and MT50 benchmarks from Meta-World are widely recognized and present significant challenges in multi-task RL. Since no existing method has fully solved both, demonstrating strong performance on these benchmarks effectively highlights the strength of our approach.

That said, we agree that including the suggested benchmarks could further strengthen our claims. Benchmarks [2] and [3] are primarily designed for demonstration-driven settings, such as imitation learning and offline RL. While our current focus is on the online setting, we believe that extending our ARS method to offline multi-task RL would be an interesting direction for future work. We appreciate the reviewer for introducing these valuable benchmarks.

Reward Scale Issue with Sparse Reward Setting

Even in sparse reward settings, task difficulty can vary significantly across tasks, leading to large variations in returns during training. This can cause an uneven reward distribution in the replay buffer. Applying the ARS framework can help mitigate this issue.

Instability caused by varying reward scales

Large variations in reward scales during training can indeed destabilize the learning process. To mitigate this, we update the reward scales only during reset periods—four times in MT10 and six times in MT50. This reset mechanism helps maintain stability by preventing frequent changes in reward scaling. Additionally, since the networks are initially trained with the established reward scales, this approach further supports stable training throughout.

审稿人评论

Thank you to the authors for their response. I still recommend acceptance.

审稿意见
3

This paper introduces Adaptive Reward Scaling, a novel framework for multi-task reinforcement learning that dynamically adjusts reward magnitudes using a history-based scaling strategy and integrates a periodic network reset mechanism to mitigate overfitting and biases toward simpler tasks. The empirical results on the Meta-World show some improvements in success rates compared to several established baselines.

Update after rebuttal--I appreciate the authors’ thorough revision and the additional experiments addressing my concerns (especially the attempt with MOORE). I hope those will be implemented in the revised draft. I have decided to increase my score.

给作者的问题

  • Do you have any intuition why SAC-MT works well with ARS in MT10, while Soft Modular works well with ARS in MT 50?
  • Can you provide more insights into the computational overhead and potential limitations introduced by the reset mechanism.
  • How can ARS be extended or adapted to other multi-task or multi-agent settings?

论据与证据

The paper claims that the adaptive reward scaling combined with resets improves training stability and overall performance in multi-task settings. These claims are supported by comprehensive experimental evidence, including detailed success rate tables and ablation studies isolating the contributions of each component. The experimental results are convincing, though a few recent method is missing in baseline and additional experiments on diverse benchmarks could further validate the generality of the method.

方法与评估标准

The proposed approach is well-motivated and clearly explained. The use of the replay buffer to compute task-specific mean rewards for scaling and the integration of resets to counteract biases are both novel and appropriate for the challenges in multi-task RL. The evaluation criteria—success ratios, effective solvable task ratios, and ablation studies—make sense and they demonstrate the performance gains in straightforward way.

理论论述

The paper does not emphasize formal proofs or theoretical guarantees but rather focuses on the algorithmic innovation and empirical validation.

实验设计与分析

The experimental design is robust, comparing ARS against several state-of-the-art baselines using standard benchmarks. The inclusion of ablation studies provides clear insight into the effectiveness of both the reward scaling mechanism and the reset strategy. It could have been better to explore beyond MetaWorld suite, though I understand the current situation; exploring additional environments could further support the claims.

补充材料

The supplementary material—comprising additional experiments, hyperparameter details, and extended ablation studies—was reviewed and provides valuable context that reinforces the primary findings of the paper. However, Appendix A is an exact copy of the other paper (Cho et al. 2024). You should not do this.

与现有文献的关系

ARS builds on existing work in reward scaling, modular networks, and resetting mechanisms in deep RL. The paper successfully situates its contributions within the broader context of multi-task RL research by comparing with methods such as SAC-MT, PCGrad, and Soft Modular. This connection to prior work is well articulated.

遗漏的重要参考文献

The author should have the paper called Multi-task Reinforcement Learning with Mixture of Orthogonal Experts (MOORE, ICLR 2024) both in Related work section and experiments as a baseline. In addition, it might benefit from a discussion of very recent advances in reward normalization and adaptive scaling across different RL domains to highlight its broader applicability and limitations.

其他优缺点

Strengths:

  • Clear and innovative formulation of an adaptive reward scaling mechanism.
  • Thorough empirical evaluation with convincing ablation studies.
  • Significant improvements on challenging benchmarks (especially hard problems).

Weaknesses:

  • Theoretical analysis is somewhat limited.
  • Evaluation is limited to Meta-World benchmarks; broader testing could enhance the claims.

其他意见或建议

  • In table 4, you should consider statistical significance when you bold the method that shows the highest performance. For instance, in easy tasks, second column is within range of third column.
  • Citing mistake in line 400 right column.
  • Typo in line 423. Multi-taks → Multi-task

伦理审查问题

Appendix A is an exact copy of Appendix A in "Cho, M., Park, J., Lee, S., & Sung, Y. (2024, July). Hard tasks first: multi-task reinforcement learning through task scheduling. In Forty-first International Conference on Machine Learning. https://openreview.net/forum?id=haUOhXo70o"

作者回复

We thank Reviewer SzVus for their thoughtful feedback and valuable suggestions, which have significantly improved our paper. We have carefully addressed each comment, strengthened our experimental results, and clarified our key contributions accordingly. Below, we respond in detail to each point raised by the reviewer

Other Multi-Task RL Baseline (MOORE [1]):

We appreciate the reviewer’s suggestion to include an additional multi-task RL baseline, specifically MOORE [1]. In response, we conducted comparative experiments using MOORE on the MT10 and MT50 benchmarks.

In the default setup of the MetaWorld v2 environments, the horizon length is set to 500, which was used consistently across all our experiments. However, we noticed that the experiments reported in the original MOORE paper [1] were conducted using a horizon length of 150, making a direct comparison between the reported MOORE results and our ARS method inappropriate.

To address this, we attempted to run MOORE with the horizon length set to 500. Unfortunately, we encountered significant computational challenges, as the official implementation requires approximately 200 hours to complete on the MT50 benchmark. To ensure fairness, we evaluated ARS using a horizon length of 150. The ARS results are shown in Table 1 at the following anonymous link:

https://sites.google.com/view/icml25ars

The MOORE setup (n_expert = 6) on MT50 uses significantly more parameters than our default ARS (400×4). To ensure a fair comparison, we evaluated MOORE against a larger ARS variant (800×4) with a similar number of parameters. Despite the comparable model size, ARS consistently outperformed MOORE on both benchmarks while also requiring dramatically lower computational costs. Specifically, on MT50, MOORE requires approximately 200 hours of training, while ARS (800×4) achieves superior performance in just 22 hours—an order-of-magnitude reduction in training time. These results demonstrate that our ARS framework achieves significant performance improvements without incurring substantial computational overhead. We will include MOORE results in the final version.

[1] Hendawy, Ahmed, Jan Peters, and Carlo D'Eramo. "Multi-Task Reinforcement Learning with Mixture of Orthogonal Experts." The Twelfth International Conference on Learning Representations.

Discussion of Recent Advances in Reward Normalization and Adaptive Scaling across Different RL Domains .

We thank the reviewer for recommending additional related work. We will expand our related work section accordingly to enhance the paper.

Evaluation is Limited to Meta-World Benchmarks

We acknowledge the importance of including experiments beyond Meta-World. While the MT10 and MT50 benchmarks are widely recognized and challenging in multi-task RL—with no method fully solving both—they serve as a strong demonstration of our approach's effectiveness. Nevertheless, we welcome suggestions for additional benchmarks and will gladly conduct further experiments to strengthen our claims.

Lower Performance of Soft Modular with ARS in MT10

We attribute the initially lower performance of Soft Modular with ARS on MT10 to suboptimal hyperparameter tuning. By increasing the batch size per task from 100 to 128, we significantly improved performance to 98.8±1.398.8 \pm 1.3.

Computational Overhead and Potential Limitations Introduced by the Reset Mechanism

The reset mechanism introduces minimal computational overhead, occurring only during reset periods. With our reset strategy (updating re-initialized networks 1,000 times per reset), the total number of updates increases only slightly from 2,000,000 to 2,006,000—an increase of just 0.3%.

How can ARS be extended or adapted to other multi-task or multi-agent settings?

As illustrated in Algorithm 1 and Figure 14, the ARS framework is easily adaptable to any off-policy multi-task RL method by incorporating an adaptive reward scaling factor based on the replay buffer.

In multi-agent settings, we can consider either individual rewards per agent or a global reward. With a global reward, applying ARS is challenging since it effectively becomes a single task. In contrast, individual rewards allow each agent to be treated as a separate task, making ARS more applicable. However, caution is needed—unlike multi-task learning where tasks are independent, agents in multi-agent systems are interdependent. Thus, careful design of the adaptive reward scaling factor is crucial.

Limited Theoretical Analysis

We agree with the reviewer that the theoretical analysis is limited. However, we believe the extensive experimental validation provides strong support for the effectiveness of ARS. We plan to address a comprehensive theoretical analysis in future work.

Other Comments Or Suggestions:

We thank the reviewer for identifying typos, issues with bold symbols, and citation errors. These will be corrected thoroughly in the revised paper.

最终决定

The authors propose a mechanism to scale the rewards of each environment in a multitask rl setting, coupled with parameter resets. They show that compared with various rewards scaling mechanisms, the proposed method performs very well.

In particular the authors have reacted to the comments they've received from the reviewers adding two additional baselines and providing well thought-out responses. While the work restricts itself to metaworld the experiments are done carefully and convincingly.

NOTE: Please take care of duplicating text from Cho et al (2024) with respect to the appendix, but additionally look at the other section of the paper with the same issue in mind. It would be good not to have any issue on this front.

Therefore I'm happy to recommend accept.