PaperHub
3.0
/10
withdrawn3 位审稿人
最低3最高3标准差0.0
3
3
3
3.7
置信度
正确性2.0
贡献度1.7
表达2.7
ICLR 2025

$f$-Divergence Policy Optimization in Fully Decentralized Cooperative MARL

OpenReviewPDF
提交: 2024-09-24更新: 2024-12-04

摘要

Independent learning is a straightforward solution for fully decentralized learning in cooperative multi-agent reinforcement learning (MARL). The study of independent learning has a history of decades, and the representatives, such as independent Q-learning and independent PPO, can obtain good performance in some benchmarks. However, most independent learning algorithms lack convergence guarantees or theoretical support. In this paper, we propose a general formulation of independent policy optimization, $f$-divergence policy optimization. We show the generality of such a formulation and analyze its limitation. Based on this formulation, we further propose a novel independent learning algorithm, TVPO, that theoretically guarantees convergence. Empirically, we show that TVPO outperforms state-of-the-art fully decentralized learning methods in three popular cooperative MARL benchmarks, which verifies the efficacy of TVPO.
关键词
multi-agentreinforcement learningfully decentralized learningpolicy optimizationconvergenceindependent learning

评审与讨论

审稿意见
3

This paper utilizes f-divergence, specifically the total variation, to generalize the KL divergence in independent policy optimization.

优点

  • The presentation is clear and easy to follow.

缺点

  • The application of f-divergence in policy optimization is not new; a comprehensive analysis of various distance constraints in policy gradients has been provided in [1].

  • Extending existing single-agent analysis to the multi-agent setting is reasonable, but some assumptions are questionable. Specifically, the approach assumes full observability in MARL making the setting difficult to distinguish from single-agent reinforcement learning. Under full observability, what meaningful difference remains between centralized and decentralized control?

  • The performance improvement appears marginal. With full observability, IPPO has already demonstrated near-optimal performance on SMAC and Multi-Agent MuJoCo. Were the baseline hyperparameters tuned to achieve their optimal reported performance?

  • Why is win rate not used as the evaluation metric for SMAC-v2 tasks?

[1] Zhang, Junyu, et al. "Variational policy gradient method for reinforcement learning with general utilities." Advances in Neural Information Processing Systems 33 (2020): 4572-4583.

问题

See Weaknesses.

伦理问题详情

N/A

评论

The application of f-divergence in policy optimization is not new; a comprehensive analysis of various distance constraints in policy gradients has been provided in [1].

[1] discusses the convergence of policy gradient iteration for a general utility instead of the original reward function in single-agent RL. Though the general utility may cover the ff-divergence case, our analysis and theoretical results including the formulation of policy iteration of ff-divergence policy optimization are novel in fully decentralized learning.

Extending existing single-agent analysis to the multi-agent setting is reasonable, but some assumptions are questionable. Specifically, the approach assumes full observability in MARL making the setting difficult to distinguish from single-agent reinforcement learning. Under full observability, what meaningful difference remains between centralized and decentralized control?

For the assumption of full observability or global state, we have a more detailed discussion in Appendix F.6. In conclusion, this is necessary for meaningful theoretical analysis. Even under full observability, decentralized control is still not able to observe the other agents' actions and use a centralized critic to settle the non-stationarity issue like centralized control. There are some studies on the decentralized control under full observability including [1,2] mentioned by Reviewer vfTk.

[1] Leonardos, Stefanos, et al. "Global convergence of multi-agent policy gradient in markov potential games." arXiv preprint arXiv:2106.01969 (2021).

[2] Fox, Roy, et al. "Independent natural policy gradient always converges in markov potential games." International Conference on Artificial Intelligence and Statistics. PMLR, 2022.

The performance improvement appears marginal. With full observability, IPPO has already demonstrated near-optimal performance on SMAC and Multi-Agent MuJoCo. Were the baseline hyperparameters tuned to achieve their optimal reported performance?

The main difference lies in the experiment setting. As we mentioned in Section 5.2, all the algorithms use the independent parameter to agree with the fully decentralized setting, and parameter sharing is banned. Moreover, the SMAC tasks and Multi-Agent MuJoCo tasks are partial observable in our experiments. In this setting, IPPO can't perform as well as in the CTDE setting. As for the hyperparameters, We are running extra experiments about the clip parameters of IPPO. We will update the empirical results as soon as possible.

Why is win rate not used as the evaluation metric for SMAC-v2 tasks?

As we mentioned in Section 5.2, These tasks are difficult for fully decentralized learning, so we also use the cumulative reward as the metric. All the algorithms can hardly win in SMAC-v2 tasks, so the win rates may not show any difference.

评论

We have updated the additional empirical results in Appendix G. We compare the influence of the hyperparameters on IPPO's performance. We choose clip parameters with values 0.1,0.2,0.30.1, 0.2, 0.3 for ablation study and select the 10_vs_10 protoss task for experiments. The empirical results are ilustrated in Figure 14. We can see that the impact of this hyperparameter is not significant. Moreover, we can observe similar results in the TMLR version of DPO [1]. Therefore, we believe that the performances of IPPO are mainly affected by the experiment setting instead of hyperparameters.

[1] https://openreview.net/pdf?id=MppUW90uU2

审稿意见
3

This paper explores independent learning in the multi-agent reinforcement learning (MARL) setting and introduces f-divergence policy optimization. The authors analyze the limitations of the method with an illustrative example and propose defining the f-divergence as the total variation distance. Theoretical and experimental results confirm the effectiveness of the proposed approach.

优点

  1. Detailed related work in the Fully Decentralized Learning field.
  2. The paper introduces a well-grounded technique for achieving monotonic improvement in multi-agent optimization through decentralized learning.
  3. The paper is well-structured and easy to follow.

缺点

  1. The relevant work of CTDE is incomplete and lacks recent work, such as HASAC[a] and MAT[b].
  2. Assuming global information might influence the impact of this work.
  3. While the experiment results appear promising, the contribution is slightly insufficient compared with existing work[c,d].

a. Liu, Jiarong, et al. "Maximum Entropy Heterogeneous-Agent Reinforcement Learning." The Twelfth International Conference on Learning Representations.

b. Wen, Muning, et al. "Multi-agent reinforcement learning is a sequence modeling problem." Advances in Neural Information Processing Systems 35 (2022): 16509-16521.

c. Grudzien, Jakub, Christian A. Schroeder De Witt, and Jakob Foerster. "Mirror learning: A unifying framework of policy optimisation." International Conference on Machine Learning. PMLR, 2022.

d. Su, Kefan, and Zongqing Lu. "Decentralized policy optimization." arXiv preprint arXiv:2211.03032 (2022).

问题

  1. Why use different metrics for SMAC (win rate) and SMACv2 (return)?
  2. Due to the assumption of the global state, I suggest using Markov games [a] as the multi-agent framework.

a. Littman, Michael L. "Markov games as a framework for multi-agent reinforcement learning." Machine learning proceedings 1994. Morgan Kaufmann, 1994. 157-163.

评论

The relevant work of CTDE is incomplete and lacks recent work, such as HASAC[a] and MAT[b].

Thank you for providing more related works. HASAC combines the heterougeneous-agent decomposition with the entropy regularization in SAC. MAT introduce Transformer and sequencial modeling into the heterougeneous-agent decomposition. We will update these contents in the revision.

Assuming global information might influence the impact of this work. Due to the assumption of the global state, I suggest using Markov games [a] as the multi-agent framework.

Thank you for your advice. We have a more detailed discussion about the global state assumption in Appendix F.6. In conclusion, it has been proven that the problem under the partial observable framework may be too difficult to obtain any useful analysis. The Markov games may be a good framework to skip the gap between the global state, but the gap between Markov games and existing partial observable benchmarks arises (that's the opinion from previous reviewers). We will follow your advice to use the Markov game framework in the revision, but we need to point out that changing the framework might make it easier to understand while may not change the essence of the issue.

While the experiment results appear promising, the contribution is slightly insufficient compared with existing work[c,d].

Mirror learning [c] is a very good paper on the convergence of single-agent RL algorithms and has provided us with a lot of inspiration on introducing a general distance or divergence constraint in the policy iteration. However, when extending its conclusion into the fully decentralized learning, we find that its core results about the monotonic improvement fail because of the other agents' influence. Therefore we need to propose some novel methods for

As for our contribution compared with DPO [d], first we propose a general framework of policy iteration, ff-divergence policy optimization, in fully decentralized learning. We also provide a detailed analysis and discussion about ff-divergence policy optimization and propose an algorithm TVPO with a convergence guarantee in fully decentralized learning based on this framework, which shows its potential. For the comparison between TVPO and DPO, we provide a detailed discussion in the Remark part of Section 4.2 and Appendix F.5. In conclusion, TVPO and DPO are based on different objective, and the approximation of TVPO is more accurate which means TVPO can avoid more trivial solutions.

Why use different metrics for SMAC (win rate) and SMACv2 (return)?

As we mentioned in Section 5.2, These tasks are difficult for fully decentralized learning, so we also use the cumulative reward as the metric. All the algorithms can hardly win in SMAC-v2 tasks, so the win rates may not show any difference.

评论

Thank you for your response! However, I still think there is no significant difference with related work.

审稿意见
3

This paper proposes TVPO for cooperative Markov games, with the update rule of each agent as πt+1i=argmaxπiaiπi(ais)Qiπt(s,ai)ωDTV(πi(s)πti(s))\pi^i_{t+1}=\arg\max_{\pi^i} \sum_{a_i} \pi^i(a_i | s)Q_i^{\pi_t}(s,a_i)-\omega D_{TV}(\pi^i(\cdot|s)|| \pi_t^i(\cdot|s) ) and shows that the algorithm can converge monotonically to the NE of the game. Moreover, TVPO with the adaptive β\beta in PPO shows superior empirical performance over previous algorithms.

优点

  • The empirical performance of TVPO is superior to previous SOTA
  • The writing is clear except for several typos (see weaknesses)
  • The proofs are easy to follow
  • Compared to previous algorithms, TVPO is easy to implement

缺点

Comparison to Related Work

My major concern is that this paper seems to miss several relevant literature. For instance, [1], [2] both proposed algorithms for independent learning in potential Markov games, which include the cooperative Markov games investigated in this paper. Further, [1] proposed a policy gradient algorithm and [2] proposed a policy iteration algorithm, which is highly relevant to this paper.

Moreover, the algorithm in [2] can also use the adaptive β\beta in PPO. Therefore, I'm wondering if TVPO will be superior to [2] when both using an adaptive β\beta.

Writings

  • ii is superscript for π\pi but subscript for V,QV,Q
  • The MM in Proposition 4.2 and Section 4.2 differs
  • Line 152: such as...

I would be happy to raise the score if the author can resolve the issues above.

[1] Leonardos, Stefanos, et al. "Global convergence of multi-agent policy gradient in markov potential games." arXiv preprint arXiv:2106.01969 (2021).

[2] Fox, Roy, et al. "Independent natural policy gradient always converges in markov potential games." International Conference on Artificial Intelligence and Statistics. PMLR, 2022.

问题

  • Is the VV^* in Theorem 4.6 the stationary point instead of the value function corresponding to the optimal policy?
  • In the second line of Eq (23), it seems to be \Rightarrow instead of \Leftrightarrow. Because ff is convex instead of strongly convex
评论

My major concern is that this paper seems to miss several relevant literature. For instance, [1], [2] both proposed algorithms for independent learning in potential Markov games, which include the cooperative Markov games investigated in this paper. Further, [1] proposed a policy gradient algorithm and [2] proposed a policy iteration algorithm, which is highly relevant to this paper.

Thank you for providing two more baseline algorithms. We are running additional experiments about these two algorithms and we will update the empirical results as soon as possible.

Is the  VV^* in Theorem 4.6 the stationary point instead of the value function corresponding to the optimal policy?

Yes, VV^* here represents the value function corresponding to the converged joint policy π\pi^* and π\pi^* is not guaranteed to be the optimal policy (it will be sub-optimal in most cases).

About second line of Eq (23)

Thank you for pointing out our lack of rigor. Here we can replace the ">>" with "\ge" in the conclusion and proof, then they will be correct. This change will not influence the corollary 4.3 and the counterexample we constructed. We will rewrite our statement in the revision.

评论

We have updated the additional empirical results in Appendix G. For the comparison with the baseline IPG and INPG, we select three 10_vs_10 SMAC-v2 tasks. The empirical results are illustrated in Figure 13 in Appendix G. We can find that IPG's performance is not stationary and may drop with the progress of training compared with other policy based algorithms. We think the main reason is that IPG lack the constraints about the stepsize of policy iteration. We use the adaptive coefficient for INPG, and its performance is similar to DPO, which is reasonable as their policy objectives are similar except for a square root term.

评论

Thank you for your response! Are you using adaptive βi\beta^i for INPG? Moreover, may you explain the difference between INPG and TVPO?

评论

Yes, we use the same adaptive adjustment method for INPG like TVPO. The difference between TVPO and INPG in implementation is that TVPO uses total variational distance and INPG uses KL-divergence.

评论

Thank you for your response! However, I will keep my score since changing the distance from KL-divergence to total variational distance seems incremental. Moreover, TVPO's practical improvement is adapted from PPO. Therefore, I think the results of this paper are mostly known so I would suggest rejection.

撤稿通知

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.