PaperHub
6.3
/10
Poster3 位审稿人
最低6最高7标准差0.5
6
6
7
3.7
置信度
正确性2.7
贡献度2.3
表达2.7
NeurIPS 2024

Scalable Constrained Policy Optimization for Safe Multi-agent Reinforcement Learning

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06
TL;DR

We develop a novel scalable multi-agent constrained policy optimization method and prove that the safety constraints and the joint policy improvement can be met when each agent adopts a sequential update scheme to optimize a $\kappa$-hop policy.

摘要

A challenging problem in seeking to bring multi-agent reinforcement learning (MARL) techniques into real-world applications, such as autonomous driving and drone swarms, is how to control multiple agents safely and cooperatively to accomplish tasks. Most existing safe MARL methods learn the centralized value function by introducing a global state to guide safety cooperation. However, the global coupling arising from agents’ safety constraints and the exponential growth of the state-action space size limit their applicability in instant communication or computing resource-constrained systems and larger multi-agent systems. In this paper, we develop a novel scalable and theoretically-justified multi-agent constrained policy optimization method. This method utilizes the rigorous bounds of the trust region method and the bounds of the truncated advantage function to provide a new local policy optimization objective for each agent. Also, we prove that the safety constraints and the joint policy improvement can be met when each agent adopts a sequential update scheme to optimize a $\kappa$-hop policy. Then, we propose a practical algorithm called Scalable MAPPO-Lagrangian (Scal-MAPPO-L). The proposed method’s effectiveness is verified on a collection of benchmark tasks, and the results support our theory that decentralized training with local interactions can still improve reward performance and satisfy safe constraints.
关键词
Multi-agent reinforcement learningpolicy optimizationsafe learningscalable method

评审与讨论

审稿意见
6

The paper studies the problem of constrained MARL in a cooperative setting and focus on the decentralized learning settings without global observability. The paper proposes a constrained policy optimization method and its practical version, Scal-MAPPO-L. Theoretical results are established for the dynamics/policy truncation and the trust-region subproblems. The effect of the proposed method is validated through numerical experiments.

优点

The paper is well-organized, and complete in structures.

  • In terms of theoretical contribution, the paper derives a monotone improvement property in the exact setting, i.e., when no parameterization is involved. The empirical algorithm, i.e., Scal-MAPPO-L, is proposed.

  • In terms of numerical experiments, the paper performs reasonable experiments to compare Scal-MAPPO-L with other PPO family algorithms (with code provided). The experimental results demonstrate Scal-MAPPO-L exhibits decent performance.

缺点

My major concern about the paper is regarding the novelty of the paper given the existing literature [10]. The paper shares many similarities with [10] (Safe multi-agent reinforcement learning for multi-robot control), including presentation and theoretical results. Although I understand that [10] considers a centralized setting with full observation, and the setting considered here is decentralized, I am not sure about the technical contribution of the paper beyond leveraging the spatial decay of correlation assumption.

Also, the introduction of the spatial decay of correlation is somewhat abrupt. The author seems did not mention anything about the graph structure for the agents before introducing the assumption.

问题

I mainly have the following questions.

  • Could the author make further clarifications about the technical contribution of the paper against [10]. What are the main difficulties met when extending the results in [10] to the current setting (after imposing the spatial decay of correlation assumption).

  • Could the author provide more justifications about why the methods proposed in the paper is better than the existing literature [11] and [12]. In the paper, the author mentions [11] imposes extra parameter-sharing constraint, which results in suboptimality. This sounds not convincing enough to me, because the method developed in this paper is also "not optimal". For the discussion about [12], I failed to understand the idea conveyed by the author, and hope the author could elaborate further.

  • The empirical algorithm Scal-MAPPO-L does not have any theoretical guarantee. I wonder is it possible to say anything about its performance guarantee (maybe under additional assumptions)?

局限性

The authors have discussed the limitations in appendix, which sounds reasonable to me.

作者回复

We sincerely appreciate the valuable comments from the reviewer. We hope our responses below provide further clarity.

Remark: Without further specification, we use "[number]" to refer to the corresponding reference in our paper.

W1: My major concern …… correlation assumption.

A: We re-clarify our technical contributions in the first point of our General Response and provide more concrete technical contribution of the paper against [1] as follows:

  • We quantify the maximum information loss regarding the advantage truncation in Proposition 3.3 by extending the theoretical results of the truncated Q in [2-3] and draw further the bound of surrogate return in Corollary 3.4 under the spatial correlation decay assumptions (their proofs is reported in Appendixes C.3-C.4).

  • We provide a new local policy optimization objective for each agent by integrating the rigorous bounds of the trust region method and the bounds of the truncated advantage function (refer to Proposition 3.5 and its proof in Appendix C.5). In addition, based on the upper-bound version of the trust region method, we obtain the upper bound of the safety constraints (refer to Corollary 3.6 and its proof in Appendix C.6).

  • We develop a novel scalable multi-agent constrained policy optimization method that guarantees both satisfaction of safety constraints and monotonic performance improvement in Theorem 3.7 with a sequential update scheme (its proof is reported in Appendix C.7).

W2: Also, the introduction …… introducing the assumption.

A: Thank you for the valuable suggestions, we will introduce the graph structure and reorganize some of the symbols in the new version. A preliminary modification can be seen in the third point of our General Response.

Q1: Could the author …… correlation assumption).

A: We re-clarify our technical contributions and main technical difficulties in the first point of our General Response and provide more concrete technical contribution of the paper against [1] in the answer to W1(Major).

Q2: Could the author provide …… elaborate further.

A: We would like to provide comprehensive justifications for comparing the existing literature [2] and [3] as follows.

Against [2]: Safe Dec-PG [2] tackles distributed safe reinforcement learning problems, which implies the absence of a central controller to coordinate the agents. Both the rewards and constraints are only locally/privately known to each agent. Specifically, they decouple common reward functions and joint actions through a communication network to share information with neighboring agents. However, it is worth noting that their approach still assumes each agent can access the global state (which we do not have access to) and requires that the actions of all neighboring agents on that network be available (whereas we employ sequential updating).

Against [3]: Literature [3] also proposes a scalable safe MARL approach based on the spatial decay assumption of the environment dynamics. The paper updates the policies of agents by the truncated gradient estimators, which depend on the local states and actions of the κ-hop neighboring agents. Despite this, the problem of non-stationarity within local interactions remains acute. However, we adopt the multi-agent advantage decomposition and the sequential policy update scheme from [1][4] when updating local policies. Specifically, the policy update of an agent depends only on the actions of previous agent in the sequence, rather than on the actions of all κ\kappa-hop neighboring agents.

Q3: The empirical algorithm …… under additional assumptions)?

A: Thank you for the valuable comments. We admit that some approximations of the surrogate objective are employed in the practical algorithms, as clarified in lines 224 in Section 3.3. Most of these approximations are traditional practices in RL.

In the actual execution, Scal-MAPPO-L may not rigorously maintain the theoretical guarantees in Theorem 3.7, which is mainly due to several reasons:

  • Uncertainty in neural networks: Neural networks are inherently uncertain and extracting useful information from many messages may lead to lower performance, especially for algorithms with rich information (the observation of agents).

  • A form of expected KL-divergence constraint: This approach is commonly used in RL to avoid computing KL-divergence at every step. However, it introduces sampling errors; fortunately, the sampling errors are recomputable/controllable.

We will consider how to solve the optimization problem in Theorem 3.7 more precisely and try to provide theoretical guarantees in future work.

References:

[1] Shangding Gu, Jakub Grudzien Kuba, Yuanpei Chen, Yali Du, Long Yang, Alois Knoll, and Yaodong Yang. Safe multi-agent reinforcement learning for multi-robot control. Artificial Intelligence, 319:103905, 2023.

[2] Songtao Lu, Kaiqing Zhang, Tianyi Chen, Tamer Ba¸sar, and Lior Horesh. Decentralized policy gradient descent ascent for safe multi-agent reinforcement learning. In AAAI, 2021.

[3] Donghao Ying, Yunkai Zhang, Yuhao Ding, Alec Koppel, and Javad Lavaei. Scalable primal-dual actor-critic method for safe multi-agent rl with general utilities. In NuerIPS, 2023.

[4] Jakub Grudzien Kuba, Ruiqing Chen, Muning Wen, Ying Wen, Fanglei Sun, Jun Wang, and Yaodong Yang. Trust region policy optimization in multi-agent reinforcement learning. In ICLR, 2021.

评论

I thank the author for the detailed justifications. I am happy to increase my score to 6 : )

评论

Thank you very much for appreciating our work and raising the score! Your invaluable advice plays a pivotal role in guiding our efforts. We will meticulously revise our paper based on your suggestions and diligently strive to produce an even better version. Thank you once again for your invaluable support!

审稿意见
6

the work proposes a scalable version of MAPPO-L for constrained policy optimization, taking into account that decays in the inter-dependence between agents in a Markov Game projects into bounded errors while limiting the information sharing between agents. The theoretical results transfer nicely from the two starting frameworks, and the authors propose some empirical corroboration.

优点

originality: The overall combination of known results from scalable algorithms and MAPPO-L might not be that original. Having said that, it is not trivial as well, so while not surprising, I would say it is not negligible, assuming that the authors fairly address the similarities in the related works (see weaknesses).

quality: the work reaches high quality in the analysis and the empirical corroboration is mostly convincing, out of some doubts to be clarified (see Questions)

clarity: the work is overall well-written, and the concepts are generally well-described and help to understand the contributions.

significance: the work addresses an important problem, that is how to build a more scalable version of algorithms from SOTA MARL algorithms.

缺点

Major:

  • In the related works section, I would expect an extensive and fair discussion of the difference between this work and works on scalable MARL and MAPPO-L. The work has value in being a combination of the two, but this needs to be clarified. For example, more comments on the differences/novelties in the proofs and results from works cited as [10], [15], [18], [47].

  • The experimental corroboration does not provide any useful info about the difference in computational/communication burden introduced by Scalable-MAPPO-L compared to MAPPO-L. Additionally, from Figure 2, the fact that using k=1 (almost IPPO-L then) is way enough to solve the problem would rather suggest that the tasks themselves are easy.

Minor:

  • Section 2.2 : I would strongly suggest introducing the notion of agent graph somewhere, otherwise the SNikS_{N_i^k} and the notion of distance between agents might be hard to digest from scratch.

  • Eq 4/5 : I would suggest distinguishing the constants β1\beta_1 and β2\beta_2 (in case they are not different).

  • Eq. 10: I believe AπjA^j_\pi should be AπhA^h_\pi.

  • Eq. 13: I would specify over which class of policies are arg-maximizing.

  • Theorem 3.7: I would suggest refactoring the constants, they are hard to read.

问题

Major:

  • Can the authors discuss the validity of the assumptions? When would it be the case that ζ[0,2/γ]\zeta \in [0, 2/\gamma]? How would the condition of the policies be enforced while learning? Does it mean that the policies learned in the experiments are of the form in lines 522 in the Appendix? Are the assumptions valid in the instances of MAMUJOCO taken into account?

  • What do lines 259/260 stand for? what information in the environment do the authors refer to?

  • Code: would the authors provide a repository for the code? otherwise, the claims about the reproducibility of the work in the checklist are not very satisfied

  • Figure 1: with which value of k were the experiments done? Line 248 is not fully clear to me. Would the authors explain further? The performances of MAPPO-L do not match the performances of the original paper, why that?

  • Figure 2: Why is there a difference in performances between the two Manyagents 6x1 experiments in Figures 1 and 2? The performances of MAPPO-L do not match the performances of the original paper, why that?

  • What is the meaning of the claims in lines 327, 328 about the differences with [15], [47], [12]?

Minor:

  • What does duid^i_u in equation (17) stand for?

局限性

yes

作者回复

We sincerely appreciate the valuable comments from the reviewer. We hope our responses below provide further clarity.

Remark: Without further specification, we use "[number]" to refer to the corresponding reference in our paper.

W1(Major): In the related …… cited as …….

A: We re-clarify our technical contributions in the first point of our General Response and would like to provide a fair discussion with the existing literature, covering all relevant aspects and nuances, as follows.

Qu et al. [1] introduced the spatial correlation decay property into the field of MARL and carried out a series of fundamental results [2-4], which broadened the research avenues of scalable MARL. However, to the best of our knowledge, all of these studies [1-4] mainly focus on (natural) policy gradient methods with average rewards or general utilities and have not yet been combined with trust region methods, which rigorously enable RL agents to learn monotonically improving policies. Furthermore, only recent research [4] considers both safety and scalability for MARL. Our results build upon the scalable MARL family of works [1-4] and PPO-based (TRPO-based) MARL family of works [5-6].

The main differences between this work and previous works are as follows:

  • Compared to the works on scalable MARL family [1-4], we integrate the bounds of the trust region method with the bounds of the advantage truncated and introduce the multi-agent advantage decomposition and the sequential policy update scheme [5-6]. Each agent's policy update only depends on the actions of the previous agent rather than on the actions of all κ\kappa-hop neighboring agents.

  • Compared to TRPO-based MARL family [5-6], we focus on decentralized learning settings and develop a novel scalable and theoretically-justified multi-agent constrained policy optimization method. This method utilizes the rigorous bounds of the trust region method and the bounds of the truncated advantage function to provide a new local policy optimization objective for each agent.

W2(Major): The experimental …… are easy.

A: We provide information about the computational complexity in Appendix D.2 and add new experimental results with more agents and more training steps in the fourth point of our General Response.

It is worth noting that the wall-clock times do not significantly down when κ\kappa gradually decreases. This is due to the fact that we have yet to consider the process of sending and receiving information realistically. However, based on the successful research conducted in the field of communication [7-8], it is evident that algorithms requiring less communication undoubtedly have an advantage in terms of reducing communication burden and enhancing applicability.

Regrading the 1~5 in minor weaknesses.

A: Thank you for the valuable suggestions. In the next version, we will introduce the graph structure as the third point of our General Response and reorganize some of the symbols.

Q1(Major): Can the authors discuss …… taken into account?

A: We provide further discussion on the assumptions in the second point of our General Response.

Q2(Major): What do lines 259/260 …… refer to?

A: We're really sorry for any confusion caused by not explaining things clearly. We rephrase it as follows: Figure 2 shows the performance of Scal-MAPPO-L in different environments with varying values of κ\kappa, where MAPPO-L accesses the global state. We have noticed that the algorithm's performance is consistently the lowest, and the cost is nearly the highest when κ=1\kappa=1.

Q3(Major): Code: would …… not very satisfied.

A: We have been submitted our code in the supporting material.

Q4(Major): Figure 1: with which …… why that?

A: In Figure 1, we set that each agent in Scal-MAPPO-L can access the state of about half of the agents. Specifically, κ=1\kappa = 1 in Safe ManyAgent Ant (2×32 \times 3), κ=2\kappa = 2 in Safe ManyAgent Ant (3×23 \times 2), and κ=3\kappa = 3 in Safe ManyAgent Ant (6×16 \times 1).

Furthermore, the reason behind the discrepancy in performances between MAPPO-L and the original paper is attributed to the global state, which is a combination of each agent’s ID and the κ\kappa-hop information rather than a long state vector in our code. We elaborate on this in Appendix D.

Q5(Major): Figure 2: Why is …… why that?

A: The results in Figure 1 were obtained from a different server with a single A100 GPU. Unfortunately, we missed the slight difference between them. In the next version, we are committed to presenting the results on the same computer for consistency.

The performance of MAPPO-L does not match the original paper for the same reason as the answer to Q4(Major).

Q6(Major): What is the meaning …… differences with ……?

A: An elaboration about the compare with the existing literature [1-3] is re-reported in the answer to W1(Major).

Q1(Minor): What does duid^i_u in equation (17) stand for?

A: duid^i_u in equation (17) stand for the cost-constraining value.

References:

[1] Scalable reinforcement learning of localized policies for multi-agent networked systems. arXiv preprint arXiv:1912.02906, 2019.

[2] Scalable multi-agent reinforcement learning for networked systems with average reward. In NeurIPS, 2020.

[3] Multi-agent reinforcement learning in stochastic networked systems. In NeurIPS, 2021.

[4] Scalable primal-dual actor-critic method for safe multi-agent rl with general utilities. In NuerIPS, 2023.

[5] Trust region policy optimization in multi-agent reinforcement learning. In ICLR, 2021.

[6] Safe multi-agent reinforcement learning for multi-robot control. Artificial Intelligence, 319:103905, 2023.

[7] A survey of multi-agent reinforcement learning with communication. arXiv preprint arXiv:2203.08975, 2022.

[8] Learning structured communication for multi-agent reinforcement learning. Autonomous Agents and Multi-Agent Systems, 36(2), p.50, 2022.

评论

I would like to thank the authors for taking the time to extensively answer the raised doubts. Provided that the suggested modifications were included in the revised version, I am more than positive with incresing the score.

评论

Thank you very much for appreciating our work and the generous boost to our score! Your invaluable advice plays a pivotal role in our efforts to improve the quality of our paper. We are committed to present a better version. Thank you once again for your invaluable support!

审稿意见
7

The paper proposed a scalable multi-agent constrained policy optimization for safe reinforcement learning. It is an extension of two previous work on safe reinforcement learning and scalable multi-agent reinforcement learning. The trust region policy updates and truncated policy/advantage function are combined to give theoritical performance bound. A practical algorithm based on PPO is also shown and the empirical results verifies the claimed performance.

优点

  1. The presentation is clear and easy to understand, the authors presented clearly the relation between this paper and the previous work.
  2. It is a very solid combination of the ideas in previous works, and extends the scalable multi-agent reinforcement learning idea to multi-agent safe reinforcement learning. It will be beneficial to the safe RL community.
  3. The theoretical results look correct to me.

缺点

There are all minor weaknesses in this paper.

  1. The author should discuss more clearly on how the assumptions are related to previous works, for example, the Dobrushin conditions in (1) and Assumption 2.1. They both appear in the previous scalable multi-agent reinforcement learning paper [1] but in slightly different formulations.

  2. The experiments are only on a smaller number of agents (for 12 agents, we don't really need a scalable algorithm to handle it).

[1] Guannan Qu, Yiheng Lin, Adam Wierman, and Na Li. Scalable multi-agent reinforcement 374 learning for networked systems with average reward. In NeurIPS, 2020.

问题

The theoritical results look okay to me, although I didn't check all the proofs. for the experimental results,

  1. For figure 1 and 2, what is the constraint on the average episodic cost? It will be more clear to draw a horizontal line showing that.
  2. The performance is still increasing at 1e7 steps, the algorithms might not converge. Can you explain a bit?

局限性

The authors addressed the limitation well.

作者回复

We sincerely appreciate the valuable comments from the reviewer. We hope our responses below provide further clarity.

W1: The author should …… different formulations.

W2:The experiments …… algorithm to handle it).

Q2: The performance …… a bit?

A: We thank the reviewer for appreciating our work and kindly refer the reviewer to our General Response, where we provide the discussion for the validity and applicability of the assumptions about spatial correlation decay and add new experimental results with more agents and more training steps.

Remark: The safe MARL problem has received much attention from researchers in recent years. Unfortunately, the benchmark environment still needs to be developed. To the best of our knowledge, Safe Multi-Agent MuJoCo [1] is a popular safe MARL benchmarking environment. In addition, literature [2] designs an access control task with safety constraints under wireless communication, which has 25 agents. But they did not provide the experimental code. In the next version, we will strive to reproduce this wireless communication environment and provide the results of our experiments.

Q1: For figure 1 and 2, …… showing that.

A: Figure 1 and 2 show the experimental results on several safe tasks in the Safe MAMUJOCO environment, which preserves the agents, physics simulator, background environment, and reward function and comes with obstacles, like walls or pitfalls. Furthermore, the environment emits cost [1] with the increasing risk of an agent stumbling upon an obstacle.

The “average episode cost” represents the average cost per episode in a batch. We will fix this in the new version.

References:

[1] Shangding Gu, Jakub Grudzien Kuba, Yuanpei Chen, Yali Du, Long Yang, Alois Knoll, and Yaodong Yang. Safe multi-agent reinforcement learning for multi-robot control. Artificial Intelligence, 319:103905, 2023.

[2] Donghao Ying, Yunkai Zhang, Yuhao Ding, Alec Koppel, and Javad Lavaei. Scalable primal-dual actor-critic method for safe multi-agent rl with general utilities. In NuerIPS, 2023.

评论

Thanks for the response. I have a better understanding about the paper and I agree that the undevelopment of safe multi-agent RL, and the paper will be beneficial to the community. I will increase my score to 7.

评论

Thank you very much for appreciating our work and raising the score! Your invaluable advice plays a pivotal role in our efforts to improve the quality of our paper. Thank you once again for your invaluable support!

作者回复

General Response

We would like to express our sincere gratitude to the reviewers for reading our paper and providing valuable feedback. Below, we answer some common questions raised by the reviewers, including the technical contributions, the assumptions about spatial correlation decay, the graph structure, and the new experimental results. Please find our responses to other questions in the personalized rebuttals.

Remark: Without further specification, we use "[number]" to refer to the corresponding reference in our paper.

Regarding the technical contribution of the paper.

A: Our theoretical results build upon the scalable MARL family of works [1-3] and PPO-based (TRPO-based) MARL family of works [4-5]. Their solid and complete theoretical analyses provide a good research foundation for our work. Here, we would like to re-clarify our technical contributions.

The main technical contributions are as follows:

  • First, we quantify the maximum information loss regarding the advantage truncation based on two assumptions about the transition dynamics and policies.

  • Then, by integrating the rigorous bounds of the trust region method and the truncated advantage function, we provide a new local policy optimization objective for each agent.

  • Furthermore, we develop a novel scalable multi-agent constrained policy optimization method and prove that the safety constraints and joint policy improvement can be guaranteed.

  • In addition, we parameterize each agent’s policy and propose a practical algorithm called Scalable MAPPO-Lagrangian (Scal-MAPPO-L).

The main technical difficulties are as follows:

  • How to quantify the information loss regarding the advantage truncation? (refer to Proposition 3.3 and its proof)

  • How to ensure the local policy updates are not overly conservative? (refer to Proposition 3.5 and Corollary 3.6 and their proofs)

  • How to prove that the proposed method can consistently improve rewards and adhere to safety constraints at every iteration? (refer to Theorem 3.7 and its proof)

Regarding Assumption 2.1 and Assumption 2.2.

A: We would like to provide further discussion on the assumptions about spatial correlation decay as follows:

  • The parameter WijW^{ij} in Dobrushin condition [6] reflects the extent to which the local transition probability of agent ii is affected by the state and action of agent jj. Assumption 2.1 amounts to requiring WijW^{ij} decreases exponentially with the distance between any two agents ii and jj, which has been used in previous works [1-3] about scalable MARL. This paper does not make it more stringent.

  • From a theoretical perspective, our approach can be considered for most of safe MARL tasks, especially when there is a performance gap between independent learning and centralized learning.

  • We provide a mathematical example to illustrate the relationship between the two assumptions in Appendix B.2. It is evident from this mathematical example that Assumption 2.2 necessarily holds when Assumption 2.1 holds and the parameters ξ\xi and β\beta satisfy certain conditions. However, in order to maintain a concise presentation, we introduce Assumption 2.2.

  • When Assumption 2.1 holds, the numerical example in lines 522 in the appendix can provide a reference basis for selecting the values of the parameters in Assumption 2.2. However, accurately determining the spatial decay of correlation for the dynamics remains a challenging engineering task. In this paper, we empirically adopt conservative values.

Regarding the graph structure under a network of agents.

A: We sincerely appreciate the suggestion from the reviewers to introduce the graph structure for networked multi-agent systems, and we accept it to make the paper read more smoothly. Specifically, in the new version, we will introduce the agent graph structure in the introduction section and redescribe the safe MARL problem in Section 2.1 as follows:

Consider a safe MARL problem subject to multiple constraints, where each agent are associated with an underlying undirected graph G=\mathcal{G}= (N,E)(\mathcal{N}, \mathcal{E}). Here, N={1,,n}\mathcal{N}=\{1, \ldots, n\} is the set of agents and EN×N\mathcal{E} \subset \mathcal{N} \times \mathcal{N} is the set of edges. The problem can be formulated as a constrained Markov game, ……

Regarding the new experimental results.

A: We run new experiments to provide the results with more agents (for 17 agents in Safe Humanoid task) and more training steps (1.5×1071.5 \times 10^7) and update the results in the PDF. All results are averaged over two random seeds, and the curves are smooth over time. We will continue our efforts to provide richer and more complete experimental results in the new version.

References:

[1] Guannan Qu, Yiheng Lin, Adam Wierman, and Na Li. Scalable multi-agent reinforcement learning for networked systems with average reward. In NeurIPS, 2020.

[2] Yiheng Lin, Guannan Qu, Longbo Huang, and Adam Wierman. Multi-agent reinforcement learning in stochastic networked systems. In NeurIPS, 2021.

[3] Donghao Ying, Yunkai Zhang, Yuhao Ding, Alec Koppel, and Javad Lavaei. Scalable primal-dual actor-critic method for safe multi-agent rl with general utilities. In NuerIPS, 2023.

[4] Jakub Grudzien Kuba, Ruiqing Chen, Muning Wen, Ying Wen, Fanglei Sun, Jun Wang, and Yaodong Yang. Trust region policy optimization in multi-agent reinforcement learning. In ICLR, 2021.

[5] Shangding Gu, Jakub Grudzien Kuba, Yuanpei Chen, Yali Du, Long Yang, Alois Knoll, and Yaodong Yang. Safe multi-agent reinforcement learning for multi-robot control. Artificial Intelligence, 319:103905, 2023.

[6] Amir Dembo and Andrea Montanari. Gibbs measures and phase transitions on sparse random graphs. arXiv preprint arXiv:0910.5460, 2009.

最终决定

The paper addresses the important problem of a constrained MARL. The theoretical results rely heavily on the spatial correlation decay assumption, which simplifies analysis greatly but is a realistic approximation in multiple practical scenarios. The theoretical analysis is used to motivate a practical decentralized algorithm. Numerical experiments illustrate the proposed algorithm is able to satisfy constraints as well as the centralised solution, and can also achieve similar reward to the centralised solution when the agents can access the observation of enough number of neighbours.

One limitation in the experiments is the use of small networks (up to 12 agents) and the lack of details on the network topology. However, the authors provided a still small but more convincing experiment with up to 17 agents during the rebuttal period. The authors have committed to provide details on the network topology, and they should include the new experiment (with details on computation time for completeness) in the camera ready version.

Overall, the paper is clearly written, establishes a new baseline for constrained MARL, and can inspire future research.