7.0

/10

Poster4 位审稿人

最低6最高8标准差1.0

4.0

置信度

ICLR 2024

Learning Multi-Agent Communication from Graph Modeling Perspective

Shengchao Hu,Li Shen,Ya Zhang,Dacheng Tao

OpenReview PDF

提交: 2023-09-21更新: 2024-03-11

摘要

关键词

communication learningmulti-agent reinforcement learning

评审与讨论

审稿意见

评分: 6置信度: 52023-10-31

The paper introduces CommFormer, a novel approach for optimizing the communication architecture among multiple intelligent agents involved in collaborative tasks. By conceptualizing the architecture as a learnable graph and employing a bi-level optimization process with attention units, CommFormer enables agents to efficiently optimize their communication and adapt more coordinated strategies in a variety of scenarios, as demonstrated in experiments on StarCraft II combat games.

I have several comments and questions that need to be addressed before publication:

what if the communication graph determined by your approach is not physically feasible, for instance due to environmental constraints such as a far physical distance, etc.? Isn’t a graph communication approach that determines the communication based on physical proximity better in such real-world scenarios? Maybe the best solution is a hybrid approach where environment constraints are considered and baked into the problem for determining the communication graph?
I find the presented related work section to be weak and relatively old. Many recent SOTA graph-based multi-agent communication learning approaches are never mentioned or discussed, despite their high relevance to the proposed approach. For instance, [1]-[4] below are only a few of such works. Almost all of these works offer a distributed graph-based learned multi-agent communication method that work under POMDPs and are trained under CTDE. There are more of such recent paper. I believe the authors need to perform a more comprehensive search on the recent literature.

[1] Seraj, Esmaeil, et al. "Learning efficient diverse communication for cooperative heterogeneous teaming." Proceedings of the 21st international conference on autonomous agents and multiagent systems. 2022.

[2] Niu, Yaru, Rohan R. Paleja, and Matthew C. Gombolay. "Multi-Agent Graph-Attention Communication and Teaming." AAMAS. 2021.

[3] Bettini, Matteo, Ajay Shankar, and Amanda Prorok. "Heterogeneous Multi-Robot Reinforcement Learning." Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems. 2023.

[4] Meneghetti, Douglas De Rizzo, and Reinaldo Augusto da Costa Bianchi. "Towards heterogeneous multi-agent reinforcement learning with graph neural networks." arXiv preprint arXiv:2009.13161 (2020).

There are many existing recent, SOTA graph-based multi-agent communication learning approaches, see above [1]-[4] (which are not even mentioned in the paper), that could be a competition for the proposed approach. The selected benchmarks do not necessarily specialize in graph-based distributed communication. The proposed learned communication graph approach should be experimented and evaluated against other graph-based methods.
All the evaluations are performed in SMAC domains. Is this approach specialized and designed for SMAC? If not, and the solution is in fact generalizable, other domains and different problem settings must be considered. Many of such standard domains can be found in the prior work. Although SMAC domains are interesting game scenarios, the point is to have a comparable baseline performance in standard domains that can also solve other multi-agent coordination and collaboration problems, social interactions, etc.
Related to the point above, if the presented approach does not apply to other multi-agent problems and scenarios, this should be mentioned and discussed as a limitation. Otherwise, only presenting results in one domain does not suffice.
The second contribution bullet-point mentions the use of attention units for allocating credit to received messages. Doesn’t TarMAC already do that?
What are the limitations of the approach? The limitations are never discussed.

At current states I vote weak rejection, since the algorithm seems to be sound and working, however there are some notable weaknesses in literature review and benchmarking (methods and domains) that need to be addressed as much as possible. I’d be happy to increase my score further when authors satisfactorily addressed my comments and questions.

优点

See above.

缺点

See above.

问题

See above.

2023-11-17

Q1 & Q2

The proposed learned communication graph approach should be experimented and evaluated against other graph-based methods.

Is this approach specialized and designed for SMAC?

Thanks for the suggestion! We have incorporated additional experiments to enhance the generalization of our method. Taking into account the communication domains explored in previous works, we have included the following experiments. It is worth noting that in certain domains, our objective extends beyond maximizing the average success rate or cumulative rewards. We also aim to minimize the average number of steps required to complete an episode, emphasizing the ability to achieve goals in the shortest possible time.

Another three maps in the SMAC environment: 1o10b_vs_1r and 1o2r_vs_4r, which pose challenges due to partial observability, and 5z_vs_1ul, where successful outcomes require strong coordination.
Predator-Prey (PP) [1]: The goal is for 𝑁 predator agents with limited vision to find a stationary prey and move to its location. The agents in this domain all belong to the same class (i.e., identical state, observation and action spaces).
Predator-Capture-Prey (PCP) [2]: We have two classes of predator and capture agents. Agents of the predator class have the goal of finding the prey with limited vision (similar to agents in PP). Agents of the capture class, have the goal of locating the prey and capturing it with an additional capture-prey action in their action-space, while not having any observation inputs (e.g., lack of scanning sensors).
Google Research Football (GRF) [11]: We evaluate algorithms in the football academy scenario 3 vs. 2, where we have 3 attackers vs. 1 defender, and 1 goalie. The three offending agents are controlled by the MARL algorithm, and the two defending agents are controlled by a built-in AI. We find that utilizing a 3 vs. 2 scenario challenges the robustness of MARL algorithms to stochasticity and sparse rewards.

We include several state-of-the-art graph-based multi-agent communication learning approaches as additional baselines in our evaluation. These methods encompass QGNN [7], SMS [3], TarMAC [4], NDQ [5], MAGIC [6], HetNet [2], CommNet [8], I3CNet [9], and GA-Comm [10].

The performance results of these baselines are presented below. It is important to note that due to time constraints, we directly obtain the performance results from the respective papers. Our CommFormer consistently demonstrates favorable performance across the evaluated metrics.

Task	Metric	CommFormer(0.4)	QGNN	SMS	TarMAC	NDQ	MAGIC	QMIX
1o2r_vs_4r	Success Rate	96.9 $\pm$ 1.5	93.8 $\pm$ 2.6	76.4	39.1	77.1	22.3	51.1
1o10b_vs_1r	Success Rate	96.9 $\pm$ 3.1	98.0 $\pm$ 2.9	86.0	40.1	78.1	5.8	51.4
5z_vs_1ul	Success Rate	100.0 $\pm$ 1.4	92.2 $\pm$ 1.6	59.9	44.2	48.9	0.0	82.6

Task	Metric	CommFormer(0.4)	MAGIC	CommNet	I3CNet	TarMAC	GA-Comm
GRF	Success Rate	100.0 $\pm$ 0.0	98.2 $\pm$ 1.0	59.2 $\pm$ 13.7	70.0 $\pm$ 9.8	73.5 $\pm$ 8.3	88.8 $\pm$ 3.9
GRF	Steps Taken	25.4 $\pm$ 0.4	34.3 $\pm$ 1.3	39.3 $\pm$ 2.4	40.4 $\pm$ 1.2	41.5 $\pm$ 2.8	39.1 $\pm$ 3.1

Task	Metric	CommFormer(0.4)	MAGIC	HetNet	CommNet	I3CNet	TarMAC
PP	Average Cumulative Reward	-0.121 $\pm$ 0.008	-0.386 $\pm$ 0.024	-0.232 $\pm$ 0.010	-0.336 $\pm$ 0.012	-0.342 $\pm$ 0.015	-0.563 $\pm$ 0.030
PP	Steps Taken	4.99 $\pm$ 0.31	10.6 $\pm$ 0.50	8.30 $\pm$ 0.25	8.97 $\pm$ 0.25	9.69 $\pm$ 0.26	18.4 $\pm$ 0.46

Task	Metric	CommFormer(0.4)	MAGIC	HetNet	CommNet	I3CNet	TarMAC
PCP	Average Cumulative Reward	-0.197 $\pm$ 0.019	-0.394 $\pm$ 0.017	-0.364 $\pm$ 0.017	-0.394 $\pm$ 0.019	-0.411 $\pm$ 0.019	-0.548 $\pm$ 0.031
PCP	Steps Taken	7.61 $\pm$ 0.66	10.8 $\pm$ 0.45	9.98 $\pm$ 0.36	11.3 $\pm$ 0.34	11.5 $\pm$ 0.37	17.0 $\pm$ 0.80

2023-11-17

Q3

what if the communication graph determined by your approach is not physically feasible, for instance due to environmental constraints such as a far physical distance, etc.? Isn’t a graph communication approach that determines the communication based on physical proximity better in such real-world scenarios? Maybe the best solution is a hybrid approach where environment constraints are considered and baked into the problem for determining the communication graph?

Thanks for the valuable comment.

A possible application of this study is to create an efficient communication framework tailored for enclosed, finite environments, typical of logistics warehouses. In these settings, agent movement is limited to designated zones, and communication is facilitated through overhead wires, akin to a trolleybus system.

In contrast, open environments present unique challenges, primarily due to the potential vast distances between agents, which requires wireless communication and may hinder effective communication. To address this, a straightforward approach could be to add bidirectional edges between agents when they come within close proximity, enabling communication between them [2]. However, a more effective solution may involve a hybrid approach that considers the constraint on the available bandwidth：initially segmenting agents into groups based on proximity, followed by an internal search for an optimal communication graph within each group. If agent distances vary dynamically during testing, this process is repeated as necessary to adjust the communication graph in real-time, ensuring continuous adaptability to changing environmental conditions.

Q4

The second contribution bullet-point mentions the use of attention units for allocating credit to received messages. Doesn’t TarMAC already do that?

In our framework, two attention units are implemented within the encoder and decoder blocks. In the encoder block, the attention unit is tasked with allocating credit to observations received from other agents. This mechanism is somewhat akin to the approach used in TarMAC, which employs targeted, multi-round communication.

Conversely, within the decoder block, we introduce a specific constraint. This constraint limits attention computations to interactions between an agent $i$ and its preceding agents $j$ , where $j<i$ . Such a restriction upholds a sequential update scheme, crucial for the decoder to generate the action sequence in an auto-regressive manner, which ensures a monotonic improvement in performance throughout the training period [12].

Q5

What are the limitations of the approach? The limitations are never discussed.

Thanks for this suggestion! Firstly, our approach is not suitable for deployment in open regions where the physical proximity between agents may exceed the communication range. Additionally, our method may not generalize well to environments that necessitate dynamic communication patterns, such as situations where agents need to interact with different agents at different stages to accomplish tasks. Furthermore, when dealing with a large number of agents, the active edges, determined by the sparsity coefficient, can still impose a physical burden. In such cases, it may be more appropriate to determine a fixed number of edges rather than relying solely on sparsity considerations. These considerations need to be taken into account for further improvements and generalization of our method.

Q6

There are more of such recent paper. I believe the authors need to perform a more comprehensive search on the recent literature [2, 6, 14, 15].

Thanks! These methods primarily employ GNNs to encode pre-defined graphs, allowing for the acquisition of efficient and diverse communication models to facilitate coordination within cooperative and heterogeneous teams. These approaches emphasize the importance of graph feature learning and heterogeneous policy learning, aiming to improve performance in these aspects. In contrast, our CommFormer approach takes a novel approach by simultaneously learning the communication graph and heterogeneous policy from an optimization perspective. By doing so, it seeks to identify and utilize the optimal communication graph for enhanced performance. We will improve the related work in the updated version.

2023-11-17

Reference

[1] Amanpreet, Singh, et al. "Learning when to communicate at scale in multiagent cooperative and competitive tasks." arXiv preprint arXiv:1812.09755 (2018).

[2] Seraj, Esmaeil, et al. "Learning Efficient Diverse Communication for Cooperative Heterogeneous Teaming." AAMAS 2022.

[3] Xue, Di, et al. "Efficient Multi-Agent Communication via Shapley Message Value." IJCAI 2022.

[4] Das, Abhishek, et al. "Tarmac: Targeted multi-agent communication." ICML 2019.

[5] Wang, Tonghan, et al. "Learning nearly decomposable value functions via communication minimization." arXiv 2019.

[6] Niu, Yaru, et al. "Multi-Agent Graph-Attention Communication and Teaming." AAMAS 2021.

[7] Ryan Kortvelesy and Amanda Prorok. "QGNN: Value Function Factorisation with Graph Neural Networks." arXiv preprint arXiv:2205.13005, 2022.

[8] Sainbayar Sukhbaatar, et al. "Learning multiagent communication with backpropagation." NeurIPS 2016.

[9] Amanpreet Singh, et al. "Learning when to communicate at scale in multiagent cooperative and competitive tasks." arXiv preprint arXiv:1812.09755 (2018).

[10] Yong Liu et al. "Multi-Agent Game Abstraction via Graph Attention Neural Network." AAAI 2022.

[11] Karol Kurach, et al. "Google Research Football: A Novel Reinforcement Learning Environment." AAAI 2020.

[12] Wen, Muning, et al. “Multi-agent reinforcement learning is a sequence modeling problem.” NeurIPS 2022.

[14] Bettini, Matteo, Ajay Shankar, and Amanda Prorok. "Heterogeneous Multi-Robot Reinforcement Learning." Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems. 2023.

[15] Meneghetti, Douglas De Rizzo, and Reinaldo Augusto da Costa Bianchi. "Towards heterogeneous multi-agent reinforcement learning with graph neural networks." arXiv preprint arXiv:2009.13161 (2020).

评论- Response to Authors

2023-11-21

Thank you to authors for clarifications. While I'm satisfied with most of the responses, I suggest that authors make sure to include the above responses (mainly Q1, Q2, Q3, Q4, and Q5) in the camera-ready version upon acceptance. I believe it is critical to include the new results, domains, and discussions regarding the limitations in open areas as well as differences with the attention mechanism in TarMAC. I also appreciate the additional evaluations and results, which adds to the value of their work.

I raise my score, contingent upon applying the required revisions, as mention above, by the authors. Good luck!

审稿意见

评分: 6置信度: 32023-10-31

This paper introduces a novel approach called CommFormer, which addresses the challenge of learning multi-agent communication from a graph modeling perspective. The communication architecture among agents is modelled as a learnable graph. The problem is treated as the task of determining the communication graph while enabling the architecture parameters to update normally, thus necessitating a bi-level optimization process. By leveraging continuous relaxation of graph representation and incorporating attention mechanisms within the graph modeling framework, CommFormer enables the concurrent optimization of the communication graph and architectural parameters in an end-to-end manner.

优点

This paper introduces a novel approach which models the communication architecture among agents as a learnable graph. The considered problem is formulated as the task of determining the communication graph while enabling the architecture parameters to update normally, thus necessitating a bi-level optimization process.

缺点

There have been some works which learns multi-agent cooperative behaviors based on learnable graphs. It would be better to illustrate the differences of the paper compared to them. An example is provided below.

Liu, Y., Dou, Y., Li, Y., Xu, X., & Liu, D. (2022). Temporal Dynamic Weighted Graph Convolution for Multi-agent Reinforcement Learning. Proceedings of the Annual Meeting of the Cognitive Science Society.

问题

The paper proposes a communication-based MARL method. In fact the paradigm CTDE is not suited for such method. There are still some communications among agents for the execution of policies. It seems that CTCE is more suited for the proposed method. Some CTCE based MARL methods. for example, graph-based MARL methods, should be considered for the comparison in the experiment.
In Table 1, the FC is a little bit confusing. Even there are no constrictions on the communication bandwidth, the win rate is still hard to be 100% as it depends how the opponents perform. Of course, 100% is the maximum value for the win rate, but it is a meaningless upper bound. Further, how the value 93.8 is obtained in FC column as the upper bound?

2023-11-17

Q2

In Table 1, the FC is a little bit confusing. Even there are no constrictions on the communication bandwidth, the win rate is still hard to be 100% as it depends how the opponents perform. Of course, 100% is the maximum value for the win rate, but it is a meaningless upper bound. Further, how the value 93.8 is obtained in FC column as the upper bound?

In our study, "FC" refers to CommFormer with a sparsity setting of 1.0. This configuration implies that there are no restrictions on the communication graph, allowing agents to freely communicate with all other agents. Effectively, this represents the upper performance limit of the CommFormer methods. By presenting results under this setting, we aim to demonstrate that with our bi-level learning process, a sparsity of 0.4 can achieve comparable results to a full sparsity of 1.0 in most scenarios.

Given the "FC" framework, the bi-level optimization problem simplifies to the following optimization formulation:

\min_{\theta, \phi} ~L_{val}(\phi, \theta)

Q3

There have been some works which learns multi-agent cooperative behaviors based on learnable graphs. It would be better to illustrate the differences of the paper compared to them.

Thanks! TWG-Q[12] primarily focuses on exploring diverse spatial-temporal information environments, necessitating the utilization of a temporal weight learning mechanism and weighted GCN to dynamically capture the intensities of cooperations. Conversely, CDC[13] dynamically adjusts the communication graph, taking into account the diffusion process perspective to capture the information flow on the graph. In contrast, our CommFormer approach learns the static communication graph through an optimization perspective, employing attention scores to automatically assign credit to received messages. We will improve the related work in the updated version.