6.0

/10

Poster4 位审稿人

最低5最高7标准差0.7

3.5

置信度

正确性2.8

贡献度2.5

表达2.5

NeurIPS 2024

MADiff: Offline Multi-agent Learning with Diffusion Models

Zhengbang Zhu,Minghuan Liu,Liyuan Mao,Bingyi Kang,Minkai Xu,Yong Yu,Stefano Ermon,Weinan Zhang

OpenReview PDF

提交: 2024-05-11更新: 2024-11-06

TL;DR

We propose a multi-agent diffusion framework that unifies decentralized policy, centralized controller, teammate modeling, and trajectory prediction.

摘要

关键词

Multi-agent RLDiffusion ModelsOffline RL

评审与讨论

审稿意见

评分: 7置信度: 42024-06-24

The paper introduces MADiff, a generative multi-agent learning framework designed to tackle the challenges of coordinating multiple agents in offline MARL. Leveraging the attention-based diffusion model, MADiff effectively captures complex agent interactions, enabling effective teammate modeling and trajectory prediction. Experimental results demonstrate MADiff's superiority over baseline algorithms across various tasks, highlighting its effectiveness in managing multi-agent interactions.

优点

This paper proposes an offline MARL algorithm based on DecisionDiffuser, incorporating an attention mechanism to integrate information from other agents.
The paper is clearly written and easy to understand.
The algorithm is compared with multiple baselines, showcasing superior performance.

缺点

The novelty is limited, as this work extends the diffusion-based offline reinforcement learning method Decision Diffuser from single-agent to multi-agent settings. The new addition is the attention mechanism, which is not a novel concept in the multi-agent domain.
Why the win rate in SMAC is not reported, but instead the average return? From my perspective, a high return does not necessarily equate to a high win rate (e.g., high-return conservative policies may not achieve overall winning).
Experiments use only three random seeds, whereas current MARL works typically use five or more to validate algorithm effectiveness.
Why was a diffusion model based on the U-Net architecture with an added attention mechanism chosen? As far as I know, combining transformers with diffusion models, such as DiT [1], yields stronger performance, with transformers inherently possessing attention mechanisms.
Although the paper claims to be the first to introduce diffusion models to offline MARL, is MADiff's performance mainly due to the attention mechanism? Comparing the performance of the variant version without attention (Ind) in the ablation study with baseline algorithms, I observed that it often underperforms compared to OMAR across most datasets. Therefore, can I infer that the performance advantage of MADiff mainly derives from the attention mechanism, which is common in MARL algorithms? Applying the attention mechanism to other baseline algorithms would help assess the source of MADiff's performance gains.
As the number of agents increases, MARL faces the dimensionality curse of the joint action space, which becomes more pronounced with computationally heavy diffusion models.

[1] Peebles, William, and Saining Xie. "Scalable diffusion models with transformers." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

问题

Beyond being index-free, what other advantages does the attention mechanism provide? Can the agent observations be concatenated as a single input?
The inverse dynamics model takes the local observations $o^i$ of the agents as input. If other agents are considered part of the environment, it should be non-stationary. Can the inverse dynamics model learn the environment dynamics stably and effectively under these conditions?
Can the authors elaborate on how DMs avoid extrapolation errors in offline RL?
Should $\mathbb{E}{(s^i,a^i,s^{'i})}$ in Eq. (6) be modified to $\mathbb{E}{(o^i,a^i,o^{'i})}$ ?
Please see Weaknesses for additional questions.

If the authors address the above concerns, I am happy to raise the score.

局限性

The limitations are explicitly discussed in Section 5.6 of the paper.

2024-08-07

References

[1] Janner, Michael, et al. Planning with Diffusion for Flexible Behavior Synthesis.

[2] Iqbal, S., & Sha, F. Actor-attention-critic for multi-agent reinforcement learning.

[3] Si, C., et al. Freeu: Free lunch in diffusion u-net.

[4] Chi, C., et al. Diffusion policy: Visuomotor policy learning via action diffusion.

[5] Shen, Zhuoran, et al. Efficient attention: Attention with linear complexities.

2024-08-09

Thank you for addressing some of my questions. However, I still have a few concerns:

A5: "Both diffusion modeling and attention networks are essential in our algorithm."

I still have some questions. As mentioned in Q5, "Comparing the performance of the variant version without attention (Ind) in the ablation study with baseline algorithms, I observed that it often underperforms compared to OMAR across most datasets." Could the authors provide an explanation for this?

A6: "For example, with DDIM samplers, we can reduce the sampling steps from 200 to 15 with negligible performance loss."

Could the authors provide specific experimental results to support this claim? Based on my personal experience, the number of sampling steps in diffusion models can have a significant impact on decision-making performance.

2024-08-10

Thank you for participating in the discussion phase!

Q5.1: "As mentioned in Q5, 'Comparing the performance of the variant version without attention (Ind) in the ablation study with baseline algorithms, I observed that it often underperforms compared to OMAR across most datasets.' Could the authors provide an explanation for this?"

A5.1: DMs produce samples with inherent diversity due to the stochasticity in the reverse process. This is fine for single-agent RL, as any action is acceptable as long as it corresponds to the conditioned high rewards. However, in multi-agent learning, the team rewards depend on all agents' behaviors. If each agent only samples ego trajectory from the trajectory distribution with high team rewards, the sample diversity can make the combined multi-agent trajectories inconsistent and fail to cooperate. The issue arising from such diversity may outweigh the advantages of the DMs' modeling capabilities, making it less effective than the baseline method, OMAR. By training a centralized DM on joint trajectory distribution and explicitly modeling other agents' trajectories during decentralized execution, MADiff can encourage the generation of globally consistent trajectories, thus mitigating the issue.

Q6.1: "Could the authors provide specific experimental results to support this claim? Based on my personal experience, the number of sampling steps in diffusion models can have a significant impact on decision-making performance."

A6.1: To avoid potential misunderstandings, we clarify that when considering sampling acceleration, we did not reduce the number of diffusion steps. The model is still trained with 200 steps. In the denoising process, we use DDIM to skip most steps, leaving only 15 steps.

We compare MADiff-D using 15-step DDIM sampling and 200-step DDPM sampling on four SMAC datasets, and the results are listed below:

Dataset	MADiff-D DDPM (Score)	MADiff-D DDPM (Win Rate)	MADiff-D DDIM (Score)	MADiff-D DDIM (Win Rate)
5m6m-Good	17.8 $\pm$ 0.8	0.78 $\pm$ 0.08	17.7 $\pm$ 0.7	0.78 $\pm$ 0.05
5m6m-Medium	17.3 $\pm$ 0.5	0.72 $\pm$ 0.04	17.4 $\pm$ 0.5	0.74 $\pm$ 0.04
2s3z-Good	19.6 $\pm$ 0.3	0.94 $\pm$ 0.05	19.5 $\pm$ 0.3	0.93 $\pm$ 0.04
2s3z-Medium	17.4 $\pm$ 0.4	0.62 $\pm$ 0.08	17.4 $\pm$ 0.2	0.63 $\pm$ 0.05

The results indicate that DDIM sampling does not cause a noticeable performance loss. This aligns with findings from a recent report on DMs in single-agent RL (Table 6) [1], and they claim "5 sampling steps are adequate for most tasks, suggesting that more sampling steps in previous works, e.g., 100, are unnecessary".

References

[1] Dong, Z., Yuan, Y., Hao, J., Ni, F., Ma, Y., Li, P., & Zheng, Y. (2024). CleanDiffuser: An Easy-to-use Modularized Library for Diffusion Models in Decision Making. arXiv preprint arXiv:2406.09509.

2024-08-13

Thanks for your response, which addresses most of my concerns, I would like to raise my score from 6 to 7.

2024-08-13

Thank you for your thoughtful consideration and for raising your score. We're glad that our revisions addressed your concerns. Your support means a lot to us!

作者回复

2024-08-07

Q1: "The novelty is limited ... which is not a novel concept in the multi-agent domain."

A1: Many successful design choices of the decision diffuser (DD) inspired and guided our method. DD also referred to the design of the diffuser [1], but it remains excellent work.

Compared with DM, which focuses on SAL, MADiff is naturally designed for multi-agent problems. By using different conditioning during evaluation, the same framework handles multi-agent coordination, teammate modeling, and joint trajectory prediction, supported by solid experimental results. We acknowledge the long-standing use of the attention mechanism in MAL and verified its effectiveness in conditional generative modeling of multi-agent trajectories.

Q2: "Why the win rate in SMAC is not reported, but instead the average return? "

A2: We adopted offline datasets and the performance of most baseline algorithms from OGMARL, which reported the average return of their baseline algorithms instead of the win rate. Therefore, we chose the average return as the evaluation metric in Table 1. We have included the win rate of MADiff and our self-implemented baseline, MADT, in three SMAC maps in Table 1 of the supplement PDF. We will re-run the other baseline algorithms and the 8m map and update the win rates to the main paper.

Q3: "Experiments use only three random seeds, whereas current MARL works typically use five or more to validate algorithm effectiveness."

A3: We conducted experiments with five seeds on three SMAC maps, as shown in Table 1 of the supplement PDF. Results generally align with our previous reported performances, showing MADiff's robustness to random seeds. We will update all experimental results to five random seeds once finished.

Q4: "Why was a diffusion model based on the U-Net architecture with an added attention mechanism chosen?"

A4: We use U-Net to model each agent's trajectory and the attention mechanism to coordinate across agents for three reasons:

U-Net-based diffusion models are still predominantly used in offline RL due to the convolutional module's ability to fuse locally consistent subsequences into globally coherent trajectories [1].
Attention modules applied to U-Net's skip-connected features, which are shown to constitute high-frequency information [3], allow agent interactions on the most critical parts of the trajectory, effectively steering the generation process.
A prior study found that while transformers performed better on some imitation learning tasks, they were generally more sensitive to hyperparameters [4].

Q5: "Although the paper claims ... would help assess the source of MADiff's performance gains."

A5: Both diffusion modeling and attention networks are essential in our algorithm. Baseline Q learning algorithms use single-agent independent actors, and only their centralized critics can incorporate attention modules. To our knowledge, no prior work uses attention-based critics in offline MARL. In online MARL, attention-based critics [2] do not outperform QMIX's mixing value network. MADT frames MAL problems as sequence modeling with a fully attention-based network, and MADiff outperformed MADT in most datasets.

Q6: "As the number of agents increases, MARL faces the dimensionality curse of the joint action space, which becomes more pronounced with computationally heavy diffusion models."

A6: The potentially high computational cost of MADiff comes from two parts: the iterative denoising process and the modeling of teammate agents.

Many off-the-shelf fast sampling techniques can be adopted to reduce the steps. For example, with DDIM samplers, we can reduce the sampling steps from 200 to 15 with negligible performance loss.
As mentioned in Section 5.5, we choose to use a shared U-Net model for all agents, where different agents' trajectories can be batched together and passed through the network. It does not cost much more inference time as the number of agents increases with GPU-accelerated computing, as reported in Appendix Table 6. For self-attention among agents' embeddings, it can also scale with linear complexity [5].

When considering performance improvements over baseline algorithms, the higher computational cost of MADiff is still tolerable in many scenarios.

Q7: "Beyond being index-free, what other advantages does the attention mechanism provide? Can the agent observations be concatenated as a single input?"

A7: Attention is conducted on U-Net skip-connected features, allowing agent interactions on crucial trajectory parts, effectively steering the generation. We compare MADiff-C with concatenating all agents' observations (ConcatDiff) on three MPE spread datasets. Table 3 of the supplement PDF shows MADiff-C consistently performs better.

Q8: "The inverse dynamics model takes the local observations oi of the agents as input ... stably and effectively under these conditions?"

A8: The non-stationarity in the transition function $p(o'|o,a)$ does not necessarily imply non-stationary inverse dynamics $p(a|o, o')$ . For example, in MPE tasks, the next observation depends on all agents' actions. However, knowing the current and next local observations allows easy determination of ego movement. Most of our benchmarked tasks don't show high non-stationarity in inverse dynamics.

If the inverse dynamics of certain tasks depend significantly on other agents, one can use DM-generated current and next observations of other agents as additional input to the inverse dynamics model.

Q9: "Can the authors elaborate on how DMs avoid extrapolation errors in offline RL?"

A9: Due to the character limitation, we response to this common question from you and Reviewer hh3p in A1 of the general response.

Q10: "Should E(si,ai,s′i) in Eq. (6) be modified to E(oi,ai,o′i)?"

A10: Thanks for your checking! We have fixed the typo.

审稿意见

评分: 5置信度: 32024-06-28

This paper presents a novel generative multi-agent learning framework named MADIFF. The framework employs an attention-based diffusion model (DM) to address the complex coordination problems in multi-agent settings. MADIFF combines centralized training with decentralized execution, enabling effective teammate modeling and trajectory prediction. The experimental results demonstrate that MADIFF significantly outperforms existing baseline algorithms across various multi-agent learning tasks.

优点

S1: The paper introduces an innovative approach by integrating attention-based diffusion models for multi-agent learning, which is a novel contribution to the field.

S2: The framework effectively addresses coordination among multiple agents, which is a significant challenge in MARL.

S3: Comprehensive experiments are conducted, demonstrating the superior performance of MADIFF across a range of tasks and benchmarks.

S4: The paper provides detailed ablation studies, showcasing the effectiveness of the attention mechanisms in the diffusion model.

缺点

W1: The paper lacks a detailed discussion on the computational complexity and scalability of the proposed approach, which is crucial for practical applications.

W2: The motivation of this paper is unclear.

W3: The evaluation primarily focuses on performance metrics; additional analysis on the robustness and adaptability of the model in diverse scenarios could provide more insights.

问题

The motivation for using DMs to solve the extrapolation errors in offline MARL is unclear to me. Why using the DMs can solve it? What are the strengths of DMs compared to previous works?
The attention mechanism and DMs for single-agent reinforcement learning (RL) have been extensively studied in previous works. This paper appears to simply integrate these two methods into MARL. Could you please provide more details on how this approach differs from directly applying these methods in MARL settings?
How does MADIFF perform with an increasing number of agents in MARL systems? Additionally, I am very interested in how MADIFF addresses the challenges of adding a new agent to the existing MARL systems.

If you can address my concerns, I would be happy to improve my rating.

局限性

This paper presents an unclear motivation and methodology. It would be great to have a more holistic picture of the problem and solution.
The evaluation primarily focuses on performance metrics; additional analysis on the robustness and adaptability of the model in diverse scenarios could provide more insights.
Related works should discuss using diffusion for single-agent RL more.
It would be greater to add some discussions of real-world applications.

作者回复

2024-08-07

Q1: "The paper lacks a detailed discussion on the computational complexity and scalability of the proposed approach, which is crucial for practical applications."

A1: The computational complexity of MADiff during sampling is $O(KN^3)$ . The following three points warrant clarification:

Within decentralized execution settings, each agent must make decisions independently. Consequently, one factor of $N$ in $KN^3$ is inevitable for all methods.
As shown in Figure 1, we adopt one-layer attention only before each decoder layer. Most of the computations are still independently computed within each agent's U-Net. This majority of computation scales linearly with agent numbers. If the agent number is not very large, the attention operation will not cost much effort.
For the diffusion step K, many off-the-shelf fast sampling techniques can be adopted. For example, with DDIM samplers, we can reduce the sampling steps from 200 to 15 with negligible performance loss.

To ensure we address your concern accurately, could you please provide more specific details on what aspects of scalability you would like us to focus on? For instance, are you interested in the scalability in terms of computational resources, dataset size, or large number of agents? Any additional guidance would be greatly appreciated.

Q2: "The motivation of this paper is unclear."

A2: In fact, you have summarized the motivation of our paper quite well in the summary and strengths of the review: "introduces an innovative approach by integrating attention-based diffusion models for multi-agent learning" to "effectively address coordination among multiple agents, which is a significant challenge in MARL". We are continually revising the paper, especially the abstract and introduction sections, to emphasize our motivation more prominently. We are a little confused regarding the unclear motivation issue you mentioned. We would appreciate it if you could clarify it more precisely.

Q3: "The evaluation primarily focuses on performance metrics; additional analysis on the robustness and adaptability of the model in diverse scenarios could provide more insights."

A3: Thanks for the suggestion. The online evaluation return is the only optimization objective in the problem formation of offline (MA)RL. Thus, we choose the return as the measure to compare with baseline algorithms. Regarding robustness, MADiff is not sensitive to hyperparameters. As listed in Appendix F.3, most of the hyperparameters are kept the same for all datasets. We only tune the reward discount, return scale, and diffusion horizons. Regarding adaptability, could you please specify what aspects of analysis or testing you would like to see?

Q4: "The motivation for using DMs to solve the extrapolation errors in offline MARL is unclear to me. Why using the DMs can solve it? What are the strengths of DMs compared to previous works?"

A4: As the similar question was proposed by Reviewer joCT, please refer to the general response A1 for our answer to this issue.

Q5: "The attention mechanism and DMs for single-agent reinforcement learning (RL) have been extensively studied in previous works. This paper appears to simply integrate these two methods into MARL. Could you please provide more details on how this approach differs from directly applying these methods in MARL settings?"

A5: Our contributions include but beyond the model design, and the most important one is that it allows the same model structure to handle various tasks (CTCE, CTDE, MATP) by flexible conditioning during evaluation and achieves superior results over baseline algorithms. We conducted ablation experiments on model design and case studies about teammate modelling with diffusion models in multiple environments, all of which are part of our contribution and are novel. We believe that our work, being the first to apply the diffusion model in multi-agent learning, is valuable. To better address your concern, could you specify what you mean by "directly applying these methods in MARL settings"? Thank you.

Q6: "How does MADIFF perform with an increasing number of agents in MARL systems? Additionally, I am very interested in how MADIFF addresses the challenges of adding a new agent to the existing MARL systems."

A6: As shown in Table 1 and Table 2, our experiments are done on various sets of MARL systems ranging from 2 to 10 agents, and achieve superior performances on most of them. Adding a new agent to the existing MARL systems is a different problem known as ad hoc teamwork [1], and is beyond the scope of our paper. To the best of our knowledge, there is no offline algorithm that tackles the ad hoc teamwork problem.

Q7: "Related works should discuss using diffusion for single-agent RL more."

A7: Thank you for your kind reminder. We have included an additional related work subsection to elaborate more on prior studies that using DMs in RL.

Q8: "It would be greater to add some discussions of real-world applications."

A8: MADiff has the potential to be applied in scenarios such as multi-robot collaboration, autonomous driving, and multi-player match data analysis (similar to our NBA experiments). We'll include a discussion of real-world applications in the paper.

References

[1] Chen, S., et al. Achieving the ad hoc teamwork by employing the attention mechanism.

2024-08-09

Regarding the novelty, this paper appears to primarily involve training diffusion models on an offline MARL dataset and using these models to synthesize new data. This paradigm is already widely utilized in synthesizing offline single-agent datasets. Additionally, the application of the attention mechanism for cooperation remains unclear to me. Why dose attention mechanism work? Are there any other possible solutions? Why you choose attention machanism here? These paper looks like incremental for me.

2024-08-09

Thank you for participating in the discussion phase.

Our work goes beyond just borrowing DMs from single-agent learning to MARL datasets. By using flexible evaluation-time conditioning, MADiff effectively handles various MAL tasks (CTCE, CTDE, and MATP). Note that our experiments show that during decentralized execution, using a centralized-trained DM to predict teammates' trajectories enhances each agent's decisions. This enhancement allows MADiff-D to outperform the model variant where each agent generates its own trajectory, which is a direct application of DMs in single-agent learning to MAL. Therefore, we consider our work a novel and non-trivial use of DMs in MAL, not incremental.

Q9: "Why does attention mechanism work?""Why you choose attention mechanism here?"

A9: We choose our model design, the U-Net with cross-agent attention, for three reasons:

Not all generated teammate trajectories are equally important for ego planning. The attentional mechanism can learn to focus on interactions with important teammates, as demonstrated in Figure 1 of the supplement PDF.
We compare our model design with concatenating all agents' observations (ConcatDiff, Table 3 of the supplement PDF) and independently generating ego trajectories (Section 5.5), with notably better performance.
The attention modules are applied to the skip-connected features of the U-Net, which are shown to constitute high-frequency information. This allows the interactions among agents to happen on the most critical parts of the trajectory, most effectively steering the generation process.

Q10: "Are there any other possible solutions?"

A10: Our model design is not the only option. As mentioned in A9, we compared MADiff with other modeling choices and achieved better results. It is possible that some non-trivial modeling designs could surpass our current results. As the first to apply diffusion models in MAL, we will be thrilled if our work can inspire further research.

Thanks again for your time and effort in reviewing our paper!

2024-08-10

Thank you for your responses.

I believe that the fundamental generality of generative models is a necessary requirement for the method. As the author claims, this paper is the first to apply diffusion models in MAL.

I recommend providing a more holistic picture of the problem and solution. This picture should include exploring the scope of this research and other solutions that do not work and then introducing current methodologies as a logical progression rather than merely stating, 'We propose...' Besides, in the experiments, please verify the other solutions as baselines to highlight the importance of the proposed method.

This revision may be challenging to complete in this interactive round. I would prefer to maintain my current score, but thank you again for your contributions. However, if you can accomplish the revision, I would be willing to improve my score.

Best,

Reviewer hh3p

2024-08-13

Thank you for your quick response! We have carefully revised the abstract and introduction following your suggestions (see below), adding necessary content to provide readers with a holistic picture of the problem and solution.

Regarding the experimental section, since the other solutions that apply DMs in MAL are not published works, we prefer to keep the comparison of MADiff with other solutions within the ablation study subsection. In addition to the existing comparison with independent DMs, we have included comparisons with the model variant that uses a single U-Net to process concatenated multi-agent observations without using attention, which is denoted as ConcatDiff.

We have included the revised abstract and introduction in the following official comments, with important modifications bolded. Additionally, we have provided the related work subsection on DMs in single-agent learning mentioned in A7, and we hope it can offer a more comprehensive review of previous studies. The new ablation results are conducted on MAMujoco datasets and presented in the last official comment.

Please note that we have been working diligently over the past few days to finalize the experiments and incorporate your suggestions into the revisions. We hope these efforts contribute positively to the overall evaluation. Please feel free to raise any additional concerns or comments on our revised version!

评论- Revised Abstract

2024-08-13

Below is our revised abstract:

Offline reinforcement learning (RL) aims to learn policies from pre-existing datasets without further interactions, making it a challenging task. Q-learning algorithms struggle with extrapolation errors in offline settings, while supervised learning methods are constrained by model expressiveness. Recently, diffusion models (DMs) have shown promise in overcoming these limitations in single-agent learning, but their application in multi-agent scenarios remains unclear. Generating trajectories for each agent with independent DMs may impede coordination, while concatenating all agents' information can lead to low sample efficiency. Accordingly, we propose MADiff, which is realized with an attention-based diffusion model to model the complex coordination among multiple agents. To our knowledge, MADiff is the first diffusion-based multi-agent learning framework, functioning as both a decentralized policy and a centralized controller. During decentralized executions, MADiff simultaneously performs teammate modeling, and the centralized controller can also be applied in multi-agent trajectory predictions. Our experiments demonstrate that MADiff outperforms baseline algorithms across various multi-agent learning tasks, highlighting its effectiveness in modeling complex multi-agent interactions.

评论- Revised Introduction

2024-08-13

Below is our revised introduction section:

Offline reinforcement learning (RL) [Fujimoto et al., 2019, Kumar et al., 2020] learns exclusively from static datasets without online interactions, enabling the effective use of pre-collected large-scale data. However, applying temporal difference (TD) learning in offline settings causes extrapolation errors [Fujimoto et al., 2019], where target value functions are evaluated on out-of-distribution actions. Sequence modeling algorithms bypass TD-learning by directly fitting the dataset distribution [Chen et al., 2021, Janner et al., 2021]. Nevertheless, these methods are limited by the model’s expressiveness, making it difficult to handle diverse datasets. They also suffer from compounding errors [Xiao et al., 2019] due to autoregressive generation. Recently, diffusion models (DMs) have achieved remarkable success in various generative modeling tasks [Song and Ermon, 2019; Ho et al., 2020; Xu et al., 2022] owing to their exceptional ability to capture complex, high-dimensional data distributions. Their successes have also been introduced to offline RL, offering a superior modeling choice for sequence modeling algorithms [Janner et al., 2022, Ajay et al., 2023].

Compared to single-agent learning, offline multi-agent learning (MAL) has been less studied and is more challenging. Since the behaviors of all agents are interrelated, each agent is required to model interactions and coordination among agents, while making decisions in a decentralized manner to achieve the goal. Current MAL approaches typically train a centralized value function to update individual agents’ policies [Rashid et al., 2020] or use an autoregressive transformer to determine each agent’s actions [Meng et al., 2021, Wen et al., 2022]. However, without online interactions, an incorrect centralized value can lead to significant extrapolation errors, and the transformer can only serve as an independent model for each agent.

In this paper, we aim to study the potential of employing DMs to solve the above challenges in offline MAL problems. Merely adopting existing diffusion RL methods by using independent DMs to model each agent can result in serious inconsistencies due to a lack of proper credit assignment among agents. Another possible solution is to concatenate all agents’ information as the input and output of the DM. However, treating the agents as a single unified agent neglects important nature of multi-agent systems. One agent may have strong correlations with only a few other agents, which makes a full feature interaction redundant. In many multi-agent systems, agents exhibit certain symmetry and can share model parameters for efficient learning [Arel et al., 2010]. However, concatenating them in a fixed order breaks this symmetry, forcing the model to treat each agent differently.

To address the aforementioned coordination challenges, we propose the first centralized-training-decentralized-execution (CTDE) diffusion framework for MA problems, named MADiff. MADiff adopts a novel attention-based DM to learn a return-conditional trajectory generation model on a reward-labeled multi-agent interaction dataset. In particular, the designed attention is computed in several latent layers of the model of each agent to fully interchange the information and integrate the global information of all agents. To model the coordination among agents, MADiff applies the attention mechanism on latent embedding for information interaction across agents. The attention mechanism enables the dynamic modeling of agent interactions through learned weights, while also enabling the use of a shared backbone to model each agent’s trajectory, significantly reducing the number of parameters. During training, MADiff performs centralized training on the joint trajectory distributions of all agents from offline datasets, including different levels of expected returns. During inference, MADiff adopts classifier-free guidance with low-temperature sampling to generate behaviors given the conditioned high expected returns, allowing for decentralized execution by predicting the behavior of other agents and generating its own behavior. Therefore, MADiff can be regarded as a principled offline MAL solution that not only serves as a decentralized policy for each agent or a centralized controller for all agents, but also includes teammate modeling without additional cost. Comprehensive experiments demonstrated superior performances of MADiff on various multi-agent learning tasks, including offline MARL and trajectory prediction.

评论- Revised Introduction Cont.

2024-08-13

In summary, our contributions are (1) the first diffusion-based multi-agent learning framework that unifies decentralized policy, centralized controller, teammate modeling, and trajectory prediction; (2) a novel attention-based DM structure that is designed explicitly for MAL and enables coordination among agents in each denoising step; (3) achieving superior performances for various offline multi-agent problems.

References

Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision-making? International Conference on Learning Representations, 2023.

Itamar Arel, Cong Liu, Tom Urbanik, and Airton G Kohls. Reinforcement learning-based multi-agent system for network traffic signal control. IET Intelligent Transport Systems, 4(2):128–135, 2010.

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pages 2052–2062. PMLR, 2019.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.

Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. Advances in neural information processing systems, 34:1273–1286, 2021.

Michael Janner, Yilun Du, Joshua Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. In International Conference on Machine Learning, pages 9902–9915. PMLR, 2022.

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.

Linghui Meng, Muning Wen, Yaodong Yang, Chenyang Le, Xiyun Li, Weinan Zhang, Ying Wen, Haifeng Zhang, Jun Wang, and Bo Xu. Offline pre-trained multi-agent decision transformer: One big sequence model tackles all smac tasks. arXiv e-prints, pages arXiv–2112, 2021.

Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Monotonic value function factorisation for deep multi-agent reinforcement learning. The Journal of Machine Learning Research, 21(1):7234–7284, 2020.

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.

Muning Wen, Jakub Kuba, Runji Lin, Weinan Zhang, Ying Wen, Jun Wang, and Yaodong Yang. Multi-agent reinforcement learning is a sequence modeling problem. Advances in Neural Information Processing Systems, 35:16509–16521, 2022.

Chenjun Xiao, Yifan Wu, Chen Ma, Dale Schuurmans, and Martin Müller. Learning to combat compounding error in model-based reinforcement learning. arXiv preprint arXiv:1912.11206, 2019.

Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Ermon, and Jian Tang. Geodiff: A geometric diffusion model for molecular conformation generation. arXiv preprint arXiv:2203.02923, 2022.

评论- Additional Related Work Subection

2024-08-13

Below is the additional related work subsection on on DMs in single-agent learning :

There is a recent line of work applying diffusion models (DMs) to decision-making problems such as RL and imitation learning. Janner et al. [2022] design a diffusion-based trajectory generation model and train a value function to sample high-rewarded trajectories. A consequent work [Ajay et al., 2023] takes conditions as inputs to the DM, thus bringing more flexibility that generates behaviors that satisfy combinations of diverse conditions. Another line of work [Wang et al., 2022, Hansen-Estruch et al., 2023, Kang et al., 2024] uses the DM as a form of policy, i.e., generating actions conditioned on states, and the training objective behaves as a regularization under the framework of TD-based offline RL algorithms. Different from the above, SynthER [Lu et al., 2024] adopts the DM to upsample the rollout data to facilitate learning of any RL algorithms. All of these existing methods focus on solving single-agent tasks. The proposed MADiff is structurally similar to Ajay et al. [2023], but includes effective modules to model agent coordination in MAL tasks.

References

Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision-making? International Conference on Learning Representations, 2023.

Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies, 2023.

Michael Janner, Yilun Du, Joshua Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. In International Conference on Machine Learning, pages 9902–9915. PMLR, 2022.

Bingyi Kang, Xiao Ma, Chao Du, Tianyu Pang, and Shuicheng Yan. Efficient diffusion policies for offline reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.

Cong Lu, Philip Ball, Yee Whye Teh, and Jack Parker-Holder. Synthetic experience replay. Advances in Neural Information Processing Systems, 36, 2024.

Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning. arXiv preprint arXiv:2208.06193, 2022.

评论- Ablation on Other Diffusion Solutions

2024-08-13

In addition to the MPE experiments in Section 5.5, we compared MADiff with independent DMs in MAMujoco datasets. Since shared-parameter independent DMs performed significantly worse than non-shared ones in Section 5.5, here we compared MADiff (with shared U-Net parameters) to independent DMs without parameter sharing in the decentralized execution setting. The results are listed below:

Dataset	MADiff-D	Independent DMs
2ant-Good	2940 $\pm$ 58	2918 $\pm$ 131
2ant-Medium	1210 $\pm$ 89	1078 $\pm$ 25
2ant-Poor	902 $\pm$ 24	856 $\pm$ 86
2halfcheetah-Good	8254 $\pm$ 179	8139 $\pm$ 452
2halfcheetah-Medium	2215 $\pm$ 27	2026 $\pm$ 84
2halfcheetah-Poor	751 $\pm$ 74	659 $\pm$ 89
4ant-Good	3090 $\pm$ 26	3036 $\pm$ 70
4ant-Medium	1679 $\pm$ 93	1543 $\pm$ 149
4ant-Poor	1268 $\pm$ 51	945 $\pm$ 88

To validate the advantages of the attention mechanism over direct concatenation (ConcatDiff) in modeling multi-agent interactions, we conducted ablation experiments on MAMujoco datasets in the decentralized execution setting. When performing distributed execution with ConcatDiff, we condition only on the dimensions corresponding to the ego agent in the current and historical observations, while the other dimensions are generated by diffusion. The results are listed below:

Dataset	MADiff-D	ConcatDiff
2ant-Good	2940 $\pm$ 58	2905 $\pm$ 98
2ant-Medium	1210 $\pm$ 89	1152 $\pm$ 64
2ant-Poor	902 $\pm$ 24	856 $\pm$ 59
2halfcheetah-Good	8254 $\pm$ 179	8070 $\pm$ 355
2halfcheetah-Medium	2215 $\pm$ 27	1982 $\pm$ 31
2halfcheetah-Poor	751 $\pm$ 74	642 $\pm$ 91
4ant-Good	3090 $\pm$ 26	2851 $\pm$ 316
4ant-Medium	1679 $\pm$ 93	1439 $\pm$ 237
4ant-Poor	1268 $\pm$ 51	1015 $\pm$ 97

2024-08-13

Thanks for your revision.

I have raised my score.

2024-08-13

Thank you again for valuable suggestions and recognition of our contributions. We will incorporate the above modifications and experimental results in the revised manuscript.

审稿意见

评分: 6置信度: 32024-07-11

This paper proposes a novel diffusion-based offline multi-agent learning framework called MADiff. It extends previous diffusion-based offline RL work to offline cooperative MARL, particularly in CTDE and centralized control settings. The main contribution is the inclusion of an attention module in the diffusion model for integrating information and facilitating coordination among agents during centralized training. MADiff can be applied in both centralized and decentralized execution settings and performs opponent modeling in decentralized execution. It is evaluated in various settings such as SMAC, MPE, and the NBA trajectory prediction dataset, and the experiments show its effectiveness against competitive baselines.

优点

It extends the application of diffusion models to offline multi-agent RL and modifies the U-Net architecture for better coordination, addressing a significant problem with potential applications in multiple fields.
MADIFF is a principled framework without complex components yet achieving effective performance, and its code is available for reproducibility.
The proposed attention mechanism for information interchange among agents is suitable and helpful and is important in the offline MARL, as confirmed by the ablation study.

缺点

The weaknesses of this paper include:

The novelty of the work is somewhat limited, being seen as a simple extension of Decision Diffuser to the MARL setting with an additional attention layer.
Writing quality needs improvement, especially to a table to explain the mathematical symbols would help.
Is it unclear how sensitive is the MADiff's performance to some critical hyperparameters, e.g.,

3.1) do you need to balance the weights of the two losses in Eq. (6)?

3.2) how many encode-decode layers are required to perform satisfactorily in different tasks?

The self-containedness needs improved and many symbols in the paper have not been explained properly (e.g., \emptyset, \omega and \beta*\emptyset in eq.4 and 6).

问题

How are “Decentralized policy and centralized controller” related to "centralize training decentralize execution"?
What is condition y during training in Algorithm 1 and 2? Is it trajectory return or what? Doesn't seem to be mentioned in the text.
Please check whether the Classifier-free guided generation part in Equations 4 and 6 is represented correctly or explained intuitively. Although it is known that these equations are similar to (Ajay, 2023), the paper should ensure its self-containedness.

3.1) is the \omega a scalar or a netowrk using \tau, k, etc as its input in Eq. 4?

3.2) why is there a \beta before \emptyset? It seems that no difference will be made by using this term.

In all, there readability of this paper should be improved and intuitive description should be included properly.

局限性

Limitations on scalability to many agents and applicability in highly stochastic environments have been discussed.

作者回复

2024-08-07

Q1: "The novelty of the work is somewhat limited, being seen as a simple extension of Decision Diffuser to the MARL setting with an additional attention layer."

A1: It is indeed that many successful design choices in single-agent diffusion learning have inspired us and guided us to derive our method. Similarly, much of Decision Diffuser (DD) [1] also referred to the design of Diffuser [2], but that does not take away from the fact that DD is excellent work.

Also, many of the well-acknowledged multi-agent RL methods inherit a lot from single-agent RL in both online (MADDPG, MAPPO) and offline (OMAR, MADT) settings. MADiff is the first to apply diffusion models in multi-agent settings, naturally designed for multi-agent problems and taking multi-agent coordination, opponent modeling, and trajectory planning in a unified framework.

Q2: "Writing quality needs improvement, especially to a table to explain the mathematical symbols would help."

A2: Thanks for pointing it out. We have made a table that lists and explains all the mathematical symbols used. Due to the space limit, we are unable to include it in the supplement PDF. We will add the table in the Appendix of our paper.

Q3: "Is it unclear how sensitive is the MADiff's performance to some critical hyperparameters, ..."

A3: Thanks for the question.

The diffusion loss and inverse dynamics loss are independent and used to train different model parameters, so one does not need to balance between them.
We did not tune the number of encoder and decoder layers for each experiment setting. We use three encoder layers and three decoder layers for all our experiments. Intuitively, for tasks with smaller observation dimensions (e.g., MPE), a smaller number of network layers may suffice.

Q4: "How are 'Decentralized policy and centralized controller' related to 'centralize training decentralize execution'?"

A4: Our proposed algorithm has two variants depending on the use case. Both MADiff-C and MADiff-D are trained in a centralized way.

MADiff-C: If the task setting allows for a centralized controller to set actions of all agents, the diffusion model can be used to generate trajectories for all agents conditioned on all agents' current observations. This is the "centralized training with centralized execution" case.
MADiff-D: If the agents are required to make decisions independently using a decentralized policy based on their local observations, the diffusion model is used to generate ego trajectory along with predictions of other teammate agents. This is the "centralized training with decentralized execution" case.

Q5: "What is condition y during training in Algorithm 1 and 2?"

A5: Condition $y$ refers to other conditions besides the historical (and current) observations used in trajectory generation .

In offline MARL experiments, the condition $y$ is the cumulative discounted reward in the offline dataset. During online evaluation, it is set to a relatively large value to generate high-rewarded behaviors.
In MATP experiments on the NBA dataset, the condition $y$ includes the ball's historical trajectories, player ids, and a binary variable indicating the side of each player's frontcourt, as described in Line 256-257.

Q6: "Please check whether the Classifier-free guided generation part in Equations 4 and 6 is represented correctly or explained intuitively. Although it is known that these equations are similar to (Ajay, 2023), the paper should ensure its self-containedness."

A6: Thanks for your suggestion! Regarding your questions:

As mentioned in A3 to reviewer vCUA, there is a typo in Equation 4 and the correct form should be $\hat{\epsilon} := \epsilon_{\theta}(\hat{\tau}_k,\emptyset, k) + \omega(\epsilon\_\theta(\hat{\tau}_k, y(\tau), k)-\epsilon\_{\theta}(\hat{\tau}_k,\emptyset, k))$ , and $\omega$ is a scalar to balance the conditioned and unconditioned model output. If $\omega$ is set to 1, it reduces to pure conditioned generation. Prior diffusion model studies reveal that an $\omega$ larger than 1 can yield better condition quality [3], as it explicitly forces the sample to stay away from unconditioned regions.
The empty set means the model is not conditioned on any extra information $y$ . In practice, it is implemented as zero out the embedding of $y$ .
$\beta$ is not a constant, but a random variable which has the Bernoulli distribution (we choose $p(\beta=1)=0.25$ for all our experiments). It means during training, the embedding of $y$ has a probability of 0.25 to be masked with all zeros. Therefore, the trained diffusion model can be used to provide both conditioned and unconditioned outputs in Equation 4 with the same set of parameters.

We have revised the explanation of the above symbols in the paper to ensure sufficient self-containedness.

References

[1] Ajay, Anurag, et al. Is Conditional Generative Modeling all you need for Decision Making?.

[2] Janner, Michael, et al. Planning with Diffusion for Flexible Behavior Synthesis.

[3] Ho, J., & Salimans, T. Classifier-Free Diffusion Guidance.

2024-08-13

Thanks for the authors' response. I have read the rebuttal and would like to maintain my score.

2024-08-13

Thank you for taking the time to read our response! We appreciate your feedback and hope that our response has addressed your concerns. We are more than happy to discuss further if you have any additional questions or need further clarification.

审稿意见

评分: 6置信度: 42024-07-13

The paper introduces MADiff, a generative multi-agent learning framework that leverages diffusion models (DMs) to address coordination challenges in multi-agent scenarios. It extends the previous work of using DMs for single-agent decision-making tasks, where the direct application to multi-agent problems is limited due to the complexity of inter-agent interactions. MADiff bridges this gap by using an attention-based diffusion model to coordinate behaviors among multiple agents. This model operates both as a decentralized policy and a centralized controller, enabling it to generate desired trajectories during online execution while considering multi-agent interactions and teammate modeling effectively.

优点

The idea of the paper is straightforward and easy to understand.
The proposed framework is extensively tested on multiple environments and datasets, demonstrating its robustness and superior performance compared to baseline methods.
The results provide detailed and interpretable analysis of the important components of the paper, teammate modeling and attention layer for inter-agent communication.

缺点

The proposed method is mostly evaluated across domains with a small number of agents (up to 8) and MADiff with decentralized execution requires each agent to predict all agents' trajectories, which is difficult to be applied to larger scales.
It would be beneficial to provide some discussions about how attention help with the teammate modeling for decentralized execution.
The clarity of section 2.3 could be improved. For example, what is $w$ in Equation (4)?
The subscripts used in Equation (7) and (8) are not consistent with the ones in Figure 1 ( $\tilde x^i_{K, t+1}$ , $o^i_{t+H}$ ).

问题

In decentralized execution with teammate modeling, agent $i$ cannot access other agents' current observations. Under this case, the imagined agent $j$ 's current state $\tilde x^j_{K, t}$ could be very different from the ground-truth current observation $o^j_t$ ? How will this gap influence the performance of the ego agent? Some visualizations of the heatmap of the attention layer for communication and interaction among agents might be helpful.
Following the previous question, is the centralized training procedure exactly the same for both centralized control and decentralized execution with teammate modeling?
In section 5.4, how do you define two paths are consistent?
In Line 307 and 308, it is said that "the red agent and the purple agent generate 307 inconsistent plans," but I think the inconsistent plans are generated by purple and green agents (i.e., planning agents) right?
How will planning horizon $H$ affect the performance?

局限性

This method is hard to be scaled to problems with a large number of agents.

作者回复

2024-08-07

Q1: "The proposed method is mostly evaluated across domains with a small number of agents (up to 8) and MADiff with decentralized execution requires each agent to predict all agents' trajectories, which is difficult to be applied to larger scales."

A1: Thanks for your comment!

Most existing works in offline MARL do not consider tasks with a large number of agents. For example, MADT-KD [1] and OMAR [2] are tested on environments with up to 8 agents. Meanwhile, our MATP experiments are conducted on NBA datasets with 10 agents.
As predicting all agents' trajectories introduces more computational cost, as shown in Appendix Table 6, the computation time does not increase much with the number of agents thanks to GPU-accelerated computing.
When the number of agents is extremely large, teammate agents can modeled by a latent diffusion model [3] with the agent dimension reduced, thus avoiding the curse of dimensionality. We leave this direction to future work.

Q2: "It would be beneficial to provide some discussions about how attention help with the teammate modeling for decentralized execution."

A2: In a decentralized execution setting, although different agents make decisions independently, they can use the same attention network to jointly infer other agents' future trajectories and plan their own trajectories. Since the attention network is trained in a centralized manner, all agents' trajectories should be reasonable, and tend to be consistent given enough information. Intuitively, it requires the ego agent to first think at a higher level, i.e., make a coordinated plan for all agents, and then place itself in that plan. We will include more discussion of the role of attention modules in CTDE settings in our paper.

Q3: "The clarity of section 2.3 could be improved. For example, what is $w$ in Equation (4)?"

A3: Thanks for the careful checking! Equation 4 should be $\hat{\epsilon} := \epsilon_{\theta}(\hat{\tau}_k,\emptyset, k) + \omega(\epsilon\_\theta(\hat{\tau}_k, y(\tau), k)-\epsilon\_{\theta}(\hat{\tau}_k,\emptyset, k))$ .

Q4: "The subscripts used in Equation (7) and (8) are not consistent with the ones in Figure 1."

A4: We have revised the subscripts in Figure 1 to be the same as those in Equation 7 and 8.

Q5: "In decentralized execution with teammate modeling, agent cannot access other agents' current observations ... Some visualizations of the heatmap of the attention layer for communication and interaction among agents might be helpful."

A5: Great question, thanks! In decentralized execution, teammate modeling is most important for reasonableness given local information, not correctness (which is hard to achieve). Figure 2 shows that the purple agent trajectory predicted by the red agent differs from it's own plan. However, the three agents still cover different landmarks in the red agent's imagination, and is valid in completing the teamwork. The model also has the ability to adjust the prediction given more information from other agents is observed later.

Regarding the potential problem you mention, the most extreme case is when some teammates are completely outside the visible range, at which point their imagined current observations could be highly stochastic and the model could not use those estimated trajectories to make decisions. We verified on SMACv2 that the attention mechanism mitigates this problem. Figure 1 of the supplement PDF shows the attention scores in two environment states, and we found that the teammates outside the visible range do have lower attention scores.

Q6: "Is the centralized training procedure exactly the same for both centralized control and decentralized execution with teammate modeling?"

A6: Yes, the centralized training procedure is the same for CTCE and CTDE.

Q7: "In section 5.4, how do you define two paths are consistent?"

A7: In Section 5.4, consistency is defined on plans made by different agents. Two agents' plans are consistent if the ego trajectories in both plans cover different landmarks. Therefore, if inconsistent plans are executed without runtime correction, two agents will cover the same landmark, and the 'Spread' task will fail. We have revised the corresponding part of the paper for better clarity.

Q8: "In Line 307 and 308, it is said that 'the red agent and the purple agent generate inconsistent plans,' but I think the inconsistent plans are generated by purple and green agents (i.e., planning agents) right?"

A8: In Figure 2, at t = 0, the plans (dashed lines) in the first and second plots are generated by the red and purple agents, respectively (the planning agents are represented by triangular marks). The ego trajectories (starting from triangular marks) are both heading towards the middle landmark. Thus, the red and purple agents generate inconsistent plans.

Q9: "How will planning horizon $H$ affect the performance?"

A9: We did not delicately tune the planning horizon, and it is decided for each environment and kept the same for datasets from the same environment. The choice of planning horizon is a tradeoff between planning ability and computational efficiency. A short planning horizon can not take full advantage of diffusion planning, and an overlong planning horizon requires more computing resources and potentially more training data. We report the performance of MADiff-C with different planning horizons in SMAC 3m-medium dataset in Table 2 of the supplement PDF.

References

[1] Pan, Ling, et al. Plan better amid conservatism: Offline multi-agent reinforcement learning with actor rectification.

[2] Tseng, Wei-Cheng, et al. Offline Multi-Agent Reinforcement Learning with Knowledge Distillation.

[3] Rombach, Robin, et al. High-resolution image synthesis with latent diffusion models.

2024-08-13

I sincerely thank the authors for their responses, additional results, and visualizations. My major concerns have been addressed, and I will be raising my score.

2024-08-13

Thank you for your thoughtful review and for taking the time to consider our additional results and visualizations. We are pleased that our revisions have addressed your concerns and appreciate your positive feedback.

作者回复

2024-08-07

We want to express our sincere gratitude to all the reviewers for their valuable feedback and insightful comments. We appreciate the recognition of our paper's strengths, highlighted by the reviewers: novelty (nRbJ, hh3p), superior / effective performance (all reviewers), and extensive experiments and ablations (vCUA, hh3p).

We have uploaded a one-page pdf file for additional results. Specifically, Table 1 reports the win rate of MADiff and MADT on SMAC datasets. Table 2 includes ablation study on different planning horizons on the SMAC 3m Medium dataset. In Table 3, we compare MADiff to a model variant that uses a single U-Net to process concatenated multi-agent observations without using attention, which is denoted as ConcatDiff. Figure 1 illustrates the attention scores of other teammates from one agent's perspective when using its own U-Net embedding as the query in two different steps in SMACv2.

As Reviewer hh3p and joCT have a common question regarding how DMs avoid extrapolation errors in offline RL, we choose to response to this question in the general response.

Q1: "The motivation for using DMs to solve the extrapolation errors in offline MARL is unclear to me. Why using the DMs can solve it? What are the strengths of DMs compared to previous works?""Can the authors elaborate on how DMs avoid extrapolation errors in offline RL?"

A1: The term "extrapolation error" in the context of offline MARL refers specifically to the estimation error induced by the target policy selecting an out-of-distribution $a'$ to calculate the $Q(s', a')$ in offline Q learning [1, 2]. As using DMs as trajectory planners does not involve Q learning, the extrapolation error does not exist by nature.

There are other algorithms that similarly solve offline RL without Q learning, such as decision transformers. DMs achieve superior performances mainly for two reasons:

As a powerful generative model, DMs are suitable for modeling potentially diverse and mixed-quality offline RL datasets.
DMs are skilled at handling high-dimensional data, which enables them to generate multiple future steps simultaneously. This results in better temporal consistency and fewer compounding errors [3].

In the following responses, we address each reviewer's comments and provide detailed explanations. We are willing to engage in further discussion regarding specific issues.

References

[1] Fujimoto, S., et al. Off-policy deep reinforcement learning without exploration.

[2] Fujimoto, S., & Gu, S. S. A minimalist approach to offline reinforcement learning.

[3] Janner, Michael, et al. Planning with Diffusion for Flexible Behavior Synthesis.

评论- Please respond to author response if you haven't

2024-08-11

Dear Reviewers,

Before the author-reviewer discussion period ends, please make sure that you have read the author responses, acknowledge that your review reflects them, and engage with the authors if necessary.

Thank you!

2024-08-14

We would like to extend our sincere appreciation to the reviewers for their evaluations and invaluable feedback throughout the discussion period. We are deeply grateful to the area chair for guiding the discussion process.

We are pleased that the discussions have addressed the reviewers' concerns, leading to improved ratings from 3 out of 4 reviewers. In response to the feedback, we have implemented several changes aimed at clarifying the motivation behind our model design, enhancing the readability of the paper, and providing more comprehensive experimental results:

As suggested by Reviewer hh3p, we have thoroughly revised the abstract and introduction as well as included a new related work subsection to provide readers with a holistic picture of the problem and solution. This revision strengthens the logical progression of our MADiff proposal. Many thanks to Reviewer hh3p!
To further emphasize the contribution of MADiff in introducing DMs to MAL, we compare independent DMs and ConcatDiff, highlighting their limitations in the revised abstract and introduction. The ablation results (Table 3 in the attached PDF and responses to Reviewer joCT) and visualizations (Figure 1 in the attached PDF) clearly demonstrate MADiff's superiority over alternative design choices.
We have conducted a thorough review of the symbols used in the paper to ensure consistency and clarity. A table has been added to the appendix to explain all symbols, facilitating a better understanding for the reader.
We have expanded our experimental evaluation by comparing the win rate in SMAC and increasing the number of random seeds. The results confirm that MADiff maintains a clear advantage over MADT. We are in the process of completing additional experiments, including the results with more random seeds in other environments and the win rates of other baselines in SMAC, which will be incorporated into the next version of the paper.

We believe that the revisions and additional experiments have substantially strengthened our paper, addressing the key concerns raised by the reviewers. We once again thank the reviewers and the area chair for their thoughtful feedback and guidance, which have been instrumental in refining our paper.

Sincerely,

Paper 4655 authors

最终决定Accept (poster)

2024-09-25

This paper presents MADIFF, an offline multi-agent RL using diffusion models to address the challenge of complex coordination among agents. MADIFF utilizes an attention-based diffusion model to simultaneously function as a decentralized policy and a centralized controller, enabling effective multi-agent interaction modeling. Its CTDE approach enhances multi-agent trajectory prediction and teammate modeling without additional costs. Experimental results demonstrate that MADIFF outperforms existing baseline algorithms across various multi-agent learning tasks, proving its effectiveness in handling offline multi-agent coordination. The paper highlights the framework's robustness and applicability in environments requiring both centralized control and decentralized execution.

Most of the weaknesses and questions are resolved by revision during author-reviewer discussion period. Authors are highly recommended to incorporate the revision into the final version of the paper.