INS: Interaction-aware Synthesis to Enhance Offline Multi-agent Reinforcement Learning
We propose an interaction-aware approach to synthesize high-quality datasets for offline MARL.
摘要
评审与讨论
In this paper, the authors propose INS, a novel multi-agent reinforcement learning (MARL) data synthesizer aimed at enhancing the performance of offline MARL algorithms. INS leverages a diffusion model to generate synthetic data and employs a sparse attention architecture and a value-based selection mechanism to improve the data quality. Compared to the naive extension of a single-agent diffusion-based data synthesizer to multi-agent tasks (MA-SynthER), INS demonstrates significant improvements.
优点
- The paper is clearly presented and easy to follow. The idea is simple yet effective.
- The authors made several incremental improvements (Bit Action, Sparse Attention, Value-Based Selection) to progressively transform SynthER into INS. The motivation and methods for each improvement are convincing and predictably effective.
- In the ablation studies, the authors use many figures and tables to illustrate the contribution of each improvement and explain the reasons behind these gains. This level of detail provides valuable insights for future research.
缺点
- In my view, the novelty is somewhat limited. INS is essentially an improved version of SynthER in the MARL context, and the ideas of sparse attention and value-based selection are neither difficult to conceive nor novel, as these methods have already been widely applied.
- The experimental results are somewhat disappointing. First, many results in Table 1 do not show significant improvements over MA-SynthER. For example, in the SMAC-8m-good task, while the mean score improves by 0.1, the standard deviation is around 1.0, making the increase in the mean unconvincing. Second, SynthER has demonstrated its ability to significantly improve the sample efficiency of Online RL in the experiments. Including similar experiments for INS would help enrich the paper’s content and better showcase its capabilities.
问题
- See Weaknesses.
Thank you for your time to read our paper and provide useful comments. We carefully address your concerns as follow:
Q1: What are the differences between INS and SynthER?
A1: Thank you for your concern. While SynthER has indeed inspired some design choices in our method, it primarily focuses on single-agent scenarios. In contrast, INS is specifically designed for multi-agent settings. We introduce a sparse attention-based transformer encoder to model agent interactions more accurately, reducing computational overhead. Additionally, we incorporate a value-select mechanism to improve the quality of the synthesized dataset and use bit action to handle discrete actions. These innovations are supported by extensive evaluations of data quality metrics and trained policy performance.
We would like to respectfully emphasize that, while some of these techniques may have been applied in other domains, we believe their application in the multi-agent data synthesis setting is novel and has been experimentally validated to be effective. Similar MARL works that are inspired by successful single-agent techniques and demonstrate effectiveness include MADDPG, MAPPO, MADT, MADiff, CFCQL, and MAZero [1-6]. We believe it’s worth discussing such techniques, which suggest a new perspective or problem to the community; and we believe our paper does the same.
Q2: The experimental results do not show significant improvements over MA-SynthER.
A2: Thank you for your comment. It is true that the improvements brought by the synthetic dataset may not appear as "significant" as those achieved by improving the policy algorithms themselves. Moreover, the choice of policy algorithm can significantly affect the extent of improvements; for example, the suboptimal performance of MA-BCQ may lead to less noticeable gains from the dataset. In all of our experiments, we used 8 random seeds for each setting, and we believe that INS achieved consistently better results in SMAC. Additionally, compared to MA-SynthER, INS demonstrates broader improvements across most scenarios (56/63 better than the original dataset), while MA-SynthER underperforms the original dataset in most cases (only 28/63 better), highlighting the performance advantage of INS over MA-SynthER.
Q3: Could you add experiments in online scenarios similar to those in SynthER?
A3: Thank you for your suggestion regarding experiments in online scenarios. The primary motivation of our work is to address the challenge of data scarcity in offline MARL and the unique difficulties of synthesizing multi-agent data. Therefore, we believe that evaluating performance in online settings is somewhat beyond the scope of this paper. However, as the reviewer mentioned, the data generation capabilities of INS could be beneficial for some online RL algorithms that rely on experience buffers. To address this concern, we have conducted preliminary evaluations of INS in online MARL scenarios, using an update-to-data (UTD) ratio of 20, following a setup similar to SynthER's [7]. Due to time and computational constraints, we only carried out an initial evaluation in two typical scenarios for 1M steps: MADDPG in MPE and MAPPO in SMAC.
| Scenario | Raw Algorithm | INS-Enhanced Algorithm |
|---|---|---|
| Tag (MADDPG, score) | 13 | 82 |
| 5m_vs_6m (MAPPO, win rate) | 0% | 35% |
The initial results indicate a clear improvement in sample efficiency when combining INS with online MARL. If these results are well-received, we plan to use the time before the camera-ready submission to further discuss the potential of applying INS to online MARL, as well as to conduct more comprehensive online experiments. We greatly appreciate your valuable feedback and your contribution to improving the quality of our paper.
Thank you once again for your valuable feedback. We hope our response has satisfactorily addressed your concerns. If you find that we have addressed your concerns, we kindly hope you reconsider your rating. If you need further elaboration or additional points to include in the response, we welcome further discussion to ensure everything is clear and satisfactory.
Reference:
[1] Multi-agent actor-critic for mixed cooperative-competitive environments, NeurIPS 2017.
[2] The surprising effectiveness of ppo in cooperative multi-agent games, NeurIPS 2022.
[3] Offline pre-trained multi-agent decision transformer, Machine Intelligence Research 2023.
[4] Counterfactual conservative Q learning for offline multi-agent reinforcement learning, NeurIPS 2023.
[5] MADiff: Offline Multi-agent Learning with Diffusion Models, NeurIPS 2024.
[6] Efficient Multi-agent Reinforcement Learning by Planning, ICLR 2024.
[7] Synthetic Experience Replay, NeurIPS, 2023.
As the rebuttal period is coming to an end, we would like to thank you again for your valuable feedback on our work. We hope that our clarifications, together with the additions to the revised manuscript, have addressed your concerns. Assuming this is the case, we would like to ask if you would be willing to update your review score. Otherwise, please let us know if you have any further questions. Thanks a lot.
Thank you for your response. I believe the community does need a multi-agent diffusion-based data synthesizer. INS provides insights on how to make SynthER more effective in a multi-agent setting. Although I still think the novelty of INS is somewhat lacking, its contribution and the additional experiments conducted by the authors make me feel that it is still a paper worth accepting. I will raise my score to 6.
We sincerely thank you for your recognition and for increasing your score! We really appreciate it!
The constructive suggestions you gave during the rebuttal session are greatly helpful in improving the quality of our paper. Thanks for your time and hard work!
The paper makes a significant high-level contribution by extending data synthesis from single-agent reinforcement learning (RL) to multi-agent reinforcement learning (MARL), which is an important advancement. However, I have some concerns and suggestions.
优点
The paper represents the first attempt to apply data synthesis techniques to multi-agent reinforcement learning (MARL). The experiments are carefully designed and provide strong support for the proposed method. Additionally, the writing is clear and accessible, making the paper easy to follow.
缺点
See questions.
问题
1.Clarity of Motivation: The detailed motivation for addressing the unique challenges of data synthesis in MARL seems somewhat unclear. While agent interactions are indeed a key issue, other challenges such as partial observability and environmental non-stationarity also play significant roles. I encourage the authors to delve deeper into these challenges and explain how their proposed method tackles them.
2.Modeling Agent Interactions: Relying solely on an attention mechanism to model agent interactions may be overly simplistic. In the MARL domain, modeling agent relationships is a complex and standalone research topic. Would employing more sophisticated methods for agent relationship modeling enhance the model's effectiveness?
3.Generation of Global States: I noticed that the generated data includes joint actions and joint observations but seemingly not the global state. How is this generated data applied to value decomposition models like QMIX that require global state information?
4.Consistency with Environment Dynamics: How does the method ensure that the generated data conforms to the environment's dynamic model? If the generated state transitions are invalid, could this negatively impact the pre-training process? I recommend including theoretical analysis or experimental results to demonstrate the validity and effectiveness of the generated data.
5.Computational Efficiency: Utilizing diffusion models in MARL might introduce computational efficiency challenges. Could the authors provide experimental data or analysis to illustrate the computational efficiency of the proposed method?
Overall, the research shows potential, but addressing the above issues is essential to enhance the credibility and impact of its contributions.
After submitting my official review, I noticed that the novelty of this paper may not be as strong as described in the introduction. The introduction states that INS is the first data synthesis method for offline MARL. However, after studying this domain further, I find that this contribution may not be entirely valid. For example, MAdiff also utilizes diffusion models to generate offline MARL datasets. Although the authors cite these works in the related work section, they do not deeply discuss how INS innovates beyond these existing methods. A more thorough discussion and a redefinition of the novelty in the introduction would help readers better understand the unique contributions of the paper.
Thank you for your thorough review and patience. MADiff [1] models a planner as a return-conditional diffusion model for decision-making in multi-agent systems. At each time step, MADiff generates an observation trajectory based on the agents' current observations and then uses only and along with an inverse dynamics model to derive the decision, . After carefully reviewing the MADiff paper, we believe that MADiff is a planner rather than a data synthesis method, for the following reasons:
-
Theoretical Perspective: We define data synthesis in reinforcement learning as the process of generating data that can be used for policy training, rather than directly engaging in decision-making (as a planner/policy) [2,3]. In two surveys on diffusion models for RL [2,4], MADiff is categorized as a planner (It is particularly worth noting that the first author of [4] is also the first author of MADiff). In contrast, INS and SynthER generate synthetic data without maximizing rewards or selecting actions, and should therefore be categorized under Data Synthesizers. Therefore, we believe there is a fundamental distinction between MADiff and our method. Moreover, MADiff takes current observations as input and generates an observation trajectory, but only uses the next observation () to generate and execute actions, making it incapable of synthesizing a dataset from scratch for policy training. In contrast, our method requires no input and can generate complete transition datasets for use in offline reinforcement learning.
-
Technical Perspective: In terms of the objects being modeled, MADiff only learns from the observation sequence in the dataset, whereas INS learns from the entire transition tuple . Regarding the generated outputs: MADiff produces an observation sequence to support inverse decision-making, while INS generates multiple transitions as part of a dataset. As for the training process: MADiff uses classifier-free guidance, requiring the agent's current observation and team rewards as inputs, while our method requires no inputs and can synthesize a multi-agent dataset from scratch.
The key differences between INS and MADiff are summarized in the following table:
| Method | Key Feature | Input | Output | Requires Real-State Observations? |
|---|---|---|---|---|
| MADiff [1] | Planner | Yes | ||
| INS (ours) | Data Synthesizer | None | No |
In conclusion, we appreciate your valuable suggestions for improving the paper. We have added a discussion on related work in the revised manuscript (Appendix D).
Reference:
[1] MADiff: Offline multi-agent learning with diffusion models, NeurIPS 2024.
[2] CleanDiffuser: An Easy-to-use Modularized Library for Diffusion Models in Decision Making, NeurIPS 2024.
[3] S2p: State-conditioned image synthesis for data augmentation in offline reinforcement learning NeurIPS 2022.
[4] Diffusion models for reinforcement learning: A survey, arXiv preprint 2023.
Q5: How is the computational efficiency of INS?
A5: Thanks for your question. We have added experiments to evaluate the computational efficiency of INS. Specifically, we measured the time taken by INS to generate 1M transitions and compared it with the time required to collect 1M transitions by interacting with the environment using trained policies (both without parallelization).
| Scenario | INS | Collecting by agents |
|---|---|---|
| spread | 1540s | 2728s (MAPPO) |
| 3m | 2396s | 9243s (QMIX) |
The results show that INS is significantly more time-efficient than collecting data through direct interaction with the environment, while still ensuring high quality in previous experiments. We believe this demonstrates that synthesizing datasets is a promising trend for MARL, where generative methods like INS can be used to synthesize multi-agent datasets, reducing the challenges and time associated with data collection in real-world environments. We have added a computational efficiency comparison between INS and the previous method to the appendix of the revised manuscript (Appendix G.5).
Please let us know if we have addressed your concerns. We are more than delighted to have further discussions and improve our manuscript.
Reference:
[1] Multi-agent deep reinforcement learning: a survey, Artificial Intelligence Review 2022.
[2] Dealing with Non-Stationarity in MARL via Trust-Region Decomposition, ICLR 2022.
[3] Revisiting Some Common Practices in Cooperative Multi-Agent Reinforcement Learning, ICML 2022.
[4] Shared Experience Actor-Critic for Multi-Agent Reinforcement Learning, NeurIPS 2020.
[5] Towards Understanding Cooperative Multi-Agent Q-Learning with Value Factorization, NeurIPS 2021.
[6] Offline multi-agent reinforcement learning with implicit global-to-local value regularization, NeurIPS 2023.
[7] Diffusion for World Modeling: Visual Details Matter in Atari, NeurIPS 2024.
[8] World Models via Policy-Guided Trajectory Diffusion, TMLR.
[9] Diffusion models are real-time game engines, arXiv preprint 2024.
Dear reviewer, thanks for your sincere review and advice! In response to your questions and concerns, we have made the following revisions and clarifications:
Q1: Does INS face other key challenges of MARL when synthesizing multi-agent datasets?
A1: Thank you for your comment. The motivation behind INS is to address the issue of data scarcity in offline MARL by synthesizing high-quality datasets, while also considering the challenge of accurately capturing agent interactions in multi-agent systems (L16). It is true that MARL often faces challenges such as partial observability and non-stationarity of the environment. Partial observability refers to agents not having access to global observations, which can impair their decision-making [1]. Non-stationarity, on the other hand, arises because agents' policies keep evolving during learning, making it harder for any one agent to learn an optimal policy [2]. However, it is important to note that these problems mainly affect the policy learning/decision-making phase rather than the data synthesis process. Furthermore, our approach is a centralized synthesis method, where transitions for all agents are synthesized together, which helps alleviate potential issues related to partial observability or non-stationarity of the environment during the synthesis process.
Q2: Would employing more complex methods for modeling agent interactions improve the model's performance?
A2: While we agree with the reviewer that more complex mechanisms for modeling agent interactions could potentially improve performance, our main focus is on evaluating whether advanced generative models like diffusion models can effectively address the data scarcity issue in offline MARL. Our findings show that even with relatively simple multi-agent modeling—such as using transformers with sparse attention—we can achieve sufficient accuracy in synthetic datasets (as demonstrated in the comparative evaluation and dataset quality analysis sections) for offline MARL training. Additionally, diffusion models are computationally intensive, and introducing more complex interaction modeling would significantly increase the computational burden. This is one reason why we did not pursue more intricate methods, as it would lead to a sharp rise in training and inference complexity. Given our goal of generating high-quality datasets, we believe the current trade-off between simplicity and effectiveness is appropriate. We see the exploration of more complex modeling approaches as a promising direction for future work.
Q3: How is this generated data applied to value decomposition models like QMIX that require global state information?
A3: INS can be extended to generate the global state by adjusting the input to the diffusion model. Specifically, instead of modeling joint local transitions using tuples of the form , we could model expanded tuples like . This extension enables the generated data to be compatible with value decomposition models, such as QMIX, that require global state information.
Additionally, we would like to clarify that while many value decomposition algorithms condition the Q-value on the global state, there are also several algorithms that condition the Q-value on conditioned on local observations [2-6]. Thus, INS can also be applied directly to these value decomposition methods. We have added a more detailed discussion on generating the global state in the appendix of the revised version of the paper (Appendix K.1).
Q4: How does the method ensure that the generated data conforms to the environment's dynamic model?
A4: We train a diffusion model on real datasets to implicitly learn the environmental dynamics contained within the data. As a powerful generative model, diffusion models are well-suited to capturing the complex distributions inherent in interaction datasets [7-9].
To assess whether the model has learned the environment dynamics, we use the Dynamic MSE metric. This metric calculates the mean squared error (MSE) between the predicted next observation and the true next observation, providing a direct evaluation of how well the synthetic data conforms to the environment dynamics. As shown in our experiments (Figures 1, 3), INS outperforms other generative methods, such as additive noise, VAE, and MA-SynthER, in terms of environment dynamics accuracy. In the future, exploring ways to design explicit constraints to further enhance the learning of environment dynamics is an interesting direction. Thank you again for your insightful comment.
Thank you for the response. After reading the authors' replies, most of my concerns and questions have been addressed. Therefore, I am inclined to improve the rating from 6 to 8.
We sincerely thank you for your recognition and for increasing your score! We really appreciate it!
The constructive suggestions you gave during the rebuttal session are greatly helpful in improving the quality of our paper. Thanks for your time and hard work!
This paper introduces INS, a method for synthesizing high-quality multi-agent datasets to enhance offline multi-agent reinforcement learning. INS leverages diffusion models and incorporates a sparse attention mechanism to capture inter-agent interactions, ensuring the synthetic dataset reflects the underlying agent dynamics. The method also includes a select mechanism to prioritize high-value transitions, improving dataset quality. The experimental results demonstrate the effectiveness of INS.
优点
- The paper is well-organized.
- Considering both dataset diversity and high-value transitions is valuable.
缺点
- On Line 087, the authors claim that "INS is the first data synthesis approach for offline MARL." However, on Line 118, in the related works section, the authors mention that "recent studies extend diffusion models to the MARL domain, applying them to trajectory generation [1]." This statement suggests that the authors are not the first to engage in data synthesis work within the offline MARL domain. Furthermore, upon reviewing the relevant literature, the reviewer found additional papers that have also conducted data generation tasks in offline MARL [2]. Therefore, the reviewer believes that the authors have overclaimed their contribution. It is important to acknowledge the pioneering work in the field, and the authors should provide a more accurate representation of their contribution in relation to prior art.
[1] MADiff: Offline multi-agent learning with diffusion models.
[2] A Model-Based Solution to the Offline Multi-Agent Reinforcement Learning Coordination Problem.
- The authors claim to introduce a sparse attention mechanism to capture inter-agent interactions, utilizing the method sparsemax, which was originally proposed in the NLP field. However, upon reviewing the literature, the reviewer found that sparsemax has already been applied in the MARL domain [3,4], reducing the novelty of this aspect of the work. Furthermore, the formulation of the sparsemax (equations 3-6) shows overlap with [4]. The authors should acknowledge the existing use of sparsemax in MARL and clarify how their implementation differs from or improves upon previous applications. A detailed comparison with related works would strengthen the paper's contribution and novelty.
[3] SparseMAAC: Sparse Attention for Multi-agent Reinforcement Learning.
[4] Interaction pattern disentangling for multi-agent reinforcement learning.
-
The authors propose the use of Bit Action as an alternative to one-hot action representation for action generation. However, the reviewer has a concern regarding the orthogonality of each action in the Bit Action representation. Non-orthogonal actions may lead to ambiguity in the action space, potentially affecting the outcomes.
-
The selection proportion parameter in the authors' method lacks flexibility and requires tuning for different tasks.
-
The synthetic method comparison is limited to SynthER. A more comprehensive comparison should be conducted, including both single-agent methods mentioned in the related works and multi-agent methods [1,2].
问题
See Weaknesses.
Q5: How does INS compare to methods other than SynthER?
A5: Thank you for your comment. In the Preliminary evaluation section (L62), we have already compared INS with methods other than SynthER, such as Additive Noise and VAE Augmented. Both of these methods produced poor-quality datasets and policy performance, which is why they were excluded from the main experiments. As mentioned in A1, the two other works you referred to do not meet the requirements for synthesizing datasets suitable for offline MARL training. Based on your suggestion, we have made an effort to include a comparison with MTDiff-S [11], a multi-task single-agent data synthesis algorithm in related works. We implemented a multi-agent version called MA-MTDiff-S and compared the policy performance of MA-ICQ trained on datasets generated by INS and MA-MTDiff-S across multiple datasets:
| Dataset | INS | MA-MTDiff-S | Additive Noise | VAE Augmented |
|---|---|---|---|---|
| Spread (Expert) | 107.0 4.4 | 103.2 3.2 | 81.6 2.9 | 90.1 2.7 |
| Spread (Medium) | 30.2 2.5 | 27.9 2.1 | 19.9 1.9 | 23.5 1.8 |
| 3m (Good) | 19.9 1.4 | 18.4 0.3 | 10.5 1.0 | 14.7 1.1 |
| 3m (Medium) | 18.5 1.8 | 18.0 0.9 | 9.6 0.9 | 11.4 1.3 |
The experimental results demonstrate that our method outperforms MA-MTDiff-S, further showcasing the advantages of INS in synthesizing multi-agent datasets. Since MA-MTDiff-S is designed for multi-task scenarios and does not account for interactions between agents, its performance was suboptimal. Due to computational resources and time constraints, we plan to add more detailed comparisons with MA-MTDiff-S in the subsequent revised version.
Thank you once again for your valuable feedback. We hope our response has satisfactorily addressed your concerns. If you find that we have addressed your concerns, we kindly hope you reconsider your rating. If you need further elaboration or additional points to include in the response, we welcome further discussion to ensure everything is clear and satisfactory.
Reference:
[1] MADiff: Offline multi-agent learning with diffusion models, NeurIPS 2024.
[2] A Model-Based Solution to the Offline Multi-Agent Reinforcement Learning Coordination Problem, AAMAS 2024.
[3] CleanDiffuser: An Easy-to-use Modularized Library for Diffusion Models in Decision Making, NeurIPS 2024.
[4] S2p: State-conditioned image synthesis for data augmentation in offline reinforcement learning NeurIPS 2022.
[5] Diffusion models for reinforcement learning: A survey, arXiv preprint 2023.
[6] Integrated architectures for learning, planning, and reacting based on approximating dynamic programming, Machine learning proceedings 1990.
[7] Interaction pattern disentangling for multi-agent reinforcement learning, IEEE TPAMI 2024.
[8] SparseMAAC: Sparse attention for multi-agent reinforcement learning, DASFAA 2019.
[9] Off-the-Grid MARL: Datasets and Baselines for Offline Multi-Agent Reinforcement Learning, AAMAS 2023.
[10] Analog bits: Generating discrete data using diffusion models with self-conditioning, ICLR 2022.
[11] Diffusion model is an effective planner and data synthesizer for multi-task reinforcement learning, NeurIPS 2023.
Thank you for the response.
- Regarding the contribution to data synthesis, after reviewing the author's response, I can agree with distinguishing between data synthesis and planning as separate categories. However, it's crucial for the author to clearly define these terms in the introduction to prevent any misunderstanding among readers. Moreover, based on the authors' definition, the reviewer believes that the model-based MOMA-PPO qualifies as a form of data synthesis. The authors have clearly mentioned that MOMA-PPO involves repeating a process multiple times to generate a trajectory, which is then used to train MAPPO. This aligns with the definition of data synthesis as "generating diverse data that can be used to train policies." The difference from the original method lies only in the specific approach to data synthesis. Therefore, the reviewer believe that the authors have overstated their contribution by claiming that "INS is the first data synthesis approach for offline MARL".
Despite the reviewer's emphasis on the insufficient contributions, the authors have only made adjustments in the Appendix of the revised version regarding comparisons with existing work, while the main text remains unchanged. This could potentially mislead other readers in the community.
Comparison of Related Work in the Main Text: Thank you for agreeing with our clarification that data synthesis and planning belong to different categories. We have explicitly highlighted these comparisons in the revised manuscript. We sincerely appreciate your valuable feedback, which has contributed to improving the quality of our paper.
Explanation of the Differences between Data Synthesis and Model-Based Methods: We understand your concerns, and we would like to provide a more detailed explanation to address them:
-
MOMA-PPO employs a world model as a simulator, which takes the current observation and action as input to predict the next observation and reward. This places it within the category of predictive models . In such algorithms, the world model's objective is to learn an accurate prediction of the environment's dynamics. During each iteration, the world model requires real initial states or histories from the dataset to perform rollouts under the current policy. These rollouts are then used for further learning, and the process is repeated in subsequent iterations with updated policies. Moreover, model-based methods typically involve an iterative process of "rollout under a given policy to generate trajectories," which we believe follows the same paradigm as traditional data collection in the environment, except that the real environment is replaced by the world model. This distinguishes model-based methods from data synthesis approaches, which directly synthesize multi-agent data according to the dataset distribution.
-
INS, on the other hand, does not rely on any additional inputs. It directly samples a set of independent transitions without requiring a policy or initial states, functioning as a generative model . Instead of explicitly learning the environment's dynamics, INS models the dataset distribution and generates diverse, high-quality data. In addition, we reviewed works on using DM for data synthesis in single-agent settings [1] [2], and they also explicitly emphasized the distinction between data synthesis and model-based methods.
Summary of the Comparison between INS and MOMA-PPO:
| Method | Category | Mathematical Form | Model Type | Requires Real States | Requires Policy |
|---|---|---|---|---|---|
| MOMA-PPO | Model-Based MARL | Predictive Model | Yes | Yes | |
| INS (Ours) | Data Synthesis | Generative Model | No | No |
To clarify the distinction between these two approaches, we have added a detailed explanation of the differences between INS and model-based methods in the updated Related Work section.
Relaxing the Claims of Our Contributions: To address any potential confusion for readers and to alleviate your concerns about overstating our contributions, we have followed your suggestion to appropriately relax the claims in our revised manuscript. Specifically, we now state:
INS is the first diffusion-based data synthesis approach for offline MARL.
We hope this addresses your concerns.
Once again, thank you for taking the time to review our response and for your valuable feedback. If you find that we have addressed your concerns, we kindly hope you reconsider your rating. If you need further elaboration or additional points to include in the response, we welcome further discussion to ensure everything is clear and satisfactory.
Reference:
[1] Synthetic experience replay, NeurIPS 2023.
[2] Prioritized Generative Replay, arXiv preprint 2024.
Q2: What is the difference in the role of sparsemax in INS and [7, 8]?
A2: Thank you for pointing out the relevant literature on the application of sparsemax in MARL. We acknowledge the connection between these works and ours, and we have added relevant discussions and comparisons in the revised version (Appendix E). After carefully reviewing both papers, we would like to clarify the differences between the role of sparsemax in INS and in these works:
In OPT [7] and SparseMAAC [8], sparsemax is used to extract relationships between the real observations of multiple agents, enabling agents to focus on specific targets during inference, which helps in effective decision-making. Therefore, in these works, the input to sparsemax is the encoding of agent observations or observation-action pairs, and the output is used for subsequent action generation or Q-value computation.
In contrast, our work is a generative task, where sparsemax is applied during the denoising process of transitions (which are initially sampled entirely from noise). During training, sparsemax learns the interaction patterns present in the dataset, and during synthesis, it progressively shapes the inter-relationship among agents in the denoising process, rather than extracting pre-existing relationships. Therefore, we believe the role of sparsemax in INS is fundamentally different from its role in the aforementioned works.
Summary of the comparison between these algorithms:
| Methods | Process | Role | Input | Output |
|---|---|---|---|---|
| OPT & SparseMAAC | Decision-making (policy inference) | Extract relationships | Real observations | Action/Q-value |
| INS (Ours) | Transition synthesis (denoising process) | Shape the inter-relationships | Noisy synthetic transitions | Transitions |
Q3: Comparison between bit action and one-hot encoding.
A3: We appreciate the reviewer's concerns. First, it is important to clarify that the discrete action space in the SMAC offline dataset [9] and the synthetic dataset used for training are of int type and do not possess orthogonality. Next, we will explain both theoretically and experimentally the reason for using bit action instead of one-hot encoding for data synthesis:
Theoretically, as mentioned in the paper (L252), bit action is more efficient than one-hot encoding, significantly reducing the dimensionality of the action representation from to .
Experimentally, [10] has shown that using 2-bit encoding yields better results than one-hot encoding for discrete variable generation. We have also conducted experimental evaluations on bit action. Specifically, we replaced bit action with one-hot encoding in INS and tested it on four datasets.
| Dataset | INS (bit action) | INS (one-hot encoding) |
|---|---|---|
| Spread (Expert) | 107.0 4.4 | 103.5 5.0 |
| Spread (Medium) | 30.2 2.5 | 27.7 3.0 |
| 3m (Good) | 19.9 1.4 | 18.5 0.8 |
| 3m (Medium) | 18.5 1.8 | 17.6 1.4 |
The results show a notable performance drop in the MA-ICQ policy trained on the one-hot encoding datasets, indicating that the use of bit action leads to better performance in generating actions compared to one-hot encoding.
Q4: The selection proportion parameter lacks flexibility and requires tuning for different tasks.
A4: We apologize for any confusion caused. This statement is not accurate. We used the same selection proportion of 0.8 across all major experiments, and it consistently yielded satisfactory performance in most scenarios, which is why we recommend this parameter setting (L441). Additionally, when the parameter varied from 0.6 to 1 (in Figure 5), INS still achieved performance that exceeded the baseline. While INS does allow for adjusting the selection proportion to synthesize datasets with different diversity/quality trade-offs, this does not imply that the parameter needs to be fine-tuned for every task (unless, of course, one wishes to perform a hyperparameter sweep for optimal performance—something that is common across many algorithms). We have clarified this point in the revised manuscript (L441).
Dear reviewer, thanks for your time to read our paper and provide useful comments. We have conducted several additional experiments, and try to ease your concern below. Hope we can ease your concerns.
Q1: What are the differences between INS, diffusion planner (MADiff [1]), and model-based RL (MOMA-PPO [2])?
A1: After carefully reviewing the two papers you mentioned, we would like to gently emphasize the distinction between these methods and INS:
Difference to diffusion planner methods: MADiff [1] models a planner as a return-conditional diffusion model designed to maximize the cumulative reward for multi-agent decision-making. At each time step, MADiff generates a trajectory of length based on the agent's current or historical observations and then uses only the observations along with an inverse dynamics model to determine the action .
- Theoretically, data synthesis in reinforcement learning often refers to the ability to generate diverse data that can be used to train policies rather than directly engaging in decision-making (as a planner or policy) [3, 4]. In two surveys on diffusion models for RL [3, 5], MADiff is consistently categorized as a planner (notably, the first author of [5] is also the first author of MADiff). In contrast, INS and SynthER generate synthetic data without maximizing rewards or selecting actions, and should therefore be categorized under Data Synthesizers. Thus, we argue that MADiff and our method are fundamentally different. Furthermore, MADiff generates observation sequences based on the agent's current state and only uses the next observation to produce an action, meaning it cannot generate a dataset from scratch that can be used for policy training. In contrast, our method requires no input and can generate a complete dataset of transitions suitable for subsequent offline multi-agent reinforcement learning.
- Technically, both MADiff and INS leverage the generative capabilities of diffusion models. In terms of the learned targets, MADiff trains on observation sequences , while INS learns from transitions . When it comes to the output, MADiff produces observation sequences for optimal decision-making, while INS generates multiple transitions that form a dataset. In terms of training, MADiff uses classifier-free guidance, requiring the agent's current observation and rewards as inputs, whereas INS needs no input and can generate multi-agent datasets.
Difference to model-based RL methods: MOMA-PPO [2] is a model-based MARL method. It first samples historical data from the dataset, inputs it into the current policy to obtain corresponding actions, and then feeds these actions and states into a world model (MLPs). This process is repeated multiple times to roll out a trajectory, which is then used to train MAPPO.
- The authors of MOMA-PPO describe it as a Dyna-like [6] model-based approach. Dyna-like methods are classic model-based approaches that optimize a policy by unrolling trajectories "in the imagination" of the world model. In contrast, INS is a data synthesis method that generates a group of independent transitions without needing to start from a real state or the current policy, and the generated experiences are distributed according to the dataset. Furthermore, INS is an orthogonal approach that could be combined with dynamics models by generating initial states via diffusion, potentially leading to increased diversity.
Summary of the comparison between the three algorithms: ( means the observation history)
| Method | Key Feature | Objective | Input | Output | Requires Real State | Requires Policy |
|---|---|---|---|---|---|---|
| MADiff [1] | Planner | Higher rewards | Yes | No | ||
| MOMA-PPO [2] | Model-based MARL | Higher rewards | Yes | Yes | ||
| INS (ours) | Data Synthesizer | Diverse high-quality data | None | No | No |
We have added a comparison between INS and the aforementioned methods (planners & model-based MARL) to the revised manuscript (Appendix D). However, if in further discussions the reviewer still believes that [1,2] should be considered data synthesis methods due to their process of generating "certain types of data", we are open to appropriately softening our claim in the contributions section.
Thanks for your clarification. Based on the revised version, I have updated my score. Good luck!
We sincerely thank you for engaging with us and for increasing your score! We really appreciate your effort in reviewing our paper and your recognition of our work!
The constructive suggestions you gave during the rebuttal session are greatly helpful in improving the quality of our paper. Thanks for your time and hard work!
The paper addresses data scarcity in offline multi-agent reinforcement learning (MARL), emphasizing the unique challenges in synthesizing high-quality multi-agent datasets due to complex inter-agent interactions. The authors propose Interaction-aware Synthesis (INS), a method that uses diffusion models with sparse attention to accurately model these interactions and a bit action module to support both discrete and continuous actions. INS also includes a selection mechanism to prioritize high-value transitions. Experiments in MPE and SMAC environments show that INS outperforms current methods, enhancing downstream policy performance and dataset quality. Remarkably, INS can synthesize effective data using only 10% of the original dataset, demonstrating its efficiency in data-scarce scenarios.
优点
- The paper was well-written. It's easy to understand the motivation, contributions, and methodology of the proposed work
- Collecting datasets for multi-agent reinforcement is hard. Also, generating datasets in the multi-agent scenario is non-trivial due to the difficulty of modelling the interaction between the agents.
- The experiments are complete and promising.
缺点
- In terms of novelty, the proposed solution looks like simply applying a diffusion model for generating trajectories with additional attention mechanisms across the agents, which is a lack of novelty.
问题
- Is the proposed approach scalable in terms of a large number of agents?
- It seems that the objective of the trajectory generation process only guarantees whether the generated trajectories are realistic or not. There is no mechanism to implicitly guide the diffusion model to enhance the quality of the trajectories (accumulated rewards of trajectories).
Dear reviewer, we appreciate your time and effort in reviewing this paper. We try to address your concerns as follows and hope we can ease your concerns.
Q1: How does INS differ from other methods that use diffusion models for generation?
A1: Although there are existing methods that employ diffusion models for trajectory generation, INS differs significantly from them in many crucial aspects:
In terms of contribution, we introduce a transformer encoder-based diffusion model (DM) as a synthesizer for multi-agent datasets to enhance offline MARL performance, effectively reducing the dependency on real interaction data. This fundamentally differs from existing approaches in MARL that incorporate DMs as either planners [4] or policies [5].
Regarding method design, unlike single-agent data synthesis methods, we discovered that directly applying DMs to multi-agent systems is suboptimal due to the complexity introduced by inter-agent interactions. To address this, INS incorporates a sparse attention mechanism to maintain focus on agent interactions during the generation process. Additionally, we developed a select mechanism that improves dataset quality compared to previous data synthesis approaches.
Empirically, policies trained on INS-synthesized datasets demonstrate improved performance across most multi-agent scenarios. We evaluated the datasets using multiple metrics to validate our method's advantages. Furthermore, INS can synthesize high-quality data using only 10% of the original dataset, a capability that previous methods have struggled to achieve.
While we acknowledge that INS is relatively straightforward, its effectiveness and accessibility have been recognized by all reviewers in the advantages section. Moreover, several highly influential recent RL papers published at similar venues have demonstrated that simple ideas can be both intuitive and effective [1,2,3].
Q2: Is the proposed approach scalable in terms of a large number of agents?
A2: Scalability is a good point that we haven't paid much attention to. INS is tested with up to 8 agents, which is the maximum number of agents considered by existing offline MARL methods. Inspired by your comment, we additionally conducted experiments on SMAC scenario 27m_vs_30m, which involves 27 agents:
| Dataset | MAICQ (original) | MAICQ (INS) | MABCQ (original) | MABCQ (INS) |
|---|---|---|---|---|
| 27m_vs_30m (Good) | 15.71.5 | 16.41.2 () | 10.11.2 | 10.81.1 () |
| 27m_vs_30m (Medium) | 12.50.3 | 13.10.4 () | 9.60.9 | 9.91.0 () |
| 27m_vs_30m (Poor) | 9.70.9 | 9.90.8 () | 9.60.6 | 9.60.7 () |
The experimental results show that our algorithm can effectively scale to systems with more agents. Moreover, since collecting data in real large-scale systems is challenging, this further highlights the advantages of our approach.
Q3: Why not implicitly guide the diffusion model to enhance the quality of the trajectories?
A3: While it is true that a guided diffusion model can improve the quality of the generated strategies when used for decision-making, in the context of data synthesis, our goal is to ensure compatibility with a broad range of offline MARL algorithms. Introducing a guided diffusion model could bias the generated data toward certain patterns, which may reduce the diversity of the synthesized data [6]. Additionally, this would introduce a trade-off between diversity and fidelity [7]. However, we do see the potential for applying guided diffusion models in specific domains with particular preferences, such as safety RL [8], and have added this consideration to the limitations section of the revised manuscript. Thanks for your valuable suggestion.
Please let us know if we have addressed your concerns. We are more than delighted to have further discussions and improve our manuscript.
Reference:
[1] A minimalist approach to offline reinforcement learning, NeurIPS 2021.
[2] Image augmentation is all you need: Regularizing deep reinforcement learning from pixels, ICLR 2021.
[3] S4rl: Surprisingly simple self-supervision for offline reinforcement learning in robotics, CoRL 2022.
[4] MADiff: Offline multi-agent learning with diffusion models, NeurIPS 2024.
[5] Beyond Conservatism: Diffusion Policies in Offline Multi-agent Reinforcement Learning, arXiv preprint 2023.
[6] Diffusion model is an effective planner and data synthesizer for multi-task reinforcement learning, NeurIPS 2023.
[7] PixelAsParam: A gradient view on diffusion sampling with guidance, ICML 2023.
[8] Safe Offline Reinforcement Learning with Feasibility-Guided Diffusion Model, ICLR 2024.
We sincerely thank the reviewers for their thoughtful comments and thorough reading of our paper. We are pleased to see that the reviewers found our problem setting to be "hard" and "non-trivial" (reviewer HKfA). We also appreciate the positive feedback on our approach being "easy to understand" (reviewer HKfA), "valuable" (reviewer jHSc), "easy to follow", "shows potential" (reviewer 8UPd), and "simple yet effective" as well as "convincing" (reviewer PNEG). Additionally, we are grateful for the recognition that our experiments are "complete and promising" (reviewer HKfA), "carefully designed and provide strong support" (reviewer jHSc), and "provide valuable insights for future research" (reviewer PNEG).
To address the reviewers' suggestions and concerns, we have made a number of changes to the paper. The main changes are highlighted in blue in the updated manuscript and appendices. A brief summary of the modifications is as follows:
- Add further discussion of related works [1, 2] in the main text and Appendix D.
- Include a comparison of the role of sparsemax in [3, 4] and INS in Appendix E.
- Provide an evaluation of the computational efficiency of INS in Appendix G.5.
- Update the explanation of global state generation in Appendix K.1.
- Revise the limitations section to highlight potential applications of guided diffusion models in specific domains.
- Clarify some expressions in the paper that may cause ambiguity for the reviewers.
Furthermore, we incorporate additional evaluation results in response to specific comments, and we hope that the extended experimental analysis strengthens the contribution of our method.
Please let us know if we have sufficiently addressed your concerns. We would be more than delighted to engage in further discussions and improve our manuscript. If our response has addressed your concerns, we would be grateful if you could re-evaluate our work.
Reference:
[1] MADiff: Offline multi-agent learning with diffusion models, NeurIPS 2024.
[2] A Model-Based Solution to the Offline Multi-Agent Reinforcement Learning Coordination Problem, AAMAS 2024.
[3] Interaction pattern disentangling for multi-agent reinforcement learning, IEEE TPAMI 2024.
[4] SparseMAAC: Sparse attention for multi-agent reinforcement learning, DASFAA 2019.
This work addresses a challenging problem (data scarcity in offline MARL), tests a simple modern baseline (MA-SynthER) and then diagnoses the weaknesses and proposes a more considered approach to apply it to the specific setting. The resulting method produces gains. That is a very nice recipe for scientific contribution, so the paper should be accepted. I would encourage the authors to be a little clearer on bolding or highlighting in the tables, and to only do so when the results are statistically significant. Aside from that it is good work.
审稿人讨论附加意见
Discussion was mostly clarifications since the reviewers were mostly positive to begin with.
Accept (Poster)