Off-the-Grid MARL: Datasets with Baselines for Offline Multi-Agent Reinforcement Learning
Offline MARL is a nascent field which promises to turn large datasets into powerful, decentralised decision making systems. However, progress has been hampered by the lack of high-quality benchmark multi-agent datasets.
摘要
评审与讨论
This paper introduces OG-MARL, an expansive repository made for cooperative offline multi-agent reinforcement learning (MARL). Addressing the current lack of standardized datasets and baselines in offline MARL, the authors offer a collection mirroring real-world system complexities, such as heterogeneous agents and non-stationarity. These datasets, classified into types like Good, Medium, Poor, and Replay, undergo thorough quality assurance checks. The authors have made OG-MARL publicly accessible.
优点
The paper addresses a significant gap in the field of offline multi-agent reinforcement learning. This initiative targets the lack of standardized datasets and baselines, a challenge often overlooked by many in the field of reinforcement learning. In terms of quality, the datasets were curated and validated. The use of diverse real-world system parameters, like heterogeneous agents and non-stationarity, makes the paper better. This paper is also easy to follow. By providing a public repository, it contributes to the MARL research community.
缺点
-
The categorization of datasets heavily based on the quality of experience may inadvertently introduce biases. A more well-rounded evaluation could be achieved by integrating additional qualitative and quantitative metrics.
-
The results section provides an overview of algorithmic performance but lacks analytical depth. Explain the reasons behind the observed performances, such as the underperformance of vanilla QMIX, could offer more substantial insights.
-
There is a similar work, "Off-the-Grid MARL: Datasets and Baselines for Offline Multi-Agent Reinforcement Learning," has been previously published in AAMAS. Does this submission introduce novel datasets or environments that extend beyond those covered in the AAMAS paper? Are there any innovative algorithmic approaches, evaluation metrics, or experimental setups that were not addressed in the prior publication? Further, how does the current paper tackle the challenges and limitations identified in the AAMAS publication?
-
While this paper undeniably provides significant aid to the research community in terms of establishing a baseline database and engineering groundwork for MARL, its depth seems somewhat superficial. The ideas, though functional, are straightforward by testing different algorithms in different environments and producing new datasets (by using the old method). Meanwhile, while the authors have laid out certain frameworks and methodologies, there isn't clear documentation on how one might go about implementing novel algorithms or introducing new environments within the given context. This work is engineering important but has little contribution to the theoretical underpinnings or conceptual advancements in the field.
问题
Please see the Weaknesses.
伦理问题详情
Non
We thank the reviewer for their in depth feedback on our work. We appreciate the reviewer's confidence that “[...]this paper undeniably provides significant aid to the research community[…]”.
- “There is a similar work, ‘Off-the-Grid MARL: Datasets and Baselines for Offline Multi-Agent Reinforcement Learning,’ has been previously published in AAMAS.”
- The paper at AAMAS was an extended abstract and therefore non-archival and has not been published in any proceedings. Our work improves upon the extended abstract in several significant ways. Firstly, several new environments were supported, namely SMACv2, KAZ, MPE, Voltage Control and CityLearn. Secondly, we added a dataset generated from human players. And finally, we added a competitive offline dataset.
- “The ideas, though functional, are straightforward by testing different algorithms in different environments and producing new datasets (by using the old method) […] This work is engineering important but has little contribution to the theoretical underpinnings or conceptual advancements in the field.”
- We feel the significance of dataset papers to the field can not be overstated. RL Unplugged [1] and D4RL [2] both helped drive progress in the field of single-agent offline RL and enabled many breakthroughs in recent years [3,4]. Moreover, we echo reviewer KxKX’s remark that “Benchmarking is essential in machine-learning communities as well as multi-agent learning communities.” Without proper benchmarking contributions it's impossible to get a clear overview of the current state of the field. We believe it is for these reasons that ICLR welcomes “Dataset and Benchmark” contributions (see the list of subject areas at https://iclr.cc/Conferences/2024/CallForPapers).
- “The results section provides an overview of algorithmic performance but lacks analytical depth.”
- As with related offline RL dataset publications RL Unplugged [1] and D4RL [2], providing detailed theoretical or empirical evidence for a discussion on why one baseline outperforms another is not the focus of this work. The purpose of including baselines was so that future works can easily compare their algorithms to baselines on our datasets. We agree however that a deeper analytical analysis is a valuable direction for future work.
- “...there isn't clear documentation on how one might go about implementing novel algorithms or introducing new environments within the given context.”
- We provided an in depth tutorial on how to add a new environment and record data using OG-MARL in the code available on our website (https://sites.google.com/view/og-marl). Furthermore, ours is the largest collection of offline MARL algorithms openly accessible and implemented under the same framework and therefore arguably one of the best resources for future researchers to use to implement new algorithms.
- “A more well-rounded evaluation could be achieved by integrating additional qualitative and quantitative metrics.”
- Determining the boundaries for the different datasets (Good, Medium, Poor) required an in depth qualitative analysis of the various environments and how return related to agent skills. Unfortunately we opted not to include the details in the final writeup because we thought it would not be important to practitioners using our datasets. However, we see now that in fact the details may be valuable and have therefore added them in the appendix and we kindly invite the reviewer to read this analysis.
[1] Gulcehre, Caglar, et al. “Rl unplugged: A suite of benchmarks for offline reinforcement learning.” NeurIPS 2020
[2] Fu, Justin, et al. “D4rl: Datasets for deep data-driven reinforcement learning.” arXiv pre-print 2020
[3] Kumar, Aviral, et al. "Conservative q-learning for offline reinforcement learning." NeurIPS 2020
[4] Kostrikov, Ilya, Ashvin Nair, and Sergey Levine. "Offline Reinforcement Learning with Implicit Q-Learning." ICLR. 2021.
Thanks for answering my questions and resolving my concerns. I have raised my scores because of the clarifications on the AAMAS paper.
This paper proposes datasets for offline multi-agent reinforcement learning including games(real world problems) with both discrete and continuous actions. The paper also provide different types of the datasets: Good, Medium, Poor. Evaluation results of offline multi-agent reinforcement learning baselines are provided.
优点
1.The dataset for offline multi-agent reinforcement learning is missing, which is important for this community. 2.The paper provides a comprehensive dataset including both games and real world problem, both discrete and continuous. 3.The paper is well written.
缺点
1.The major concern of the paper is the correctness of the implementation of the baselines. OMAR definitely outperforms CQL in many tasks, as reported in "Beyond Conservatism: Diffusion Policies in Offline Multi-agent Reinforcement Learning https://arxiv.org/abs/2307.01472". However, this is not true in Table D.5. As a dataset and benchmark paper, I think it's crucial to ensure that the results are replicable and the claims made for previous baselines are correct. 2.The paper does not provide explanations on why an algorithm outperforms another algorithm.
问题
See the above section.
We thank the reviewer for their feedback.
“OMAR definitely outperforms CQL in many tasks, as reported in "Beyond Conservatism …”
We are confident that our implementation is correct as it closely resembles the results reported by Barde et al. (2023) “A Model-Based Solution to the Offline Multi-Agent Reinforcement Learning Coordination Problem” (https://arxiv.org/pdf/2305.17198.pdf). Here the authors evaluate OMAR on two new continuous action settings, namely Reacher (Table 2) and Ant (Table 3). In their results the authors show that the performance of ICQL and OMAR are very similar, with OMAR’s mean performance only marginally superior to ICQL in 7 out of 11 settings and their uncertainty estimates overlapping in all but one scenario. Furthermore, both OMAR and ICQL are outperformed by ITD3+BC and BC in all but one setting, and usually by some margin. These results closely resemble our reported finding, namely that the performance of ICQL and OMAR are very similar but significantly worse than ITD3+BC and BC on all tested continuous action settings.
The original OMAR work and the work from the reviewer’s cited paper share authors and therefore do not provide an independent verification of OMAR’s performance. To the best of our knowledge, the paper we cite above, represents the only independent study that uses OMAR, and this work corroborates our findings in terms of the performance of OMAR compared to other algorithms such as ICQL, ITD3+BC and BC.
In addition, we will add the following remark to the appendix to clarify why there might be a discrepancy in performance. “We could not make OMAR or ICQL perform well on our tasks. We are unsure if this is because these algorithms perform poorly on our specific tasks, or if our implementations are missing an important detail. But since our results closely resemble the results reported by Barde et al. (2023), an independent work to the original OMAR paper, we decided to include them.”
We trust we have provided enough evidence to validate our reported results using OMAR and request the reviewer to kindly consider improving their score or provide us with their reason for why this is not possible.
This paper proposed off-the-grid MARL (OG-MARL) datasets with baselines for cooperative offline MARL. The datasets provide settings include complex environment dynamics, heterogeneous agents, non-stationarity, many agents, partial observability, suboptimality, sparse rewards and demonstrated coordination. The OG-MARL provides a range of different dataset types and profiles the composition of experiences for each dataset.
优点
- This paper proposed the datasets of offline MARL by extending the idea of single-agent offline RL datasets such as D4RL (Fu et al., 2020) and RL Unplugged (Gulcehre et al., 2020). The datasets provide settings include complex environment dynamics, heterogeneous agents, non-stationarity, many agents, partial observability, suboptimality, sparse rewards and demonstrated coordination.
- This paper also provided baselines for existing cooperative offline MARL such as Behaviour Cloning (BC), QMIX (Rashid et al., 2018), QMIX with Batch Constrained Q-Learning (Fujimoto et al., 2019), QMIX with Conservative Q-Learning (Kumar et al., 2020) and MAICQ (Yang et al., 2021). The results concluded that on PettingZoo environments, with pixel observations, MAICQ is the current state-of-the-art offline MARL algorithm in discrete action settings.
- The paper is well-written and mostly has clarity.
缺点
Although this paper includes a novelty about MARL extension from single-agent RL datasets and baselines, other points seem to be ordinary. More challenging benchmarks and more real-world scenarios, might provide more significance, as described below. There were also some unclear points described below.
问题
-
P6: The authors said that “We chose these environments because they have visual (pixel-based) observations of varying sizes; an important dimension along which prior works have failed to evaluate their algorithms”. What are the prior works specifically and why did they fail the evaluation?
-
For human data (KAZ), the detailed description will be described because humans have diversity and usually the property of the human participants (e.g., age and the game experience) should be reported. If possible, comparison with the data from RL algorithms will estimate the property of human data.
-
The paper mentioned about KAZ that “The players where given no instruction on how to play the game and had to learn through trial and error.” but does it mean the data may include not only “learned” data but also “learning” data? The data acquisition process can be clarified.
-
More challenging MARL benchmarks such as team sports (e.g., [1] [2]) or more real-world robotics data might provide more significance. [1] Kurach et al. Google Research Football: A Novel Reinforcement Learning Environment, AAAI, 2020 [2] Liu et al. From motor control to team play in simulated humanoid football, Science Robotics, 2022
伦理问题详情
No
We thank the reviewer for their positive feedback on our work.
- “What are the prior works specifically and why did they fail the evaluation?”
- Due to space constraints we could only include a subset of all the baseline experiments we ran in the main text. All the baselines on the other environments were included in the Appendix. We chose to include the experiments on environments with pixel-based observations in the main text because prior offline MARL [1,2,3] works had not used such environments and in [4] the authors emphasize that benchmarks on pixel-based observations have been lacking in the broader offline RL literature.
- “... does it mean the data may include not only “learned” data but also “learning” data?”
- Yes. To make sure the human-generated dataset included sufficiently diverse data we opted to include “learning” data.
- “More challenging MARL benchmarks such as team sports … might provide more significance.”
- We agree that adding additional environments such as Google Football would be valuable. However, our priority has been to initially support all of the most popular MARL benchmarks including SMAC v1 & v2, MPE, MAMuJoCo and PettingZoo. Having said that, the data recorder provided in OG-MARL is flexible enough for researchers to use on a wide range of currently unsupported environments with minimal effort. Please see the tutorial we provided in the OG-MARL code for how to record data in a new environment.
[1] Jiechuan Jiang, et al. “Offline Decentralized Multi-Agent Reinforcement Learning.” arXiv Pre-Print
[2] Ling Pan, et al. “Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement Learning with Actor Rectification.” NeurIPS
[3] Yang et al. “Believe What You See: Implicit Constraint Approach for Offline Multi-Agent Reinforcement Learning.” NeurIPS 2021
[4] Cong Lu, et al. “Challenges and opportunities in offline reinforcement learning from visual observations.” ICML 2022
Thank you for the reply and clear comment about your third response (and my fourth question).
- “What are the prior works specifically and why did they fail the evaluation?” According to your comment, the prior work did not evaluate them, but we do not know whether they failed the evaluation or not. If so, I recommend modifying the paper according to your response.
- “... does it mean the data may include not only “learned” data but also “learning” data?” You clarified this point, but did not clarify my second question. As I wrote in my third question, the data acquisition process should be clarified.
Dear reviewer, we thank you for your response. We believe the reviewer may have misinterpreted our statement in the text. When we said that “prior works have failed to evaluate their algorithms” on pixel-based environments, we meant that prior works have not used any pixels-based environments in their experiments. That is to say they did not test their algorithms on pixel-based environments. We make no claim about whether they fail or not on pixel-based observations. We hope this clarifies what we meant to the reviewer and apologise for the confusing wording.
With regard to the data acquisition, we will add the following to the main text. “We collected data from 10 people aged between 20 and 30 years. Half the participants were female and the other half were male. The players were unfamiliar with the game when they played it for the first time and had to learn through trial and error. Each pair of players was allowed to play the game for 10 episodes.”
The paper introduces the OG-MARL openly available offline datasets and baselines for MARL. The datasets cover a range of scenarios, including micromanagement in StarCraft 2, continuous control in MAMuJoCo, diverse environments in PettingZoo, train scheduling in Flatland, and energy management in Voltage Control/CityLearn. The paper tries to address the lack of benchmark datasets and baselines in offline MARL and aims to facilitate research and comparison of MARL algorithms.
优点
- the paper addresses the lack of commonly shared benchmark datasets and baselines in the field of offline MARL.
- the proposed data contains data on a large collection of various MARL environments, including SMAC, MAMuJoCo, PettingZoo, Flatland, and Voltage Control/CityLearn.
- the authors provide detailed descriptions of the different environments and datasets, including information about the composition of the datasets and visualizations of the behavior policy.
缺点
- it would be beneficial to include performance comparisons with more existing algorithms and baselines on the provided datasets. For instance, federated offline MARL, etc
- it would be beneficial if the paper could provide a list of the size of the data and the approximate amount of computation resources required for training the baseline.
- it seems the dataset has fewer scenarios with competitive case, adding more competitive datasets would probably be helpful to make it more general.
问题
- For different levels of data (good, medium, etc), how do you make sure that the dataset contains a wide variety of experiences and is not biased to a certain type of policy?
- Based on the C.1 paper, is the size of the dataset sufficiently large for large-scale experiments such as federated offline MARL?
We thank the reviewer for their positive feedback on our work.
- “...it would be beneficial to include performance comparisons with more existing algorithms…”
- We kindly ask the reviewer to point us to any published works in offline MARL we may have missed, we will gladly add these to our related work section. We are unfortunately not familiar with federated offline MARL. It is however our hope that OG-MARL becomes a living and growing project where the community uses the framework we provide to add new baselines and datasets in the future.
- “...adding more competitive datasets would probably be helpful…”
- The lack of additional competitive multi-agent scenarios is reflective of the fact that the offline MARL community has primarily been focused on the cooperative setting given its potential real-world applicability. Competitive offline MARL research has largely been theoretical [1,2,3]. We hope the competitive dataset we provided, the easy-to-use data recorder and our step-by-step tutorial on how to use it, will encourage other researchers to make future contributions of competitive datasets.
- “… how do you make sure that the dataset contains a wide variety of experiences …”
- For each dataset, we used 4 independently trained systems of policies to rollout and record experiences. Additionally, we added a certain amount of random exploration noise to each policy (epsilon-greedy in discrete action environments and clipped Gaussian noise in continuous action environments. We verified that our datasets were sufficiently diverse by inspecting the violin plots of the distribution of episode returns in the datasets and qualitatively inspecting recordings of the trajectories. We will add the details regarding this qualitative analysis to the appendix of the paper.
- “ … is the size of the dataset sufficiently large for large-scale experiments such as federated offline MARL?”
- We have demonstrated that the datasets are sufficiently large for the presented baseline algorithms. However, we are not familiar with federated offline MARL and can therefore not say for sure. However, the data recorder provided in OG-MARL can be leveraged by the research community to craft their own datasets at any desired scale.
[1] Cui, Qiwen, et al. “When are Offline Two-Player Zero-Sum Markov Games Solvable?” NeurIPS 2022.
[2] Cui, Qiwen, et al. "Provably efficient offline multi-agent reinforcement learning via strategy-wise bonus." NeurIPS 2022
[3] Yan, Yuling, et al. "Model-based reinforcement learning is minimax-optimal for offline zero-sum markov games." arXiv preprint.
This work introduces Off-the-Grid Multi-Agent Reinforcement Learning (OG-MARL), a repository aiming to address the lack of standardized benchmark datasets and baselines in the emerging field of offline multi-agent reinforcement learning (MARL). The motivation is to leverage large datasets from real-world industrial systems, where distributed processes can be recorded during operation. The provided datasets in OG-MARL exhibit characteristics of complex real-world environments, including partial observability, suboptimality, demonstrated coordination, etc.
优点
- Benchmarking is essential in machine-learning communities as well as multi-agent learning communities.
- This benchmark contains a variety of settings in multi-agent, such as team & individual rewards and homogeneous & heterogeneous agents.
- This paper is well-written to some extent.
缺点
- An explanation and comprehensive analysis of the baselines tested on the proposed dataset should be provided as well.
- Is there any measurement of the diversity of the trajectories in the dataset?
- Please clarify the difference between this work and another recent work [1].
[1] Off-the-Grid MARL: Datasets and Baselines for Offline Multi-Agent Reinforcement Learning, AAMAS 2023
问题
Please refer to the weakness section
We would like to thank the reviewer for their time and effort in providing feedback on our work.
- “An explanation and comprehensive analysis of the baselines tested on the proposed dataset should be provided as well.”
- As with related offline RL dataset publications RL Unplugged [1] and D4RL [2], providing detailed theoretical or empirical evidence for a discussion on why one baseline outperforms another is not the focus of this work. The purpose of including baselines was so that future works can easily compare their algorithms to baselines on our datasets. We agree however that a deeper analytical analysis is a valuable direction for future work.
- “Is there any measurement of the diversity of the trajectories in the dataset?”
- As in prior works, we used the spread of episode returns as a proxy for diversity. However, unlike most prior works which typically only report the standard deviation of episode returns in the dataset, we proposed visualizing the distribution of episode returns using violin plots, shedding significantly more light on the diversity of trajectories in the datasets. Having said that, designing a better metric to measure the diversity of trajectories is an important open problem in the offline RL literature and a good direction for future work.
- “Please clarify the difference between this work and another recent work…”
- The AAMAS version of Off-the-Grid MARL was an extended abstract and therefore non-archival and has not been published in any proceedings. Our work includes several new environments (SMACv2, CityLearn, KAZ, MPE and Voltage Control). Additionally, we added a dataset generated from human players and a competitive MARL dataset.
[1] Gulcehre, Caglar, et al. “Rl unplugged: A suite of benchmarks for offline reinforcement learning.” NeurIPS 2020
[2] Fu, Justin, et al. “D4rl: Datasets for deep data-driven reinforcement learning.” arXiv pre-print 2020
This paper builds on an earlier extended abstract, to propose a benchmark suite of tasks for offline multi-agent RL and contribute datasets, source code and tutorials. The benchmark has significant potential for impact on the research community, but the paper was currently considered by all reviewers as not yet ready for publication.
Whilst discussion focused on the relative performance of OMAR, broader issues raised across all reviews have had a larger effect on my recommendation for the paper. Most notably the diversity of environments and details on the datasets.
Regarding OMAR, the authors' and reviewers' combined references show that OMAR's performance is variable and uncertain. A clearer documentation of this in the paper could help mitigate this issue in future revisions.
为何不给更高分
- Unanimous opinion among all reviewers that the paper is not ready for publication
- Missing details on datasets and breadth of environments included
为何不给更低分
N/A
Reject