Negatively Correlated Ensemble Reinforcement Learning for Online Diverse Game Level Generation
This paper proposes a regularised ensemble reinforcement learning approach with policy regularisation theorems to train generators that generates diverse and promising game levels in real-time.
摘要
评审与讨论
This paper proposes NCERL, an ensemble Reinforcement Learning (RL) method for game level generation with more diversity. NCERL uses a set of different Gaussian actors which outputs actions in the latent space, and the output is decoded into different level segments by a GAN decoder. The final output segment is chosen with the probability generated by a learned selector. To encourage diversity between different actors, a regularizer of Wasserstein distance between policy distributions (and weighted by selector) is added to the reward. The paper also gives a modified gradient and convergence proof on the new reward, as the regularizer depends on the whole policy instead of a single action. On a well known level-generation benchmark, NCERL achieves comparable reward (measuring game-design goals) with better level diversity, and can also perform a well trade-off between diversity and reward by controlling its regularization coefficient.
优点
-
As this paper uses a latent space for policy and existing decoder for segment generation, the problem addressed by this paper is not only interesting, but actually quite general: how to achieve a trade-off between policy diversity and performance, with game level generation being one of its real-world applications.
-
The paper has a sound theoretical basis, with a convergence proof for the new reward (which, similar to soft actor-critic, depends on policy distribution instead of action) and a rigorous, modified version of gradient update.
-
The proposed work is scalable, with a well-designed and parallelized reward evaluation framework.
缺点
Some details of the papers are not presented clearly enough.
-
Figure 1 has many math symbols and no caption on high-level ideas of each component; in addition, the meaning of in the figure is unclear.
-
There is no clear definition on what a state is in the paper; the readers can only speculate that it is a latent vector from the end of Section 2.1 ("... a randomly sampled latent vector is used as the initial state").
-
There is no description on how Wasserstein distance is calculated. It is true that 2-Wasserstein distance between Gaussian distributions can be easily calculated, but it would be better if the Gaussian property can be emphasized at the beginning of Section 3.2, a formula can be given to make the paper self-contained, and "2-Wasserstein" instead of "Wasserstein" is specified.
-
typos: in conclusion, bettwe -> better.
问题
I have two questions:
-
In Table 1, the trade-off of NCERL between reward and diversity in some of the environments does not follow the trend; for example, in Mario Puzzle, the diversity with seems to have two peaks (0.2 and 0.5), and the reward at is the worst despite of low diversity. Could the author explain this?
-
Currently, there is only comparison between NCERL, ensemble RL methods and non-ensemble RL methods, and the encoding is over GAN. What is the performance of non-RL solutions, such as scripted (possibly with learnable parameters) solution, or supervised/self-supervised learning over more recent generative models such as diffusion models or VAEs?
Thank you for the insightful comments and we greatly appreciate your constructive suggestions for improving the presentation. Our paper has been carefully revised according to your suggestions and we also checked through the paper again for typos. The changes are highlighted in blue. For your questions, we hope the following responses address them properly.
Q1: In Table 1, the trade-off of NCERL between reward and diversity in some of the environments does not follow the trend; for example, in Mario Puzzle, the diversity with λ seems to have two peaks (0.2 and 0.5), and the reward at λ=0.1 is the worst despite of low diversity. Could the author explain this?
We consider this may be caused by some local optima during the training. Therefore, the training results of NCERL can be a little bit unstable. The black-box decoder further makes the training results hard to predict. The factor does not affect the trade-off between reward and diversity but also affects the exploration efficiency, making the effect of on the performance more non-monotonous. The relative poor performance of the generator trained with can be attributed to an abnormal trial which only yielded of average reward and of diversity. For future work, we consider integrating our method with multi-objective reinforcement learning [1] to train a set of non-dominated policies with mutable regularisation coefficients. This could potentially bypass the problem of unstable and non-monotonous effects of the regularisation weight. To facilitate a more in-depth analysis of the results, we present the reward and diversity of each independent trial. Full results are presented and discussed in Appendix E.1.
Q2: Currently, there is only a comparison between NCERL, ensemble RL methods and non-ensemble RL methods, and the encoding is over GAN. What is the performance of non-RL solutions, such as scripted (possibly with learnable parameters) solutions, or supervised/self-supervised learning over more recent generative models such as diffusion models or VAEs?
Due to the limited time, we are not able to compare our approach with all the non-RL approaches listed by the reviewer, but we made an effort to conduct a meaningful comparison. We use the code of [2] to train diffusion models (DDPM) for five independent trials. Specifically, we utilized the code from [2] to train DDPM for five independent trials. In this process, we randomly sampled 25 noises to generate 25 level segments, then concatenated them with an initial segment used in the RL generator testing. This procedure was applied to each of the 500 initial segments used for testing RL generators, resulting in a test set of 500 levels. We evaluated these test sets in terms of both reward and diversity.
| Task | Criterion | DDPM | = | = | = | = | = | = |
|---|---|---|---|---|---|---|---|---|
| MarioPuzzle | Reward | -29.45 | 55.24 | 51.42 | 53.78 | 53.22 | 54.59 | 53.26 |
| MarioPuzzle | Diversity | 1630 | 1342 | 1570 | 1940 | 1688 | 1698 | 1967 |
| MultiFacet | Reward | -119.4 | 46.39 | 46.16 | 45.35 | 37.87 | 40.86 | 35.77 |
| MultiFacet | Diversity | 1630 | 401.6 | 492.3 | 620.2 | 1024 | 889.6 | 1142 |
According to the results, DDPM exhibited good diversity scores but performed poorly in terms of reward. This is attributed to the fact that the training of DDPM does not consider reward functions that evaluate the quality of generated levels. To our knowledge, DDPM can not optimise certain objectives to cater to customised objectives, while the scripted method relies on domain knowledge and needs significant development costs. The generated samples of DDPM are available in our anonymous code repository (https://anonymous.4open.science/r/NCERL-Diverse-PCG-4F25/, the generation_results folder).
[1] Hayes, Conor F., et al. "A practical guide to multi-objective reinforcement learning and planning." Autonomous Agents and Multi-Agent Systems 36.1 (2022): 26.
[2] Lee, Hyeon Joon, and Edgar Simo-Serra. "Using Unconditional Diffusion Models in Level Generation for Super Mario Bros." 2023 18th International Conference on Machine Vision and Applications. IEEE, 2023.
For your suggestions about the presentation, we have addressed them carefully in the revision. Our changes in the paper are highlighted in blue.
1.Figure 1 has many math symbols and no caption on high-level ideas of each component; in addition, the meaning of in the figure is unclear.
We have updated Figure 1 and its caption to explain the components and math symbols.
2.There is no clear definition on what a state is in the paper; the readers can only speculate that it is a latent vector from the end of Section 2.1 ("... a randomly sampled latent vector is used as the initial state").
We are sorry for such unclarity in the paper. We have revised the description of the domain (state space, action space, reward criterion and diversity criterion) to make it clearer. In the revised paper, Section 3 “Problem Formulation” describes state and action. A state is a concatenated vector of a fixed number of latent vectors from recently generated segments. If there are not enough segments () to construct a complete state, zeros will be padded in the vacant entries.
3.There is no description on how Wasserstein distance is calculated. It is true that 2-Wasserstein distance between Gaussian distributions can be easily calculated, but it would be better if the Gaussian property can be emphasized at the beginning of Section 3.2, a formula can be given to make the paper self-contained, and "2-Wasserstein" instead of "Wasserstein" is specified.
We have updated Section 3.2 as suggested by the reviewer. In the revision, Appendix C.2 is added to provide the formulation of 2-Wasserstein distance and how it can be calculated between Gaussian distributions.
4.typos: in conclusion, bettwe -> better.
We have fixed it and checked through the paper for typos.
The paper proposes an ensemble reinforcement learning approach for generating diverse game levels. The approach uses multiple sub-policies to generate different alternative level segments, and stochastically selects one of them following a selector model. The paper also integrates a novel policy regularisation technique, which is a negative correlation regularisation that increases the distances between the decision distributions determined by each pair of actors. The regularisation is optimised using regularised versions of the policy iteration and policy gradient, which provide general methodologies for optimising policy regularisation in a Markov decision process. The paper's contributions are:
- The proposed ensemble reinforcement learning approach for generating diverse game levels.
- The novel policy regularisation technique that encourages the sub-policies to explore different regions of the state-action space.
- The regularised versions of the policy iteration and policy gradient algorithms that provide general methodologies for optimising policy regularisation in a Markov decision process.
优点
originality: the paper proposes a novel approach for generating diverse game levels using ensemble reinforcement learning and policy regularisation. The paper develops two theorems to provide general methodologies for optimizing policy regularisation in a Markov decision process. The first theorem is a regularised version of the policy iteration algorithm, which is a classic algorithm for solving MDPs. The second theorem is a regularised version of the policy gradient algorithm, which is another classic algorithm for solving MDPs.
quality: the paper provides a detailed description of the proposed approach and the regularisation technique. The paper also provides theoretical proofs of the regularised versions of the policy iteration and policy gradient algorithms.
clarity: the paper is well-written and easy to follow. The authors provide clear explanations of the proposed approach and the regularisation technique.
缺点
the proposed approach assumes that the reward function is known and fixed. However, in practice, the reward function may be unknown or may change over time. Therefore, the proposed approach may not be applicable in such scenarios.
the paper only considers a single game genre (platformer) and a single game engine (Super Mario Bros.). The proposed approach may not be directly applicable to other game genres or engines.
问题
Q1: can the proposed approach be directly applicable to 3D game levels or other types of game content? or what are the difficulties in this extension, such as complex reward design or high computational burden?
Thank you for providing insightful comments. We hope the following response addresses your concerns. Our paper is revised and updated according to reviewers’ comments, the changes are highlighted in blue.
Q1: can the proposed approach be directly applicable to 3D game levels or other types of game content? or what are the difficulties in this extension, such as complex reward design or high computational burden?
Our method can be directly applicable to 3D game levels like Minecraft generation [1] and other types of game content like music [2] and narrative generation [3] since RL has been applied to generate that content. Those applications could benefit from our proposed method since we only change the RL algorithm while maintaining the problem setting unchanged.
When applying our approach to other scenarios, the design of the reward can be challenging, as indicated by the reviewer. However, the procedural content generation (PCG) community has proposed a range of evaluation metrics [4], which can serve as the reward function in our framework. Taking Minecraft as an example again, Jiang et al. [1] have proposed some reward functions to train controllable RL-based game level generators, while there is also a set of evaluation metrics verified based on human evaluation scores [5], those metrics could be used as reward functions. For the issue of high computational burden, we have devised and implemented an asynchronous framework to speed up the training. On the other hand, the use of an action decoder, proposed in [6], contributes to faster generation speeds. There are some generative models that can generate high-quality 3D game levels, such as world-GAN [7] for Minecraft world generation and VAE-GAN for 3D indoor scene generation [8]. By employing these models as decoders, our method can be directly applied to the corresponding application scenarios with rapid generation speeds.
[1] Jiang, Zehua, et al. "Learning Controllable 3D Level Generators." Proceedings of the 17th International Conference on the Foundations of Digital Games. 2022.
[2] Jaques, Natasha, et al. "Generating music by fine-tuning recurrent neural networks with reinforcement learning." (2016).
[3] Huang, Qiuyuan, et al. "Hierarchically structured reinforcement learning for topically coherent visual story generation." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 33. No. 01. 2019.
[4] Shaker, Noor, Julian Togelius, and Mark J. Nelson. "Procedural content generation in games." (2016): 978-3.
[5] Hervé, Jean-Baptiste, and Christoph Salge. "Comparing PCG metrics with Human Evaluation in Minecraft Settlement Generation." Proceedings of the 16th International Conference on the Foundations of Digital Games. 2021.
[6] Shu, Tianye, Jialin Liu, and Georgios N. Yannakakis. "Experience-driven PCG via reinforcement learning: A Super Mario Bros study." 2021 IEEE Conference on Games (CoG). IEEE, 2021.
[7] Awiszus, Maren, Frederik Schubert, and Bodo Rosenhahn. "World-gan: a generative model for minecraft worlds." 2021 IEEE Conference on Games (CoG). IEEE, 2021.
[8] Li, Shuai, and Hongjun Li. "Deep Generative Modeling Based on VAE-GAN for 3D Indoor Scene Synthesis." International Journal of Computer Games Technology 2023 (2023).
Besides the above response to the reviewer's question, we would like to clarify the significance of this work regarding the reviewer's comments.
Regarding the comment "the proposed approach assumes that the reward function is known and fixed. However, in practice, the reward function may be unknown or may change over time. Therefore, the proposed approach may not be applicable in such scenarios.": Although the reward function used in this paper is known and fixed, such a problem setting has extensive application scenarios in the domain of game content generation [1,2,3,6]. However, the reviewer has raised an interesting future research direction. Our proposed method can be made compatible with other approaches to tackle the challenge of unknown or unfixed rewards. For the issue of unknown rewards, inverse reinforcement learning can be used to build some reward functions from human demonstrations [9]. In the context of game content generation, it is possible to collect human-authored levels as demonstrations and learn a reward model from them with inverse reinforcement learning. Our method can be combined with inverse reinforcement learning to address tasks with unknown rewards. When facing the challenge of unfixed rewards, integrating reward estimation and replay sampling techniques [10] into our framework can be a viable solution. By combining our approach with these techniques, one can handle game content generation tasks with unknown or unfixed rewards.
Regarding the comment "the paper only considers a single game genre (platformer) and a single game engine (Super Mario Bros.). The proposed approach may not be directly applicable to other game genres or engines.": This paper employs Super Mario Bros. as the benchmark for our approach, as it is commonly used and representative within the Procedural Content Generation (PCG) community, and it is open-source. For more details on the potential applications of our approach to other games, please refer to our response to Q1.
[9] Arora, Saurabh, and Prashant Doshi. "A survey of inverse reinforcement learning: Challenges, methods and progress." Artificial Intelligence 297 (2021): 103500.
[10] Chen, Shi-Yong, et al. "Stabilizing reinforcement learning in dynamic environment with application to online recommendation." Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018.
Problem Setting
We want to do level generation but we want to induce some diversity in how the levels are generated. Here the policies define level generators which build the level through an MDP.
Algorithm / NN Structure
The policies are defined as mixtures of Gaussians, implemented by creating a set of sub-policies along with a weighting. The weighting itself id modelled by a selector policy, creating a form of hierarchy.
The sub-policies are regularized to be diverse from each other using a Wasserstein distance. This distance is clipped, encouraging policies to be diverse only if their decisions are too close. Sub-policies have Gaussian action heads. Regularization is implemented as an auxilliary reward.
A regularized version of policy iteration of the policy gradient are presented. Thus the agent is trained not only to optimize for diversity in the current timestep, but in future timesteps via a reguarlized value function.
The practical implementation is built off SAC.
Experiments are presented on the Mario level generation benchmark. Results show that NCERL is able to achieve comparable reward to other methods, and outperform in terms of diversity.
优点
This paper proposes a clean and thorough study of a method to induce diverse level generation. The idea is to define policies as a mixture of sub-policies, then regularize those sub-policies so that diversity is increased. While this is a straightforward idea, it is especially applicable in a domain such as level generation where diversity is desired in itself rather than as simply a means towards exploration. The quality of the writing and presentation is solid and clear. Theoretical results are presented re-deriving the policy iteration and policy gradient update explicitly in terms of regularizing the diversity between sub-policies, and proofs are presented regarding convergence. The significance of this work stems from its thorough theoretical contributions.
缺点
Because the experiments are largely domain-specific and improve on diversity rather than pure performance, the significance of this work is limited.
While there are novel derivations and a clean interpretation of regularizing the policy gradient, the idea of representing an agent as sub-policies has been explored in fields such as skill discovery and hierarchical reinforcement learning, which were not referenced in this work.
The description of the domain is unclear to me as a reader, e.g. what is the action space of an agent generating Mario levels? What are the criteria used to evaluate reward and diversity? I would have liked to see examples of the generated levels.
问题
See above for questions related to the experimental section.
The section on asynchronous evaluation seems orthogonal to the main contribution of the work. Asynchronous RL has been explored in the actor-critic setting (e.g. A3C), which this work uses as it builds off SAC. Is there a specific connection between the asynchronous implementation and the novel contribution here?
What does the behavior of the weighting-selector policy look like? It would provide more clarity into the method to showcase how often this selector policy utilizes specific sub-policies, or if certain sub-policies go unused.
It may help to label the other comparison methods in Figure 4.
Thank you for providing insightful comments. We are especially encouraged by your appreciation of our theoretical contribution. We hope the following response addresses your concerns. Our paper is revised and updated according to reviewers' comments, the changes are highlighted in blue.
Q1: The section on asynchronous evaluation seems orthogonal to the main contribution of the work. Asynchronous RL has been explored in the actor-critic setting (e.g. A3C), which this work uses as it builds off SAC. Is there a specific connection between the asynchronous implementation and the novel contribution here?
Indeed, the asynchronous evaluation is orthogonal to the main contribution of this work. Therefore, we removed it from the contribution statements in Section 1 of the revised paper.
Q2: What does the behaviour of the weighting-selector policy look like? It would provide more clarity into the method to showcase how often this selector policy utilizes specific sub-policies, or if certain sub-policies go unuse
If the used in training is large, the probabilities of selecting sub-policies change over time step, and all the sub-policies are used. While if the is small, it is possible that some sub-policies go unused. To showcase the selection probabilities, we pick two NCERL generators trained with and and record their selection probabilities during stochastically generating two levels from the same initial state, respectively.
Regarding the generator trained with , all the sub-policies are activated multiple times within the two trials. Regarding the generator trained with , some of the selection probability is near zero and sub-policies 2 and 3 are never used within the two trials. As the regularisation coefficient increases, the probabilities of selecting sub-policies become more uniform. The tables are included in Appendix E.2 of our revised paper with the images of generated levels in those two trials. We also present them as follows for reference.
Selection probabilities of an NCERL generator trained with and , trial 1, bold texts indicate their corresponding sub-policies are activated at their corresponding time steps.
Selection probabilities of an NCERL generator trained with and , trial 2, bold texts indicate their corresponding sub-policies are activated at their corresponding time steps.
Selection probabilities of an NCERL generator trained with and , trial 1, bold texts indicate their corresponding sub-policies are activated at their corresponding time steps.
Selection probabilities of an NCERL generator trained with and , trial 2, bold texts indicate their corresponding sub-policies are activated at their corresponding time steps.
Q3: It may help to label the other comparison methods in Figure 4.
Thank you for your constructive suggestion. We have added the legends to label the algorithms and values in Figure 4.
Q4: The description of the domain is unclear to me as a reader, e.g. what is the action space of an agent generating Mario levels? What are the criteria used to evaluate reward and diversity?
We have revised the description of the domain (state space, action space, reward criterion and diversity criterion) to make it clearer. In the revised paper, Section 3 “Problem Formulation” describes state and action, Section 6.1 "Online Level Generation Tasks" briefly describes the reward and Section 6.2 "Performance Criteria" introduces the reward and diversity criteria. Due to the page limit, more details including the formulations of reward function and performance criteria are presented in Appendix D.4 and D.1. The state space, action space, reward criterion, and diversity criterion are described as follows.
State space: A dimensional continuous space, where is the dimensionality of the latent vector of the action decoder and is the number of recently generated segments being considered in the reward function. A state is a concatenated vector of a fixed number of latent vectors of recently generated segments. If there are not enough segments have been generated (< n) to construct a state, zeros will be padded in the vacant entries.
Action space: A -dimensional continuous space. An action is a latent vector which can be decoded into a level segment by the decoder. The decoder is a trained GAN in this work.
Reward criterion: reward criterion for a generator is calculated as for each level, i.e., each MDP trajectory, then averaged over all levels generated for testing performance. The reward functions are adopted from previous level generation papers. We described them in Section 6.1 "Online Level Generation Tasks" formulated and described them in Appendix D.1. The description has been revised to make it clearer.
Diversity criterion: Diversity score of a generator is calculated as , where indicates the Hamming distance, i.e., how many different tiles are there between the two levels to be compared; and , indicate the one and the one in the levels generated for testing performance.
Q5: I would have liked to see examples of the generated levels.
We have added some of the generated levels in Appendix E.3 of our revised paper. On both tasks, the levels generated by NCERL generators trained with , , and are visualised and compared to the ones generated by SAC. Those examples show SAC generate similar levels while our proposed NCERL generate diverse levels. Complete generation samples of all trained generators are uploaded to our anonymous code repository (https://anonymous.4open.science/r/NCERL-Diverse-PCG-4F25/, the generation_results folder).
Besides the above point-to-point responses to the reviewer's questions, we would like to clarify the significance of this work regarding the reviewer's comments.
Regarding the comment "Because the experiments are largely domain-specific and improve on diversity rather than pure performance, the significance of this work is limited.": Although our work is domain-specific, online level generation is a broad and crucial area with extensive industrial applications. Procedural content generation via reinforcement learning is actively researched and applied in various gaming domains, including 2D [1] and 3D games [2], as well as virtual reality games [3]. On the other hand, many commercial games feature online level generation, for example, Minecraft, No Man’s Sky and a wide range of roguelike games like Spelunky and Hades. While traditional methods typically rely on human-crafted rules, machine learning-based methods can largely reduce the development cost [4-7]. Moreover, though we do not improve on reward, diversity itself is a goal of interest in not only game content generation, but also other RL directions like multi-agent RL [8] and quality-diversity RL [9].
[1] Khalifa, Ahmed, et al. "PCGRL: Procedural content generation via reinforcement learning." Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment. Vol. 16. No. 1. 2020.
[2] Jiang, Zehua, et al. "Learning Controllable 3D Level Generators." Proceedings of the 17th International Conference on the Foundations of Digital Games. 2022.
[3] Mahmoudi-Nejad, Athar, Matthew Guzdial, and Pierre Boulanger. "Arachnophobia exposure therapy using experience-driven procedural content generation via reinforcement learning (EDPCGRL)." Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment. Vol. 17. No. 1. 2021.
[4] Yannakakis, Georgios N., and Julian Togelius. Artificial intelligence and games. Vol. 2. New York: Springer, 2018.
[5] Shaker, Noor, Julian Togelius, and Mark J. Nelson. "Procedural content generation in games." (2016): 978-3.
[6] Liu, Jialin, et al. "Deep learning for procedural content generation." Neural Computing and Applications 33.1 (2021): 19-37.
[7] Guzdial, Matthew, Sam Snodgrass, and Adam J. Summerville. Procedural Content Generation Via Machine Learning: An Overview. Springer, 2022.
[8] Cui, B., Lupu, A., Sokota, S., Hu, H., Wu, D. J., & Foerster, J. N. (2022, September). Adversarial Diversity in Hanabi. In The Eleventh International Conference on Learning Representations.
[9] Wu, Shuang, et al. "Quality-Similar Diversity via Population Based Reinforcement Learning." The Eleventh International Conference on Learning Representations. 2022.
Regarding the comment "While there are novel derivations and a clean interpretation of regularizing the policy gradient, the idea of representing an agent as sub-policies has been explored in fields such as skill discovery and hierarchical reinforcement learning, which were not referenced in this work.": Indeed, the idea of representing an agent as sub-policies has been explored in skill discovery and hierarchical reinforcement learning. We have added the relevant reference [10,11] in the "Population-based RL" part of Section 2 of our revised paper.
[10] Pateria, Shubham, et al. "Hierarchical reinforcement learning: A comprehensive survey." ACM Computing Surveys (CSUR) 54.5 (2021): 1-35.
[11] Konidaris, George, and Andrew Barto. "Skill discovery in continuous reinforcement learning domains using skill chaining." Advances in neural information processing systems 22 (2009).
This paper introduces a method for online generating diverse game levels through an ensemble of negatively correlated RL generators. The authors derived a policy update operator under the diversity bonus. Apart from that, the authors propose an async framework for speeding up the training. Experiments show that the method is able to generate a wide range of policies through tunning the diversity coefficient .
优点
- The paper is written in clarity. Hypotheses are well supported by the experiments.
- Originality looks good to me (or maybe I am not following the OLG line of research, but I study MARL diversity, in which no noticeable significantly similar methods to my knowledge)
缺点
To my understanding, this method adds a reward bonus/diversity constraint to the diversity among the policies, where the proof is kind of established in the literature. The effect of adding diversity regularization is similar to the quality diversity methods, where you are pursuing optimal in the new reward space. The role of is close to the Lagrange multiplier in the dual formulation of the original problem with a diversity constraint. I think it is ok to include them as contributions to the paper but building theoretical analysis on the interactions among ensemble policies or regularization effect would be more interesting. My ratings are subject to change.
问题
It is similar to (adversarial) diversity in populations of policies that learn incompatible policies or impose distance(entropy) regularization. Can the authors provide their view of how it compares to population-based methods with diversity regularization?
[1] Xing, D., Liu, Q., Zheng, Q., Pan, G., & Zhou, Z. H. (2021). Learning with Generated Teammates to Achieve Type-Free Ad-Hoc Teamwork. In IJCAI (pp. 472-478).
[2] Lupu, A., Cui, B., Hu, H. & Foerster, J.. (2021). Trajectory Diversity for Zero-Shot Coordination. <i>Proceedings of the 38th International Conference on Machine Learning</i>, in <i>Proceedings of Machine Learning Research</i> 139:7204-7213 Available from https://proceedings.mlr.press/v139/lupu21a.html.
[3] Cui, B., Lupu, A., Sokota, S., Hu, H., Wu, D. J., & Foerster, J. N. (2022, September). Adversarial Diversity in Hanabi. In The Eleventh International Conference on Learning Representations.
[4] Rahman, A., Fosong, E., Carlucho, I., & Albrecht, S. V. (2023). Generating Teammates for Training Robust Ad Hoc Teamwork Agents via Best-Response Diversity. Transactions on Machine Learning Research.
[5] Charakorn, R., Manoonpong, P., & Dilokthanakul, N. (2022, September). Generating Diverse Cooperative Agents by Learning Incompatible Policies. In The Eleventh International Conference on Learning Representations.
[6] Rahman, A., Cui, J., & Stone, P. (2023). Minimum Coverage Sets for Training Robust Ad Hoc Teamwork Agents. arXiv preprint arXiv:2308.09595.
Thank you for providing insightful comments. We hope the following response addresses your concerns. Our paper has been revised according to reviewers' comments, the changes are highlighted in blue.
Q1: It is similar to (adversarial) diversity in populations of policies that learn incompatible policies or impose distance(entropy) regularization. Can the authors provide their view of how it compares to population-based methods with diversity regularization?
We appreciate the recommended papers from the reviewers; they have been very helpful. We have checked all of them and found [2,3,5] are the most related ones and have included them in the "Population-based RL" (Renamed from "Policy Ensemble in RL") part of Section 2 in our revised paper. This section also includes a discussion of population diversity in ensemble RL. For papers [2], [3] and [5], we discuss them as follows.
-
Paper [2] uses JS divergence of individual agents’ trajectories as a diversity regularisation. It is similar to our diversity regularisation, while we use the regularisation with a different distance metric and to the decision distribution rather than trajectories.
-
Paper [3] considers an adversarial diversity, which makes a policy different from an “repulser” policy. This is realised by modifying the TD target, while our approach uses distance between decision distributions of the sub-policies.
-
Paper [5] learns incompatible policies within a joint policy, meaning that substituting a policy in the joint policy with the incompatible policies causes a significant deterioration in performance. Our sub-policies can be viewed as a sort of incompatible policies but the measurement is a distance metric instead of performance.
The core differences between our approach and those approaches are: 1. our method makes decisions with all individual policies as a whole, whereas in those methods, each individual policy makes its own decisions; 2. those works consider the diversity in the policy population as the goal, while our work focuses on the diversity of generated levels as the goal.
[2] Lupu, A., Cui, B., Hu, H. & Foerster, J.. (2021). Trajectory Diversity for Zero-Shot Coordination. <i>Proceedings of the 38th International Conference on Machine Learning</i>, in <i>Proceedings of Machine Learning Research</i> 139:7204-7213 Available from https://proceedings.mlr.press/v139/lupu21a.html.
[3] Cui, B., Lupu, A., Sokota, S., Hu, H., Wu, D. J., & Foerster, J. N. (2022, September). Adversarial Diversity in Hanabi. In The Eleventh International Conference on Learning Representations.
[5] Charakorn, R., Manoonpong, P., & Dilokthanakul, N. (2022, September). Generating Diverse Cooperative Agents by Learning Incompatible Policies. In The Eleventh International Conference on Learning Representations.
Regarding the comment "... I think it is ok to include them as contributions to the paper, but building theoretical analysis on the interactions among ensemble policies or regularization effect would be more interesting.": Indeed, theoretical analysis of the interactions among ensemble policies or the regularisation effect is an interesting future direction. Unfortunately, we were unable to conduct a theoretical analysis of these aspects during this rebuttal period. In our future work, we plan to delve into such theoretical analysis. For now, we provide experimental evidence showcasing the interactions among ensemble policies in Appendix E.2 of the revised paper. Specifically, we report the selection probabilities of each sub-policy during generating a level. We observe the selection probabilities of the sub-policies can be adaptively adjusted during generating levels. The selection probabilities also become more uniform as the regularisation coefficient increases. We hope this added analysis can partially address your concern.
We would like to thank all the reviewers for their careful reviews and valuable comments. We have revised the paper to address the reviewers' comments and improve the presentation and clarity. Following the reviewers' suggestions, we also added some new analysis and some examples of generated levels to Appendix. Complete generated levels of all trained generators can be found in our anonymous code repository (https://anonymous.4open.science/r/NCERL-Diverse-PCG-4F25/, the generation_results folder). The main updates of our paper are listed as below. The revisions are highlighted in blue in the paper.
-
Figure 1 and its caption have been revised as suggested by Reviewer cq3K.
-
We have removed the asynchronous framework from the contribution statements in Introduction since it is orthogonal to the main contribution of the work, as pointed out by Reviewer qX6Q.
-
We have added some additional related works to Section 2, as reviewers NGar and qX6Q have pointed out some relevant works that we did not cover in our original paper.
-
Problem formulation: Following the comments by Reviewer qX6Q and Reviewer cq3K, Section 3 has been revised to better describe our considered level generation environment, and Section 6.1 "Online Level Generation Tasks" has been revised to better explain the reward function used in our work.
-
We have clarified the description of Wasserstein distance in Section 3.2 and added Appendix C.2 to provide the formulation of Wasserstein distance as suggested by Reviewer cq3K.
-
Tables 5 and 6 have been added in Appendix E.1 to show the performance of each independent NCERL generator to better analyse the training results and answer Reviewer cq3K's question.
-
Some illustrative examples have been added to Appendix E.2 (Tables 7 and 8) to showcase the selection probabilities of sub-policies to address Reviewer qX6Q's questions.
-
We have added a comparison of levels generated by SAC and levels generated by some NCERL generators (Figures 6-11 in Appendix E.3). Complete generation results are available in our anonymous code repository.
This paper addresses the problem of generating a diverse sequence of levels or level segments that also correspond to certain quality metrics. This is a particular form of quality-diversity problem, dealing with generating a sequence of environments. Casting it as essentially a hierarchical RL problem, and looking for negatively correlated sub-policies, is clever. While this may seem to be a somewhat niche problem, I think if might also be useful in generating sequences of educational materials, and possibly for open-ended learning.
The paper is well-written and the results are convincing. I think it should be accepted.
为何不给更高分
The problem is of less general relevance than some others.
为何不给更低分
Well-done paper describing a novel approach; no real issues.
Accept (poster)