Adversarial Environment Design via Regret-Guided Diffusion Models
We propose a novel UED algorithm that uses regret-guided diffusion models to improve agent robustness.
摘要
评审与讨论
This work focuses on Unsupervised Environment Design (UED), a problem setting whereby a teacher designs environments for a student, learning to solve the task. This area of research has been in focus recently due to its ability to train more general agents in an open-ended setting. The authors look to build on recent work using generative models to generate environments, and propose a diffusion based approach ADD. The method seems sound and leads to gains in two of the canonical UED environments. I recommend acceptance as it incrementally improves UED literature.
优点
The main strengths are listed below. Since I recommend acceptance, the focus of the review is on weaknesses to improve the paper as much as possible for the camera ready version.
- The method makes sense intuitively and seems to work as expected, especially with the curriculum results.
- The paper is well written and clearly builds on previous works (e.g. in the related work discussion).
- Experiments are well presented and show clear performance gains.
缺点
- The x axis for the plots is showing total steps and not student gradient updates. This puts replay methods at a huge disadvantage since they take ~2x the number of environment steps, but the policy isnt training so arguably it should not count when comparing the effectiveness of the curriculum. I think this should be changed to match the literature.
- The baselines are fairly weak, aside from PLR and ACCEL (which have been weakened by the previous point). Can the authors compare against CLUTR and show why ADD produces a better curriculum? To me that is the obvious baseline. Further, note the improvements to PAIRED proposes in [1] which makes it a much stronger baseline. I would say the original PAIRED is more of a concept than a baseline at this point due to known deficiencies.
- There is a clear limitation of your method that you require a pre-training phase with access to the environment, and do not count the number of steps as part of training because it is "unsupervised". This is fine, but then makes it wrong to show training steps for the replay methods. Please either 1) include the steps from this in the x axis for ADD (which will drastically weaken performance) or 2) switch the x axis to student gradient updates. Otherwise it is an apples to oranges comparison in your favour.
- I would love to see examples of the generated environments in the main body, for me it is more useful than the theory part, but maybe that is subjective :)
- More of a general comment, I find it strange that UED papers don't cite [2]. It is the largest scale demonstration of the value of UED and motivates research on these methods. It explicitly shows PLR is effective for curricula over a massive task space with a 500M parameter transformer based policy. This clearly shows that UED research can have impact on large scale AGI focused projects, so I would have expected the UED community to all be very excited about this.
[1] Mediratta et al. "Stabilizing Unsupervised Environment Design with a Learned Adversary". CoLLAs 2023 [2] Bauer et al. "Human Timescale Adaptation in an Open-Ended Task Space". ICML 2023
问题
- The meta data plots show ACCEL starting with the same number of stumps etc as the other baselines. However in the original implementation/paper it should be starting with very simple terrains. Was this the case in your work?
局限性
The limitation of requiring additional data to pre train the diffusion model is not mentioned, I think this is actually the largest weakness by far. It could be interesting to see if pre trained foundation models work well here too.
We appreciate Reviewer a2dv for the valuable feedback and review. Below is our response to the reviewer's comments and questions.
Weak 1: About the x axis for the plots
We agree that using total steps is disadvantageous for replay methods. However, we want to point out that if we use student updates, it becomes disadvantageous for methods that do not use a replay. We deliberated on this issue when organizing the experimental results, and decided to use total steps based on two reasons; (1) it is common in reinforcement learing (RL) research to evaluate the performance of the online RL algorithm after the fixed number of environmental steps. (2) replay methods prior to ACCEL also used total steps for the x axis [1, 2].
Despite our decision, we realize that there could be a controversy about this issue. Hence, we measure the performance of our method trained for only half of environmental steps used in the original experiment to accommodate the perspective that using the number of policy updates as a metric is more reasonable. When using half of environmental steps, average score of our method in bipedal walker is 127.4 +/- 16.0, and average success rate in maze environment is 0.72 +/- 0.04. We believe the performance is still competitive compared to baselines, and will add this performace to the appendix C.1 and C.2.
Weak 2: About the baselines
We agree that it would be beneficial to compare our method against CLUTR [3] and PAIRED with high entropy and protagonist-antagonist behavioral cloning [4]. However, we want to point out that [3] was proposed to solve combinatorial explosion issue and that the bipedal walker domain is not a target domain. Furthermore, [4] eventually utilizes replay in the bipedal walker, so it is not adequate methods to represent learning-based methods. Based on these points, we did not include them in the baselines, but since experiments in the maze environment is possible, we will conduct additional experiments on these baselines and add the result to the Appendix C.1.
Weak 3: About including the pre-training steps into the x axis
We want to point out that there is no interaction with environments during the pre-training phase since we randomly sample parameters without access to environments and train the diffusion model using the sampled parameters. So we disagree about the reviewer's concern that we should take account the steps from pre-training phase in the x axis. However, similar to the reviewer's point, we agree that the time spent on pre-training should be considered for a fair comparison. To resolve this issue, we will include the performance measured after the number of environmental steps that is adjusted to reflect the time spent on pre-training. Since pre-training took much less time than agent training (9 hours vs 56 hours in the maze environment, and 7 hours vs 92 hours in the BipedalWalker environment), we expect there will be no drastic performance degradation.
Weak 4: Examples of generated environments in the main body
We will resize Figure 7 and Figure 12 and add each to Section 5.1 and Section 5.2.
Weak 5: An additional citation
We appreciate the reviewer for pointing out the remarkable research [5] related to UED. We will add it to our citations.
Question 1: About the meta data plots of ACCEL
The ACCEL paper [6] describes two implementation methods. The difference between the two methods lies in the domain from which the initial environment parameters are sampled. One method samples from the full parameter range, while the other restricts sampling to a range that ensures simple environments are generated. Starting with a simple environment and gradually evolving it is a simple yet powerful idea. However, when using ACCEL as a baseline, it might not be fair because it already incorporates prior knowledge of which parameters create simple environments. Thus, we chose to use the ACCEL implementation that samples parameters from the full parameter range as the baseline. As a result, complex environments are generated at the beginning of training, leading to differences from the original paper's plot.
References
[1] Jiang et al. "Prioritized Level Replay." International Conference on Machine Learning. 2021.
[2] Jiang et al. "Replay-Guided Adversarial Environment Design." Advances in Neural Information Processing Systems. 2021.
[3] Azad et al. "CLUTR: Curriculum learning via unsupervised task representation learning." International Conference on Machine Learning. 2023.
[4] Mediratta et al. "Stabilizing Unsupervised Environment Design with a Learned Adversary." Conference on Lifelong Learning Agents. 2023.
[5] Bauer et al. "Human Timescale Adaptation in an Open-Ended Task Space." International Conference on Machine Learning. 2023.
[6] Parker-Holder et al. "Evolving Curricula with Regret-Based Environment Design." International Conference on Machine Learning. 2022.
I am already a high score for this so the rebuttal was unlikely to see an increase. Please include a discussion of Weakness1 in the main body. I don't agree with the comment "[4] eventually utilizes replay in the bipedal walker" because there are still stronger baselines than the ones in your paper. These should be in the main body, why wouldn't they be?
Thanks for the response. We misunderstood the reviewer's concern as a suggestion to replace PAIRED with [4]. We now understand the reviewer's point and will include the results of [4] in the main body.
This work applies regret-guided diffusion models to the UED setting in order to generate adversarial environments that preserve diversity.
优点
- The contributions are well motivated and appear to be novel.
- The writing is generally clear and concise.
- The paper is contextualized well within prior literature.
缺点
- Minor typos/grammatical issues:
- Line 253: “challenging”—This word alone does not hold any descriptive power. Because the nature of the tasks are described next in impartial terms, this word seems redundant/rhetorical.
- The limitations do not mention the assumed structure of the environments; I am not convinced that diffusion can be applied to all environment parameterizations.
- Why is shortest path length used as a complexity metric? It seems to describe the tail of the distribution; Why not average path length? Also, number of blocks does not seem to be a well-motivated complexity metric either.
- While ADD “successfully generates adversarial environments while preserving diversity”, the behavior improvements seem to be minimal based on the training curves / zero shot performance.
- A more general point: it is hard to tell what exactly the authors are trying to show from the experiments; the conclusion nicely sums up the analysis, but it would have been nice to have been told at the top of “experiments” what each experiment is trying to test.
- Given the prevalence of UED domains in the literature, a third evaluation domain would really strengthen this work. Otherwise, I am concerned about the applicability of this method to other environments, especially since the authors have not addressed the implicit assumptions about workable environment parameterizations for their diffusion approach.
问题
- What are the assumptions on the structure of the environment parameterization? Can this be applied to all simulated environments? Or only a subset with certain properties?
- Why were RGB channels used to describe the maze environment instead of e.g. a binary (continuized for the diffusion process); it seems odd to parameterize the environment based on the observation space, and then to extract these parameters themselves based on the observation values. Are there certain parameterizations for the diffusion process that work better/worse? What properties does this representation space need to have?
- Are the t-SNE embeddings generated using all points, and then each individual plot displays just the relevant method’s points?
局限性
- The limitations are outlined accurately by the authors in Section 6. However, I would like the authors to address the first question above, which I believe is an additional limitation that would be worth discussing.
We appreciate Reviewer 1NF7 for the valuable feedback and review. Below is our response to the reviewer's comments and questions.
Weak 1: Minor typo "challenging"
We agree that the word "challenging" in line 253 is redundant, so we will exclude it.
Weak 2: About complexity metrics: shortest path length and number of blocks
We used shortest path length and the number of blocks following prior works [1, 2, 3]. While the number of blocks may not always correlate with an environment's complexity, there is a tendency for environments to become more maze-like and challenging as the number of blocks increases when the goal and start positions are fixed. Thus, considering both the number of blocks and the shortest path length can give a rough estimate of the environment's complexity. Additionally, we agree that a metric considering the lengths of all paths can better explain the complexity of an environment. However, since there are numerous paths to the goal within the time limit, simply averaging these paths does not meaningfully explain complexity. Hence, a carefully designed metric considering the lengths of possible paths is needed, and we believe this is an area that requires further exploration by UED researchers.
Weak 3: About the expression “successfully generates adversarial environments while preserving diversity”
It summarizes the complexity and diversity of the environments generated during the training process, which are shown in Figures 2(c) and 2(d). We would like to point out that it is not about the performance.
Weak 4: What each experiment is trying to test
To help readers better understand the purpose of each experiment, we will improve the explanation in Line 266-272 as follows. "The primary goal of our experiments is to evaluate the generalization capability of the trained agent. To this end, we measured the zero-shot transfer performance in challenging, human-designed environments. Additionally, to understand where the differences in performance originate, we conducted quantitative and qualitative analyses of the curriculum of the generated environments by tracking complexity metrics and generating t-SNE plots."
Weak 5: A third evaluation domain and applicability
Regarding the reviewer's concern about applicability, we would like to point out that most tasks used in current UED research either have a form similar to Minigrid or utilize a continuous parameter space. Hence, sampling environment parameters using a diffusion model is feasible. Therefore, we agree that experiments in third domain will strengthen our work, but not necessary to demonstrate our method's applicability. For deeper discussion on the workable parameterization, please refer to the response to the question 1.
Question 1: About the assumptions on the structure of the environment parameterization
The assumed environment parameterization is one that can be learned by the diffusion model. If the parameter space is continuous, methods such as DDPM [4] can be applied. Even when the parameter space is discrete, methods such as D3PM [5] can be applied. Hence, we believe that our algorithm can cover most tasks that can be handled by existing UED methods, which model the environment generation process as an MDP [1] or randomly sample environment parameters [2, 3].
Even for environment parameters requiring constraints, such as ensuring that a path always exists from start to goal in a maze environment, we can utilize guided diffusion and rejection sampling to meet the constraints. Furthermore, when the constraint is complex, the diffusion model can be an effective solution, as discussed in [6]. Therefore, it can be seen that our method has a potential to handle both continuous and discrete parameters with constraints. However, if the environment itself is challenging to parameterize, it would be difficult to apply ours and other UED algorithms.
We hope this response sufficiently addresses the reviewer's question. If there is an example of a parameterization where challenges in applying our approach is expected, we would be happy to discuss it further.
Question 2: About RGB channels used to describe the maze environment
The representation that we used is a continuized binary, where open spaces are represented by 0 and other elements by 1. And there are two reasons why we use three channels. First, it allows us to represent obstacles, a start point, and a goal point in different channels using binary values, which we believe is a natural representation. Second, it is easy to visulize. It is technically feasible to use a single channel and represent each element with a different value. We expect this would make training even easier than with our representation, as the reduced dimension could lead to faster learning. We hope we have understood the reviewer's question correctly and our respones addresses it sufficiently.
Question 3: About t-SNE embeddings
We generated the t-SNE embeddings using all environment parameters, and each plot displays only the points corresponding to the respective methods. By doing so, we were able to compare the diversity of environments generated by each baseline.
References
[1] Dennis et al. "Emergent Complexity and Zero-Shot Transfer via Unsupervised Environment Design." Advances in Neural Information Processing Systems. 2020.
[2] Jiang et al. "Replay-Guided Adversarial Environment Design." Advances in Neural Information Processing Systems. 2021.
[3] Parker-Holder et al. "Evolving Curricula with Regret-Based Environment Design." International Conference on Machine Learning. 2022.
[4] Ho et al. "Denoising Diffusion Probabilistic Models." Advances in Neural Information Processing Systems. 2020.
[5] Austin et al. "Structured Denoising Diffusion Models in Discrete State-Spaces." Advances in Neural Information Processing Systems. 2021.
[6] Yang et al. "Compositional Diffusion-Based Continuous Constraint Solvers." Conference on Robot Learning. 2023.
I thank the authors for their thorough response to my questions and concerns. I find the authors' responses convincing, and believe that the changes made in response to mine and the other reviewers' suggestions have also improved the manuscript. I am updating my score from "5: borderline accept" to "7: accept".
Thanks for the response. We are glad that our answers cleared up the reviewer's concerns.
This paper proposes an approach for gradient directed, regret-based UED based on guiding a pre-trained diffusion model.
优点
This paper addresses a major shortcoming of prior UED approaches. In the past gradient-based UED approaches have been out-performed by sample-based or evolutionary approaches for searching for environments. While there was a general intuition that gradient based approaches would ultimately be more scalable, it was difficult to realise this intuition due to the fact that environment-design is a high-dimensional optimisation problem for which RL algorithms have struggled. Using insights from generative modelling is a natural approach to bridging this gap, and this paper provides a few tricks which seem to have been missing from prior attempts.
Specifically, pre-training the diffusion model on random levels, and using learned guidance seems to be a particularly powerful combination.
缺点
My main concern is that It seems like the baseline results in the bipedal walker domain don't replicate the results from prior work, suggesting some sort of bug/lack of tuning in the implementation? It appears that if I compare the numbers of ADD to the results in the ACCEL paper it gets lower or equivalent performance to ACCEL itself? My main reason for not currently raising my score is questions about the accuracy of this evaluation.
There are some tricks that are rediscovered from prior work which should be attributed. Specifically, entropy regularisation for neural generators is studied in [1], and training a sort of critic to evaluate levels is studied in [2]. That being said, ADD puts a unique spin on both of these.
[1] Mediratta, Ishita, et al. "Stabilizing unsupervised environment design with a learned adversary." Conference on Lifelong Learning Agents. PMLR, 2023.
[2] Bhatt, Varun, et al. "Deep surrogate assisted generation of environments." Advances in Neural Information Processing Systems 35 (2022): 37762-37777.
For the citation of environment design, it is correct to cite the UED formalism was from Dennis et al. but for the more general concept of designing environments it would be best to also cite contemporaries POET[3] and GPN [4].
[3] Wang, Rui, et al. "Paired open-ended trailblazer (poet): Endlessly generating increasingly complex and diverse learning environments and their solutions." arXiv preprint arXiv:1901.01753 (2019).
[4] Bontrager, Philip, and Julian Togelius. "Learning to generate levels from nothing." 2021 IEEE Conference on Games (CoG). IEEE, 2021.
It would also be best to be careful not to equate minimax regret based UED approaches with all UED approaches as is done on line 83, as there are many non minimax regret approaches to UED such as POET, SAMPLR, CLUTR, and DRED.
Pre-training on random levels seems like it leaves only a limited amount that the diffusion model could learn. It would be interesting to periodically fine-tune the diffusion model on newly generated levels to increase the power of the generator over time. This would generally be much more convincing as it could scale much further off of the distribution of random levels.
Clarity
It would be helpful to include the per-transfer environment bar-plots as is traditional with UED papers, it seems like the same information is included in Table 5 and 6 but they are much harder to read in that format, and it is difficult to tell where error bars overlap.
It would also be helpful to include the bootstrapped CI's plot often in UED papers as recommended by [5]
Figure 4 and Figure 9 would be amazing to include in the main body, and go a long way towards explaining and demonstrating the method.
[5] Agarwal, Rishabh, et al. "Deep reinforcement learning at the edge of the statistical precipice." Advances in neural information processing systems 34 (2021): 29304-29320.| |APA||
How do you arrive at the ADD acronym? It is a bit difficult for me to remember it and what it stands for.
问题
Have you tried replacing equation 12 with a PAIRED-style loss maximising the expectation between a protagonist and antagonist?
局限性
See weaknesses.
We appreciate Reviewer TVkr for the valuable feedback and review. Below is our response to the reviewer's comments and questions.
Weak1: ACCEL results in the bipedal walker domain
There are two main differences between the original ACCEL paper and our experiments. First, the domain of environment parameters is different. As shown in Table 6 of the original ACCEL paper [1], ACCEL was implemented in two ways, ACCEL+ and ACCEL++ (we replace dagger of the original paper with "+" for readability in the review format). ACCEL++, which shows better performance, sampled environment parameters from an easy parameter range where the difficulty level is inherently low (Table 1 below or Table 9 of the ACCEL paper). In contrast, ACCEL+ and other baselines generated environments from the full parameter range (Table 2 below). While ACCEL++ shows powerful performance, we decided to use ACCEL+ as a baseline since it can be seen that ACCEL++ uses prior knowledge of which parameters create simple environments. It is one of the reasons why the average score of ACCEL is lower than that of the original paper.
The second difference is that the ACCEL paper recorded performance after the fixed number of policy updates, whereas we recorded performance after the fixed number of environmental steps, following other UED papers [2, 3] and traditional deep reinforcement learning research. Since ACCEL requires episodes without policy updates after mutating replayed environments, the number of environmental steps per policy update is higher than other baselines. Therefore, when using environmental steps as the metric, the performance of ACCEL degrades than that reported in the original paper.
We will include an explanation of this difference in the paper to ensure that readers are aware of these differences. Addtionally, we will add the performance of our method trained for only half of environmental steps used in the original experiment to accommodate the perspective that using the number of policy updates as a metric is more reasonable. In bipedal walker domain, average score of our method after half of environmental steps is 127.4 +/- 16.0, and we believe it is still competitive performance compared to baselines.
Table 1: Easy parameter range in the bipedal walker domain
| stump height | stair height | stair steps | roughness | pit gap |
|---|---|---|---|---|
| [0, 0.4] | [0, 0.4] | [1, 1] | [0, 0.6] | [0, 0.8] |
Table 2: Full parameter range in the bipedal walker domain
| stump height | stair height | stair steps | roughness | pit gap |
|---|---|---|---|---|
| [0, 5] | [0, 5] | [1, 9] | [0, 10] | [0, 10] |
Weak 2: Prior works and citations
We appreciate the reviewer for highlighting prior works ([4], [5]) related to our approach. We will add these to related works and reference them to strengthen our claim regarding the addtion of the entropy term and the use of a learned critic. We will also cite POET [6] and GPN [7] to help readers better understand previous research on the general concept of environment design.
Weak 3: No minimax regret UED
We will add an explanation to the related works to clarify that there are UED methods with non-minimax regret objectives.
Weak 4: Periodically fine-tuning the diffusion model
We agree that periodically fine-tuning the diffusion model using newly generated levels could potentially result in a more powerful generator. However, we are not sure that fine-tuning the diffusion model using levels generated by the diffusion model itself would genuinely enhance the generator's capabilities. One possible approach to address this issue is to use another UED algorithm in parallel to generate levels that our diffusion model cannot create. We believe it could be an interesting future work.
Weak 5: Clarity
Additional plots: We will include per-transfer bar plots and bootstrapped CI plots below Table 5 and 6 to improve the readability of the experimental results.
Figure 4 and Figure 9: We will include Figure 4 and its explanation in the beginning of Section 5.1. We will also include Figure 9 after Section 5.2 and add its explanation, which is described in Appendix A.5, after Section 4.4.
ADD acronym: The ADD acronym stands for "A"dversarial environment "D"esign via regret-guided "D"iffusion models. We welcome any suggestions for a better acronym.
Question 1: Replacing equation 12 with a PAIRED-style loss
We have not yet attempted to replace Equation 12 with a PAIRED-style loss, which predicts the difference between the antagonist and protagonist. However, we believe it would be a worthwhile experiment and could potentially yield competitive performance.
References
[1] Parker-Holder et al. "Evolving Curricula with Regret-Based Environment Design." International Conference on Machine Learning. 2022.
[2] Jiang et al. "Replay-Guided Adversarial Environment Design." Advances in Neural Information Processing Systems. 2021.
[3] Garcin et al. "DRED: Zero-Shot Transfer in Reinforcement Learning via Data-Regularised Environment Design." International Conference on Machine Learning. 2024.
[4] Mediratta et al. "Stabilizing Unsupervised Environment Design with a Learned Adversary." Conference on Lifelong Learning Agents. 2023.
[5] Bhatt et al. "Deep Surrogate Assisted Generation of Environments." Advances in Neural Information Processing Systems. 2022.
[6] Wang et al. "Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions." arXiv:1901.01753. 2019.
[7] Bontrager et al. "Learning to Generate Levels from Nothing." IEEE Conference on Games. 2021.
ACCEL was implemented in two ways, ACCEL+ and ACCEL++ (we replace dagger of the original paper with "+" for readability in the review format). ACCEL++
This makes sense! I think ACCEL++ is the conical version as one of the core insights of that paper is that there is often a small number of conical "empty" levels which are a better initialisation for evolution. But it does make sense to compare against ACCEL+ given that the inductive bias is orthogonal to your approach and we want to isolate the effect of the different level optimiser. You should include them both and flag this nuance, but I agree this is a valid experimental methodology and a fair comparison.
We will include an explanation of this difference in the paper to ensure that readers are aware of these differences. Additionally, we will add the performance of our method trained for only half of environmental steps used in the original experiment to accommodate the perspective that using the number of policy updates as a metric is more reasonable.
It would also be useful to compare at the number of steps as the original ACCEL, since the longer horizon will give a better sense of the long term performance. Including this both in terms of environment steps and policy steps are interesting, and would be useful for the community to have a sense of the nuances in current SOTA.
Give that I'm now convinced the empirical evaluation is correct, and since this paper presents a novel and promising attack against one of the biggest shortcomings of prior UED approaches, I expect it will have a large impact on the field. I'm raising my score to reflect this.
We are glad that our clarification addressed the reviewer's concerns. We will include the results of ACCEL++ and the performance of the baselines measured after a fixed number of policy updates.
This paper proposes a diffusion model with differentiable regret estimate for unsupervised environment design. The authors write a diffusion process to model environment parameters where the process is described in terms of a scoring function and derivative of the regret. The scoring function is pre-trained on a set of random environments. The diffusion process is further fine-tuned with the regret to generate environments for curriculum learning of an agent. A critic, which is trained with cross-entropy loss using binned environment returns, is used to approximate the regret. By using the derivative of the critic w.r.t. environment parameters, diffusion process is fine-tuned with entropy augmented regret. Experimental results on continuous and discrete control domains show that the model is competitive with previous best models. Using regret guidance significantly improves the performance.
优点
The paper introduces a diffusion process which is well suited for modeling continuous parameter. It implements a differentiable regret approximation that better assigns credit for the diffusion process. The formulation through the optimal environment distribution with a trainable scoring function is also interesting.
缺点
There are a few things that need more clarification and ablations.
-
While the paper shows that the diffusion process with a differentiable regret estimate gives good results, it is not clear which component is the most critical. Is the diffusion process critical for the success? Can you train PAIRED with a differentiable regret?
-
Similar to above, is the entropy term critical? Can you train the diffusion process without it and get comparable results?
-
Can you explain the reason why you trained the critic model with binned returns rather than using actual returns in a regression objective?
-
How critical is the pre-training for the scoring function? How does the performance change with less or more number of environment samples?
问题
Please see above for specific questions.
局限性
The paper addresses the limitations.
We appreciate Reviewer rCfY for the valuable feedback and review. Below is our response to the reviewer's comments and questions.
Weak 1: Is the diffusion process critical for the success? Can you train PAIRED with a differentiable regret?
The proposed algorithm critically relies on both the diffusion process and differentiable regret. Without a differentiable regret, using only the pre-trained diffusion process would theoretically result in generating random environments, failing to create a meaningful curriculum. This outcome is evident in Figures 2 and 3, where the performance of ADD without guidance is shown. Additionally, since differentiable regret is an approximation of the previously used regret estimation method, it is possible to train PAIRED [1] using this approach, but significant performance improvement is not expected.
Weak 2: is the entropy term critical? Can you train the diffusion process without it and get comparable results?
The entropy term is also critical. Adding the entropy term ensures that the distribution from which environment parameters are sampled becomes the softmax distribution of regret, as shown in Equation 8. Without the entropy term, it would be challenging to implement the sampling of environment parameters that maximize regret using the diffusion process, to the best of our knowledge. However, by setting \omega to a exteremely high value, we can effectively simulate the absence of the entropy term. We will include an ablation study to explore the performance changes with varying \omega values in the appendix. We report partial results in the following table.
Table 1: Performance in accordance to \omega in partially observable navigation task. Blanks stand for experiments that are not finished yet.
| \omega | 5 | 10 | 20 | 40 | 80 | 160 |
|---|---|---|---|---|---|---|
| mean success rate | 0.85 +/- 0.05 | 0.81 +/- 0.05 | 0.82 +/- 0.03 | - | - | - |
Additionally, as reviewer TVkr pointed out, there is a prior work [2] that claims adding high entropy bonus when training neural generator of PAIRED yields better performance. This supports our claim that the entropy term plays a critical role.
Weak 3: Can you explain the reason why you trained the critic model with binned returns rather than using actual returns in a regression objective?
The reason for training the critic model with binned returns, rather than using actual returns in a regression objective, is to obtain a differentiable regret estimate. We need to estimate the maximum and average returns, and to accurately reflect the stochasticity of the environment and policy, we trained a network to predict the distribution of returns, similar to distributional RL. We used binned returns following one of the foundational distributional RL studies, C51 [3]. However, it is also possible to use methods like Implicit Quantile Networks (IQN, [4]), which learn the distribution with actual returns as the output, and are known to yield better performance.
Weak 4: How critical is the pre-training for the scoring function? How does the performance change with less or more number of environment samples?
Pre-training the scoring function is an essential step. Through pre-training, the diffusion process becomes capable of sampling a wide range of environment parameters. This diversity enables the diffusion process to generate meaningful curricula when guided by differentiable regret. As the number of environment samples used in pre-training increases, we expect the diffusion process's ability to generate diverse environments to improve, thereby enhancing overall performance. We are conducting experiments with a smaller number of environment samples , and we will add the results to the appendix.
References
[1] Dennis et al. "Emergent Complexity and Zero-Shot Transfer via Unsupervised Environment Design." Advances in Neural Information Processing Systems. 2020.
[2] Mediratta et al. "Stabilizing Unsupervised Environment Design with a Learned Adversary." Conference on Lifelong Learning Agents. 2023.
[3] Bellemare et al. "A Distributional Perspective on Reinforcement Learning." International Conference on Machine Learning. 2017.
[4] Dabney et al. "Implicit Quantile Networks for Distributional Reinforcement Learning." International Conference on Machine Learning. 2018.
Dear Reviewer rCfY,
As the Area Chair for NeurIPS 2024, I am writing to kindly request your attention to the authors' rebuttal for the paper you reviewed.
The authors have provided additional information and clarifications in response to the concerns raised in your initial review. Your insights and expertise are invaluable to our decision-making process, and we would greatly appreciate your assessment of whether the authors' rebuttal adequately addresses your questions or concerns.
Please review the rebuttal and provide feedback. Your continued engagement ensures a fair and thorough review process.
Thank you for your time and dedication to NeurIPS 2024.
Best regards,
Area Chair, NeurIPS 2024
We are writing this comment to help reviewers understand some of reviewer rCfY's concerns and our response, as well as to share the complete results of additional experiments. First, in response to the concern about the importance of the entropy term (Weak 2 in the rebuttal), we explained that the entropy term is crucial because it shapes the target distribution in a way that allows environment parameters to be sampled by the diffusion model. Since the influence of the entropy term diminishes as \omega in Equation 7 increases, we mentioned that we were conducting an ablation study on different \omega values and would include the results. The completed experimental results are shown in Table 1 below. From the results, we observed that as \omega becomes large, performance decreases, which highlights the importance of the entropy term.
Table 1: Performance in accordance to \omega in partially observable navigation task. Blanks stand for experiments that are not finished yet.
| \omega | 5 | 10 | 20 | 40 | 80 |
|---|---|---|---|---|---|
| mean success rate | 0.85 +/- 0.05 | 0.81 +/- 0.05 | 0.82 +/- 0.03 | 0.64 +/- 0.07 | 0.47 +/- 0.16 |
Next, to address the concern about the number of samples used during the pre-training phase (Weak 4 in the rebuttal), we trained the diffusion model using 1 million samples, which is 100 times fewer than in the original experiment, and measured the performance of the proposed algorithm. The result is a mean success rate of 0.76 +/- 0.07 in the partially observable navigation task. This is about 11% lower than the result reported in the main text (but still outperforms the baselines), which supports our claim that a larger number of samples used in pre-training would lead to better performance. Additionally, we would like to point out that since we are dealing with an unsupervised setting and the samples used in pre-training are generated through random sampling, there is no need to worry about data scarcity.
We hope that this additional comment helps the reviewers better understand our rebuttal.
The paper "Adversarial Environment Design via Regret-Guided Diffusion Models" presents an approach to unsupervised environment design (UED) in reinforcement learning. The authors propose using a diffusion model guided by differentiable regret to generate adversarial environments, which leads to the creation of meaningful curricula for training agents. The reviewers agree that the paper makes a significant contribution to the field and recommend its acceptance.
The strengths of the paper lie in its use of diffusion models and differentiable regret to guide the generation of adversarial environments. The proposed method, ADD, demonstrates superior performance compared to state-of-the-art UED methods in terms of zero-shot transfer to challenging, human-designed environments. The authors provide a comprehensive analysis of the generated curricula, showing that ADD successfully creates diverse and increasingly complex environments that facilitate the learning of robust agents.
The reviewers raised several important questions and concerns, which the authors addressed thoroughly in their rebuttal. They clarified the critical role of the diffusion process, differentiable regret, and the entropy term in the success of their method. The authors also provided additional experimental results and ablation studies to support their claims and demonstrate the robustness of their approach.
While some reviewers suggested including additional baselines and evaluation domains, the authors convincingly argued that their chosen baselines and domains are representative of the current UED literature. They also discussed the applicability of their method to various environment parameterizations, as well as its potential to handle both continuous and discrete parameters with constraints. Overall, the reviewers' concerns have been adequately addressed in the rebuttal, and the authors have promised a commitment to improving the paper based on the feedback received. Given the contributions and the overall quality of the work, I recommend accepting this paper for publication at NeurIPS.