N-agent Ad Hoc Teamwork
Proposes a generalization of ad hoc teamwork to the N-agent setting.
摘要
评审与讨论
The paper introduces a novel problem setting within cooperative multi-agent systems, where a dynamically varying number of autonomous agents must cooperate with a set of uncontrolled teammates to achieve a common goal. This setting generalizes existing paradigms of cooperative multi-agent reinforcement learning (CMARL) and ad hoc teamwork (AHT). The authors propose the Policy Optimization with Agent Modeling (POAM) algorithm, which utilizes a policy gradient approach combined with agent modeling to enable agents to adapt to diverse teammate behaviors. The algorithm's effectiveness is demonstrated through empirical evaluations in multi-agent particle environments and StarCraft II tasks, showing improved performance over baseline approaches and better generalization to unseen teammates.
优点
The paper presents a new problem setting, NAHT, which extends existing frameworks by addressing more realistic scenarios where teams are not fully controlled or consist of a single adaptive agent. The methodology is well-structured, with a clear explanation of the problem formulation and the proposed algorithm. The paper is well-written, with a clear flow from problem definition to solution proposal and empirical validation.
缺点
Although the author raised a new question, the solution adopted lacks innovation. The encoder and decoder architecture used in the Agent Modeling Network is also very common in the field of opponent modeling. I did not learn anything new in terms of methodology.
While the empirical results are good, the paper could benefit from a more thorough theoretical analysis of the POAM algorithm's convergence properties and performance guarantees.
The scalability of the POAM algorithm with respect to the number of agents and the complexity of the tasks is not fully addressed. It would be beneficial to include more extensive experiments or discussions on how the algorithm performs as these factors increase.
问题
Please refer to the weaknesses.
局限性
Yes
Novelty - Encoder/Decoder-based Agent Modelling
While we understand the reviewer’s reservations on the prevalent use of the encoder-decoder architectures for agent modeling, we believe it should not be the sole basis for assessing the novelty of our work.
Despite using encoder-decoders for agent modeling, whose use was proposed as early as 2016 [1], some important recent works in AHT research [2, 3, 4] have contributed to the community by proposing creative ways to use representations produced by the encoder-decoder network to tackle unaddressed problems/settings in AHT teamwork. Therefore, we argue that how the agent modeling component is used to address existing issues in AHT research should also be considered when measuring novelty.
Here, our innovation is our use of the agent modeling component’s representations during joint training for all controlled agents, for addressing NAHT problem. We show through our motivating example (Section 4) and experimental baseline of POAM-AHT (Fig. 3) that solely relying on agent modeling without joint training will yield controlled agents whose joint policy is suboptimal for addressing even the simplest NAHT problems. In the end, we believe that the AHT community could build upon this insight to extend current AHT methods to NAHT methods. We expect this insight to lead towards better solutions to the NAHT problem, which may come from improvements to the agent modeling component or the joint training process.
Convergence & Performance Guarantees, other theoretical analysis
From a theoretical perspective, POAM would inherit IPPO’s convergence guarantees for a nonstationary learning environment, which have recently been established [6].
In POAM, the gradient corresponding to the encoder-decoder (ED) embeddings is detached when the encoder-decoder embeddings are computed in the actor and critic networks. Thus, the actor/critic updates do not change the weights of the ED networks. Conversely, as the ED update is fully independent of the actor/critic networks, updates to the ED do not change the weights of the actor/critic networks.
The effect of this architecture and update scheme is that the ED updates cause a small amount of nonstationarity for the PPO backbone learning algorithm. Empirically, we address this issue by using a small learning rate for the ED. Further, as POAM uses an independent learning scheme, another source of nonstationarity is the updates of the controlled agents themselves.
Scalability to Increasing Number of Agents & Task Complexity
In practice, the paper demonstrates that POAM is effective on tasks with a variety of team sizes. The task with the most agents is the 10v11 task, which has 10 agents on the allied agent team, and 11 enemies controlled by the game AI, which is a relatively large number of agents—especially in the context of AHT research, which typically considers only 2 agents.
We employ several techniques to help POAM scale to a larger number of agents. Multi-agent Reinforcement Learning: the backbone multi-agent learning algorithm is independent learning with parameter sharing [5], which aids scalability in the homogeneous agent setting.
Agent Modeling: In the agent modeling problem, the prediction target dimension increases linearly with the number of agents. To prevent the output dimension of the decoder network from similarly scaling with the number of agents, the decoder employs parameter sharing as well.
References
[1] He et al. Opponent modeling in deep reinforcement learning. ICML 2016.
[2] Papoudakis et al. Agent modelling under partial observability for deep reinforcement learning. NeurIPS 2021.
[3] Zintgraf et al. Deep Interactive Bayesian Reinforcement Learning via Meta-Learning. AAMAS 2021.
[4] Gu et al. Online ad hoc teamwork under partial observability. ICLR 2022.
[5] Christianos et al. Shared experience actor-critic for multi-agent reinforcement learning. NeurIPS 2020.
[6] Sun et al. Trust region bounds for decentralized PPO under non-stationarity. AAMAS 2023.
Thank you for your response. After carefully reviewing the rebuttal and considering the opinions of the other reviewers, I still believe this is a borderline paper, so I am maintaining my original score.
We thank R eZ5A for their time, and care in going through the rebuttal along with other reviewers' comments. We would be more than happy to address any further questions or concerns that the reviewer has.
This paper proposes a generalization to the ad hoc teamwork (AHT) setting where N agents follow a trained policy instead of just 1 agent or all agents. Within this NAHT framework, the authors describe a technique for modeling the other agents, accelerating the ability to learn in this setting relative to baselines.
优点
- The NAHT domain is an important contribution to the general field of MARL. In particular, this work specifically investigates settings with more than 2 players, which is often understudied in the broader AHT field.
- This paper investigates multiple settings and includes significant supplementary data in the appendix. Furthermore, all details for reproducing the results are present in the appendix.
- There are also many ablations, justifying each design decision in their POAM algorithm.
- The paper is well written and easy to read. The proofs presented in the appendix are also sound.
缺点
- The "out of distribution generalization" results are unsatisfactory to me, undermining the central thesis that POAM benefits NAHT. In particular, the "mismatched" scores are quite high relative to cross-play scores when varying algorithms, indicating that algorithms learn compatible conventions when only varying seeds in these settings. This means that the OOD evaluation may not be representative of performance when paired with humans or new conventions. A game that requires more convention-dependent behavior (such as Hanabi) would provide a more informative evaluation for your technique. Otherwise, using explicit cross-play minimization techniques to provide a performance lower bound (i.e. agents that are explicitly training to maximize self-play while minimizing cross-play with your trained model), human subject studies, or models trained on human data could demonstrate the limits of your technique's out of distribution capabilities.
- The decision to use data from uncontrolled agents is not well justified in the paper. The value function is also "on policy" in PPO, so it is unclear why this additional data is helpful in NAHT training.
问题
- Does the agent modeling network improve few-shot learning? In particular, if there is a new team and we let the POAM agent play with this configuration for multiple episodes (starting from the previous episode's hidden state instead of resetting), does this improve performance in future episodes? [Positive results in this direction would be very significant in the domain of few-shot ad hoc coordination.]
- Does NAHT training improve AHT performance?
- Why is the probability of predicting the correct action lower for the checkpoint at 19m timesteps versus 15m timesteps?
局限性
The limitations section is solid, though I think it should mention that finding diverse (or human-like) conventions in new settings is still an open question (or point to existing works that may be integrated with POAM as a future direction).
Out-of-Distribution Evaluation - Environment Selection
Hanabi represents a challenging scenario for AHT and requires a large amount of computational resources. Prior papers have used billions of training steps to train agents, in contrast with the tens of millions used in our work [1,2]. Unfortunately, experimenting on Hanabi is not possible for this current paper due to computational constraints, but we agree that this game is very interesting for NAHT purposes.
There are two barriers to the reviewer’s suggested strategy of using explicit cross-play minimization techniques to generate opponents. First, existing cross-play minimization techniques such as BrDIV [3] are designed for the two-player AHT scenario and do not directly apply to the N-player scenario, and such an extension would merit its own paper.
Second, explicitly training agents that are bad for our agent violates a key assumption of AHT — that the uncontrolled teammates are acting in good faith / have some minimal level of competency.
Out-of-Distribution Evaluation - Agent Team Generation
We do not mean to claim that the OOD experiment would represent performance when paired with arbitrary conventions or against specific classes of interest, such as human opponents, and will modify the language in the paper to clarify the limited scope of the OOD experiment. These goals remain open challenges for the AHT, ZSC, and human-AI coordination communities. The goal of our work is to introduce the NAHT problem and a viable approach to solving it.
We also acknowledge that the teammate generation method suggested by the reviewer would accentuate the generalization challenge more than the teammates we have used in the OOD generalization experiments. However, the matched-mismatched experiment (Fig. 11) and cross-play tables (Tables 12-15) clearly demonstrate the generalization challenge. Specifically, the fact that performance decreases when we pair agents with another team trained under a different seed/algorithm implies none of the uncontrolled teams perform optimally against all other teams in evaluation. This emphasizes the need for adaptation and generalization when dealing with previously unseen uncontrolled agents. Note that other papers [2, 4] have also used the teammate generation method we propose to train agents that can robustly deal with a wide range of teammates.
Question - Agent Modeling for Few-Shot Learning
As POAM was not trained under the framework of allowing multiple episodes to learn about a new set of teammates, we do not expect that it would perform well under this setting. We agree that this would be an interesting contribution for few-shot AHT, but this is out of the scope of our current paper.
Question - NAHT methods for AHT Performance
Theoretically, an optimal NAHT agent should perform optimally for N=1, which is the AHT scenario. We confirm our hypothesis empirically in the new bit matrix results. Please see Section 1 and Table 1 of the rebuttal pdf for a discussion.
Question - Lower Probability for Correct Action as Training Progresses
POAM’s encoder-decoder (ED) models both controlled and uncontrolled agents.
Since the controlled agents are updated during the training process, this creates a moving target for the encoder-decoder, thus making the problem of modeling the controlled agents both more challenging and noisy. The observation that the overall action probabilities are lower at 19m than at 15m is largely due to noise in the modeling of the controlled agents.
Fig. 4 in the rebuttal pdf shows the probability of predicting the correct action for the uncontrolled agents and controlled agents separately.Note that the action probabilities shown in Fig. 4 of the original paper would be the average of the two plots shown in the aforementioned Fig. 4.
For uncontrolled agents (Fig. 4, left) we observe that the accuracy of action predictions for the uncontrolled agents increases much more consistently as training goes on, and is higher than that for the controlled agents, indicating that the ED is able to model the uncontrolled agents more easily. From these plots, we can also see that the observed sudden decrease in action probabilities from 15m to 19m occurs for the controlled agents only.
References:
[1] Hu et al. Off-belief learning. ICML 2021.
[2] Hu & Foerster. Simplified Action Decoder for Deep Multi-Agent Reinforcement Learning. ICLR 2020.
[3] Rahman et al. Generating teammates for training robust ad hoc teamwork agents via best-response diversity. TMLR 2023.
[4] Strouse et al. Collaborating with humans without human data. NeurIPS 2021.
Thank you for the detailed rebuttal! I appreciate the new rebuttal experiments answering my questions on AHT performance and the lower probability for correct actions.
I ultimately still believe that the weaknesses I laid out in my original review are still valid, so I stand by my score.
Just as a point regarding cross-play minimization algorithms, the papers "Adversarial Diversity in Hanabi" and "Diverse Conventions for Human-AI Collaboration" focus on acting in good faith despite training to minimize cross-play scores, but they did not explicitly study more than 2-player AHT scenario, so I agree that extensions to these algorithms would merit their own papers.
We thank R hc6T for going through our rebuttal carefully! The original review pointed out two weaknesses, and three questions, summarized below:
Weaknesses:
- The way that the out-of-distribution teammates were generated
- The missing rationale for using data from uncontrolled agents
Questions:
- Whether the agent modeling network improves few-shot learning
- Whether NAHT training improves AHT performance
- Why the probability of modeling the correct action is lower at 19m training steps than 15m training steps
We thank the reviewer for acknowledging that our rebuttal responses and experiments were satisfactory for all three questions. For the two weaknesses, we wish to emphasize the following points:
Weakness 1: Given that R hc6T agrees that the cited adversarial diversity methods are not applicable to the N-agent setting considered by us, we hope R hc6T would not see the evaluation as a major weakness. We believe that in the absence of an applicable method to generate teammates for evaluation, training agent teams via self-play is both (1) the best we can do for the StarCraft domains we test on, where hand-coding evaluation agent policies is not feasible, and (2) established experimental practice for AHT/ZSC papers published at top venues. In addition to references 2 and 4 in our rebuttal to hc6T (published at ICLR 2020 and Neurips 2021 resp.), please note that Papoudakis et al., published at Neurips 2021, also used a similar approach to generate teammates for evaluation [1].
Weakness 2: The common rebuttal includes our explanation on why using uncontrolled agent data for learning could be useful. We would appreciate it if R hc6T could look through our argument and evaluate whether it addresses their concerns.
[1] Papoudakis et al. Agent modelling under partial observability for deep reinforcement learning. NeurIPS 2021.
Thanks for the follow-up. I just want to clarify the weaknesses I stated, because I think there may have been some misunderstandings.
-
My main concern regarding the OOD evaluation experiments is that "mismatched" scores are quite high relative to cross-play scores when varying algorithms, indicating that algorithms learn compatible conventions when only varying seeds in these settings. In particular, I'm agnostic to the method of generating the agents for evaluation (since I suggested either cross-play minimization, scripted agents, human BC models, or studying more convention-dependent domains) and I would've been fine with only varying seeds if they had low mismatched scores. Another alternative could've been keeping each algorithm in a held-out set and determining if your new algorithm can still generalize to held-out algorithms (since we see relatively lower cross-play scores across algorithms).
-
My question regarding uncontrolled agent data was asking why using data from uncontrolled agents to train the value function (which is also on-policy) was valid. In a sense this is the inverse of the question answered in the common rebuttal.
We apologize for the misunderstandings, and thank the reviewer for clarifying their questions. We would like to address the two issues raised in the follow-up.
Issue 1:
We reiterate that we believe that the drop in task score from matched to mismatched conditions means that the current OOD teammates does still pose a generalization challenge, even if that challenge is not major. We will clarify the limited scope of our experiment in the paper.
However, we agree that the reviewer’s proposed modification to the OOD experiment would be an interesting direction to deepen the analysis of the OOD properties of POAM. We currently follow the reviewer's advice by splitting the MARL training algorithms for generating teammates into two sets: agents generated with the algorithms, QMIX, MAPPO and IQL, will be used for training, whereas agents generated with the algorithms, VDN and IPPO, will be used for evaluation. We hope to present results on this experiment before the rebuttal period ends. However, if time doesn’t permit and the paper is accepted, we will add the results to the Appendix.
Issue 2:
We apologize for our mistake in referencing the common rebuttal. R 9xa8 raised the same question as this reviewer, inquiring why it is valid to use off-policy data to update the value function. Our reply was actually in the specific response to R 9xa8 (copied below). We are glad that R 9xa8 found our argument convincing.
While the data from the uncontrolled agents is not “on-policy” with respect to controlled agents, here are some reasons why using off-policy data to update POAM’s critic network is useful.
Useful cooperative behavior can be learned more quickly by bootstrapping based on transitions from the initially more competent, uncontrolled teammate policies (Section 5.6 of [1]). Early in training, value function learning based on the competent, uncontrolled agents’ data leads to controlled agents’ high appraisal of all uncontrolled agents’ decisions leading towards high returns. Controlled agents then learn to adopt similar decisions after using the learned value function for policy updates. Also, [2] demonstrated the improved data efficiency and stability in single-agent RL from using off-policy data to update the critic in an otherwise on-policy algorithm."
References:
[1] Rahman et al. A general learning framework for open ad hoc teamwork using graph-based policy learning. JMLR 2023.
[2] O'Donoghue et al. Combining policy gradient and Q-learning. ICLR 2017.
Below are the results of conducting the out-of-distribution (OOD) experiment described above, where the OOD teammates come from unseen algorithms (train set: QMIX, MAPPO, IQL, test set: IPPO, VDN), on the MPE-PP task. The results reported below are the mean and 95% confidence bounds. Even on this more challenging generalization task, we observe that POAM achieves higher task scores than IPPO-NAHT, with the effect being larger against the unseen IPPO teammates.
We hope this evaluation alleviates R hc6T's concerns. If accepted, we will add the full results and discussion to the Appendix.
| IPPO-NAHT | POAM | |
|---|---|---|
| VDN | 4.198 0.830 | 4.905 0.844 |
| IPPO | 5.340 1.461 | 8.850 1.076 |
Thank you for the explanations and the new results! I will update my score accordingly.
Thank you for going over the new results and explanations, and raising your score!
The paper proposes a MARL algorithm for the N-agent ad hoc teamwork setting. The algorithm specifically includes policy optimisation by modelling the other agents. The agent modelling uses and encoder-decoder architecture. In the actor-critic framework, the critic uses data from both controlled and uncontrolled agents. The agent modelling allows their approach to generalise well to scenarios with unseen teammates.
优点
- The problem setting is quite interesting and the solution proposed is quite simple and easy to simple to implement.
- The paper is quite well-written and easy to follow. I liked the motivating example for NAHT in Section 4.
- The ablation studies are quite comprehensive.
缺点
- The baselines considered in the paper do not seem to include algorithms that model other agents explicitly. Also, the related works section for agent modeling seems to be incomplete. There have been a lot of papers related to agent/opponent modelling. I would be curious to see how these baseline algorithms would compare against POAM.
- In the out-of-distribution experiments, the authors do not give a reason for why POAM performs much better in only a few scenarios. Is there anything specific within those scenarios that helps POAM generalise better?
Minor: The plots could be smoothened to make them more readable. Especially Figure 4.
Relevant references:
[1] Everett, R., & Roberts, S. (2018, March). Learning against non-stationary agents with opponent modelling and deep reinforcement learning. In 2018 AAAI spring symposium series.
[2] Shen, M., & How, J. P. (2021, May). Robust opponent modeling via adversarial ensemble reinforcement learning. In Proceedings of the International Conference on Automated Planning and Scheduling (Vol. 31, pp. 578-587).
[3] Papoudakis, G., & Albrecht, S. V. (2020). Variational autoencoders for opponent modeling in multi-agent systems. arXiv preprint arXiv:2001.10829.
[4] Foerster, J. N., Chen, R. Y., Al-Shedivat, M., Whiteson, S., Abbeel, P., & Mordatch, I. (2017). Learning with opponent-learning awareness. arXiv preprint arXiv:1709.04326.
问题
How would this work when there are teammates from more than 2 different teams? For example: Team A + Competitive Team B + Competitive Team C? Is there any assumption on the identification of the team the agent belongs to in the observation space?
局限性
The experiments are carried out in only one environment (SMAC). It would be good to see POAM in more applied scenarios like search and rescue as mentioned in the introduction.
Points addressed in common rebuttal
- Additional agent modeling baselines and suggested references
- Number of test domains
On Off-policy data for training PPO critic
(R 9xa8) The decision to use data from uncontrolled agents is not well justified in the paper. The value function is also "on policy" in PPO, so it is unclear why this additional data is helpful in NAHT training.
While the data from the uncontrolled agents is not “on-policy” with respect to controlled agents, here are some reasons why using this particular off-policy data to update POAM’s critic network is useful.
Useful cooperative behavior can be learned more quickly by bootstrapping based on transitions from the initially more competent, uncontrolled teammate policies (Section 5.6 of [1]). Early in training, value function learning based on the competent, uncontrolled agents’ data leads to controlled agents’ high appraisal of all uncontrolled agents’ decisions leading towards high returns. Controlled agents then learn to adopt similar decisions after using the learned value function for policy updates. Also, [2] demonstrated the improved data efficiency and stability in single-agent RL from using off-policy data to update the critic in an otherwise on-policy algorithm.
Insights on Out-of-Distribution Performance
It is difficult to pinpoint causal factors that lead to the difference in performance in these domains. While they are broadly used to test coordination and cooperation, there are multiple factors that could influence the difficulty of NAHT in these domains. Nevertheless, we can look at some of the results in these domains and make educated guesses.
For example, Fig. 11 in the Appendix shows the return of teams trained together (matched seeds) versus those of teams that were not trained together (mismatched seeds), across all naive MARL algorithms. We observed that for the three domains where there is a larger difference in performance between matched and mismatched seed teams (5v6, 3s5z, MPE-PP), there is also a larger difference in performance between POAM and IPPO-NAHT in the OOD experiments. This perhaps indicates that coordination is highly sensitive in those domains, so agent modeling makes a larger difference.
Clarity - Improved Figures
Thank you for the suggestion. We will add smoothing to all figures
Questions - Addressing Cooperation With >2 Agent Teams
Addressing the question about the observation space first: POAM does not require the observation space to include any team ID information. Instead, the goal of POAM’s agent modeling module is to infer this information.
In theory, applying POAM to the scenario of Team A + Team B + Team C would be as simple as extending the training/evaluation framework to sample from two uncontrolled teams, rather than one. In practice, we anticipate credit assignment problems. For example, if the uncontrolled teams B and C do not interact well together, the overall team performance might drop greatly, but this would not be the fault of the controlled team A.
Addressing the scenario of blending multiple teams would be an interesting direction for future work.
References
[1] Rahman et al. A general learning framework for open ad hoc teamwork using graph-based policy learning. JMLR 2023.
[2] O'Donoghue et al. Combining policy gradient and Q-learning. ICLR 2017.
Thanks a lot for answering my questions. These responses as well as the global rebuttal increase my confidence about the paper being a good contribution. I would like to increase my score from 6 to 7.
Thank you for reading our responses carefully, and recognizing the contribution of our work!
This work is motivated by the realistic limitations of current multi-agent studies, specifically the assumption that either all agents are controllable or that only a single agent is controlled in the multi-agent system. To address this challenge, the authors introduce the N-agent ad hoc teamwork (NAHT) approach, in which multiple controllable agents interact with randomly selected uncontrolled agents. In the proposed framework, the controllable agent can make flexible decisions using information from its teammates. To implement this, the authors leverage an encoder-decoder structure designed to predict the observations and actions of other teams based on the agent’s trajectory history. The simulation results demonstrate improved performance compared to other multi-agent reinforcement learning baselines.
优点
- This paper is well motivated. The proposed solution is easy to implement.
- The proposed solution addresses a crucial aspect in the realm of MARL society. The reviewer believes that implementing a method that can collaborate with unknown teammates can effectively enhance the practicality of the MARL solution.
缺点
The reviewer is not convinced of the effectiveness of the agent modeling network for the following reasons.
- The proposed framework is defined under a partially observable domain, in which each agent cannot observe all the other teammates. For this reason, the reviewer wonders how the model can accurately infer other teammates’ observations and actions using only the agent’s trajectory history.
- The result in Figure 4 does not demonstrate the performance of the agent modeling network. In this result, the maximum action probability achieved through training is about 0.6. The reviewer thinks that this performance does not indicate that the model has been successfully trained.
This paper has some typos and needs to redefine some notations to enhance clarity.
- In line 13, 'Modelling' should be corrected to 'Modeling'.
- The reviewer suggests that the notation should be redefined as . This change is necessary because represents the trajectory history of agent , and including in the notation will enhance clarity.
- The reviewer believes that the figures in the simulation results section have room for improvement.
- The authors have marked the baseline names in lowercase in the figure legends. The reviewer suggests using capitalized words for consistency with the manuscript.
- The authors refer to the results in the appendix for the simulation analysis in Section 6. The reviewer believes the analysis in the main manuscript should use the results presented in the main body.
Weak evaluation
- The selected demonstration task is limited to SMAC tasks.
- It is necessary to consider an appropriate task that can show the difference between AHT and NAHT problems.
- Line 163 (h^t={o_i^k,a_i^(k-1) }_(k=1)^t): How to handle input where history length changes?
- Line 316: What is the different non-controlled and uncontrolled agents?
问题
Proposed solution
-
The reviewer wants to know if the proposed solution works only with the PPO backbone. This question arises from the authors' reasoning about why they do not leverage data from uncontrolled agents. In line 193, the authors mention that the policy update is highly sensitive to off-policy data. However, the reviewer believes this issue might be mitigated by replacing the backbone algorithm with an off-policy algorithm, e.g., soft actor-critic (SAC).
-
Line 59: What is the common reward function? Is it a team reward or an individual reward? Please provide details.
-
Does each controlled agent have a separate encoder and decoder? Or do the controlled agents learn a single network and then use it in a distributed manner?
-
Equation (2) o Why is observation decoding loss based on mean squared error (MSE) and action decoding loss based on likelihood? o How to calculate probability p? There is no definition of p. Please provide details.
Experiment
- Uncontrolled agents and controlled agents operate as a team, but their training processes are entirely separate. In the training process of controlled agents, they can learn cooperative decision-making adapting to pre-trained uncontrolled agents. Conversely, In the pre-training of uncontrolled agents, what cooperative decisions do they learn? For example, if three uncontrolled agents are needed for a task in which five agents form a team, how can we pre-train only the three agents separately?
- How does a team consisting of only uncontrolled agents perform? Such empirical results can be used as a baseline.
Figure 4
- What is the range of observation and action? If it is generally assumed that the range is between [-1, 1], the MSE in Figure 4 appears to have a very large error rate.
- What information can we get from action probability? Unlike MSE, it is difficult to identify trends.
Figure 5
- The main idea of POAM lies in building an agent modeling network, and it requires uncontrolled agent data (UCD). In Figure 5, how to train agent modeling network in POAM without UCD?
- What is the difference between POAM without a modeling network and IPPO?
Figure 6
- It would be easier to check how much performance increases or decreases if a baseline for non-OOD cases is also provided.
局限性
The authors provided the general limitations in MARL algorithms.
The maximum theoretical value for the action probabilities is not 1.0 when the modeled teammates are stochastic
To illustrate this, we construct a toy example based off of the bit matrix game that is described in Section 4 (The Need for Dedicated NAHT Algorithms). For each agent, the action space of the bit matrix game is {}. Let denote the true action distribution of the uncontrolled agent, and let denote the modeled action distribution, parametrized by .
The quantity displayed in Fig. 4 is the average probability of the observed, ground-truth actions under the current agent model: .
Suppose the uncontrolled agent selects 1 with probability ⅓, and 0 otherwise, i.e. w.p. . We call this the Bernoulli(⅓) agent. Assume that the modeled action distribution exactly equals the ground truth agent policy, , to compute the maximum possible value of the action probabilities.
Then .
While this is a toy example, it serves to illustrate the point that if the ground-truth agent policies are stochastic, the maximum action probability displayed in Fig. 4 (right) will not be 1.0.
The action probability discussed previously is very closely related to the action modeling loss that the encoder-decoder is trained with. We chose to display the action probability rather than the action modeling loss because we thought it would be more interpretable for readers, but if the reviewer thinks it would be clearer to display the action modeling loss, we are happy to make that modification.
POAM’s encoder-decoder achieves the minimum action loss on the bit matrix game.
The action loss is given by the formula . Using the same notation and set up as above, for the Bernoulli(⅓) uncontrolled agent, the minimal action loss can be computed by setting :
(using ).
Scale of Observation MSE in Fig. 4
Fig. 4 shows the observation MSE over an entire episode across multiple training checkpoints, from 0 to 19 million timesteps. Later in training, the ED requires a few time-steps of history in order to predict observations with low error. The initial MSE tends to be large early in training and early in the episode, as pointed out by the reviewer, but by the end of training and the end of the episode, we verify that it has reduced to the following values for predator prey and 5v6:
- MPE-PP: 0.041
- 5v6: 9.011e-07
For MPE-PP, we verified that we can further reduce the observation MSE by increasing the epochs of encoder-decoder updates at the beginning of training, although this does not improve performance on primary metrics.
To verify that POAM’s ED can indeed achieve the theoretical minimum observation and action loss in practice, we train the encoder-decoder on data generated by the Bernoulli(⅓) agent only (Fig. 2 in the rebuttal pdf).
Thank you for the detailed response from the author. These helped clarify my points of concern, but I still have a few questions before finalizing my score.
- In Figure 2, the author uses theoretical probability, but the correlation with the loss is not clear.
- In Figure 3, the magnitude of the error bars for all algorithms appears to be the same. Please check if this result is correct.
- The reviewer believes the authors should provide more details regarding the MPE-PP setting. Specifically, the reviewer wonders how a pre-trained prey agent can be trained.
- The author did not provide baseline performance data related to uncontrolled agents. Without providing the performance in a scenario where all agents are uncontrolled, it is difficult to evaluate the effectiveness of the proposed algorithm.
Dear authors,
The discussion period is about to expire. Please do address the further questions from Reviewer 5dYr at your earliest convenience.
Kind regards, AC
We thank the AC for their close following of the discussion with reviewers. We recently posted a reply to R 5dYr, which we hope will address their final concerns. We also plan to post a reply to R hc6T.
Thank you to the author for the active response. After discussion, the reviewer thinks many previously unclear areas have been clarified. Consequently, we have raised our score from 3 to 4.
A few of the experimental results contained errors, which the author claims are not significant. Nonetheless, this raises concerns for the reviewer regarding the clarity and completeness of the work. For example, in the Baselines section, the author reports the best naive MARL baseline. Although the paper on MAPPO[1] demonstrated nearly 100% performance in MAPPO, IPPO, and SMAC tasks, the simulation results in this study show notably lower performance. In addition, the reviewer believes that incorporating the discussion from this rebuttal into the paper would require substantial revisions. Therefore, the reviewer could not raise this work to an acceptance score.
Reference [1] Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen and Yi Wu, “The surprising effectiveness of ppo in cooperative multi-agent games,” NeurIPS, 2022.
We thank the reviewer for considering our rebuttal and revising their score upwards from 3 to 4. We greatly appreciate the reviewer’s active engagement.
We are sorry to hear that the reviewer believes that the paper does not meet the bar for acceptance at Neurips. We are also somewhat surprised, since we have successfully clarified all of the previous questions/issues raised by the reviewer — to the reviewer’s satisfaction, for all issues except one (the plotting mistake involving the error bars). Please see the summary at the bottom of this response for a list of resolved issues. Here, we also provide responses for the two new issues raised by the reviewer.
If we have addressed all of the reviewer’s questions/concerns, we respectfully request that the reviewer reconsider their position on the acceptance of this paper.
Validity of MAPPO as a Naive MARL Baseline
First, please note that the naive MARL baseline displayed in Fig. 3 (of the submitted paper) is performance of the best naive MARL team among VDN, QMIX, IQL, IPPO, and MAPPO, as evaluated according to the mean cross-play (XP) score, which is shown in the rightmost column of Tables 12-16 in the submitted paper. In particular, for MPE-PP, 5v6, 8m, 3s5z, and 10v11, the best naive MARL method is VDN for MPE-PP and QMIX for the Starcraft tasks. As such, the performance of MAPPO has no bearing on the naive MARL baseline in Fig. 3.
Second, the main contribution of [1] was to demonstrate that policy gradient approaches such as MAPPO and IPPO achieve SOTA performance on various cooperative MARL tasks. Thus, the authors of [1] spend a great deal of effort optimizing the performance of MAPPO. In contrast, our paper uses IPPO and MAPPO to generate uncontrolled teammates, which do not need to be optimal, for the purposes of studying the NAHT problem. Please note that an established experimental practice in ad hoc teamwork is to hand-code teammate policies, which naturally does not result in optimal teammate policies [2].
Third, to address a potential misunderstanding: the performance of the naive MARL baseline shown in Fig. 3 is the mean NAHT cross-play score, when each MARL baseline team is evaluated in the presence of a varying number of uncontrolled and potentially unseen agents. This evaluation metric is explained starting at Line 670 in the submitted paper, and is a novel metric introduced by our work to evaluate NAHT performance. The cross-play scores should not be directly compared against scores reported by MARL papers such as [1], which evaluate the performance of agents together with their training-time teammates. In our work, we refer to the latter as model self-play scores (see Line 695). We generally expect our NAHT cross-play scores to be lower than self-play scores, due to the openness and generalization challenges introduced by NAHT. Further, prior work in the AHT setting has observed that naive MARL methods’ cross-play scores tend to be lower than self-play scores, due to the generalization challenge created by an unknown teammate [3].
Fourth, to address another potential misunderstanding: [1] reports results on Starcraft according to the win percentage, which is a proxy objective that naturally tops out at 100%. In StarCraft, the RL agents attempt to maximize the sum of shaped rewards across an entire episode. The per-timestep shaped reward consists of a reward/penalty for winning/losing the battle, a reward for each enemy killed, and a reward for the damage dealt to the enemies (as explained in Line 655 of the paper). Our results are reported according to this test return mean that the RL agent is actually optimizing for, which generally tops out at around 15-20, depending on the number of enemies in the StarCraft task. When evaluated according to the win percentage, our MAPPO agents achieve the following scores, which are similar to those reported in Table 1 of [1]:
- 5v6: 85.6 .09 (compare to 88.3 1.2 in [1])
- 8v9: 97 0.18 (compare to 96.9 0.6)
- 3s5z: 97 0.29 (compare to 96.9 1.9)
- 10v11: 99 0.01 (compare to 96.9 1.2)
Points Addressed in Common Rebuttal
- Only one evaluation domain
- Question about on/off-policy algorithms for POAM
- Requested baselines: in-distribution performance for Fig. 6; performance of uncontrolled teams
Accuracy of Inferred Actions and Observations in the Dec-POMDP setting
We agree that the agent model should experience a drop in accuracy from operating in partially observable environments, due to limited information about teammates. However, the agent model does not need to be perfect to be useful for NAHT; it only needs to be able to characterize the uncontrolled agents sufficiently s.t. POAM agents may best respond to the uncontrolled agents.
Experiments demonstrate that compared to IPPO-NAHT (equivalent to POAM without agent modeling), POAM has improved sample efficiency, asymptotic performance (Fig. 3), and out-of-distribution generalization to unseen teammates (Fig. 10) in the partially observable SMAC tasks.
Agent Modelling Network Performance in Fig. 4
The reviewer thinks that this performance [0.6] does not indicate that the [agent] model has been successfully trained.
We provide three counter-arguments:
(1) The maximum theoretical value for the action probabilities is not 1.0, as the maximum depends on the stochasticity of the modeled policies. For example, if an agent turns left 80% of the time and right 20%, even if we perfectly model the policy, then the expected modeled action probability is . A follow-up comment will make a formal argument. Further, partial observability might preclude achieving the maximum theoretical value.
(2) We have included loss plots of the encoder-decoder in Fig. 1 of the rebuttal PDF to show that the observation and action losses reduce smoothly during training.
(3) We provide additional empirical results on the bit matrix game showing that in a fully observable setting the encoder-decoder reduces both observation and action losses to the theoretical minimum value (Fig. 2 in rebuttal pdf). Derivation of the minimum action loss in follow-up comment.
Clarifications/Typos
We thank the reviewer for pointing out typos and clarity issues, and will implement all corrections in the paper.
History Length - Line 163
POAM uses a recurrent encoder-decoder, where the hidden state is only reset at the start of the episode. The lowercase in the expression refers to the current timestep of the episode, which varies. Thus, POAM deals naturally with changing history lengths.
Un- vs Non-controlled Agent - Line 316
Non-controlled agents are the same as uncontrolled agents; we will make the terminology consistent.
Common Reward Function
A common reward means that all the agents in a team get the same reward at every time step. This follows the standard Dec-POMDP setup, as specified in Section 2.
…do the controlled agents learn a single [encoder decoder] network and then use it in a distributed manner?
Controlled agents learn a single encoder-decoder and use it in a distributed fashion, a practice called parameter sharing [1]. This was mentioned in Line 394 of the original paper: “POAM, which employs full parameter sharing…”
Questions
Why is observation decoding loss based on mean squared error (MSE) and action decoding loss based on likelihood?
The MSE loss is equivalent to the negative log-likelihood loss (NLL) under a Gaussian distribution, and is the standard loss for modeling continuous features (as we have in our tasks). Thus, the observation decoding loss is a unified methodology with the action decoding NLL loss, which uses a Categorical distribution for the discrete action space.
How to calculate probability p?
We noticed that Eq. 2 is missing a summation over the controlled agent index , for , which may have reduced readability. We will fix this typo.
is the probability of observing the actions of agents , where the probability distribution is parameterized by the output of the decoder. In practice, since the experimental tasks have discrete actions, is the Categorical distribution over the action space (corresponding to the ground-truth Categorical agent policies).
In the pre-training of uncontrolled agents, what cooperative decisions do they learn? For example, if three uncontrolled agents are needed for a task in which five agents form a team, how can we pre-train only the three agents separately?
The uncontrolled agents in our paper are derived from running MARL algorithms. Thus, they are trained within their own team, under whatever assumptions that MARL algorithm makes. The partial uncontrolled team is not trained separately; it is simply a subset of the full uncontrolled team that was trained to work together.
What is the range of observation and action? …the MSE in Fig. 4 appears to have a very large error rate.
The obs. range was normalized to [-1,1] for the experiments, while the discrete action space is one-hot encoded. The observation MSE decreases over training and over the episode, and can be further decreased by increasing agent model training epochs. We will post a further discussion of the observation MSE in Fig. 4.
In Fig. 5, how to train agent modeling network in POAM without UCD?
Training without UCD means that the encoder-decoder’s training dataset is limited to observations corresponding to agents from the controlled set. Importantly, agents from the controlled set may still observe uncontrolled agents, making it possible to predict observations and actions corresponding to uncontrolled agents.
What is the difference between POAM without a modeling network and IPPO?
POAM w/o modeling network is equivalent to IPPO-NAHT. The difference between IPPO-NAHT and IPPO is the use of uncontrolled agent data to train the value function.
References:
[1] Christianos et al. Scaling multi-agent reinforcement learning with selective parameter sharing. ICML 2021.
Thank you for reviewing our rebuttal with great care, and for the thoughtful follow-up. We address the questions point-by-point below:
Q1:
In Figure 2, the author uses theoretical probability, but the correlation with the loss is not clear.
The reviewer mentions Fig. 2 in their question, but we are not sure which Fig. 2 they refer to. Fig. 2 in the submitted paper is a diagram illustrating POAM, while Fig. 2 in the rebuttal PDF shows the encoder-decoder (ED) observation and action loss values, rather than action probabilities. We proceed on the assumption that the reviewer is asking for further clarification of the relationship between the action probability and action loss, but are happy to continue discussing this topic if we have misunderstood.
The action loss is the expected negative log likelihood loss, as shown in Eq. 2 of the submitted paper: On the other hand, as described in our comment above, the action probability displayed in Fig. 4 of the submitted paper is the average probability of the observed, ground-truth actions under the current agent model: Thus, the only differences between the two quantities is the sign and the log function within the action loss. Conceptually, we wish to minimize the action loss, and maximize the action probability. The solution set for minimizing the action loss is the same as maximizing the action probability due to the monotonic log function, which does not change the ordering of solutions.
To give a numerical example comparing the action probability and action loss, please consider the bit matrix game example described in our follow-up comment titled, “The maximum theoretical value for the action probabilities is not 1.0 when the modeled teammates are stochastic”. Recall that is the action random variable that takes values 0 or 1; denotes the true action distribution of the uncontrolled agent; and denotes the modeled action distribution, parametrized by .
The uncontrolled Bernoulli(⅓) agent chooses A=1 with a probability of ⅓, and A=0 with a probability of ⅔ . Given this uncontrolled agent policy, a perfect decoder network would generate a distribution that also chooses A=1 with a probability of ⅓, and A=0 with a probability of ⅔, i.e. .
As computed in our comment titled, “POAM’s encoder-decoder achieves the minimum action loss on the bit matrix game”, the perfect decoder (i.e. when ) would achieve an expected action loss of (using ) for the Bernoulli(1/3) agent. On the other hand, this perfect decoder would achieve an expected action probability of .
Q2
In Figure 3, the magnitude of the error bars for all algorithms appears to be the same. Please check if this result is correct.
We thank the reviewer for pointing out this error, which was due to a variable name error in the plotting code. Unfortunately, we are not able to attach a corrected figure due to the discussion rules, but the corrected confidence bound values for Figure 3 in the rebuttal pdf are listed below, and will be corrected in the next draft of the paper. Importantly, we also verified that the significance of our results and corresponding analysis did not change due to this error.
Group names, left to right: IPPO, IQL, MAPPO, QMIX, VDN:
IPPO-NAHT: 0.699, 0.491, 0.532, 1.189, 0.685 POAM: 1.719, 1.696, 1.402, 2.108, 0.709
(continued in part 2 below)
Q3
The reviewer believes the authors should provide more details regarding the MPE-PP setting. Specifically, the reviewer wonders how a pre-trained prey agent can be trained.
We use the pre-trained prey policy provided by the ePymarl MARL framework, as mentioned in the Appendix at line 643. The prey policy was originally trained by Papoudakis et al. [1] by using the MADDPG MARL algorithm to train both predator and prey agents for 25M steps . We visualized the prey policy and confirmed that the prey agent moves to escape approaching predators.
Please see the code here for the exact parameter file, titled prey_params.pt. Since the prey policy is pre-trained and fixed, our predator-prey task is a fully cooperative task from the perspective of the predator (learning) agents. We will improve the existing explanation of the MPE environment and predator-prey task to the Appendix by adding this discussion.
Q4
The author did not provide baseline performance data related to uncontrolled agents. Without providing the performance in a scenario where all agents are uncontrolled, it is difficult to evaluate the effectiveness of the proposed algorithm.
We already addressed this point in the common rebuttal section titled ”Baseline Selection (R 5dYr, R 9xa8)”. We wish to emphasize that we did both (1) evaluate the performance of the uncontrolled agent teams, where all agents are uncontrolled, and (2) assess the performance of the uncontrolled agents in the NAHT evaluation setting, and use it to contextualize POAM, the proposed algorithm.
To summarize, we provided three analyses of the uncontrolled agent performance in the submitted version of the paper:
Matched-mismatched evaluation: Section A.4.1. Fig. 11 shows the performance of each uncontrolled team on all tasks when the uncontrolled team is trained together (matched seed condition), versus when we mix two teams that were trained using the same algorithm but different seeds (mismatched seed condition). Cross-play tables: Tables 12-15 display the full cross-play results for all uncontrolled teams (generated by MARL algorithms) on all tasks. Naive MARL baseline: the naive MARL baseline in Fig. 3 is the performance of the best uncontrolled team, as evaluated through the cross-play scores shown in the cross-play tables mentioned above. The baseline shows how well naive MARL agents that were not trained in the NAHT setting, would perform when evaluated in the NAHT setting.
Again we thank the reviewer for their time and consideration , and are happy to answer follow-up questions.
References
[1] Papoudakis et al. Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks. Neurips 2021.
Scale of Revisions Required for Paper
It is not completely apparent to us why the revisions discussed here would be substantial enough to merit rejecting the paper. The planned revisions to the main paper are largely minor, and the revisions to the Appendix consist of further experimental details and supplemental results and discussion, drawn from the current rebuttal/discussion on OpenReview. Please see the list of revisions that would be made, based on our discussion with the reviewer (5dYr).
Appendix
- Discussion: relationship between action probabilities and action loss, as explained in our discussion with R 5dYr
- Discussion: theoretical lower bound on action loss / upper bound on action probabilities (as explained in discussion with R 5dYr)
- Discussion: why the observation loss is the MSE loss, while the action decoding loss is the negative log likelihood loss (as explained in the discussion with R 5dYr)
- Supplemental result: plot of encoder-decoder loss (Fig. 1 of rebuttal PDF)
- Experimental detail: how the pre-trained prey agent is trained on MPE-PP (as explained in the discussion with R 5dYr)
- Experimental detail: the observation/action ranges in MPE-PP and StarCraft tasks
Main Paper
- Typos: modelling -> modeling, non-controlled -> uncontrolled, adding a summation over the controlled agent index for to Eq. 2
- Correction: correcting the plotting of the confidence intervals for the OOD experiment plots (Fig. 6)
- Additional baseline: in-distribution performance of uncontrolled teams as an additional set of baselines for Fig. 6 (already implemented as Fig. 3 of the rebuttal pdf).
Summary of Successfully Clarified Issues
The following is a list of all questions/concerns that were previously raised by the reviewer, and successfully clarified by our response. Again, given that we have addressed all issues brought up by the reviewer, we respectfully request the reviewer to reconsider their position on the paper.
Misconceptions
- Only one evaluation domain
- Lack of evaluation of uncontrolled agents
Validity Concerns
- Validity of inferring actions/observations of other agents in a POMDP setting
- Whether the agent model has been successfully trained
- Magnitude of error bars is the same in Fig. 3 <- we acknowledge the error. However, follow-up analysis shows that the - significance of results is not affected.
Requested Additional Analysis
- Requested baseline of the in-distribution performance for Fig. 6
Clarification Questions
- Meaning of the term, “common reward function”
- Why the observation decoding loss is based on the mean squared error while the action decoding loss is based on the likelihood
- How to compute probability p
- Whether POAM can deal with changing history lengths
- Difference between POAM without modeling network and IPPO
- Relationship between action probability and action loss
Requested Further Experimental Details*
- How uncontrolled agents are trained
- How pre-trained prey agents were obtained in MPE-PP
- Range of observation and actions
- How the agent modeling network can be trained without uncontrolled agent data
Questions About Extensions to Method
- Question about whether POAM could be implemented with an off-policy algorithm
References
[1] Yu et al., The surprising effectiveness of PPO in cooperative multi-agent games. NeurIPS 2022.
[2] Papoudakis et al., Agent modelling under partial observability for deep reinforcement learning. NeurIPS 2021.
We thank the reviewers for lending their expertise in reviewing our paper, as well as for their thoughtful and helpful feedback. Here we address questions and points brought up by multiple reviewers. We respond to the other questions individually below.
Our contributions include:
- Proposing and formulating the NAHT problem.
- Demonstrating the issues from the direct application of AHT methods in NAHT problems.
- Outlining how existing agent modeling techniques for AHT can be applied for NAHT through the proposed algorithm, POAM.
We thank reviewers for recognizing the importance of Contribution 1(hc6T, 5dYr) by calling the NAHT problem realistic (eZ5A), interesting (9xa8), and well-motivated (5dYr). The rigor of the proofs is also highlighted as a strength (hc6T). Regarding Contribution 3 (POAM), the reviewers acknowledge that the solution is clear and easy to implement (5dYr, 9xa8), and recognized the comprehensive experiments (9xa8, hc6T). Additionally, ¾ reviewers found the paper well-written and clear (hc6T, 9xa8, eZ5A).
Finally, the most negative review has several misunderstandings. We accept responsibility for making the paper as clear as possible. But we respectfully request that the reviewer reevaluate their rating considering many of their points/questions were already addressed in the submission.
Use of Uncontrolled Agent Data & On/Off-Policy Algorithm Selection (R 5dYr, R 9xa8)
(R 5dYr) The reviewer wants to know if the proposed solution works only with the PPO backbone. This question arises from the authors' reasoning about why they do not leverage data from uncontrolled agents....this issue might be mitigated by replacing the backbone algorithm with an off-policy algorithm.
Please observe that POAM does use data from uncontrolled agents to update the value function. As the reviewer stated, it does not use that data to update the policy, because the policy update is sensitive to off-policy data.
It is possible to combine the agent modeling scheme with an off-policy learning method. However, a key aspect of POAM is the use of independent learning, allowing scalability and enabling the POAM team to operate in an environment where the number of controlled agents may vary. Off-policy MARL algorithms following the independent learning paradigm are very uncommon for cooperative MARL. The main challenge is that stale data in the experience replay may not reflect teammates' current behavior [1].
On the other hand, prior work has observed that independent PPO performs well in challenging multi-agent coordination problems [2], with theoretical justification[3].
Misconception: only tested on one domain (R 5dYr, R 9xa8)
To clarify, we evaluated on two domains at submission time: StarCraft II and the Multi-Agent Particle Environment (predator-prey task). Please see Figs. 3 (left), 4, 5, 6, along with various results in the Appendix.
Baseline Selection (R 5dYr, R 9xa8)
(R 5dYr) Performance of teams consisting of only uncontrolled agents (and their potential as a baseline).
We concur that the performance of uncontrolled agents is an interesting baseline for NAHT, and in fact, implemented this baseline in our paper. The naive MARL baseline in Fig. 3 is the performance of the best uncontrolled team, as evaluated through cross-play.
In the paper, we consider uncontrolled agent teams that are trained by five MARL algorithms and discuss/analyze these agents in Section A.4.1. Fig. 11 shows the performance of each of these algorithms on all tasks when the uncontrolled team is trained together (matched seed condition), versus when we mix two teams that were trained using the same algorithm but different seeds (mismatched seed condition). Tables 12-15 display the full cross-play results for all MARL algorithms on all tasks.
(R 5dYr) It would be easier to check how much performance increases or decreases if a baseline for non-OOD cases is also provided.
We added the in-distribution performance of IPPO and POAM to the OOD plot (Fig. 3 of the rebuttal pdf).
(R 9xa8) The baselines considered in the paper do not seem to include algorithms that model other agents explicitly. Also, the related works section for agent modeling seems to be incomplete….
The paper does not aim to develop new agent modeling techniques. Instead, it hypothesizes that agent modeling can help in solving the NAHT problem. In that regard, the innovation of this paper is in how agent modeling should be applied for the NAHT problem. Through the POAM-AHT baseline, which combines IPPO with agent modeling similar to LIAM (ref. 28 in the submitted paper), the paper shows that naively applying an agent modeling AHT algorithm is insufficient for solving the NAHT problem. The question of which agent modeling technique is optimal for which NAHT problem is promising for future work, which we will highlight in the paper.
We will cite and discuss the papers that the reviewer mentioned alongside this survey on agent modeling[4]. Please note that the suggested reference 3 is actually a prior version of ref. 28 in the submitted paper.
References
[1] Foerster et al. Stabilising experience replay for deep multi-agent reinforcement learning. ICML 2017.
[2] Yu et al. The surprising effectiveness of ppo in cooperative multi-agent games. NeurIPS 2022.
[3] Sun et al. Trust region bounds for decentralized PPO under non-stationarity. AAMAS 2023.
[4] Albrecht et al. Autonomous agents modelling other agents: a comprehensive survey and open problems. AIJ 2021.
We thank all reviewers and the AC for their active engagement with us, and close readings of the paper/rebuttals. The valuable feedback from reviewers will help us round-out the discussion in the main paper, while the additional details and experiments provided as part of the discussion process will be added to the Appendix.
Overall, ¾ reviewers gave our papers scores in the “Accept” range: 7 (R 9xa8), 5 (R hc6T), and 5 (R eZ5A), with the remaining reviewer 5dYr raising their score from 3 to 4 after acknowledging that our rebuttal and discussion addressed all of their initial concerns, with the only two remaining concerns coming up late in the discussion period, for which we have provided a rebuttal.
In particular, we thank R 9xa8 for their confidence that the paper is a good contribution to the Neurips conference. We also thank R hc6T for recognizing that their initial proposals for generating more challenging, out-of-distribution teammates via cross-play minimization techniques is not as straightforward as initially thought, and proposing an alternative teammate generation strategy. Using this strategy, we have conducted an initial experiment on the predator prey task, finding that our proposed method (POAM) still outperforms the baseline method (IPPO-NAHT), even on this more challenging generalization task.
This paper introduces and formalized a novel problem, MALR in the N-agent ad-hoc network, that generalizes MARL in both cooperative and ad-hoc network scenarios, and that is relevant for real-world applications. The paper also sets a baseline for this problem, showing that it outperforms naive baselines. I believe this paper can inspire further research.
The discussion during the rebuttal period has been very helpful to make the contributions clearer and sounder. The authors must improve the paper according to the discussion with the reviewers, especially on the points on evaluation (setup, baselines and results), scope, use of data from uncontrolled agents, and convergence.
p.s. Couple of additional minor comments:
- It is not clear what "Without loss of generality" in line 89 refers to, using a parametric policy is not general.
- "training scheme presented in Section 6.1", does it mean the practical instantiation of NAHT discussed above (homogeneous, non-communicating agents that may learn about uncontrolled agents via interaction)?