One by One, Continual Coordinating with Humans via Hyper-Teammate Identification

Cong Guan,Feng Chen,Ke Xue,Chunpeng Fan,Lichao Zhang,Ziqian Zhang,Pengyao Zhao,Zongzhang Zhang,Chao Qian,Lei Yuan,Yang Yu

OpenReview PDF

提交: 2023-09-22更新: 2024-03-26

TL;DR

A approach models the process of training with different teammates as a continual learning process, and a mechanism based on hyper-teammate identification is proposed to avoid catastrophic forgetting while promote forward knowledge transfer.

摘要

关键词

Multi-agent Reinforcement LearningCooperation and CoordinationContinual CoordinationHuman-AI Coordination

评审与讨论

审稿意见

评分: 3置信度: 52023-10-26

This paper tackles the human-AI coordination problem with a sequential setting and continual learning paradigm. The motivation for continually training such an agent is that previous population-based training methods need to maintain a diverse population, which is costly compared to the sequential setting in this paper. Experiments are conducted on two well-established benmarks to demonstrate effectiveness of the proposed method. However, I have several concerns for this paper, especially for the motivation and experiments part. Please refer to weaknesses section.

优点

Each component of the proposed method is clearly described.
Compared to the baselines, the proposed Concord demonstrate better performance
Human study is conducted on one layout in Overcooked, where both subjective and objective results are reported.

缺点

I have several concerns for this paper.

One of my major concern for this paper is the motivation. I'm not fully convinced that training human-AI coordination agents in a sequential manner with continual learning is more efficient, compared to previous PBT methods. It is true that PBT methods like MEP [1] need to maintain a diverse population to train adaptive agents. But the proposed Concord not only need to train a population of agents like in PBT methods (for comparison, 12 in Concord and 15 in MEP, not to mention an offline replay buffer is needed to train the recognizer), but also need to collect lots of human data to train such a population. And human data is often difficult and expensive to collect due to safety and privacy issues.
As pointed out previously, human data is hard to collect. So the authors use AI agents trained by PBT methods (MEP and HSP) as alternative human-like agents. However, as demonstrated by the study in FCP [2], there is a large gap between real humans' policies and agents policies trained via PBT methods. This further increases my concern for the motivation of this paper: why do we need to train the agents in a sequential manner if we still need to train a population of agents? And given the same amount of human data, will previous PBT methods perform better than or as well as the proposed continual-learning-based method?
More baseline methods are needed. The compared baselines are too weak. At least some zero-shot-coordination methods should be discussed to show the proposed methods' advantage over them.
I found the recognizer module is very similar as the context encoder proposed in PECAN [3] but the reference is missing. The PECAN context encoder also encodes previous trajectories to identify the teammate's type and use the encoding to constrain the agent policy. And they are both trained with the training teammates' interaction data and only used in evaluation with humans or human proxies. As the hyper-network in Concord is actually part of the actor network to process teammate representation z, the usage of the recognizer module in Concord and context encoder in PECAN also have much in common. Thus, this issue needs further discussion.

[1] Zhao, Rui, et al. "Maximum entropy population-based training for zero-shot human-ai coordination." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 37. No. 5. 2023.

[2] Strouse, D. J., et al. "Collaborating with humans without human data." Advances in Neural Information Processing Systems 34 (2021): 14502-14515.

[3] Lou, Xingzhou, et al. "PECAN: Leveraging Policy Ensemble for Context-Aware Zero-Shot Human-AI Coordination." Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems. 2023.

问题

Please respond to the weaknesses given above.

伦理问题详情

This paper involves usage of human data for both model training and evaluation. Detailed information for human study is provided in the Appendix E.

审稿意见

评分: 5置信度: 42023-10-30

This paper focuses on improving the performance of agents when collaborating with a variety of teammates, including human participants. The authors point out that prior research mostly suffers from the limitation of a fixed teammate population set. However, predefining all potential teammates is a significant challenge. To address this issue, the authors introduce Concord, an innovative approach that simulates the agent's training with different teammates in a sequential manner. They propose a hyper-teammate identification-based mechanism, designed to 1) mitigate catastrophic forgetting and 2) enhance forward knowledge transfer. This is achieved by integrating a teammate recognition module, which transforms teammate behaviors into an embedding representation. Subsequently, the agent’s policy is dynamically generated through a hyper-network. The training of the model is executed via a decomposed policy gradient technique and its efficacy is rigorously evaluated across various multi-agent benchmarks, involving both generated human proxies and actual human participants.

优点

Training the agent step by step through continual learning is a really interesting and strong idea. It's like how humans learn to work with others in real life. Also, the way the paper is written is very clear.
The experiments are conducted in great detail and are convincing. I really appreciate how the authors did thorough experiments with real human participants. This proves that their solution works well and shows it can be used in real-world situations.

缺点

I have some doubts about whether training the agent in a sequential approach is beneficial or not.

From the standpoint of cutting costs, it seems that training the agent in a "one-by-one" manner doesn’t really lower the total training needed. So, it doesn't seem to cut down the costs compared to other methods that train in parallel. Maybe the reason behind this work could be stated more appropriately.
Also, when we look at how the model is designed, this solution needs all the previous teammates for calculating the loss function during training. This might make it hard to scale up the solution if the number of teammates grows.
About the issue of “creating diverse teammates” mentioned when discussing earlier works, I'm not sure if this paper has tackled this issue. The solution this paper suggests doesn’t include making new teammates. This means it also needs a variety of teammates to start with. For each step, Concord’s training has to happen with all the teammates it has already worked with. If we think of all these teammates as the diverse group made in earlier research, then Concord’s training isn't really different from what was done before.

问题

If the agent has to be trained step by step, could this affect how useful the solution is in situations where the agent must work with many teammates simultaneously? For instance, think of an online chatbot on an e-commerce platform where customers show up all together.
I'm really curious about the Oracle setup. What does it mean when it says Oracle knows “which specific model it is to coordinate with when testing”? Does Oracle create a single policy for all teammates, or different policies for each one? How does Oracle use the information about the specific model it's coordinating with?
In the anti-forgetting step, the loss function is made by optimizing it for all previous teammates. Would this become too hard to do when there are more teammates? Also, it seems that this step would be easier if all the teammates were available simultaneously (so the training could be conducted in a parallel way). What's the advantage of training the agent with teammates one by one?

审稿意见

评分: 6置信度: 32023-11-01

This paper proposes to formulate the problem of cooperating with diverse teammates and human policies as a multi-agent continual Markov decision process. The authors propose a framework for continual learning of training agents to solve MACMDP. The framework consists of a latent teammate recognizer and a hypernetwork based on the embedding. A value decomposition credit assignment is applied to facilitate coordination. The authors also demonstrate that the framework is able to coordinate with diverse teammates and real human teammates.

优点

I like the figures in this paper.
I have been thinking multi-agent learning itself can be formulated as a continual learning and multi-task learning in some evolutionary algorithms. I am glad someone is working in this direction.

缺点

Experiments are not reported with statistical metrics (e.g. standard error, std, etc.,) in the main paper nor appendix for the 5 trials.
The claim of coordinating with unseen teammates does not have clear experimental support, or I am missing it. I didn’t find how the unseen policy is generated. I am concerned with this point as I don’t assume past human encoders can recognize new human policy unless they are very similar to the seen human policy.
The alleviation of the teammate policy generation process makes the contribution of the method less significant. There are works that model human values and use them as a diverse policy generation method [1][2].

[1] Chao Yu*, Jiaxuan Gao*, Weilin Liu, Botian Xu, Hao Tang, Jiaqi Yang, Yu Wang, Yi Wu. Learning Zero-Shot Cooperation with Humans, Assuming Humans Are Biased *(ICLR 2023)

[2] Bakhtin, A., Wu, D. J., Lerer, A., Gray, J., Jacob, A. P., Farina, G., ... & Brown, N. (2022). Mastering the game of no-press Diplomacy via human-regularized reinforcement learning and planning. arXiv preprint arXiv:2210.05492.

问题

Human experiments are expensive and human data suffers from scarcity. Why do we need to assume a limited size of human replay buffer? Can we use them better? e.g. replaying human policies or training imitation policies to record that? I can understand the motivation for the heavy computation of generated AI policies.
The key issue of a coordination game is that the population of policies is not necessarily diverse, different from adversarial games where there is motivation for moving away from a policy if it is exploitable. How did you make sure the generated teammate policies are diverse?
Does the sequence of the teammate matter?
The continual learning setting makes the comparisons with the baseline methods tricky. E.g. the authors assume a small buffer to store previous opponents’ data. Shouldn’t the “vanilla” method also be equipped with a small replay buffer with a selection of interactions to make fair comparisons?
I am confused by the words in the title “Hyper-teammate identification”, it is basically hyper network+teammate modeling.

伦理问题详情

I am not sure about the human experiment policies at ICLR. I will let AC decide if the paper needs an ethics review.