PaperHub
5.5
/10
withdrawn4 位审稿人
最低5最高6标准差0.5
5
6
6
5
3.8
置信度
ICLR 2024

CAMMARL: Conformal Action Modeling in Multi Agent Reinforcement Learning

OpenReviewPDF
提交: 2023-09-24更新: 2024-02-02

摘要

关键词
Multi Agent Reinforcement LearningReinforcement LearningConformal Predictions

评审与讨论

审稿意见
5

This paper proposes a multi-agent reinforcement learning (MARL) algorithm CAMMARL, where each agent's policy is not only conditional on its own observation, but also on the estimate of other agents actions. Specifically, it uses conformal predictions to learn confident sets containing other agents' probable action estimates. CAMMARL consider two agents setting and is tested on two MARL tasks, cooperative navigation and level-based foraging.

优点

  • The paper is well-motivated. Modeling other agent's actions is intuitively helpful for learning a good policy.
  • There are some relevant statistical analysis over how CAMMARL works, such as the averaged set sizes outputted by the conformal predictor.
  • A range of related baselines are considered.

缺点

  • The predicted conformal sets is exponentially growing with respect to Aother|\mathcal{A_{other}}|, which may not be scalable when Aother|\mathcal{A_{other}}| is large.
  • The paper is restricted to the two agents setting. It is not clear if the proposed method can generalize to more complex tasks with potentially many agents.
  • The experiment results are preliminary in only two tasks. Further, the improvement over the baselines is not obvious, especially in CN in Figure 4.

问题

  • Can CAMMARL work in settings with more than two agents? If so, it is beneficial to show its effectiveness in some more complex MARL benchmarks with more agents.
  • Does Nother\mathcal{N_{other}}'s action also condition on Nself\mathcal{N_{self}}'s action?
评论

Thank you for the insightful comments and suggestions. We have addressed your concerns below.

  1. Scalability: Even with a small number of agents, MARL can be quite challenging to solve. Our main goal in this paper was to develop a framework for conformal action prediction for MARL. For every agent, the conformal set would grow by Aother|A_{other}|. However, we can also use a common last layer for all the conformal prediction models and append that layer to the input states. That way, we can incorporate multiple agents seamlessly. Generally, even in a setting of many agents, for instance, 100 or 1000, one could also prefer modeling only a local set of agents instead of all of them, hence reducing the complexity of the scalability problem. Nonetheless, these ideas require a modified/upgraded design to be investigated and would be an interesting follow-up work.

  2. Performance with respect to Baselines: While we do agree that the performance improvement is not quite significant in CN, for LBF it is quite significant especially when considering the baselines, which have more full information about the actions of the other agents before taking any decisions. However, note that for CN, the baseline that CAMMARL majorly overlaps with EAP is actually CAMMARL with set size=1. Since action prediction is quite simple in CN, conformal predictions are not much better as in environments like LBF. GIAM is clairvoyant in that it has access to all the information, so comparing it against CAMMARL is unfair, and is meant as an upper bound on performance. In comparison to the other baselines, CAMMARL does better even in CN.

  3. Multi Agents: CAMMARL as an idea is not restrictive to two agents, and can be easily adapted to incorporate multiple agents. Results in mixed settings and more than 2 agents: We also show results in the non-cooperative Level Based Foraging Setting for 2 agents with 6 food locations in Section D.1 of the Appendix in the revised version of the paper. From this section, we can see that the utility of CAMMARL is not just restricted to cooperative settings, but it can be easily applied to a multitude of MARL problems. We also show results on Pressure Plate with 4 agents which is a new environment described in Appendix Section A of the revised version. In Section D.2 of the revised paper, we show that CAMMARL converges 10% faster (in terms of the number of episodes) when compared with the other baselines in the domain. This definitely lends further credibility to conformal predictions, and additionally shows the adaptability of CAMMARL to other MARL domains with more agents.

  4. Does N_other's action also condition on N_self's action?: No, it does not. The other agent has its own set of observations and actions and is not influenced by the actions of the main agent directly.

Given the limited compute resources, we believe that we could answer all your doubts and clarifications as much as possible. We would be grateful if you could consider increasing your score if you find our responses adequate.

评论

Thank you for your rebuttal. While some of my concerns are addressed, I still have some reservations.

  • The empirical results can be more convincing if CAMMARL is tested on some more complex MARL benchmarks such as [1] and [2]. There are some improvements in Appendix D.1, but the selected tasks are too simple.

  • If CAMMARL is not restricted to 2 agents setting, I suggest the authors to reframe section 3 with arbitrary number of agents.

[1] The StarCraft Multi-Agent Challenge

[2] Google Research Football: A Novel Reinforcement Learning Environment

评论

Thank you for your review and your additional recommendations and suggestions to make the paper stronger.

  1. We have adapted Section 3 to include multiple agents and we have adapted Algorithm 1 as well to match Section 3.
  2. In the absence of a recommendation in the original review, we added Pressure Plate with 4 agents and demonstrated good sample efficiency over a significantly stronger baseline ‘GIAM’. In addition, we also added more experiments with mixed settings for multiple agents. We do believe your recommendation to add SMAC and Google Football is good and adds value to the experimental section, however, while we did our best to adapt our code and run our experiments on both SMAC and Google Football in such a short time we were not able to train any of the models long enough to see learning behavior. Moreover, our main goal is to introduce the conformal architecture and how it can be used for MARL, and in the domains we tested on, we do notice improvements in using Conformal Predictions. We thank you again for your valuable time and helpful comments to improve the paper.
评论

Thank you for the update of the paper. I have raised the score.

评论

We are grateful for your consideration to increase the score. The experiments on StarCraft were not straightforward to set up because StartCraft requires action masking because the action space can change while training. We would like to also emphasize that our experiments on Google Football and StarCraft Multi agent challenge are running, and we hope to include them for the camera-ready version. We would be happy to address any further questions about the paper.

审稿意见
6

The paper proposes CAMMARL, a novel multi-agent reinforcement learning algorithm that uses conformal predictions to model the actions of other agents in the environment as sets that contain their true actions with high probability. The paper claims that these sets can inform the decision-making of an agent and improve its performance in cooperative tasks. The paper demonstrates the effectiveness of CAMMARL in two multi-agent domains and compares it with several baselines that use different types of information about other agents.

优点

The paper addresses an important and challenging problem of reasoning about other agents in partially observable environments. The paper introduces a novel way of using conformal predictions to model the uncertainty and confidence of other agents' actions. The paper presents extensive experiments on two challenging cooperative tasks and demonstrates that CAMMARL improves over various baselines in terms of returns, learning speed, set sizes, and coverage. The paper also discusses some limitations and future directions for extending CAMMARL to more complex scenarios. The paper is well-written, clear, and well-structured. The code is provided for reproduction.

缺点

The paper has some limitations and possible areas for improvement.

  • The paper could provide more thorough discussions on experiment details. For example, the paper does not explain how the regularization term in the conformal prediction model is chosen or tuned, or how it affects the performance of CAMMARL. A more transparent and rigorous analysis of this aspect would enhance the credibility and generalizability of CAMMARL. Also, what are the differences of the settings between different baseline algorithms compared in the paper? Since they have different dimensions of input, maybe they also need slightly different training settings to achieve their own best performance.
  • The paper uses learning curves as the only evaluation metrics but does not provide any qualitative analysis or visualization of the learned policies or behaviors of CAMMARL agents, which would help to illustrate how they leverage conformal predictions to cooperate effectively.
  • The paper only considers fully cooperative settings and does not explore how CAMMARL would perform in competitive or mixed scenarios where other agents' intentions may not be aligned or predictable. Especially, in the competitive robust RL setting, the conformal set could potentially be used for worst-case analysis.
  • The paper does not discuss any potential drawbacks or challenges of using conformal predictions, such as computational costs, calibration issues, or sensitivity to hyperparameters.

问题

  • Intuitively, the idea of using conformal predictions to model the actions of other agents is to augment the agent’s state with the historical memory of other agents' behavior, which is similar to fictitious play in game theory. However, fictitious play only converges in some specific game settings, and MARL is known to be hard to converge in general settings. Moreover, the conformal predictions are based on previous observations but the other agent’s policy is also evolving. Is there any observation where the algorithm does not converge well or even has cycling behavior?
  • In the paragraph of Global-Information-Agent-Modeling (GIAM), the paper states that this can be infeasible in real-world scenarios. Why is that? Seems like the information used in this algorithm is the same as CAMMARL. The difference is that, the historical trajectories are used to first train a prediction model and then feed into the agent’s policy network, whereas in GIAM, it is directly used in the policy network in an end-to-end way.
  • How does CAMMARL handle different set sizes produced by the conformal model? How does it encode them into input features for policy learning? How does it affect the stability and convergence of policy learning?
评论

We thank the reviewer for their detailed comments and suggestions for improving the paper. We address all the concerns below.

  1. Regularization Term in Conformal Prediction: We apologize for the lack of clarity in the explanation. Each time the conformal model is trained, we choose the lambda and k_reg parameters from a set that optimizes for small sizes of the action set. The k_reg values are dependent on the logits, however, the lambda values are in a set [0.001, 0.01, 0.1, 0.2, 0.5]. We have added a section in the Appendix explaining this. We apologize again, for the confusion and we have added this explanation in Section C of the Appendix in the revised version of the paper.

  2. Hyper-parameters for the Baselines and CAMMARL: Thank you for pointing this out. Yes, the hyper-parameters are different for the different baseline. We performed a moderate tabular search over the hyper-parameters and chose the best hyper-parameter to represent each of the individual baselines.

  3. Visualization of Policies: For all the environments, rewards provide a way to understand cooperation as well, as in all the domains, the rewards are designed to be higher for better cooperation. We show a learned policy by CAMMARL in Pressure Plate in Section D.3 of the Appendix of the revised version of the paper. We can see perfect cooperation among the 4 agents in the environment, with each agent staying behind to keep the gateway open.

  4. Drawbacks and Sensitivity Issues: The main drawback would be using an extra neural network for conformal action prediction, but this would be true for other methods that model actions from observations as well. Like any RL method, our method would also be sensitive to hyper-parameters, however, for the conformal regularization parameters, we sweep over every conformal model training step to find the optimal one for conformal sets. In terms of calibration, we adopt the same procedure as in RAPS [1], and get a conformal calibration set from the data. The core conformal prediction algorithm is exactly the same as RAPS. We hope this clarifies your doubts.

  5. Fictitious Play: Thank you for pointing out this interesting comparison, and yes intuitively it is kind of similar to fictitious play. All the action modeling baselines include actions from the previous observations, and so they would be intuitively similar as well. However, the experiments we tested are all non-zero-sum games and we did not find any cycling behavior or non-convergent behavior in our experiments.

  6. Clarification of GIAM and CAMMARL inputs: We apologize for the confusion. GIAM is infeasible because it uses the real actions of the other agents as input for taking actions, whereas CAMMARL builds a model from replay data. GIAM is in a sense clairvoyant because before taking an action, it has access to the true actions of the other agents. So, GIAM is infeasible in real-world scenarios.

  7. Different Set Sizes of Conformal Model: In CAMMARL, the conformal prediction model is a separate MLP that learns (simultaneously with the RL agent) to output conformal prediction sets. However, these sets can be of varying sizes in different situations. So, we came up with numerous ways to be able to use them to implement CAMMARL. One of the ways is to pad the output set with zeros (or any other constant value) to bring them to the same fixed size (the total number of possible actions). Another way is to pick up the last layer of the conformal prediction model’s MLP as embeddings (hence always maintaining a fixed size) instead of sending the conformal set as input for the n_self agent. We provide some more details on this in Appendix B of the revised version with more experiments showing related results.

[1] Anastasios Angelopoulos, Stephen Bates, Jitendra Malik, and Michael I Jordan. Uncertainty sets for image classifiers using conformal prediction. arXiv preprint arXiv:2009.14193, 2020.

评论
  1. Results in mixed settings and more than 2 agents: We also show results in the non-cooperative Level Based Foraging Setting for 2 agents with 6 food locations in Section D.1 of the Appendix in the revised version of the paper. From this section, we can see that the utility of CAMMARL is not just restricted to cooperative settings, but it can be easily applied to a multitude of MARL problems. We also show results on Pressure Plate with 4 agents which is a new environment described in Appendix Section A of the revised version. In Section D.2 of the revised paper, we show that CAMMARL converges 10% faster (in terms of the number of episodes) when compared with the other baselines in the domain. This definitely lends further credibility to conformal predictions, and additionally shows the adaptability of CAMMARL to other MARL domains with more agents.

We hope we could answer all your doubts and questions adequately. Given our limited compute resources, we tried to complete as many experiments as possible and we hope this further strengthens our claim. Please consider increasing your score if you find our responses satisfactory.

评论

We kindly request the reviewer to please provide us with the final response, as the deadline for the discussion phase is fast approaching.

评论

The response addresses some of my concerns and given that the authors are adding them in the appendix, I will raise my score from 5 to 6.

评论

Thank you for your thoughtful review and suggestions to improve the clarity and strength of the paper. We are grateful that you have updated your score to reflect this.

审稿意见
6

The authors propose a multi-agent reinforcement learning algorithm that uses conformal prediction to explicitly reason about the behavior of other agents. To evaluate their proposed algorithm two experiments using cooperative simulated environments are conducted. The experiments show that CAMMARL outperforms alternative methods though is still short of a model with access to global information.

优点

  • Applying conformal prediction to reason about agents in a MARL problem is a simple but straight-forward idea that doesn't suffer from the drawbacks of alternatives methods which require considerable more compute and data (e.g., inverse reinforcement learning).
  • The conformal approach not only supports reasoning about others but also provides confidence measures around its reasoning

缺点

  • Beyond the idea of using conformal modelling for action prediction there does not appear to be large algorithmic advancements in solving the conformal problem in this setting
  • The simulated environments used in the experiments are quite simple
  • Algorithm 1 is difficult to follow. For example, there are many lines with multiple assignments and updates.

问题

  • What is bconformalb_{conformal} in Algorithm 1?

伦理问题详情

I have no concerns.

评论

Thank you for the thoughtful review. In our rebuttal, we have addressed your concerns sequentially.

  1. Novelty: While we added conformal modeling for action predictions, we explored different ways of incorporating them into the main RL algorithm. Since the conformal predictions can have different set sizes, we explored ways of incorporating them into the algorithm including using the embeddings of the model directly. Our main goal is to highlight that conformal predictions can be a really useful tool for agent modeling in MARL and can be more beneficial than just predicting the exact actions or their probabilities. Simple Envs: Multi-Agent Reinforcement Learning is in general not an easy problem to solve, and our environments are indeed difficult to solve based on the performance of NOAM in CN and LBF which does not do quite as well.

  2. Results in mixed settings and more than 2 agents: We also show results in the non-cooperative Level Based Foraging Setting for 2 agents with 6 food locations in Section D.1 of the Appendix in the revised version of the paper. From this section, we can see that the utility of CAMMARL is not just restricted to cooperative settings, but it can be easily applied to a multitude of MARL problems. We also show results on Pressure Plate with 4 agents which is a new environment described in Appendix Section A of the revised version. In Section D.2 of the revised paper, we show that CAMMARL converges 10% faster (in terms of the number of episodes) when compared with the other baselines in the domain. This definitely lends further credibility to conformal predictions, and additionally shows the adaptability of CAMMARL to other MARL domains with more agents.

  3. Algorithm 1: We apologize for the lack of clarity. We have made the changes in the revised version of the paper.

  4. bconformalb_{conformal}: bconformalb_{conformal} is the replay buffer for the conformal model. It is the exact same as the replay buffer for NotherN_{other} (botherb_{other}) except the conformal model has a different update frequency than botherb_{other}.

Because of our limited compute resources, we tried to include as many new experiments as possible. We hope this rebuttal and the new experiments will clarify your doubts and answer your questions. We would be grateful if you could consider increasing your score if you find our responses satisfactory.

评论

I want to thank the authors for their thoughtful response. I am happy with the modifications namely improving the experimental section and improving the algorithmic clarity) and bumped my score up accordingly. I feel the main contribution is providing evidence that conformal modelling can be useful for MARL. To the AC, I am not an expert in the MARL literature so this may not be an advancement or strong evidence with respect to common baselines in the literature, but I don't see any obvious flaws.

评论

Thank you for your thoughtful review and suggestions to improve the clarity and strength of the paper. We are grateful that you have updated your score to reflect this.

审稿意见
5

In this paper, the authors propose a multi-agent reinforcement learning (MARL) algorithm CAMMARL, which models the actions of other agents in different situations. The the estimates are used to inform an agent's decison-making. The experimental results illustrate that the proposed method elevates the capabilities of an autonomous agent in MARL.

优点

  • The motivation of this paper is reasonable.
  • This paper is well-organized and well-written.
  • The main idea of the proposed method is described in great details.

缺点

  • It seems that the proposed method does not improve the performance greatly compared with other methods.
  • The scenarios selected in this paper are not so convincing to some extend. More experiments should be conducted to demonstrate the superiority of the proposed method against other methods.
  • The assumption of this paper is too ideal. Commonly, it is hard to obtain observations from other agents in MARL especially in the context of decentralization.

问题

See weaknesses.

评论

We thank the reviewer for the helpful comments and suggestions to improve the paper. We address all the points mentioned sequentially below.

  1. Performance with respect to Baselines: While we do agree that the performance improvement is not quite significant in CN, for LBF it is quite significant especially when considering the baselines, which have more full information about the actions of the other agents before taking any decisions. However, note that for CN, the baseline that CAMMARL majorly overlaps with EAP is actually CAMMARL with set size=1. Since action prediction is quite simple in CN, conformal predictions are not much better as in environments like LBF. GIAM is clairvoyant in that it has access to all the information, so comparing it against CAMMARL is unfair, and is meant as an upper bound on performance. In comparison to the other baselines, CAMMARL does better even in CN.

  2. More Domains: CAMMARL as an idea is not restrictive to fully cooperative worlds and is generalizable to mixed/competitive settings. More specifically, if an agent in a mixed or competitive setting is required to model other agents in the environment, developers can consider applying CAMMARL in such situations too.

Results in mixed settings and more than 2 agents: We also show results in the non-cooperative Level Based Foraging Setting for 2 agents with 6 food locations in Section D.1 of the Appendix in the revised version of the paper. From this section, we can see that the utility of CAMMARL is not just restricted to cooperative settings, but it can be easily applied to a multitude of MARL problems. We also show results on Pressure Plate with 4 agents which is a new environment described in Appendix Section A of the revised version. In Section D.2 of the revised paper, we show that CAMMARL converges 10% faster (in terms of the number of episodes) when compared with the other baselines in the domain. This definitely lends further credibility to conformal predictions, and additionally shows the adaptability of CAMMARL to other MARL domains with more agents.

  1. Using Observations of other agents: We thank the reviewer for raising this important point. It is indeed a strong assumption particularly in the case of decentralized multi-agent learning, even with observations the problem is still quite challenging to solve. Also, in problems without ego-centric views, generally, the entire observation is available and the key challenge then is figuring out the actions of the other agents for complete information. We have added this as a limitation in Section 7 paragraph 2.

Given our limited computing resources, we hope we can address all of your concerns. Please consider increasing your score if you find our responses satisfactory. Additionally, we would be happy to answer any follow-up questions.

评论

We kindly request the reviewer to please provide us with the final response, as the deadline for the discussion phase is fast approaching.

评论

We thank the reviewer for the review and wanted to stress that we have updated the paper addressing all the concerns about the assumptions, and the performance with respect to the baseline. We also added more domains including mixed settings in the Appendix in the revised version of the paper. We would request if you could consider increasing your score if you find the responses satisfactory. We would also be happy to address any remaining concerns in the remaining time.

评论

We thank everyone for their feedback and suggestions to strengthen the paper. We have clarified all the concerns mentioned and made substantial improvements to the paper as outlined in the rebuttal and the revised version of the paper.