Inverse Factorized Soft Q-Learning for Cooperative Multi-agent Imitation Learning
An Inverse Q-Learning Algorithm for Multi-Agent Imitation Learning
摘要
评审与讨论
This paper extends the IQ-Learn method to cooperative multi-agent settings. The main insight is to use mixing networks to enable centralized training via decentralized Q functions.
优点
- The paper is quite relevant to NeurIPS and it is indeed important to extend IQ-Learn (or similar inverse learning algorithms) to multi-agent systems.
缺点
- The major concern that I have is that, if my understanding is correct, the paper assumes access to the global state information. This is not realistic. In real application, this will never be the case. So the algorithm does not seem useful in practice.
- Typo: In line 62, it should be "generalization" instead of "generation",
- In line 72, \citet should be used instead of \cite or \citep so that the author names will become a part of the sentence.
- In line 162, \eqref should be used instead of \ref so that the parenthesis will appear around the equation number.
- The architecture figure is in page 7. It would significantly increase the readability if it came earlier.
- By the time the reader reads line 191, the IGC principle is still undefined. This makes reading very difficult.
- The same thing is true at line 203, too.
- Typo: In line 241, it should be "makes" instead of "make".
- Typo: In line 242, it should be "yields" instead of "yield".
问题
- How do the agent have access to the global state information. If this is the case, why does the paper even define observations? Is the global state information available only in training or after deployment, too? In what settings is this applicable?
- How could one adapt this algorithm for non-cooperative settings? Is there a straightforward way or does it require completely new approaches?
局限性
The paper does not discuss the broader impacts. I disagree that there is no potential societal impact. I invite the authors to think about the applications their algorithm may have and then consider how their algorithm would affect those applications (both positively and negatively).
How do the agent have access to the global state information. If this is the case, why does the paper even define observations? Is the global state information available only in training or after deployment, too? In what settings is this applicable?
Thank you for the question! We would like to clarify that our model does not assume access to global state information of the entire environment. As briefly described in Section 4.2.1, each agent in our MARL setting only has local observations of other agents (enemies or allies) in the agent’s neighborhood. The state information notion S in our MIFQ model is simply a combination of these local observations of our agents. We will make this distinction clearer in the revised version of the paper. We note that this setting of local observations is standard in many previous MARL studies. Such local observations are available in both training and deployment and, we believe, are highly realistic in practical applications.
How could one adapt this algorithm for non-cooperative settings? Is there a straightforward way or does it require completely new approaches?
We thank the reviewer for the question. Theoretically, our approach can be applied to non-cooperative settings. However, in practice, it would require a completely new algorithm. With conflicting rewards, the non-cooperative setting is much more challenging to train compared to the cooperative setting. We will definitely explore this in future work.
We also thank the reviewer for pointing out the typos, which we highly appreciate and will correct.
We hope that the above responses address your concerns. If you have any other comments or concerns, we are more than happy to address them.
Regarding the broader impact of our work, since our research focuses on imitation learning in multi-agent systems, it may have potential applications similar to areas where imitation learning has been impactful, such as autonomous driving, healthcare, and game theory. There are also potential negative impacts. For instance, imitation learning could be used for surveillance purposes, following and monitoring individuals in public spaces, or for developing autonomous weapons. We thank the reviewer for bringing this up and will elaborate such impacts in detail in our revised version.
I thank the authors for their response. After reading their answers, I did another pass of the paper and I believe I now have a better understanding of their algorithm. I will update my score and trust the authors that they will (1) clarify the confusion about global state information and the meaning of , and (2) add the discussion about broader impacts in their camera-ready version.
We highly appreciate the reviewer for taking the time to read our responses and for the positive feedback on our work. We will definitely improve our discussion on global state information and include a discussion on the broader impacts of our work.
This paper addresses the problem of extending a single-agent imitation learning algorithm, inverse soft-Q learning (IQ-learn, Garg et al. Neurips 21) to the multi-agent cooperative setting. The proposed algorithm, MIFQ, leverages the ideas of mixing networks and the individual-global-max (IGM) principle, to perform the extension. Experimental evaluations of MIFQ are conducted on SMAC-v2, MPE, and Gold Miner, and demonstrate that MIFQ improves over baselines across various domains and with varying numbers of demonstrations.
优点
The paper addresses the challenge of generalizing a key imitation learning (IL) algorithm from single-agent to multi-agent settings, offering a novel approach with MIFQ. The problem is clearly specified and represents an important contribution to the MARL literature.
The empirical results are robust:
- MIFQ outperforms most baselines with various demonstrations.
- Extensive experiments across multiple domains and tasks confirm MIFQ's superior performance.
- Comprehensive comparisons with baselines (BC, independent IQ learning, alternative Q-factorization methods, etc.) highlight MIFQ's advantages.
缺点
-
Some aspects of the method do not seem fully justified to me:
- The authors claim in lines 143-148 that a shortcoming of the IQ learn method is that the objective depends on the centralized state and joint action. However, Section 5.4 of the IQ Learn paper presents a state-only objective (independent from the actions). I wonder if the authors could discuss whether a simple state-only extension of IQ Learn, where critic depends on the centralized state as usual, but the actor depends on the observations, would be sufficient to sidestep many of the concerns addressed by IQ Learn?
- The authors also claim in Section 4.1.2 that the straightforward Independent Inverse Q-learning is not a satisfactory solution because the method "…has limitations in addressing the interdependence between agents and the global information available during the training process.". Can the authors more explicitly discuss what the shortcomings of an independent version of IQ-learn is not satisfactory? Does it suffer from convergence problems?
-
The current experimental analysis is somewhat shallow, and essentially amounts to a description of the plots. The authors could improve the analysis of MIFQ by considering the following additional questions:
- The original IQ learn paper plots the rewards to validate that their method recovers the ground truth reward. Can the same be done here?
- Why does MIFQ perform worse than BC on MPE, particularly the reference and spread tasks?
-
There are some issues with how the experimental results have been reported.
- What is the number of trials for each of the results? Please include this in the main paper.
- The caption of Figure 2 is missing key information to understand the figure. What is the number of demonstrations used to train each of the methods? What does the shaded region mean? Based on the std devs reported in the Appendix, I assume it is the standard deviation; please see the note below and instead compute 95% confidence intervals.
- No measurements of uncertainty are provided in Table 2, and standard deviations are provided only in the Appendix. Standard deviations reflect the underlying variance in models learned by the algorithm, rather than providing a measure of statistical significance. Please also compute 95% confidence intervals to enable readers to judge the statistical significance of the gaps in mean test returns -- ideally, bootstrapped confidence intervals. See this paper for a reference on best practices: https://arxiv.org/abs/2304.01315
-
There are also some minor clarity issues:
- IGC is used in line 192, but is only explained in the following Section 4.2.2
- Definition 4.2 - this definition is not specific enough to be useful. It handwaves by only requiring that the joint policy be 'equivalent' to the collection of individual optimal policies. Equivalent in what sense?
问题
-
Questions about experiments:
- What are some reasons why MIFQ does not achieve expert level performance? While the other methods also do not achieve expert level performance, the original IQ learn algorithm does have this ability.
- How does the method perform with demonstrations not sourced from MAPPO (an algorithm that learns gaussian policies)? For example, demonstrations sourced from QMIX, which learns 'hard max' policies?
- Why does the method need an order of magnitude more demonstrations than IQ Learn needs on complex single-agent tasks?
-
Method:
- Why is it necessary to maintain Q and V networks separately? Why not derive the global V function by computing the softmax of the Q functions as described in line 163-164?
- Why is it necessary to compute Q^tot via Q^tot = -M (-Q)? What is the purpose of the double negation? The stated justification is that this enable the method "to achieve the IGC principle and the convexity", but why exactly is this? Requiring the networks to be multi-layer feedforward w/nonnegative weights and convex activation functions (lines 194-195) is enough to ensure that Q^tot is monotonic w.r.t. the local Q functions, thus ensuring the IGC principle and convexity.
- Would major changes be necessary to enable this algorithm to operate on continuous action spaces? Did the authors consider continuous action space settings?
局限性
yes
We thank the reviewer for carefully reading our paper and providing us with valuable questions and suggestions.
I wonder if the authors could discuss whether a simple state-only extension of IQ Learn ...
Our argument in lines 143-148 simply means that directly using the global Q, V, and global state would be impractical in multi-agent settings. This is not a limitation of IQ-Learn but a well-known challenge when extending single-agent models to multi-agent settings. This is also why the centralized training decentralized execution (CTDE) approach has become appealing for MARL. The state-only approach in Section 5.4 of the IQ-Learn paper is only useful when actions are not available. Applying this in our context is unsuitable because action observations are available.
Can the authors more explicitly discuss what the shortcomings of an independent version of IQ-learn ...
If we learn the local Q independently by solving (4) for each agent, it implies that we neglect the interactions between agents. This approach ensures convergence to individual local policies but does not maintain consistency between local and global policies, as required by well-known principles such as IGO and IGM, which are necessary for a successful MARL algorithm.
The original IQ learn paper plots the rewards to validate that their method recovers the ground truth reward. Can the same be done here?
Visualizing rewards in multi-agent settings is much more challenging compared to the single-agent setting due to the vast joint state and action space. So far, we are unsure how to obtain meaningful visualizations for rewards in multi-agent tasks. Therefore, we will keep this for future investigation.
Why does MIFQ perform worse than BC on MPE, particularly the reference and spread tasks?
As mentioned in the paper, MPEs are deterministic environments (i.e., no dynamics), and BC typically performs well on such deterministic tasks.
What is the number of trials for each of the results?
We briefly mentioned these numbers in Table 2 of the appendix. Each number in Table 1 is computed based on 4 seeds and 32 evaluation runs per seed. We will add this information to the main paper.
What is the number of demonstrations used to train each of the methods? What does the shaded region mean? ...
The number of trajectories is 128 for MPEs and 4096 for Miner and SMAC-v2. We will clarify this in the caption of Figure 2. Additionally, the reviewer's point regarding the 95% confidence interval is well taken. We will compute these and update the paper accordingly.
IGC is used in line 192, but is only explained in the following Section 4.2.2. Definition 4.2 - this definition is not specific enough to be useful. ...
We appreciate the reviewer for pointing these out. We will remove the mention of IGC in line 192. In Definition 4.2, equivalence means that the joint policy is equal to the product of local policies. We will clarify this.
What are some reasons why MIFQ does not achieve expert level performance? ...
This was stated as a limitation of our approach (and other existing multi-agent IL algorithms as well). Multi-agent tasks are much more complex, making it difficult to recover the expert policy. Increasing the amount of expert demonstrations might help, but it also leads to an excessively large replay buffer, causing out-of-memory issues. Addressing this limitation will require further efforts, which we plan to pursue in future work.
How does the method perform with demonstrations not sourced from MAPPO ? ...
In our context, MAPPO achieves the best policy in MARL, so we use it as an expert. The main reason is that it is not reasonable to use a sub-optimal policy as an expert for imitation, as a sub-optimal solution to the imitation learning problem could yield better rewards than the expert, thus biasing the evaluation.
Why does the method need an order of magnitude more demonstrations than IQ Learn needs on complex single-agent tasks?
The main reason is that multi-agent tasks are much more complex than single-agent tasks, with much larger action and state spaces. Therefore, much more data is needed to understand the environment, requiring significantly more demonstrations for the imitation learning.
Why is it necessary to maintain Q and V networks separately? Why not derive the global V function by computing the softmax of the Q functions as described in line 163-164?
We have discussed this in Section B.4 of the appendix. The main reason for our approach is to make the algorithm practical. Directly computing through the global is generally impractical because it requires sampling over a joint action space, which is exponentially large. We actually attempted this approach, but it did not work at all—the algorithm couldn't learn anything, and the win rates were always zero. Therefore, we did not include this approach in the comparison. In our approach, we compute the global using local values (which only require sampling over the local action space, making it much more feasible) and then aggregate the global using the mixing network.
Why is it necessary to compute Q^tot via Q^tot = -M (-Q)?
The main reason for this double negation is not only for the monotonicity, as is sufficient for that purpose. We use this approach to ensure that the global objective function is concave in (Theorem 4.5),
Did the authors consider continuous action space settings?
So far, our algorithm is generally not suitable for continuous action spaces. However, all the environments under consideration have discrete action spaces and are taken from prior SOTA MARL works. Extending the approach to continuous action spaces would require further investigation. We plan to explore this in future work.
We hope that the above responses address your concerns. If you have any other comments or concerns, we are more than happy to address them
Thanks for addressing most of my concerns and questions. The only questions whose answers I wasn't completely satisfied with why to maintain Q and V networks separately.
The authors argued that sampling actions from the joint policies is intractable due to the size of the joint action space. However, the paper only addresses scenarios where the number of agents is relatively limited (up to 10 agents). Further, this cost would only be incurred during the training phase. Since sample efficiency is not a primary objective of this paper, I don't think this is a key issue.
Since the authors performed the experiment, perhaps they can add the results of directly computing V through Q to the appendix.
In any case, I am largely satisfied with the author's rebuttal and will raise my score.
We thank the reviewer for reading our responses and for maintaining a positive outlook on our paper.
The authors argued that sampling actions from the joint policies is intractable due to the size of the joint action space. However, the paper only addresses scenarios where the number of agents is relatively limited (up to 10 agents). Further, this cost would only be incurred during the training phase. Since sample efficiency is not a primary objective of this paper, I don't think this is a key issue.
At this point, we have found that this approach (computing V directly via Q) does not work in our context. There might be ways to overcome this, and we will explore them in the future. Thank you for your feedback!
Since the authors performed the experiment, perhaps they can add the results of directly computing V through Q to the appendix.
Thank you for the suggestion. We will definitely include these additional experiments (directly computing V through Q) in our paper.
This paper presents a novel algorithm, Multi-agent Inverse Factorized Q-learning (MIFQ), for cooperative multi-agent imitation learning (IL). It extends the inverse soft-Q learning framework to multi-agent settings by introducing a mixing network architecture for centralized training with decentralized execution. This enables learning local and joint value functions effectively. The authors conducted extensive experiments across multiple challenging environments, demonstrating that their approach outperforms existing methods.
优点
- The introduction of a multi-agent extension of inverse soft-Q learning using factorized networks is a significant and novel contribution to the field of IL.
- This paper is well-written and organized, and provides a sound theoretical analysis.
- The empirical results across three different environments, including a complex version of the StarCraft multi-agent challenge, are impressive. The proposed method outperforms existing baselines.
缺点
As someone who is not an expert in the field of imitation learning, I perceive no significant weaknesses in this paper from my perspective.
问题
- In Figure 2, the semi-transparent curves are not standardly explained. If these do not represent standard deviations, what statistical measure do they depict?
- Minor Error: On Line 62, the term "generation" is used where "generalization" might be intended. Could the authors clarify or correct this in the context?
局限性
Yes
We thank the reviewer for reading our paper and for the positive feedback.
In Figure 2, the semi-transparent curves are not standardly explained. If these do not represent standard deviations, what statistical measure do they depict?
They present standard deviations. We will clarify this in the paper.
Minor Error: On Line 62, the term "generation" is used where "generalization" might be intended. Could the authors clarify or correct this in the context?
This is “generalization” and it is a typo. We will correct this.
We hope that the above responses address your concerns. If you have any other comments or concerns, we are more than happy to address them.
The paper addresses the imitation problem in cooperative Multi-Agent Reinforcement Learning (MARL). It extends inverse soft-Q learning to the multi-agent domain by leveraging value factorizations under the Centralized Training with Decentralized Execution (CTDE) paradigm. Experimental results demonstrate the effectiveness of the proposed approach across several environments.
优点
- The study of imitation learning in MARL is a valuable and relevant research problem, and the paper provides promising solutions.
- The experimental results are robust and convincingly support the proposed method's effectiveness.
缺点
-
The paper's organization could be improved. The current structure alternates between theory and architecture without a clear flow.
-
The similarity between IGC and IGO[1] requires further clarification.
-
The objective function (6) introduces sub-optimality compared to the original objective (3) due to the restriction that and must be monotonic. Additionally, since and use different mixing networks, the relationship between them violates Equation (2). This indicates that Equation (6) does not represent the same objective as Equation (3), even without considering the sub-optimality introduced by factorization. These issues need further theoretical exploration and discussion.
-
Although the experimental results are promising, the superior performance seems to stem from the QMIX algorithm's advantage over other MARL algorithms. An important missing baseline is the soft actor-critic version of IQ-Learn, which uses a centralized Q function with decentralized critics and does not seem to violate the original objective.
[1] Zhang, et al., FOP: Factorizing Optimal Joint Policy of Maximum-Entropy Multi-Agent Reinforcement Learning, ICML 2021.
问题
- Is BC trained online, given that it shows learning curves with environment steps? If so, why not use DAGGER?
- Could the authors explain why MIFQ significantly outperforms IQVND? Is it solely due to the factorization structure?
- Why does the paper state that QPLEX is unsuitable for the proposed method? QPLEX also has .
局限性
Limitations are discussed in conclusion section.
We thank the reviewer for reading our paper and the insightful comments and questions.
The paper's organization could be improved.
Thank you for the feedback! We will revise our writing and improve our exposition.
The similarity between IGC and IGO[1] requires further clarification.
Thank you for mentioning this. IGO and our IGC are indeed similar. However, IGC is more specific in that it requires the local policies, obtained by solving the local objective functions, to be equivalent to the global policy obtained by the global objective function.
The objective function (6) introduces sub-optimality compared to the original objective (3) due to the restriction that and must be monotonic. Additionally, since and use different mixing networks, the relationship between them violates Equation (2). This indicates that Equation (6) does not represent the same objective as Equation (3), even without considering the sub-optimality introduced by factorization. These issues need further theoretical exploration and discussion.
Thank you for the insightful comments. We agree that the factorized Q-learning objective violates Eq. (3). The main reason we follow this approach is that the relationship between and cannot simultaneously hold at both the global and local levels, i.e., (2) and (5) cannot hold simultaneously (we discussed this in Section B2 in appendix). On the other hand, maintaining the relationship as in (2) is impractical because it requires computing via a global Q-function and global policy . Therefore, we choose to keep (5) valid and build our factorization approach based on this (Computing local V functions via (5) , and compute and by the mixing networks is indeed less challenging and more practical)
Furthermore, since (2) holds, each individual objective helps match the individual learning policy with the corresponding individual expert agent. Since IGC holds, our training can ensure global convergence and consistency across all agents.
Although the experimental results are promising, the superior performance seems to stem from the QMIX algorithm's advantage over other MARL algorithms. An important missing baseline is the soft actor-critic version of IQ-Learn, which uses a centralized Q function with decentralized critics and does not seem to violate the original objective.
Thank you for the comment and suggestion. There are two reasons we did not extend the soft actor-critic version of IQ-Learn to our multi-agent setting. First, as mentioned, it requires computing the global V function via the global Q function and global , which is impractical for multi-agent settings. Second, soft-actor-critic (SAC) methods only work well for continuous-action-space environments. All the environments we considered (following prior SOTA MARL papers) have discrete action spaces, making direct Q-learning algorithms more suitable. To support this argument, we have conducted an additional experiment, detailed in the attached 1-page PDF, where we compare a SAC IQ-learn adapted to multi-agent tasks. The results generally show that SAC-IQ performs worse than our algorithm, MIFQ. We will include these results in the paper.
Is BC trained online, given that it shows learning curves with environment steps? If so, why not use DAGGER?
Our BC was trained offline. The learning curves actually reflect our evaluations after certain training steps. We will clarify this in our paper.
Could the authors explain why MIFQ significantly outperforms IQVND? Is it solely due to the factorization structure?
Yes, IQVDN is simply a linear combination of local functions, while MIFQ leverages our two-layer mixing networks with learnable parameters. Previous work has also shown that QMIX generally outperforms VDN for this same reason.
Why does the paper state that QPLEX is unsuitable for the proposed method? QPLEX also has .
The monotonicity of the Q function is simply a corollary of our mixing structure and is not a key target when constructing our learning objective. Additionally, while QPLEX utilizes the advantage function (A = Q - V), our objective is different and such an advantage function is unsuitable to use. We will elaborate more on this point in the updated paper.
We hope that the above responses address your concerns. If you have any other comments or concerns, we are more than happy to address them.
Thanks for the authors' response. I have increased my initial score.
We thank the reviewer for reading our responses and for the prompt reply!
We thank the reviewers for carefully reading our paper and providing constructive feedback and questions, which we have been happy to consider and clarify. Please find a summary of our responses below.
Reviewer GGqd raised a concern about the fact that equation (2) does not hold under our mixing architecture. In response, we have clarified that the relationship between V and Q cannot be satisfied at both local and global levels simultaneously. Therefore, we chose to keep the equation valid at the local level, making our algorithm practical. In contrast, maintaining the V-Q equation at the global level (Eq. 2) would require sampling over the joint action space, which is impractical, especially for large-scale tasks such as SMAC_v2.
The reviewer also mentioned a soft-actor-critic (SAC) IQ-learn as a missing baseline. In response, we argued that such an SAC algorithm is neither suitable nor practical in our multi-agent setting. To support our arguments, we have conducted an additional experiment, detailed in the attached 1-page PDF, where we compare a SAC IQ-learn adapted to multi-agent tasks. The results generally show that SAC-IQ performs worse than our algorithm, MIFQ. We will include these results in the paper.
We have also provided detailed responses to other questions regarding why DAGGER is not used, why MIFQ outperforms IQVDN, and why QPLEX is not suitable. We will update our paper to clarify these points.
Reviewer vZKY has a clarification question about the curves in Figure 2 and pointed out a typo. We have provided a response to address this.
Reviewer v3qG raised several questions and requested clarification on the following points: (i) whether the state-only approach in the IQ-learn paper is applicable, (ii) why independent IQ-learn is limited, (iii) why MIFQ cannot achieve expert performance as in the single-agent IQ-learn setting, (iv) why MIFQ performs worse than BC on MPE, (v) how the method performs with demonstrations not sourced from MAPPO, (vi) why our method needs more demonstrations than single-agent IQ Learn, (vii) why it is necessary to maintain Q and V networks separately, and (viii) whether the authors considered continuous action space settings. In response, we have provided detailed answers to each question.
The reviewer also requested clarification regarding the number of trials for our results, the number of demonstrations used in the experiments, and the standard deviations reported in Figure 2. We will clarify these points in the paper
Reviewer 9wvM raised a concern about the use of global state information in our training algorithm, which makes it impractical. In response, we clarified that we only assume access to local observations of neighboring agents. These local observations are available in both training and deployment and are highly realistic in practical applications. The use of such information is also standard in previous multi-agent reinforcement learning algorithms. We will clarify this point in the updated paper.
The reviewer also asked whether our algorithm can be applied to non-cooperative settings. In response, we believe that the non-cooperative setting would be much more challenging and would require a new MARL algorithm, which we will explore in future work.The reviewer also pointed out some typos, which we highly appreciate and will correct.
We thank all the reviewers for their comments and feedback, which we have tried to address and clarify. If you have any further questions, we are more than happy to discuss and clarify them.
This paper presents a novel algorithm called multi-agent inverse factored Q-learning for cooperative multi-agent imitation learning. It extends inverse soft Q-learning from single-agent to multi-agent settings by a mixing network under the CTDE paradigm. Experimental evaluation across several environments demonstrates the effectiveness of the proposed algorithm. The proposed algorithm is a significant and contribution to the imitation learning field. The reviewers are largely satisfied with the author's rebuttal. I trust the authors will further improve the paper's organization and address clarity issues in the camera-ready version.