Multi-Agent Imitation by Learning and Sampling from Factorized Soft Q-Function
摘要
评审与讨论
The paper presents a new framework for multi-agent imitation learning that avoids the instability of adversarial methods and the compounding errors of behavior cloning. MAFIS extends the inverse soft Q-learning paradigm (IQ-Learn) to multi-agent settings by introducing a value decomposition network that factorizes the soft Q-function across agents. This factorized soft Q-function implicitly defines an energy-based policy from which actions are sampled using stochastic gradient Langevin dynamics (SGLD), enabling tractable imitation learning in both discrete and continuous domains. The method is evaluated on SMACv2, Gold Miner, MPE, and Multi-Agent MuJoCo, showing superior performance to existing imitation learning baselines in both online and offline settings.
优缺点分析
strength
I think the paper introduces a well-motivated and technically sound MAIL framework. Extending IQ-Learn to the multi-agent setting by adapting both the policy and soft Q-function into joint forms is a logical step, but the key contribution is the agent-level factorization of the global soft Q-function. The insight of interpreting the soft Q-function as an energy-based model elegantly avoids the need for logsumexp over continuous action spaces. Unlike many MAIL methods that are restricted to either online or offline settings, MAFIS offers a single, unified solution. The experiments are broad and convincing.
Weakness
- I think the paper could better discuss the trade-offs of using value decomposition. While value decomposition makes the learning objective factorable, it may limit expressivity in tasks requiring high-order coordination. I would have liked to see either ablations with different decomposition architectures (e.g., QMIX, Qplex) or discussion of when the factorization assumption might break down.
- While I appreciate that MAFIS is positioned within imitation learning, I believe it should be evaluated against recent offline MARL methods such as OMAR or ICQ-MA, which are also designed to learn from fixed datasets without online exploration. Since MAFIS shares many traits with offline RL approaches—such as value-based training from demonstrations, scalability, and stability—I would have liked to see whether it matches or exceeds these general-purpose baselines. Including such comparisons would help clarify the relative strengths of MAFIS in broader offline settings.
问题
- Could you elaborate on the limitations of value decomposition in high-coordination tasks?
- Why was there no comparison to recent offline MARL methods such as OMAR or ICQ-MA?
- How sensitive is MAFIS to the number and quality of expert trajectories?
局限性
Yes
格式问题
N/A
We sincerely thank you for your supportive and insightful feedback on our work. Please find our detailed responses to your concerns and questions below.
I think the paper could better discuss the trade-offs of using value decomposition...discussion of when the factorization assumption might break down... Could you elaborate on the limitations of value decomposition in high-coordination tasks?
The Value Decomposition (VD) paradigm, including seminal works like VDN and QMIX, represents a mainstream and highly successful approach in cooperative multi-agent reinforcement learning. Its prominence is rooted in several key strengths that effectively address core challenges in MARL:
First and foremost, VD methods elegantly implement the Centralized Training for Decentralized Execution (CTDE) framework. This allows them to leverage global information during the training phase to solve the difficult credit assignment problem, while producing highly efficient, decentralized policies for execution where each agent acts based on its local observations. This contributes significantly to their scalability, avoiding the exponential complexity of the joint action space.
Furthermore, methods like QMIX introduce a crucial inductive bias through the Individual-Global-Max (IGM) principle. This constraint, which enforces a monotonic relationship between individual agent utilities and the global team value, is not merely a limitation. Rather, it is a powerful mechanism that guarantees consistency between local and global optimal actions, greatly simplifying policy extraction and promoting stable learning. The strong empirical performance of these methods across numerous benchmarks validates the effectiveness of this approach.
While it is well-recognized that the structural assumptions of VD methods (such as IGM) define their application scope—potentially limiting their expressiveness on tasks requiring highly complex, non-monotonic agent coordination—this is often viewed as a strategic trade-off between expressiveness and learnability.
It is precisely because the value decomposition paradigm provides such a solid and effective foundation that our work builds upon it. We believe that exploring advancements within this powerful framework is a crucial and promising direction for pushing the frontiers of multi-agent coordination.
While I appreciate that MAFIS is positioned within imitation learning, I believe it should be evaluated against recent offline MARL methods such as OMAR and ICQ-MA, which are also designed to learn from fixed online datasets withou exploration...Why was there no comparison to recent offline MARL methods such as OMAR or ICQ-MA?
Although offline MARL methods like OMAR and ICQ-MA also learn from offline datasets, the offline dataset considered in offline MARL is fundamentally different from the one we examine. Specifically, our offline dataset is of expert quality but contains only state-action transitions without reward signals. In contrast, offline MARL does not require expert-quality data but necessitates the inclusion of reward signals. As a result, the two prominent offline MARL methods, OMAR and ICQ-MA, cannot operate on our expert dataset (due to the lack of rewards). Additionally, during the training process, offline MARL methods do not allow agents to interact with the environment, whereas we consider not only offline learning but also online learning, where agents can learn to collaborate through online interactions (though they still cannot obtain reward signals). Hence, we did not include comparisons with them in our study.
How sensitive is MAFIS to the number and quality of expert trajectories?
With more data, MAFIS can perform better. The performance of MAFIS under different amount of expert data is shown in the following table.
| 20 expert trajectories | 15 expert trajectories | 10 expert trajectories | |
|---|---|---|---|
| Ant | 4144.09 | 2106.34 | 1013.61 |
| HalfCheetah | 3248.96 | 2456.21 | 1214.28 |
| Walker | 3272.70 | 2552.46 | 2393.45 |
Like common imitation learning algorithms such as BC, MAGAIL, etc., MAFIS currently requires high-quality expert data. However, exploring how to perform imitation learning with suboptimal expert data is an interesting direction that we will investigate in future work.
Dear Reviewer H9Rs,
We would like to express our sincere gratitude for the time and effort you have dedicated to reviewing our submission. Your thoughtful comments have been invaluable.
We hope our rebuttal has provided sufficient clarification and successfully addressed the concerns you raised. We remain fully available and would be delighted to provide any further details or discuss any remaining points.
If our responses and the planned revisions have resolved your concerns, we would kindly ask you to consider reflecting this in your evaluation of our work.
Thank you once again for your constructive engagement.
Respectfully, The Authors of Submission 28624
This paper introduces MAFIS, a novel approach for multi-agent imitation learning that works for both online and offline settings in discrete and continuous control tasks. The authors adapt the single-agent IQ-Learn framework to multi-agent scenarios by introducing a value decomposition network that factorizes the joint soft Q-function as a weighted sum of individual agent Q-functions. This factorization enables the imitation objective to be decomposed at the agent level, allowing for scalable training and decentralized execution. Unlike existing methods that rely on adversarial training, MAFIS converts the adversarial objective into a non-adversarial one through the factorization approach. The optimal joint policy can be expressed as a product of individual agent policies, eliminating the need for adversarial optimization between the Q-function and policy. For continuous action spaces where computing the over actions is intractable, the authors make a key observation that the soft Q-function implicitly defines the optimal policy as an energy-based model. They use stochastic gradient Langevin dynamics (SGLD) to sample actions from this distribution, enabling gradient estimation without explicitly computing the intractable normalization terms. The experimental evaluation demonstrates MAFIS's effectiveness across multiple benchmarks including SMAC v2, MPE, Gold Miner, and MAMuJoCo, showing superior performance compared to existing baselines including Behavioral Cloning, MA-GAIL, and MIFQ.
优缺点分析
Strengths: The paper provides rigorous mathematical derivations, particularly in Propositions 3.1 and 3.2, showing how the joint soft Q-function factorization leads to a tractable non-adversarial objective. The experiments span multiple challenging benchmarks (SMACv2, MPE, Gold Miner, MaMuJoCo) covering both discrete and continuous control tasks, with both online and offline settings. The sensitivity analysis (Section 4.2 and Appendix D) provides insights into key hyperparameters like the number of samples and entropy weight . Figure 1 effectively demonstrates the instability of adversarial training in multi-agent continuous control, motivating the proposed approach. The paper tackles real issues in MAIL - BC's compounding errors and AIL's training instability. Providing a single approach that works for discrete/continuous and online/offline settings is valuable. The factorization approach enables decentralized execution, which is crucial for real-world multi-agent systems. Using SGLD to sample from the soft Q-function as an EBM for continuous control is creative and well-executed.
Weaknesses: The paper only compares against BC, MA-GAIL, and MIFQ, other recent MAIL methods could strengthen the evaluation. The paper collects only 20 trajectories for continuous tasks and 100 for discrete tasks; the sensitivity to the amount of expert data is not explored. While the method requires SGLD sampling with samples and steps, the computational overhead compared to baselines is not discussed. While Appendix B attempts to clarify differences from MIFQ, the distinction could be clearer in the main text. MAFIS creates a fundamental mismatch between training and execution policies.
问题
- Your training objective (Equation 7) optimizes Q-functions such that the induced policy matches expert behavior. However, during decentralized execution, agents use . While you correctly note these share the same argmax for greedy action selection, I have concerns about: a) In scenarios requiring exploration or stochasticity (e.g., multi-modal action distributions), the two policies will sample different actions with different probabilities. Has this been considered? b) Did you consider learning Q-functions directly for the execution policy ? For instance, by reparameterizing to absorb the effect of ?
- Can you provide analysis on why MAFIS still lags behind expert performance, particularly in continuous control? Have you tried: Varying the number of expert demonstrations (currently only 20 for continuous)? Alternative sampling methods (e.g., HMC)?
- Have you considered comparing with other recent MAIL methods, particularly: SQIL which also builds on soft Q-learning or any offline MARL methods that can work without rewards (e.g., multi-agent behavior cloning with transformers)?
局限性
Yes
最终评判理由
After carefully considering all the points raised in the authors' response, I believe that my original assessment still accurately reflects the current state of the paper and its contributions. I encourage the authors to continue their work in this direction, as the research area remains important and promising.
格式问题
No formatting issues found.
Thank you for your positive assessment of our work. Please find our detailed responses to your concerns below.
The sensitivity to the amount of expert data is not explored.
With more data, MAFIS can perform better. The performance of MAFIS under different amount of expert data is shown in the following table.
| 20 expert trajectories | 15 expert trajectories | 10 expert trajectories | |
|---|---|---|---|
| Ant | 4144.09 | 2106.34 | 1013.61 |
| HalfCheetah | 3248.96 | 2456.21 | 1214.28 |
| Walker | 3272.70 | 2552.46 | 2393.45 |
The computational overhead compared to baselines is not discussed.
Thanks to community-optimized deep learning packages like PyTorch and TensorFlow, as well as engineering techniques such as Just-in-Time (JIT) compilation, sampling an action via SGLD can be completed within milliseconds on widely used commercial GPUs (e.g., NVIDIA's A100 or 4090). Both MAFIS and baselines (like MAGAIL) can finish training within a few hours on a single GPU.
While Appendix B attempts to clarify differences from MIFQ, the distinction could be clearer in the main text.
Thank you for your advice. Due to space limitations, we have to put the full description of the differences in Appendix B and present a concise discussion of the differences between MIFQ and our approach at the end of Section 3.1. We will continue to optimize the article layout in the future.
MAFIS creates a fundamental mismatch between training and execution policies.
We argue that it is not an issue. Inspired by QMIX, we introduced into the joint Q-function factorization. MAFIS samples actions that maximize each agent's local Q-function through Equation (9), which has been adopted and recommended by widely-used open-source toolkits such as Stable-Baselines3 and Tianshou in the reinforcement learning community. Taken the maximum entropy RL method SAC as an example, they require deterministic=True to ensure actions that maximize the Q-function are sampled for evaluation. Our key observation is that although the state is unavailable to each agent during the execution phase in Dec-POMDPs, the action that maximizes is the same as the one that maximizes due to that is always greater than 0. This implies that even if we sample actions solely based on , we can still make optimal decisions.
I have concerns about: a) In scenarios requiring exploration or stochasticity (e.g., multi-modal action distributions), the two policies will sample different actions with different probabilities. Has this been considered? b) Did you consider learning Q-functions directly for the execution policy? For instance, by reparameterizing to absorb the effect of ?
For your first concern, by treating the Q-function as an energy function and sampling actions, instead of directly modeling the policy with a Gaussian distribution, we are in fact considering that the policy distribution may be multi-modal. Sampling from an energy-based model via SGLD naturally supports obtaining samples from different modes, where the does not change the local optima (as claimed in our paper).
For your second concern, the performance comparison of MAFIS using value factorization methods QMIX or VDN is shown in the following table. MAFIS(QMIX) consistently performs better than MAFIS(VDN), which verifies the effectiveness of introducing the coefficients in Equation (4).
| MAFIS(QMIX) | MAFIS(VDN) | |
|---|---|---|
| Ant(2x4) | 4144.09 | 4077.16 |
| HalfCheetah(2x3) | 3248.96 | 2687.33 |
| Walker2d(2x3) | 3272.70 | 3142.91 |
Can you provide analysis on why MAFIS still lags behind expert performance, particularly in continuous control? Have you tried: Varying the number of expert demonstrations (currently only 20 for continuous)? Alternative sampling methods (e.g., HMC)?
With only 20 expert trajectories, MAFIS achieved scores exceeding 4000 on both the Ant (2x4) and Walker2d (2x3) tasks. In contrast, the best baseline, Behavioral Cloning (BC), could not surpass a score of 2000, and MAGAIL failed to learn altogether. This demonstrates the significant effectiveness of our method. On the HalfCheetah (2x3) task, our method also outperformed the best baseline, but there is still a gap compared to the expert's performance. We attribute this primarily to the following reasons:
-
The task itself is difficult, as the continuous action space presents new challenges for agent exploration and learning.
-
The number of expert trajectories is limited. MAFIS would be able to perform better with more expert data. However, considering that expert data may not be easily obtainable in real-world applications, we chose to use only 20 expert trajectories in the paper. As can be seen, MAFIS still performs well even under this condition.
Have you considered comparing with other recent MAIL methods, particularly: SQIL which also builds on soft Q-learning or any offline MARL methods that can work without rewards (e.g., multi-agent behavior cloning with transformers)?
Thanks for your valuable feedback. In our experiments, to ensure a fair comparison, the network used for BC is a multi-layer perceptron (MLP) and a GRU, with the number of learnable parameters being almost equal to those of all other methods. Since a Transformer would introduce significantly more parameters, using it as the backbone for BC in our comparison might be somewhat unfair. However, exploring the use of a Transformer as the backbone within MAFIS is an interesting direction that we will consider for future work. The performance of MAFIS and MASQIL is shown in the following table. As the results indicate, our method MAFIS achieves a higher average score and outperforms the MASQIL baseline in a majority of the test environments.
| MAFIS | MASQIL | |
|---|---|---|
| Ant | 4144.09 | 3264.48 |
| HalfCheetah | 3248.96 | 3679.77 |
| Walker | 3272.70 | 3009.52 |
| average | 3555.25 | 3317.92 |
I would like to thank the authors for their rebuttal and comprehensively answering my concerns and questions. The rebuttal provides adequate responses to most of my concerns, including concrete data on expert trajectory sensitivity. The QMIX vs VDN ablation validates the mixing coefficients' importance. However, the training-execution mismatch remains inadequately addressed - your argument only holds for greedy action selection, while stochastic sampling will still produce different probability distributions, potentially impacting multi-modal scenarios. While the method addresses real MAIL problems with solid foundations and comprehensive experiments, the stochastic policy mismatch represents a fundamental limitation that should be acknowledged more clearly.
We are very grateful for the reviewer's detailed feedback and are pleased to hear that our previous rebuttal has addressed most of your concerns. We appreciate the opportunity to provide further clarification on the remaining point regarding the training-execution mismatch.
We agree that the underlying probability distributions for stochastic sampling differ between training (based on and execution (based on ). However, we argue that this does not constitute a fundamental limitation in practice due to the following reasons:
- Identical Optima for Deterministic Execution: Our core argument relies on the fact that the scaling factor is strictly positive (). This property ensures that the energy landscapes of and share the exact same local and global optima. Consequently, when using a deterministic policy for execution—by selecting the action with the maximum Q-value—the chosen action is guaranteed to be identical to the optimal action under the training policy. This eliminates the mismatch for the most critical execution scenario.
- Deterministic Evaluation as a Best Practice: The choice to use a deterministic policy for deployment is not an arbitrary workaround but a widely adopted best practice in the RL community. Leading frameworks like Tianshou recommend it because deterministic actions "provide consistent behavior, reduce variance in performance metrics, and are more interpretable." Our evaluation protocol aligns with this standard for robust and reproducible deployment.
- Robustness in Multi-Modal Scenarios: We also acknowledge the reviewer's concern about multi-modal scenarios.
- First, our sampling method, SGLD, is inherently well-suited for exploring multi-modal distributions as it initiates sampling from multiple random points, naturally covering different modes of the energy-based model.
- Furthermore, to explicitly leverage these modes during execution, we propose a potential enhancement: one could first cluster the sampled actions and then identify the local optimum within each cluster. This would allow for the deliberate selection of actions from different high-value modes, providing a robust strategy even for complex, multi-modal policies.
In summary, while a theoretical distributional mismatch exists for stochastic sampling, our deterministic execution protocol, grounded in established best practices, ensures consistent optimal action selection. Moreover, our framework is inherently capable of handling multi-modal scenarios. We will clarify these points in the revised manuscript to fully address this perceived limitation.
Dear Reviewer CMSX,
We would like to express our sincere gratitude for the time and effort you have dedicated to reviewing our submission. Your thoughtful comments have been invaluable.
We hope our rebuttal has provided sufficient clarification and successfully addressed the concerns you raised. We remain fully available and would be delighted to provide any further details or discuss any remaining points.
If our responses and the planned revisions have resolved your concerns, we would kindly ask you to consider reflecting this in your evaluation of our work.
Thank you once again for your constructive engagement.
Respectfully, The Authors of Submission 28624
I thank the authors for their thorough and detailed response. The additional clarifications and explanations provided have been helpful in addressing several of my underlying questions and concerns. After carefully considering all the points raised in the authors' response, I have reassessed the paper in light of these clarifications. While I appreciate the authors' efforts to address the feedback and the additional context provided, I believe that my original assessment still accurately reflects the current state of the paper and its contributions.
-
The paper studies multi-agent imitation learning by extending the single-agent IQ-Learn method (a well-known imitation learning approach in the single-agent setting).
-
By employing value decomposition (where the global Q-function is factorized as a linear combination of local Q-functions), the authors propose a CTDE (Centralized Training with Decentralized Execution) approach to learn local policies through a centralized loss function adapted from the IQ-Learn objective.
-
Experiments conducted on recent and strong benchmarks, such as SMAC_v2 and MAMuJoCo, show that the proposed method, MAFIS, appears to perform well.
优缺点分析
Strengths:
- Imitation learning in the multi-agent setting is less explored than in the single-agent case, so investigating this direction is worthwhile.
- The proposed approach aims to ensure some fundamental properties in MARL, such as consistency between global and local value functions.
- The experimental results seem solid — the proposed MAFIS method, as reported in the paper, appears to outperform some recent baselines such as MFIQ.
Weaknesses
- The main weakness is that the idea of extending IQ-Learn to the multi-agent setting is not novel. This has already been thoroughly explored in recent work [1]. Compared to MFIQ in [1], the proposed method appears to follow the same strategy (extending the IQ-Learn loss function to the multi-agent setting using CTDE and value factorization to decompose the global Q-function). Moreover, MFIQ even explores a more general setting and architecture — it discusses both 1-layer and 2-layer mixing networks, while MAFIS focuses only on a simple linear combination. In addition, properties such as global-local consistency and convexity have been more thoroughly investigated in [1].
- There is a brief discussion on the difference between MFIQ and MAFIS, but the difference seems marginal, and the core idea remains essentially the same.
- Some theoretical results are trivial and can be easily deduced from prior work. For instance, Proposition 3.1 is just a special case of the local policy formulation mentioned in [1], where a 1-layer network is used. Proposition 3.2 also appears to be trivial.
- Compared to [1], several relevant baselines are missing, such as VDN and value factorization with non-linear mixing networks.
In summary, the contribution is relatively weak given that [1] has already extended IQ-Learn to the multi-agent setting with a more in-depth investigation, stronger theoretical results, and richer experimental studies
[1] The Viet Bui, Tien Mai, and Thanh Hong Nguyen. "Inverse Factorized Soft Q-Learning for Cooperative Multi-agent Imitation Learning." In Advances in Neural Information Processing Systems (NeurIPS), 2024.]
问题
- How does your factorization method perform when using non-linear mixing structures, such as a 2-layer mixing network?
- Can you say anything about convexity in the Q-function space — an important property of IQ-Learn that has been discussed in both the original IQ-Learn paper and in MFIQ?
- How does your method compare with other standard baselines, such as VDN or independent IQ-Learn (where you perform IQ-Learn independently for each agent)?
局限性
- Yes, the authors discuss some limitations of their work, which is reasonable.
- I do not see any significant negative social impact that needs to be discussed in the paper.
最终评判理由
Although the authors have conducted additional experiments in response to my concerns, my core concerns remain.
In particular, I do not see the learning objective of the proposed algorithm as substantially different from that of MFIQ (NeurIPS 2024). While there are some structural differences, the foundational principles—namely inverse Q-learning and value decomposition—are largely shared with prior work. The theoretical contributions also appear to be incremental and could be derived straightforwardly from existing formulations.
The central claim that MAFIS significantly extends MFIQ to continuous control tasks is also not entirely convincing. The paper does not present any clear methodological innovations specifically designed for continuous control. As a result, the claimed extension seems to stem more from implementation choices and tuning rather than from a fundamentally new approach.
During the rebuttal, I mentioned several relevant techniques (e.g., XQL, IQL, SQL), which were primarily developed for offline reinforcement learning but can be effectively adapted for imitation learning to achieve better performance. The authors, however, kept arguing that these methods are specific to offline RL and not applicable to their IL setting. This response gives the impression that they did not seriously consider my suggestions and that they possess a rather limited understanding of recent developments in imitation learning and offline RL.
Given these points, I will maintain my current score as a strong reject.
格式问题
he paper format seems to be good. I do not see any major issues.
Thank you for your time in reviewing our paper. Please find our responses below. We hope our responses address your concerns and questions.
The main weakness is that the idea of extending IQ-Learn to the multi-agent setting is not novel...There is a brief discussion on the difference between MFIQ and MAFIS, but the difference seems marginal, and the core idea remains essentially the same...Some theoretical results are trivial and can be easily deduced from prior work.
We appreciate the contributions of MIFQ, and we do not claim that MAFIS is the first work to extend IQ-Learn to the multi-agent setting. However, we would like to argue that MAFIS has made non-trivial contributions compared to MIFQ and IQ-Learn:
-
For discrete control tasks, we found that after decomposing the joint Q-function into the form expressed in Equation (4), the correct form of the optimization objective should be as shown in our Equation (7) in Proposition 3.1, rather than the Equation (7) in Proposition 4.4 of MIFQ.
-
For continuous control tasks, a widely criticized problem of IQ-Learn is its instability. Starting from its adversarial optimization problem, we discovered a new optimization approach that achieves non-adversarial optimization even in continuous control tasks. This has led to performance significantly superior to adversarial methods like MAGAIL, whereas MIFQ cannot even be applied to continuous control tasks! We believe that our Propositions 3.1 and 3.2 are not simple extensions of MIFQ's propositions and theorems. On the contrary, we have not only made non-trivial contributions compared to MIFQ but have also solved the optimization problem of the original IQ-Learn in continuous control tasks.
Compared to [1], several relevant baselines are missing, such as VDN and value factorization with non-linear mixing networks...How does your factorization method perform when using non-linear mixing structures, such as a 2-layer mixing network?
The derivation of MAFIS does not require the mixing network to be a one-layer linear network. That is, MAFIS is compatiable with multi-layer non-linear networks. In our submitted code, we implement the mixing network as a non-linear three-layer mlp with ReLU activation functions as below (in line 75 of mafis/mafis_cont/mafis/algorithms/critics/twin_continuous_q_critic.py):
self.mixer = nn.Sequential(nn.Linear(share_obs_space.shape[0], 64), nn.ReLU(), nn.Linear(64, 64), nn.ReLU(), nn.Linear(64, 1)).to(device)
Can you say anything about convexity in the Q-function space — an important property of IQ-Learn that has been discussed in both the original IQ-Learn paper and in MFIQ?
Similar to the original IQ-Learn, the objective is concave in the joint Q-function space. Our factorization of the joint Q-function (as shown in Equation (4)) does not change the concavity of the objective.
How does your method compare with other standard baselines, such as VDN or independent IQ-Learn (where you perform IQ-Learn independently for each agent)?
Independent IQ-Learn fails to learn on the continuous control tasks. The performance comparison of MAFIS using value factorization methods QMIX or VDN is shown in the following table. MAFIS(QMIX) consistently performs better than MAFIS(VDN), which verifies the effectiveness of introducing the coefficients in Equation (4).
| MAFIS(QMIX) | MAFIS(VDN) | |
|---|---|---|
| Ant(2x4) | 4144.09 | 4077.16 |
| HalfCheetah(2x3) | 3248.96 | 2687.33 |
| Walker2d(2x3) | 3272.70 | 3142.91 |
I thank the authors for the response, which addresses some of my concerns. However, many issues still remain:
For continuous control tasks, a widely criticized problem of IQ-Learn is its instability.
I agree that IQ-Learn is indeed unstable for continuous control tasks. However, it is also quite outdated. Recent Q-learning approaches—such as Extreme Q-Learning—have made significant progress in addressing this instability. It appears the authors may not be aware of these advancements. Therefore, the claim that the proposed method resolves IQ-Learn's instability is not well supported, and may even be invalid in light of recent developments.
However, we would like to argue that MAFIS has made non-trivial contributions compared to MFIQ.
I do not find this argument convincing. MFIQ already extends IQ-Learn by decomposing into a function of local Q-values, while ensuring properties such as global-local consistency and soft-max policy derivation. These results appear directly applicable to your setting and are not clearly distinguished in your contribution.
The derivation of MAFIS does not require the mixing network to be a one-layer linear network.
In MFIQ and QMIX, the mixing network refers to the architecture that aggregates local Q-functions. In your Equation (4), the formulation clearly implies that this aggregation is linear in its inputs. So my question is that: what happens if this combination is non-linear, such as using a two-layer feedforward network?
In your paper, it seems the so-called mixing network refers to , which functions more as a hyperparameter network rather than an aggregator of local Q-values.
Similar to the original IQ-Learn, the objective is concave in the joint Q-function space.
It is trivial that the training objective is concave in the joint Q-function. My question actually concerns concavity with respect to the local Q-functions, which are the parameters actually being trained. This point is explicitly addressed in MFIQ—even under more complex settings involving non-linear mixing networks.
The performance comparison of MAFIS using value factorization methods QMIX or VDN is shown in the following.
This comparison is quite incomplete, as it only covers a few small MuJoCo tasks. Additional evaluation on SMACv2 tasks should be included to provide a more comprehensive assessment
Thank you for providing us with your feedback. We are glad to hear that you found our responses helpful. Please find below our new responses:
I agree that IQ-Learn is indeed unstable for continuous control tasks...and may even be invalid in light of recent developments.
We would like to briefly clarify the motivation and novelty of our work with three key points:
- IQ-Learn is Foundational, Not Outdated: In our specific field of multi-agent imitation learning (MAIL), IQ-Learn is a foundational method with research extending it only very recently (e.g., MIFQ, NeurIPS 2024).
- Extreme Q-Learning (XQL) is Inapplicable: XQL is designed for offline RL, a fundamentally different problem setting from our focus on imitation learning. Therefore, it cannot be directly applied. Furthermore, XQL itself has well-documented stability issues [1].
- Our Contribution is Novel and Validated: Our method is the first to solve this specific instability problem for multi-agent IQ-Learn. This novelty is confirmed by our strong experimental results.
[1] Stabilizing Extreme Q-learning by Maclaurin Expansion, RLC 2024.
MFIQ already extends IQ-Learn by decomposing into a function of local Q-values...and are not clearly distinguished in your contribution.
Our primary technical contribution lies in a fundamentally different value decomposition structure. The core distinction is how the state-dependent scaling factor is integrated:
- In MFIQ, the scaling factor applies outside the log-sum-exp term, scaling the resulting local value: , which results in a standard softmax policy: .
- In our framework, the factor is applied inside the exponential, directly scaling the Q-values before the softmax operation: , wich yields a policy where the scaling factor directly modulates the action probabilities: .
Furthermore, a significant contribution that extends our work's scope well beyond MFIQ is our novel solution for continuous control tasks. MFIQ's framework is exclusively designed for discrete action spaces, and our extension addresses this major limitation, broadening the applicability of multi-agent IQ-Learn.
what happens if this combination is non-linear, such as using a two-layer feedforward network?
You are correct that our current formulation in Equation (4) uses a linear combination. We opted for this design for two main reasons:
- Empirical Effectiveness: As our experimental results demonstrate, this linear approach is highly effective and already outperforms existing methods on multiple benchmarks.
- Methodological Precedent: This linear decomposition is a well-established and common approach in MARL, having been successfully employed in prominent works like DOP and VDAC.
That said, we completely agree that exploring a non-linear mixing architecture, such as the two-layer network you suggested, is a valuable and promising direction. We consider it an excellent extension for future work.
My question actually concerns concavity with respect to the local Q-functions, which are the parameters actually being trained.
The concavity of the objective over local Q-functions hold. Our Proposition 3.1 decomposes into a sum of . Here, is analogous to the single-agent IQ-Learn objective, but with the Q-function term scaled by a state-dependent factor . According to Proposition 3.6 of IQ-Learn, the original objective is concave in . Since our scaling by is a linear operation with respect to , this concavity is preserved. It follows that is concave in , and therefore the total objective is concave in .
Additional evaluation on SMACv2 tasks should be included to provide a more comprehensive assessment.
Following your guidance, we have conducted additional experiments on the SMACv2 benchmark. As shown in the table below, MAFIS(QMIX) consistently outperforms MAFIS(VDN) across all tasks, providing strong evidence for the effectiveness of our proposed method.
| MAFIS(QMIX) | MAFIS(VDN) | |
|---|---|---|
| terran_5_vs_5 | 17.68 | 14.67 |
| protoss_5_vs_5 | 16.94 | 16.60 |
| zerg_5_vs_5 | 13.97 | 12.81 |
| terran_10_vs_11 | 17.64 | 14.38 |
| protoss_10_vs_11 | 13.73 | 13.32 |
| zerg_10_vs_11 | 16.31 | 14.33 |
Dear Reviewer zF4A,
We would like to express our sincere gratitude for the time and effort you have dedicated to reviewing our submission. Your thoughtful comments have been invaluable.
We hope our rebuttal has provided sufficient clarification and successfully addressed the concerns you raised. We remain fully available and would be delighted to provide any further details or discuss any remaining points.
If our responses and the planned revisions have resolved your concerns, we would kindly ask you to consider reflecting this in your evaluation of our work.
Thank you once again for your constructive engagement.
Respectfully, The Authors of Submission 28624
I thank the authors for their response.
Extreme Q-Learning (XQL) is Inapplicable
Yes XQL is primarily an offline RL algorithm, but the extreme-V update it employs has proven useful for stabilizing Q-learning. Recent works in offline RL and imitation learning (e.g., DualRL) have successfully leveraged this type of update.
Value Decomposition Structure and the Role of
Although the authors highlight a different integration of the state-dependent scaling factor , I find the distinction from existing approaches to be relatively minor. In particular, the current formulation does not appear to be supported by new theoretical insights that would justify the change as a substantial advancement.
Choice of Linear Mixing in Equation (4)
While the authors justify using a linear combination for simplicity and clarity, this decision makes the approach less general compared to methods like MFIQ, which has demonstrated that more expressive two-layer nonlinear mixing networks yield better empirical performance. As such, the simplification here may come at the cost of generality and effectiveness.
Overall, I do not view the learning objective of the proposed algorithm as significantly different from MFIQ. While there are some structural variations, the core principles of inverse Q-learning and value decomposition remain largely aligned with prior work. The theoretical developments presented also seem incremental and derivable from existing frameworks.
Finally, the claim that MAFIS significantly extends MFIQ to handle continuous control tasks is not fully convincing. The current work does not appear to introduce any specific methodological components uniquely suited to continuous control, suggesting that the extension may primarily rely on engineering choices and hyperparameter tuning.
Given these considerations, I will maintain my current score.
Thank you for your feedback. Please find our new responses below.
Yes XQL is primarily an offline RL algorithm, but...have successfully leveraged this type of update.
Extreme-V update is NOT applicable to our considered IL problem.
- Distinct Problem Domains: The extreme-V update is a specific tool designed to mitigate out-of-distribution (OOD) value overestimation in offline RL. Our paper addresses training instability in imitation learning (IQ-Learn), which arises from an entirely different mechanism: adversarial optimization between the policy and Q-function.
- On the DualRL Example: Regarding the reference to DualRL, a closer examination reveals that it actually supports our line of reasoning. While DualRL provides a valuable theoretical bridge between offline RL and IL, its own practical algorithm for imitation learning, ReCOIL, does not use the extreme-V update. This omission is significant, as it suggests that the update is NOT considered a standard or necessary tool for modern imitation learning, even by the authors who proposed the unifying framework.
- Our Contribution is a Direct Solution: Our work provides the correct diagnosis for IQ-Learn's instability. We identify the adversarial optimization as the root cause and propose a tailored solution that completely bypasses the need for a V-function. This is a more fundamental and appropriate solution.
Although the authors highlight a different integration of the state-dependent scaling factor ...would justify the change as a substantial advancement.
Overall, I do not view the learning objective of the proposed algorithm as significantly different from MFIQ...seem incremental and derivable from existing frameworks.
We respectfully disagree with the characterization of our contribution as a "minor distinction." The core of our argument is NOT about the apparent magnitude of the change, but about the theoretical correctness of the formulation, which we believe constitutes a substantial advancement.
- The Critical Question is "What is the Correct Form?": The central issue is not whether our state-dependent scaling factor looks similar to previous work, but what its mathematically correct form should be. As we demonstrate in our rigorous proof in Appendix A.1, the optimal V-function for the underlying objective must take the form we have proposed. The formulation used in prior work, such as MIFQ, is, by this proof, NOT correct.
- Our Theory Exposes Flaws in Prior Work: Furthermore, our theoretical analysis in Appendix A.1 provides a crucial insight that directly addresses your concern about the lack of new theory. We prove that the optimization objective in MIFQ is NOT equivalent to the original IQ-Learn objective it aims to solve. This is a significant finding, as it implies that MIFQ is not guaranteed to learn the theoretically optimal Q-function and policy.
In summary, our modification, while perhaps appearing subtle, is justified by a formal theoretical proof that was absent in previous literature. It ensures that the learned V-function is consistent with the true objective. We believe this correction, grounded in new theoretical analysis, is a necessary and substantial contribution to the field.
While the authors justify using a linear combination for simplicity and clarity...the simplification here may come at the cost of generality and effectiveness.
We thank the reviewer for this point. Our method, using a simple linear mixer, consistently outperforms MFIQ with its two-layer non-linear network across all benchmark tasks. This strongly indicates that our core contribution—the theoretical correctness of the learning objective—is more critical for performance than the architectural complexity of the mixer in this context.
Nevertheless, we acknowledge that combining our improved objective with more expressive networks is a valuable direction for future work.
Finally, the claim that MAFIS significantly extends MFIQ to handle continuous control tasks is not fully convincing...may primarily rely on engineering choices and hyperparameter tuning.
We wish to clarify that our work's primary contribution is a novel methodology for continuous control, which the reviewer may have overlooked.
Our key insight is that the unstable adversarial policy learning in standard IQ-Learn is unnecessary for continuous control. We bypass this entirely by introducing a new methodological component: **using Stochastic Gradient Langevin Dynamics (SGLD) to sample actions directly from the learned Q-function.**This SGLD-based action sampling is not an "engineering choice." It is a principled mechanism that fundamentally replaces the policy optimization step, representing a significant advancement for applying this class of algorithms to continuous domains.
I appreciate the authors’ responsiveness.
"ReCOIL does not use the extreme-V update."
This is incorrect — ReCOIL does use the extreme-V update. Please refer to Equation (12) and Algorithm 1 in the paper.
"Standard IQ-Learn is unnecessary for continuous control... We bypass this entirely by introducing a new methodological component."
XQL is indeed designed to avoid computing the log-sum-exp operator, specifically to address the challenges posed by continuous action spaces.
"Our theory exposes flaws in prior work."
I’m not convinced by this claim. Could you clarify why your formulation should be considered correct while prior work is not? What exactly constitutes “correctness” in this context?
using a simple linear mixer, consistently outperforms MFIQ with its two-layer non-linear network across all benchmark tasks
This is insufficient to support the claim that nonlinear mixing is inferior to your linear mixing structure in your setting. Experimental results can be influenced by many factors, including hyperparameter choices, which can be tuned. Are you using the same dataset as MFIQ?
using Stochastic Gradient Langevin Dynamics (SGLD)
This appears to be a rather incremental adaptation of an existing method, and I do not consider it a significant contribution. Please note that there are other methods explicitly designed for continuous control in imitation learning/offline RL, such as XQL, IQL, sparse QL.
This is incorrect — ReCOIL does use the extreme-V update. Please refer to Equation (12) and Algorithm 1 in the paper.
Thank you for pointing this out. As the authors claimed, extreme-V update is used to "prevent extrapolation error in the offline setting". MAFIS considers both online learning and offline learning. Moreover, we deal with the instability by removing the adversarial optimization of the original IQ-Learn.
XQL is indeed designed to avoid computing the log-sum-exp operator, specifically to address the challenges posed by continuous action spaces.
We did NOT identify the root cause of instability of IQ-Learn as computing the log-sum-exp. In fact, IQ-Learn already deals with the computation of log-sum-exp by iteratively learning the Q-function and policy. However, we find it NOT work well, thus proposing our method.
I’m not convinced by this claim. Could you clarify why your formulation should be considered correct while prior work is not? What exactly constitutes “correctness” in this context?
Please find our detailed responses to your previous concerns and our detailed discussion in our paper.
This is insufficient to support the claim that nonlinear mixing is inferior to your linear mixing structure in your setting. Experimental results can be influenced by many factors, including hyperparameter choices, which can be tuned. Are you using the same dataset as MFIQ?
To make sure a fair comparison, all setups remain the same.
This appears to be a rather incremental adaptation of an existing method, and I do not consider it a significant contribution. Please note that there are other methods explicitly designed for continuous control in imitation learning/offline RL, such as XQL, IQL, sparse QL.
XQL, IQL and sparse QL are all offline RL methods.
This paper presents MAFIS, a novel algorithm for Multi-Agent Imitation Learning (MAIL) that based on IQ-learn. Inspired by QMIX, the authors propose a value decomposition method that factorizes the joint soft Q-function into a weighted sum of individual agents' Q-functions. This factorization yields a tractable and non-adversarial training objective. It uses Stochastic Gradient Langevin Dynamics (SGLD) to extract actions from the soft q function for continuous action. The authors conduct comprehensive experiments across a variety of discrete (SMACv2, MPE, Gold Miner) and continuous (MaMuJoCo) multi-agent benchmarks, showing that MAFIS consistently outperforms baselines.
优缺点分析
Strength:
- The combination of IQ-learn and factorized Q function is novel and elegant.
- The paper provides clear derivations for its factorized objective (Proposition 3.1) and its gradient (Proposition 3.2).
- The method is evaluated on multiple challenging multi-agent benchmarks, covering discrete and continuous domains and online and offline settings.
Weakness:
- My primary concern lies in the factorization of the Q-function and the policy. Since each agent selects actions and evaluates its Q-function almost independently, coordination between agents may be hindered at each timestep. Please correct me if I’m mistaken, but consider the following example: if two agents receive a high reward only when they choose the same action, then sampling actions independently would prevent them from achieving optimality. This suggests that the current factorization may fail to capture essential inter-agent dependencies.
- Another concern arises from the use of the state-dependent coefficients k^i(s), which are used to compose the joint Q-function. As stated in Lines 183–184, these coefficients—as well as the state itself—are not observable by individual agents during decentralized execution. This raises a potential issue: due to the presence of entropy regularization, the policy that is optimal with respect to an individual Q-function may not align with the policy optimal for the joint Q-function. While this misalignment may not significantly affect practical performance, it suggests a discrepancy between the training and deployment objectives that could impact theoretical guarantees or worst-case behavior.
- The use of SGLD for sampling in continuous action spaces can be computationally expensive, potentially limiting the method’s scalability. Moreover, employing SGLD for Q-function sampling is a well-established technique and, by itself, does not constitute a significant contribution.
问题
- though factorization is standard, I want to understand the design choice that why not considering learning a centralized q function and then factorized it by learning the expectation of each agent’s Q value?
- it seems that if we want multi-agent collaborates in the beginning, it would be necessary to introduce some hidden variables before agents make decision. I’d like to hear authors’ though on this.
局限性
yes
最终评判理由
The logic behind the factorization has become more clear to me and my concern about the SGLD has been resolved. This is a good paper with potential work. I'd keep my score and recommend acceptance.
格式问题
no concern
Thank you for your positive and encouraging comments. Please find our detailed responses to your concerns below.
My primary concern lies in the factorization of the Q-function and the policy. Since each agent selects actions and evaluates its Q-function almost independently...This suggests that the current factorization may fail to capture essential inter-agent dependencies.
During training, although each agent selects actions indenpendently, their Q-functions are evaluated in a joint manner to make sure that the learned Q-functions capture the inter-agent dependencies. To be concise, MFIS updates the Q-functions of all agents simultaneously via Equation (7) or (8), in which the Q-functions of all agents interact to maximize the shared objective.
Another concern arises from the use of the state-dependent coefficients , which are used to compose the joint Q-function...it suggests a discrepancy between the training and deployment objectives that could impact theoretical guarantees or worst-case behavior.
Inspired by QMIX, we introduced into the joint Q-function factorization. MAFIS samples actions that maximize each agent's local Q-function through Equation (9), which has been adopted and recommended by widely-used open-source toolkits such as Stable-Baselines3 and Tianshou in the reinforcement learning community. Taken the maximum entropy RL method SAC as an example, they require deterministic=True to ensure actions that maximize the Q-function are sampled for evaluation. Our key observation is that although the state is unavailable to each agent during the execution phase in Dec-POMDPs, the action that maximizes is the same as the one that maximizes due to that is always greater than 0. This implies that even if we sample actions solely based on , we can still make optimal decisions.
The use of SGLD for sampling in continuous action spaces can be computationally expensive, potentially limiting the method’s scalability.
Thanks to community-optimized deep learning packages like PyTorch and TensorFlow, as well as engineering techniques such as Just-in-Time (JIT) compilation, sampling an action via SGLD can be completed within milliseconds on widely used commercial GPUs (e.g., NVIDIA's A100 or 4090). In terms of scalability, MAFIS samples each agent's action independently, so the computational cost increases linearly with the number of agents rather than exponentially. We believe this design ensures strong scalability.
Moreover, employing SGLD for Q-function sampling is a well-established technique and, by itself, does not constitute a significant contribution.
We do not regard samping via SGLD as our contribution. However, to our best knowledge, MAFIS is the first multi-agent imitation learning method that employs SGLD for sampling from the Q-function.
Though factorization is standard, I want to understand the design choice that why not considering learning a centralized q function and then factorized it by learning the expectation of each agent’s Q value?
Your idea is quite interesting, but we haven’t yet found a suitable approach to implement it—we will explore it in future work. Additionally, learning a factorized Q-function has already achieved many successful applications in the field of multi-agent reinforcement learning, which inspires us to consider attempting it in multi-agent imitation learning.
It seems that if we want multi-agent collaborates in the beginning, it would be necessary to introduce some hidden variables before agents make decision.
MAFIS does not explicitly introduce any hidden variables. Instead, during training, we consider there are demonstrations provided by a group of experts. And by saying "a group of experts" here, we mean that they collaborate well. We believe that learning from demonstrations can teach agents how to collaborate, just like we human beings do. During execution, although agents does not commuicate with each other, their learned Q-functions will help them make decisions for a good collaboration.
Dear Reviewer Bsbt,
We would like to express our sincere gratitude for the time and effort you have dedicated to reviewing our submission. Your thoughtful comments have been invaluable.
We hope our rebuttal has provided sufficient clarification and successfully addressed the concerns you raised. We remain fully available and would be delighted to provide any further details or discuss any remaining points.
If our responses and the planned revisions have resolved your concerns, we would kindly ask you to consider reflecting this in your evaluation of our work.
Thank you once again for your constructive engagement.
Respectfully, The Authors of Submission 28624
This paper introduces MAFIS, a multi-agent imitation learning algorithm based on a factorized soft Q-function, designed to be stable and applicable to both discrete and continuous control tasks.
Most reviewers appreciated its technical soundness, strong empirical results across a wide range of benchmarks, and the extension to continuous control. However, one reviewer raised a critical concern regarding the paper's novelty, arguing that the proposed method is an incremental and straightforward extension of MFIQ. Another reviewer pointed out a theoretical concern regarding a mismatch between the training and execution policies. In response, the authors provided new results as requested. They argued that their specific value decomposition is theoretically more correct than that of MFIQ and that their extension to continuous control is a non-trivial contribution that significantly broadens the applicability of this class of algorithms.
Although the critique is valid in that MAFIS builds directly upon MFIQ, it is true that MAFIS ensures a nice theoretical property that MFIQ did not have. Since MFIQ did not support continuous control, the extension to continuous control should also be considered a non-trivial contribution. Thus, I believe that the paper's strengths outweighs the criticism that the reviewer raised. Therefore, I recommend acceptance. I strongly encourage the authors to more clearly describe the similarity and distinction over MFIQ in the camera-ready version, and to also clarify the theoretical limitation in the revision.