Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning
We develop and analyze a scalable algorithm for multi-agent RL by sampling from the mean-field distribution of the agents to overcome the curse of dimensionality.
摘要
评审与讨论
This paper proposes a novel algorithm, SUBSAMPLE-MFQ (Subsample-Mean-Field Q-learning), to address the scalability challenges in cooperative multi-agent reinforcement learning (MARL). Traditional MARL suffers from exponential complexity in the number of agents, and although mean-field MARL reduces this to polynomial complexity, it remains computationally prohibitive for large-scale systems.
The authors introduce a sampling-based approximation strategy: by selecting a small number of local agents and applying value iteration on this reduced system, they learn a policy that can be extended stochastically to the full agent population. The key theoretical contribution is showing that the performance gap between the learned policy and the global optimum decays at a rate of , independent of the total number of agents . Thus, by choosing , the algorithm achieves a polylogarithmic runtime in , representing an exponential speedup over previous approaches. The paper also extends this framework to off-policy learning and non-tabular settings.
优缺点分析
Strengths
-
Theoretical contributions are strong and rigorous. The authors provide a provable convergence rate, detailed complexity bounds, and analysis for both tabular and non-tabular settings.
-
The approach is sound and novel, building a strong bridge between mean-field theory and sampling-based approximation in MARL.
-
The paper includes clear algorithmic details, including formal descriptions (Algorithm 1 & 2), and supportive mathematical analysis.
Weaknesses
-
The experimental section is minimal and primarily relegated to the appendix. There is no strong empirical validation of the claimed exponential speedup or policy performance on standard benchmarks.
-
While the paper discusses extension to heterogeneity, the current algorithm assumes mostly homogeneous agent types.
-
Real-world deployment constraints (e.g., communication or decentralized execution under limited observability) are not empirically studied.
问题
-
Can the authors provide more comprehensive experiments on common MARL benchmarks (e.g., MAgent, PettingZoo, SMAC) to evaluate practical performance?
-
How does the method scale or degrade with increasing agent heterogeneity? Can you support this with empirical results?
-
Could the authors discuss how the learned policies can be implemented with function approximators like neural networks in practical settings?
-
Is there any guidance on how to select in real-world tasks to balance performance and computational efficiency? How does SUBSAMPLE-MFQ compare empirically to recent centralized training and decentralized execution (CTDE) methods?
局限性
Yes.
最终评判理由
The authors response the questions clearly, and it is a good thoery work in RL.
格式问题
No.
We thank reviewer PDRU for their feedback. Please find our response below:
While the paper discusses extension to heterogeneity, the algorithm assumes mostly homogeneous agent types.
Indeed, while the current algorithm can only work for the homogeneous setting, we add two caveats:
- As the reviewer points out, we mention an extension to a partially heterogeneous-agent setting by modeling types as a part of the state/action spaces of the local agents, in the “capturing heterogeneity” paragraph in page 4 and Remark 4.7.
- There has been a line of recent work on extending mean-field MARL using a graphon-based framework that supports a continuum of distinct transition probabilities (which is also a form of heterogeneity). It may be possible to extend our subsampling framework to graphon mean-field MARL. Adapting this to our sampling framework may be highly non-trivial, and we leave this as an open problem for future works. Nonetheless, we can add a detailed discussion on this as part of Remark 4.7.
Can the authors provide more comprehensive experiments on common MARL benchmarks (e.g., MAgent, PettingZoo, SMAC) to evaluate practical performance?
As our work is a theoretical framework to the NeurIPS theory track submission, we only provided sufficient experiments to numerically validate the theory. That being said, an even more thorough empirical study of cooperative multi-agent RL algorithms is an interesting avenue for future research for empirical research, however, this is outside of the scope of our current work. Moreover, the above benchmarks (MAgent, PettingZoo, SMAC) do not fit our MARL setting (for instance, SMAC requires heterogeneous agents, MAgent requires competitive agents, and PettingZoo does not support our global-agent/local-agent setup).
How does the method scale/degrade with increasing agent heterogeneity? Can you support this with empirical results?
Increasing the agent heterogeneity using our “type-formulation” makes the algorithm somewhat more expensive since it factors into the query complexity via the state space of the local agents. However, since the optimality gap of the learned policy is on the order of (modulo very small factors), increasing the amount of agent heterogeneity does not degrade the quality of the learned policy as the theoretical bound does not get worse.
Could the authors discuss how the learned policies can be implemented with function approximators like neural networks in practical settings?
Learning the policy and sharing it to all the agents is the computational bottleneck in practical settings. After this, it is efficient to implement the learned policies since the empirical distribution only takes time to compute, and this dependency is tight. In Appendix J, we extended our algorithm to continuous spaces. In this set-up, the learned policy could be implemented with neural operators and other function approximators to parameterize and evaluate the policy.
Is there any guidance on how to select k in real-world tasks to balance performance and computational efficiency? How does SUBSAMPLE-MFQ compare empirically to recent centralized training and decentralized execution (CTDE) methods?
Theoretically, suffices to have a decaying error rate with best efficiency, as we mention in Remark 4.6, and as is validated by our experiments. To the best of our knowledge, there is no previous CTDE method that utilized a similar parameterized (with respect to ) method for learning a MARL policy in the full-observation setting.
We hope this answers your questions, and look forward to further discussions.
Thank you,
Authors
Dear Reviewer PDRU,
We hope this message finds you well.
As the author-reviewer discussion period is coming to an end, we would like to kindly ask whether we have adequately addressed your concerns. If our responses have adequately addressed your concerns, we kindly ask you to consider updating your score.
If there are any remaining concerns, we would greatly appreciate it if you could share them with us, so we may have enough time to provide a comprehensive response. Thank you once again for your time and valuable feedback!
Thanks,
Authors
This paper addresses a challenging and important problem in cooperative multi-agent reinforcement learning (MARL) by proposing a Subsample-Mean-Field Q-learning algorithm that decouples the global dependence among agents through random subsampling. The paper establishes a performance gap of , where is the number of sampled agents with improved sample complexity. This is a strong and meaningful result in cooperative MARL under the homogeneous agent setting.
优缺点分析
Strengths
- The paper tackles an important and challenging problem in cooperative MARL. The proposed algorithm is intuitive, and the performance guarantee of with improved sample complexity is a great theoretical contribution to cooperative MARL, which might be leveraged to understand more complex settings.
Weaknesses
- The paper may require a simulator to generate samples, and the number of samples per iteration () could be very large, scaling exponentially with the size of the local state and action spaces. The method might be limited to the homogeneous setting.
问题
-
The algorithm's sample complexity appears to be less relevant to the time horizon , which is unusual in RL theory literature. If I understand correctly, it may be due to the large number of samples used in the Bellman updates. It would be beneficial to provide a detailed discussion on this topic.
-
In the learning period, the algorithm still seems to require a centralized way to do value iteration and (greedy) policy improvement; it would be better to explain how it can be implemented in a practical setting.
-
While the proposed method is designed for homogeneous agents, it would be valuable for the paper to discuss how the approach might be extended to heterogeneous settings, especially when the agent types are numerous or drawn from a continuous space. Even a discussion of potential challenges or directions would strengthen the broader applicability of the work.
局限性
No
最终评判理由
I think it is a nice theory paper in MARL, and I am very positive about it.
格式问题
No
We thank reviewer 1oj9 for their feedback. Please find our response below:
The paper may require a simulator to generate samples, and the number of samples per iteration could be very large, scaling exponentially with the size of the local state and action spaces. The method might be limited to the homogeneous setting.
Our algorithm assumes the existence of a simulator to generate samples. As the reviewer points out, this adds an cost per iteration, resulting in a large query complexity. However, this query complexity was previously exponential in the number of local agents and in the state/action spaces, whereas our work reduces this to a polynomial dependence with the number of local agents and a sub-exponential dependence on the state/action spaces. Moreover, Remark 4.5 and Appendix I details a version of the algorithm that relaxes the assumption of having a generative oracle by only utilizing historical data (by reducing to off-policy -learning).
The algorithm's sample complexity appears to be less relevant to the time horizon , which is unusual in RL theory literature. If I understand correctly, it may be due to the large number of samples used in the Bellman updates. It would be beneficial to provide a detailed discussion on this topic.
In Appendix G.1, we show that a necessary condition of is sufficient to allow the Bellman error to be on the order of . This enables Lemma 4.3 in the main body which allows us to abstract away the dependence on the time-horizon. We can add a parallel derivation that retains the factor via a dependence in the final bound (which shows an exponentially decaying error with the time horizon ). Moreover, in Lemma G.10, we show that the query complexity is on the order of , and we bound and set to attain the rate. We can make these dependencies more explicit in Theorem 4.2 in the final version of this paper.
Additionally, we can explain that our polynomial dependence on may be loose since we do not use more complicated variance reduction techniques as in [1, 2, 3, 4] to optimize the number of samples which is used, in turn, to bound the Bellman error . This was primarily because our goal was to analyze the necessary number of sampled agents as well as to keep complex algorithms as simple as possible (and incorporating variance reduction would significantly complicate the algorithm and intuition).
In the learning period, the algorithm still seems to require a centralized way to do value iteration and (greedy) policy improvement; it would be better to explain how it can be implemented in a practical setting.
Centralized learning is the only provable way for a multi-agent system to learn the true optimum policy in the full-information setting, as the optimum policy lives in a set whose cardinality depends on the joint state/action space. To clarify, there exists decentralized-training MARL algorithms with provable convergence guarantees to some equilibrium policy; however, these algorithms generally do not converge to the optimum policy. Since our algorithm only requires centralized training on agents, it is more scalable than previous optimum-searching policies. One practical suggestion that exploits the homogeneity of the local agents is to allow each local agent to compute the joint -function instead of having a controller share the learned policy with all the agents: this leads to a natural trade-off between the communication complexity and the sample complexity of the algorithm which can be optimized for in various practical settings, and we can add this as a discussion in Remark 4.8 in the main body.
While the proposed method is designed for homogeneous agents, it would be valuable for the paper to discuss how the approach might be extended to heterogeneous settings, especially when the agent types are numerous or drawn from a continuous space. Even a discussion of potential challenges or directions would strengthen the broader applicability of the work.
Indeed, while the current algorithm can only work for the homogeneous setting, we add two caveats:
- As the reviewer points out, we mention an extension to a partially heterogeneous-agent setting by modeling types as a part of the state/action spaces of the local agents, in the “capturing heterogeneity” paragraph in page 4 and Remark 4.7.
- There has been a line of recent work on extending mean-field MARL using a graphon-based framework that supports a continuum of distinct transition probabilities (which is also a form of heterogeneity). It may be possible to extend our subsampling framework to graphon mean-field MARL. Adapting this to our sampling framework may be highly non-trivial, and we leave this as an open problem for future works. Nonetheless, we can add a detailed discussion on this as part of Remark 4.7.
We hope this answers your questions, and look forward to further discussions.
Thank you,
Authors
====================
References:
[1] Variance Reduced Value Iteration and Faster Algorithms for Solving Markov Decision Processes (Sidford et. al., SODA 2018)
[2] Near-Optimal Time and Sample Complexities for Solving Discounted Markov Decision Process with a Generative Model (Sidford et. al., NeurIPS 2018)
[3] Truncated Variance Reduced Value Iteration (Jin et. al., NeurIPS 2024)
[4] Variance-reduced Q-learning is minimax optimal (Mainwright, arXiV 2019)
Thank you for the detailed response! It address my concerns and I think it is a nice theory paper in MARL.
This paper addresses the fundamental challenge of scalability in cooperative multi-agent reinforcement learning (MARL), where the joint state-action space grows exponentially with the number of agents, n. The authors propose a novel algorithm, SUBSAMPLE-MFQ, that tackles this curse of dimensionality by leveraging mean-field theory and subsampling. The main contribution of the paper is a rigorous theoretical proof showing that the performance gap between the learned policy and the globally optimal policy decays at a rate of O(1/ square root of k). This error bound is independent of the total number of agents n, establishing a clear trade-off between computational complexity and sub-optimality.
优缺点分析
Strengths
- This paper presents a significant and original theoretical breakthrough for scalable MARL. The idea of combining mean-field approximation with agent subsampling to achieve complexity that is independent of the total number of agents n is a powerful and novel concept. This directly tackles the most critical bottleneck in MARL—the curse of dimensionality—and provides a new and viable path toward creating provably good algorithms for systems with massive numbers of agents.
-The main theoretical result (Theorem 4.4) is very strong. Proving an optimality gap that does not depend on n is a powerful guarantee. The proof appears rigorous and leverages a non-trivial combination of techniques, including a novel MDP sampling result based on an extension of the DKW inequality and a careful adaptation of the performance difference lemma to the multi-agent sampling context. The analysis is extensive, with the appendices covering extensions to stochastic rewards, off-policy learning, and even continuous spaces, demonstrating the robustness of the framework.
-
For a deeply theoretical paper, the presentation is exceptionally clear. The authors do an excellent job of motivating the problem, providing high-level intuition for their approach, and then systematically building up the formalisms.
-
The result establishes a fundamental trade-off between computation (polynomial in k) and optimality. By setting k=O(\log n), the paper demonstrates an exponential speedup over prior mean-field methods while maintaining a provably small, decaying optimality gap. This could inspire a new class of practical, scalable MARL algorithms.
Weaknesses -The primary weakness of the paper is its experiments. The simulations are run on two relatively simple, synthetic environments ("Gaussian squeeze" and "constrained exploration") designed to illustrate the theory. The paper lacks evaluation on any standard, complex MARL benchmarks (e.g., StarCraft, Multi-Agent MuJoCo, etc.). While the results effectively demonstrate the claimed trade-off, they do not show how SUBSAMPLE-MFQ compares to or performs against established, practical MARL algorithms in more realistic settings.
-The theoretical results are derived for a specific cooperative MARL setting with homogeneous local agents and a structured reward and transition dynamic (agents do not directly interact with each other, only through the global state). While the authors mention that heterogeneity can be modeled by including a "type" in the state space, this may not be sufficient for all real-world scenarios. The restriction to purely cooperative tasks also limits the immediate applicability to competitive or mixed-motive games.
-The proposed online execution policy (Algorithm 2) requires each of the n agents to sample and observe the states of k-1 other agents at every timestep. While this is tractable from a computational perspective, it assumes a communication or observation network that can support this random sampling at scale. In a massively distributed system, this requirement could itself become a communication bottleneck, which complicates the "decentralized execution" claim.
问题
-
The theoretical results are very impressive. To better contextualize their impact, could you discuss how you expect SUBSAMPLE-MFQ to perform on standard, complex MARL benchmarks like the StarCraft Multi-Agent Challenge (SMAC)? How does the structural assumption of your model (a global agent and non-interacting local agents) map to such environments, and what challenges would arise in applying your method there?
-
The paper proves an optimality gap of O~(1/ square root of k). In the limitations, you correctly note the absence of a matching lower bound. Do you have any intuition on whether this rate is tight for this problem? Or could a different sampling or estimation strategy potentially achieve a faster rate, such as O(1/k)?
局限性
yes
格式问题
No concerns
We thank reviewer GLs2 for their feedback on our work.
While this is tractable from a computational perspective, the decentralized execution assumes a communication or observation network that can support this random sampling at scale. In a massively distributed system, this requirement could itself become a communication bottleneck, which complicates the "decentralized execution" claim.
Thanks for pointing this out. We can add some discussion on heuristics that derandomize the decentralized execution method, which may be more conductive for practical settings. Specifically, one practical implementation of the algorithm involves allowing the agents to share some randomness by only sampling within pre-defined fixed blocks of size . We can add more detail on this and add this as a new section in the appendix.
To better contextualize their impact, could you discuss how you expect SUBSAMPLE-MFQ to perform on standard, complex MARL benchmarks like the StarCraft Multi-Agent Challenge (SMAC)? How does the structural assumption of your model (a global agent and non-interacting local agents) map to such environments, and what challenges would arise in applying your method there?
As the reviewer points out, the primary contributions of our work are theoretical, and we only provided simulations sufficient to numerically validate the theoretical result. For instance, the “Gaussian squeeze” experiment is similar to one conducted in the foundational mean-field MARL paper [1] which we ablate against. An even more thorough empirical study of cooperative multi-agent RL algorithms would be an interesting avenue for future research for empirical research, however, this is outside of the scope of our current work which focuses on some theoretical bottlenecks of MARL. At the present, some structural assumptions which resist applications of our methods in some standard MARL benchmarks involve the fact that our method only incorporates weak heterogeneity, as well as the fact that it is designed for cooperative multi-agent learning rather than more competitive/game environments.
Regarding the optimality gap, do we have any intuition on whether this rate is tight for this problem? Or could a different sampling or estimation strategy potentially achieve a faster rate, such as O(1/k)?
The algorithmic bottleneck in achieving a faster rate than is actually from learning the function, rather than the online execution strategy; however, when , our algorithm reduces to mean-field learning which has a rate of (and is known to be tight in light of [1]). Thus, at least when , our rate is tight. This illustrates a natural obstacle against uniformly improving our rate and perhaps suggests intuition for why our result might be tight.
We would be happy to answer any further questions.
Thank you,
Authors
========================
References:
[1] Mean Field Multi-Agent Reinforcement Learning, Yang et. al. (ICML 2018)
Dear Reviewer GLs2,
We hope this message finds you well.
As the author-reviewer discussion period is coming to an end, we would like to kindly ask whether we have adequately addressed your concerns. If our responses have adequately addressed your concerns, we kindly ask you to consider updating your score.
If there are any remaining concerns, we would greatly appreciate it if you could share them with us, so we may have enough time to provide a comprehensive response. Thank you once again for your time and valuable feedback!
Thanks,
Authors
The paper introduces SUBSAMPLE-MFQ, a novel algorithm designed for cooperative Multi-Agent Reinforcement Learning (MARL). By subsampling agents and applying mean-field Q-learning to this reduced subset, the method achieves significant computational savings. The algorithm retains theoretical guarantees, showing that the learned policy converges to the optimal one with an optimality gap of , independent of the total number of agents . When , SUBSAMPLE-MFQ offers exponential improvements over previous methods.
优缺点分析
Strength:
- The paper provided rigorous theoretical proof for the algorithm
- The paper showed the given algorithm in clear details.
Weakness:
- Although this paper focuses more on the theory side, no experiments was given.
问题
- I wonder the feasibility of design a small experiment to validate the conclusion in this paper?
局限性
The authors mentioned the work's limitation in the paper.
格式问题
N/A
We thank reviewer APLY for their feedback on our work.
We did provide two experiments in Appendix B (page 16) corresponding to the motivating examples (Gaussian squeeze and constrained exploration) in Lines 126-145, to numerically validate the theoretical advantages of our SUBSAMPLE-MFQ algorithm. Across these experiments, we also provided figures that plot the log reward optimality gap as a function of , the time taken for SUBSAMPLE-MFQ to compute its learned policy for varying , and the cumulative discounted reward seen by the algorithm for different .
We could plan to use the extra page in the camera-ready version to move some of the experiments to the main body. We hope this eases any concerns about the experimental validity of the algorithm, and we would be happy to answer any further questions.
Thank you,
Authors
Dear Reviewer APLY,
We hope this message finds you well.
As the author-reviewer discussion period is coming to an end, we would like to kindly ask whether we have adequately addressed your concerns. If our responses have adequately addressed your concerns, we kindly ask you to consider updating your score.
If there are any remaining concerns, we would greatly appreciate it if you could share them with us, so we may have enough time to provide a comprehensive response. Thank you once again for your time and valuable feedback!
Thanks,
Authors
The paper “develop[s] and analyze[s] a scalable algorithm for multi-agent RL by sampling the mean-field distribution of the agents to overcome the curse of dimensionality.” All reviewers rated the paper highly, universally stating that it is a good theory paper that delivers on its claims. Some concerns were expressed about the limited experimental results.