In-Context Fully Decentralized Cooperative Multi-Agent Reinforcement Learning
This paper enhances the fully decentralized cooperative multi-agent reinforcement learning from a context modeling perspective.
摘要
评审与讨论
The paper identifies two fundamental challenges in fully decentralized cooperative multi-agent reinforcement learning (MARL): non-stationarity and relative overgeneralization. The proposed Return-Aware Context (RAC) method models each agent’s task as a contextual MDP, using discretized episodic returns as proxies for joint policies. This is a simple yet elegant way to distinguish joint behavior patterns in a decentralized setting.
优缺点分析
Strengths: This paper provide a clear motivation and a new method. Also, the paper clearly explains the difference between average-based value projection and the optimistic marginal value, highlighting how RAC achieves better estimation and action selection without relying on other agents’ policies or actions. The authors conduct thorough evaluations on three types of tasks (Matrix Game, Predator and Prey, and SMAC maps), comparing RAC with strong baselines such as IQL, Hysteretic Q-learning, and I2Q. The empirical results demonstrate consistent and significant performance gains. RAC requires no centralized training signals and performs well under partial observability, suggesting good applicability to large-scale or real-world settings where full observability or communication is limited.
Weaknesses:
- The use of discretized episodic returns as contexts, while intuitive, is heuristic and lacks theoretical backing regarding discriminability among joint policies. It would strengthen the work to include analysis (or even empirical metrics) of context separability.
- The paper focuses only on the case where policies change between episodes (case 2). While the authors acknowledge this, it would be useful to include results or discussion on how RAC might generalize or fail under case (1), where policy changes occur within episodes.
- The performance depends on manual tuning of the number of context bins () and the KL supervision weight (). Exploring adaptive or data-driven discretization strategies could make the method more robust and easier to apply.
- I think the ablation studies could be further expanded. Additional ablations, such as disabling the KL term, changing the action selection strategy, or comparing with soft context clustering, would help clarify the necessity of each component.
- The manuscript repeats some concepts (e.g., relative overgeneralization) multiple times. The narrative could be made more concise for better readability and emphasis on novel contributions.
问题
- The use of discretized episodic returns as contexts, while intuitive, is heuristic and lacks theoretical backing regarding discriminability among joint policies. It would strengthen the work to include analysis (or even empirical metrics) of context separability.
- The paper focuses only on the case where policies change between episodes (case 2). While the authors acknowledge this, it would be useful to include results or discussion on how RAC might generalize or fail under case (1), where policy changes occur within episodes.
- The performance depends on manual tuning of the number of context bins () and the KL supervision weight (). Exploring adaptive or data-driven discretization strategies could make the method more robust and easier to apply.
- I think the ablation studies could be further expanded. Additional ablations, such as disabling the KL term, changing the action selection strategy, or comparing with soft context clustering, would help clarify the necessity of each component.
- The manuscript repeats some concepts (e.g., relative overgeneralization) multiple times. The narrative could be made more concise for better readability and emphasis on novel contributions.
However, since my research field is not focused on multi-agent systems, I will make further reasonable evaluations based on the comments of other reviewers.
局限性
Yes
格式问题
N/A
1. Response to Weakness 1 and Question 1.
(1) Intuition. Our goal is to construct a representation of the agents' joint policy. Intuitively, joint policies that yield similar expected returns tend to have similar value functions, and their representations should be located close to one another in the representation space. Contexts based on episodic returns naturally satisfy this requirement, and the context-based value function produces similar outputs for similar joint policies. Specifically, such episodic return-based contexts cluster the agents' joint policies according to their returns (or associated value functions), resulting in a one-to-many mapping from contexts to joint policies.
(2) Empirical Evidence. As shown in Tab. 1 of the main text, for the matrix game, we define 4 episodic return intervals based on prior knowledge of the task. For agent , the contexts are associated with distinct return ranges corresponding to different policies of another agent . The context-based value function accurately approximates the rewards of all joint actions (i.e., all deterministic joint policies learned by value-based methods). The above analysis also applies to agent . These results empirically demonstrate the effectiveness of episodic return-based contexts in representing the agents' joint policies.
(3) Theoretical Consideration. The theoretical foundations for the correspondence between episodic return-based contexts and the agents' joint policies have not been explored in this work. We leave this theoretical analysis for future investigation.
2. Response to Weakness 2 and Question 2.
Case (1) considers the scenario where the context changes between time steps, i.e., other agents update/change their joint policy at every (or every few) time steps. In this case, the contexts based on episodic returns fail to uniquely represent the agents' changing joint policy, leading RAC to struggle with the resulting severe non-stationarity. The underlying reason is as follows: when other agents change their joint policy within episodes, the generation probability of current agent 's local trajectory is determined by multiple joint policies of other agents. Consequently, the corresponding episodic return is also determined by these multiple joint policies of other agents, and thus fails to uniquely correspond to any single one. We will add this discuss in the revision to enhance clarity.
3. Response to Weakness 3 and Question 3.
Thanks for your valuable suggestions. In terms of the KL supervision weight , a straightforward approach to ensure comprehensive coverage of the return space is to equip RAC with efficient exploration techniques. As for the number of context bins , as discussed in the Response to Question 2 of Reviewer KFLW, adaptive context modeling may be achieved by dynamically clustering the received returns. We plan to explore both directions in the next stage of our work.
We will add this discussion in the revision to encourage further research on in-context fully decentralized cooperative MARL.
4. Response to Weakness 4 and Question 4.
Thanks for your valuable advice. At this stage, we primarily focus on evaluating the impact of the KL term. We consider two specific cases: (1) When the KL term is fully disabled, the separately learned degenerates into standard IQL, resulting in performance equivalent to that of IQL. (2) When the learning of is removed, RAC is constrained to directly use for action selection, and we assess its performance in the matrix game, predator and prey, and the SMAC map 5m_vs_6m.
| RAC_w_ | Hysteretic Q-learning | I2Q | IQL | |
|---|---|---|---|---|
| matrix game (test return) | 8.00.0 | 8.00.0 | 8.00.0 | 6.00.0 |
| predator and prey (test return) | 33.411.34 | 2.482.57 | 0.00.0 | 0.00.0 |
| 5m_vs_6m (test win rate) | 0.00.0 | 0.290.10 | 0.420.06 | 0.390.05 |
As shown in the table above, RAC_w_ achieves superior performance in the matrix game and predator and prey tasks. This highlights the effectiveness of the individual optimistic marginal value in RAC for mitigating relative overgeneralization. However, RAC_w_ fails to learn effective policies on 5m_vs_6m. We attribute this failure to insufficient coverage of the return space, hindering the construction of meaningful contexts based on episodic returns. A detailed explanation is provided in the Response to Question 3 of Reviewer KFLW. For complex tasks such as SMAC, integrating RAC with efficient exploration techniques remains a promising direction for future work. We will also make every effort to complete additional ablation studies on alternative action selection strategies and soft context clustering, and incorporate them into the revised manuscript.
5. Response to Weakness 5 and Question 5.
Thanks for your valuable advice. We will adopt abbreviations in the revision to enhance readability.
Dear Reviewer 7ZBz,
During the rebuttal period, we have provided detailed responses to your concerns regarding:
(1) Clarifications on context separability in terms of the discriminability among joint policies;
(2) Discussion on the limitations of our method in Case (1), where contexts change between time steps;
(3) Further extensions involving adaptive context modeling and simplified hyper-parameter tuning;
(4) Enhancements to the ablation studies;
(5) The use of abbreviations to improve readability.
We sincerely hope that the current responses can resolve your concerns. Looking forward to your response if you have any further concerns about this work.
Best regards,
The Authors
Dear Reviewer 7ZBz,
We sincerely appreciate the time and effort you have dedicated to providing us with your valuable feedback. We believe that we have addressed all of your concerns. Should you have any further questions or suggestions, please do not hesitate to share them, and we would be glad to address them promptly.
Best regards,
The authors
This paper presents a novel way to perform fully decentralized multi-agent learning based on contextual MDPs. The idea is to use the episodic trajectory return as the context for a single-agent MDP, so that such an information can help the learning agent to overcome both non-stationarity (as the context reflects the behaviour of the other agents) and the relative overgeneralization phenomenon. Sampling of the agent's action is based on an optimistic marginal estimate, so that agents can coordinate during executing, assuming that the others are acting optimally.
优缺点分析
The problem of DTDE is a very interesting one in multi-agent reinforcement learning literature, as it is often the case that centralized information cannot be performed, and in general to reduce the increasing computational cost that such centralized components may produce. The idea of modelling the non-stationarity perceived by each agent as a contextual MDP is good, and novel up to be best of my knowledge. The paper is easy to read and understand, the idea is exposed nicely and the narrative is coherent.
However, I feel that some aspects of the proposed method are not sufficiently discusses or clarified, for example the increased computational requirement of RAC to compute the optimistic marginal value estimate (or the induced policies required for the KL loss term in practice). Moreover, the empirical results are not astonishing, especially on the more complex SMAC settings, thus not justifying on their own right the proposed methodology, especially if this comes at the additional costs discussed before. Please see the Questions below for a more detailed discussion.
问题
Q1: On page 5 you claim that "The associated episodic return implicitly represents the agents’ joint policy in the return space.". I do not think this is entirely true: different joint policies may lead to the same return in practice: for example, think of a simple meet-in-a-grid problem, where the agents only perceive some reward if they successfully meet: all the different sub-optimal joint policies that do not solve the problem will lead to the same return. Also the opposite is true: the same joint policy may lead to totally different return estimates when using Monte-Carlo samples of it. Simply put, trajectory returns are not in a bijection with the joint policies that generate them. How does your proposed solution consider these aspects when considering the return as a proxy of the joint policy? Please discuss a bit on this aspect, as I think it is an important consideration to properly understand your solution (and an interesting point in general for the methodology you are proposing).
Q2: In Equation (5), notation is a bit confusing: is said to be the reward emission probability, but before in the text the same exact notation was used for the actual reward function itself. Please differentiate the two terms with different notation in order to avoid confusion.
Q3: While it is true that the use of a marginal value estimate allows to overcome the problem of relative overgeneralization and push agent to select their optimal coordinated behaviour, this does not come without a cost: the number of contexts may be large (it depends on how we set , but using too low of a value may hinder learning and convergence), and we need to estimate the contextual action-value functions for each context-action pair (the state is an input to the -function generally), possibly making this a very expensive operation in moderately large setting. What is the impact of such an expensive operation on the computational and wall-clock time of your proposed method? Is there anything that you can do to try and dumper the cost of estimating such a marginal value function without estimating all the data-points that perhaps I am not seeing?
Q4: The way in which you are claiming to address efficient training in the initial phase does not sound extremely convincing to me: your use of what is basically IQL should suffer from the same issues than independent -learning would have in reaching coordinated trajectories to improve the contextual estimates no? In my understanding, the real contributor here is the supervised loss term, which is indeed pushing the agents to overcome the relative overgeneralization problem, but this does not seem to take away anything from the sample complexity arising from the next of having good estimates for the larger space of contexts. Could you please discuss this aspect in more detail?
Q5: It would be interesting to also compare RAC to a CTDE baseline (like VDN or QMIX), at least on the larger problems like SMAC, in order to assess how good are the obtained results with respect to an algorithm which leverages centralized information at training time. Of course the proposed DTDE algorithm is not expected to outperform it, but for example seeing a narrow margin from that would still be an interesting insight.
Q6: Are you sure that there is a relative overgeneralization phenomenon happening in the SMAC setting with which RAC is helping with, as you claim in the corresponding empirical results analysis subsection? Rewards in SMAC are usually quite aligned with good trajectories, and I am not aware of any overgeneralization happening here. If so, could you explicit a bit more where this is happening?
局限性
Some limitations of the proposed methodology are discussed in the paper.
最终评判理由
My concerns have mostly been addressed by the authors rebuttal, and I consider the paper a valid contribution in the field of decentralised learning.
格式问题
None
1. Response to Question 1.
Our goal is to construct a representation of the agents' joint policy. For joint policies that yield similar expected returns, their corresponding value functions are also similar, and thus their learned representations should lie close to each other in the representation space. Contexts based on episodic returns naturally satisfy this requirement, and the context-based value function produces similar outputs for similar joint policies. Specifically, such contexts cluster the agents' joint policies according to their returns (or associated value functions), resulting in a one-to-many mapping from contexts to joint policies, rather than a one-to-one correspondence.
As noted by the reviewer, a potential drawback of representing the agents' joint policy using episodic returns is the high variance caused by the stochastic policy and the stochastic environment. A promising approach to this problem is to perform multiple Monte Carlo rollouts to approximate the expected return of a policy more reliably. We will highlight this limitation in the revised manuscript. Further extensions about reducing the high variance and exploring alternative representations are left for future work.
2. Response to Question 2.
Thank you for the careful reading. We will correct this typo in the revision.
3. Response to Question 3.
The computational complexity of our approach contains two components:
(a) In terms of the context-based value function , its inputs include both states and contexts, and the outputs correspond to the value estimates of all possible local actions of agent . Its input space scales linearly with the dimension (number) of contexts, and deep neural networks can generalize well across such space. The number of its output is only , where is agent 's local action space.
(b) In terms of the individual optimistic marginal value , when we set the number of contexts too large, the enumeration of all possible contexts typically leads to high computational complexity. A promising approach is to use sampling-based derivation-free heuristic search method, such as CEM [1], to find approximate maxima. In detail, we iteratively draw a batch of random context samples from a candidate distribution , e.g., a Gaussian, at each iteration . The best samples (with the highest context-based value estimates) are then used to fit a new Gaussian distribution , and this process repeats times. Such a sampling-based approximation can efficiently reduce the complexity caused by exhaustive search over a enormous context space, making the approach more tractable when is too large.
4. Response to Question 4.
In this work, the learning processes of and form a positive feedback loop: generates informative transitions to ensure effective updates of , while , in turn, guides in selecting per-agent local cooperative actions to produce more informative transitions. This mutual reinforcement facilitates efficient, fully decentralized cooperative policy learning. However, such mutual reinforcement heavily relies on appropriately weighting the supervision loss term. As acknowledged in Sec 6 (Conclusion: Limitation and Future work), RAC necessitates a thorough coverage of the entire return space to efficiently update the context-based value function, while separately learning and tuning the supervision loss term may perform poorly in more complex tasks. To address this limitation, integrating RAC with efficient exploration techniques is a promising direction, which we leave for future work.
Regarding the context-based value function, it is trained based on the transitions generated by the separately learned . More explanations about the learning complexity of the context-based value function are detailed in the Response to Question 3.
5. Response to Question 5.
| RAC | Hysteretic Q-learning | I2Q | IQL | QMIX | VDN | |
|---|---|---|---|---|---|---|
| matrix game (test return) | 8.00.0 | 8.00.0 | 8.00.0 | 6.00.0 | 6.00.0 | 6.00.0 |
| predator and prey (test return) | 34.321.35 | 2.482.57 | 0.00.0 | 0.00.0 | 0.00.0 | 31.131.50 |
| 3s_vs_4z (test win rate) | 0.830.06 | 0.690.12 | 0.810.12 | 0.750.14 | 0.980.01 | 0.980.01 |
| 3s_vs_5z (test win rate) | 0.580.06 | 0.300.18 | 0.060.06 | 0.410.08 | 0.790.19 | 0.860.02 |
| 5m_vs_6m (test win rate) | 0.460.04 | 0.290.10 | 0.420.06 | 0.390.05 | 0.620.03 | 0.740.03 |
| 2s3z (test win rate) | 0.730.02 | 0.610.13 | 0.610.27 | 0.700.14 | 0.970.02 | 0.930.05 |
| 2s_vs_1sc (test win rate) | 0.970.04 | 0.970.03 | 0.980.02 | 0.980.03 | 1.00.0 | 0.990.01 |
Thanks for your valuable suggestions. We evaluate QMIX and VDN on the matrix game, predator and prey, and five SMAC maps, reporting the median asymptotic performance along with the standard error across five random seeds. The results, presented in the table above, show that RAC significant outperforms both QMIX and VDN on the matrix game and predator and prey, where QMIX and VDN suffer from sub-optimal policies due to their representational limitations regarding the factored global action value function. On the five SMAC maps, RAC exhibits superior performance over other fully decentralized baselines but underperforms relative to QMIX and VDN. We attribute this to RAC's limited ability to handle partial observability and to adequately cover the entire return space. Future work will focus on enhancing RAC by incorporating belief state representations and integrating more efficient exploration techniques. We will add the comparison results in the revision for a better presentation.
6. Response to Question 6.
For a cooperative game, relative overgeneralization refers to the scenario where sub-optimal equilibrium is preferred over the optimal ones. In the SMAC benchmark, due to challenges such as partial observability and algorithmic limitations (e.g., representational limitations in value decomposition methods or over-estimation in Hysteretic Q-learning), baseline methods often converge to sub-optimal policies on certain maps, as evidenced by their poor performance. In this context, relative overgeneralization steams from both environmental and algorithmic factors, rather than solely from the reward design. We will add this discussion into the revised manuscript to improve clarity.
Reference
[1] De Boer P T, Kroese D P, Mannor S, et al. A tutorial on the cross-entropy method[J]. Annals of operations research, 2005, 134(1): 19-67.
I would like to thank the authors for their rebuttal. Some additional points are outline below:
Q1: While it is true that similar policies, achieving similar returns, are clustered close, the opposite is not true: policies with similar return that are clustered close are not mandatorily similar. There may be policies which achieve a similar return value by behaving in a completely different way. Does it make sense to still cluster them close and treat them similarly?
Q3: Nonetheless, even adding such a learning-driven sampling of contexts will introduce additional complexity that you should account for and at least explicitly mention in the paper.
Q5: The gap with CTDE algorithms on more complex problems (i.e., SMAC) is quite consistent. Although I was not expecting RAC to outperform these, I think it worth mentioning and highlighting this aspect in the paper, and try and motivate it.
Q6: I would like to disagree with this: learning a sub-optimal policy may depend on a number of factors, like a large state-action space that is difficult to cover entirely while the other agents are learning. It does not always boils down to the relative overgeneralization problem. Indeed, QMIX and VDN, which are not geared to overcome such issue and indeed fails in the matrix game, are achieving optimal performance on SMAC. Associating sub-optimal policies learning with relative overgeneralization in a bijection is not accounting for the many difficulties that (multi-agent) reinforcement learning has...
Thanks for your timely reply. We detail our responses as follows.
Response to Q1.
The motivation behind this question pertains to the rigor of policy similarity measurements. Broadly, we categorize such measurements into two types: performance similarity and behavioral similarity.
(1) When performance is the primary concern (e.g., in standard RL settings), we consider two typical ways to quantify the performance similarity between policies. The first approach involves generating multiple episodes (rollouts) under policies and , and computing their average episodic returns, and , respectively, as adopted by our method. However, this method suffers from high return variance. The second approach involves comparing the state value functions and for all states . In comparison to the former, this constitutes a stricter notion of performance similarity, as it requires the policies to yield equivalent long-term returns from every state.
Nevertheless, performance-based similarity does not account for how the policies behave, specifically, which states they visit and what actions they take. As noted by the reviewer, this can lead to ambiguity: for example, in a grid world with multiple shortest paths, two policies may traverse different states and actions, but both reach the goal in minimal steps and achieve the same return.
(2) When behavioral characteristics are of greater importance (e.g., in imitation or transfer learning), policy similarity can be evaluated using action distribution metrics, such as the KL divergence between stochastic policies or direct equality between deterministic policies, and distributional differences in state or state-action occupancy, denoted as and , respectively.
The choice of similarity measure should thus be guided by the context: whether performance or behavior is more relevant. In this work, we empirically demonstrate that using episodic return-based contexts to represent the agents’ joint policy performs well in the matrix game, predator and prey, and several SMAC maps. We sincerely thank the reviewer for raising the important issue of policy similarity. We will explore more context modeling techniques based on above policy similarity metrics in future work.
Response to Q3.
We agree with the reviewer's observations. When facing with a large context space (i.e., when the number of contexts is large), calculating the individual optimistic marginal value requires enumerating all possible contexts, which incurs significant computational complexity. Although introducing a sampling-based heuristic search method such as CEM to find approximate maxima can alleviate this burden, it requires maintaining multiple sampling processes and performing iterative updates of the Gaussian distribution, which adds additional complexity, as rightly noted by the reviewer. We will add this discussion in the revision to clarify this existing limitation.
Response to Q5.
Thanks for your valuable advice. We will include the comparison results in the revision for a better presentation of our approach.
Response to Q6.
There may be a misunderstanding on our part regarding the relationship between relative overgeneralization and sub-optimal policy learning. Relative overgeneralization is a sub-class of sub-optimal policy learning, where sub-optimal equilibrium is preferred over the optimal one, due to biased value estimations of per-agent local cooperative actions (e.g., IQL) or all agents' joint cooperative actions (e.g., QMIX and VDN).
As the reviewer rightly pointed out, sub-optimal policy learning can arise from a broader range of factors, including insufficient exploration, partial observability, poor credit assignment, and other factors. Baseline methods may exhibit poor performance due to one or more of these issues. In the revision, we will provide a more accurate assessment of such failures in the SMAC setting for more clarity.
Again, I would like to thank the authors for their prompt reply. My concerns have been addressed, and I will keep my positive score.
This paper investigates fully decentralized collaborative multi-agent reinforcement learning. The author proposes a new method called RAC (Return Aware Context). The core idea of RAC is to view the constantly changing task dynamics that each agent sees as switching between different "contexts". It uses the episode returns obtained throughout the entire turn to distinguish and represent these different contexts. Based on this context, each agent learns its own value function. RAC has also designed an optimistic value estimation approach, encouraging agents to choose actions that can facilitate optimal teamwork. The author tested RAC on several tasks (matrix games, predator and prey, SMAC) and the results showed that it performed better than several existing fully decentralized methods (IQL, Hysteretic Q-learning, I2Q).
优缺点分析
Strengths:
-
The fully decentralized learning paradigm studied here has broader applicability and practical relevance than the popular CTDE (Centralized Training with Decentralized Execution) approach. Given this setting remains under-explored, the paper's research focus is significant.
-
The paper is generally clear and easy to understand. The proposed method seems straightforward to implement and offers a potentially effective solution for tackling non-stationarity.
Weaknesses & Concerns:
-
Some claims feel too strong and could confuse readers. Specifically: Why is episodic return a suitable choice for representing "context"? Why non-stationarity can be (solely or partly) attributed to context switching?
-
Compared to I2Q and Hysteretic Q algorithms, RAC does not show significant improvement.
-
Consider adding some baselines for CTDE methods such as VDN and QMIX in the experiment. Although comparing different paradigms is challenging, as pointed out in the introduction, CTDE is currently the mainstream paradigm in research. It is crucial to showcase performance differences and discuss trade-offs in order to demonstrate the value and potential advantages of fully decentralized research.
-
Considering that it is the core motivation, we would like to see more discussions in future versions on why existing methods are difficult to simultaneously solve the problems of non stationarity and relative overgeneralization, whether from a theoretical or visual analysis perspective.
-
In this article, it is proposed to attribute non stationarity to context switching. The reviewer would like to know if this viewpoint is proposed in this article or if there are relevant analyses in previous literature? If it is first proposed in this article, is there any proof of this hypothesis?
-
The context modeling section in this article should be the most important part, but the description is not clear enough. It is recommended to add an figure to illustrate this process.
-
Can you explain how RAC can avoid the lazy agent problem that may exist in the multi-agent learning process. If not, can you briefly introduce how to improve the RAC method in the future to alleviate this problem?
-
The legend in the first row and first column of Figure 2 is redundant.
问题
See the section "Strengths And Weaknesses”. If the author can respond well to my question, I am willing to modify the score.
局限性
Yes.
最终评判理由
The author supplemented the missing specific definitions and related experiments. So my final justification is "Borderline accept".
格式问题
No.
1. Response to Question 1.
The non-stationarity arises from the evolving policies of other agents . From the perspective of each agent , its local task dynamics is determined by , as stated in Eq. (1) of the main text. In this work, we associate each context with a unique , and accordingly formalize the task, as locally perceived by each agent , as a contextual Markov Decision Process (CMDP). A CMDP comprises a family of MDPs that share the same state and local action spaces but differ in their state transition and reward functions, which are determined by the associated contexts. Accordingly, the non-stationarity induced by changes in other agents ' policies can be attributed to context switching.
Furthermore, the influence of other agents ' policies on a given agent is directly reflected in the rewards it receives. In this work, we focus on case (2), where all agents keep their policies fixed within each episode and change them only between episodes. Under this setting, for a fixed joint policy , the training episodic return can be regarded as a representation of (and, by extension, of when agent 's local action or policy is known). This perspective is also adopted in related works [1]. In such case, the training episodic returns are naturally associated with specific , and thus we refer to them as contexts.
2. Response to Question 2.
As shown in Fig. 2 of the main text, RAC significantly outperforms I2Q and Hysteretic Q-learning on multiple tasks, including predator and prey, 5m_vs_6m, 10m_vs_11m, 3s_vs_4z, 3s_vs_5z, and 2s_vs_1sc. We will extend our evaluation to more benchmarks to better confirm the superior performance of RAC.
3. Response to Question 3.
| RAC | Hysteretic Q-learning | I2Q | IQL | QMIX | VDN | |
|---|---|---|---|---|---|---|
| matrix game (test return) | 8.00.0 | 8.00.0 | 8.00.0 | 6.00.0 | 6.00.0 | 6.00.0 |
| predator and prey (test return) | 34.321.35 | 2.482.57 | 0.00.0 | 0.00.0 | 0.00.0 | 31.131.50 |
| 3s_vs_4z (test win rate) | 0.830.06 | 0.690.12 | 0.810.12 | 0.750.14 | 0.980.01 | 0.980.01 |
| 3s_vs_5z (test win rate) | 0.580.06 | 0.300.18 | 0.060.06 | 0.410.08 | 0.790.19 | 0.860.02 |
| 5m_vs_6m (test win rate) | 0.460.04 | 0.290.10 | 0.420.06 | 0.390.05 | 0.620.03 | 0.740.03 |
| 2s3z (test win rate) | 0.730.02 | 0.610.13 | 0.610.27 | 0.700.14 | 0.970.02 | 0.930.05 |
| 2s_vs_1sc (test win rate) | 0.970.04 | 0.970.03 | 0.980.02 | 0.980.03 | 1.00.0 | 0.990.01 |
Thanks for your valuable suggestions. We evaluate QMIX and VDN on the matrix game, predator and prey, and five SMAC maps, reporting the median asymptotic performance along with the standard error across five random seeds. The results, presented in the table above, show that RAC significant outperforms both QMIX and VDN on the matrix game and predator and prey, where QMIX and VDN suffer from sub-optimal policies due to their representational limitations regarding the factored global action value function. On the five SMAC maps, RAC exhibits superior performance over other fully decentralized baselines but underperforms relative to QMIX and VDN. We attribute this to RAC's limited ability to handle partial observability and to adequately cover the entire return space. Future work will focus on enhancing RAC by incorporating belief state representations and integrating more efficient exploration techniques. We will add the comparison results in the revision for a better presentation.
4. Response to Question 4.
We begin by providing a theoretical insight on the relationship between non-stationarity and relative overgeneralization from a context modeling perspective. Non-stationarity arises due to the lack of access to other agents' actions or policies. If a perfect context (i.e., an accurate representation of other agents' joint policies) can be obtained, the context-based value function effectively conditions the current agent’s policy on the policies of the others. In this scenario, the agent is essentially learning a best-response policy, leading to convergence toward a Nash Equilibrium. However, in cooperative games, multiple Nash equilibria may exist, and not all are desirable. Without convergence to a Pareto-optimal solution, agents may fall into suboptimal equilibria, manifesting as the relative overgeneralization problem.
We attribute the inability of existing methods to simultaneously address non-stationarity and relative overgeneralization to two main factors.
(1) For methods under the centralized training with decentralized execution (CTDE) paradigm, this training framework inherently alleviates the non-stationarity but retains relative overgeneralization. This issue is evident in, for example, the representational limitations of factored global action value functions in value decomposition methods [2], or the centralized-decentralized mismatch encountered in multi-agent policy gradient methods [3].
(2) For methods under the fully decentralized learning paradigm, they typically lack the capacity to model other agents' policies (contexts) due to the absence of centralized information. As a result, they primarily focus on addressing relative overgeneralization in isolation, as seen in Distributed Q-learning, Hysteretic Q-learning, lenient learners, and related methods.
5. Response to Question 5.
To the best of our knowledge, this work is the first to formalize the task, as locally perceived by each agent, as a CMDP, thereby attributing the non-stationarity of local task dynamics to switches between unobserved contexts. A theoretical foundation for this formulation could be established by explicitly associating the context with the joint policy of other agents. We leave this theoretical extension for future work.
6. Response to Question 6.
Thanks for your valuable feedback. We will incorporate the context modeling process into a figure in the revision for more clarity.
7. Response to Question 7.
The lazy agent issue refers to the scenario where partial agents select non-cooperative actions. In RAC, the individual optimistic marginal value enables agents to accurately identify and select their cooperative actions. As a result, each agent tends to select its local cooperative actions, effectively addressing the lazy agent issue.
8. Response to Question 8.
Thanks for your careful reading. We will address this issue in the revision.
Reference:
[1] Qiu W, Ma X, An B, et al. Rpm: Generalizable multi-agent policies for multi-agent reinforcement learning[C]//The Eleventh International Conference on Learning Representations. 2023.
[2] Rashid T, Farquhar G, Peng B, et al. Weighted qmix: Expanding monotonic value function factorisation for deep multi-agent reinforcement learning[J]. Advances in neural information processing systems, 2020, 33: 10199-10210.
[3] Wang Y, Han B, Wang T, et al. Dop: Off-policy multi-agent decomposed policy gradients[C]//International conference on learning representations. 2020.
The author's response partially resolved my doubts, so I will increase my rating. However, the biggest shortcoming of this work at present is still the lack of sufficient and comprehensive comparative methods and experiments. It is hoped that the author can improve this in the final version.
Thanks for your valuable feedback. At this stage, we have included comparison results against two representative CTDE methods, namely QMIX and VDN. We will make every effort to further enrich our experiments by incorporating additional comparisons on more cooperative multi-agent tasks (e.g., the modified predator-prey [1], and a broader set of SMAC maps) as well as against more fully decentralized baselines and CTDE methods. These additions will be integrated into the revision to more comprehensively demonstrate the effectiveness of our proposed approach.
Reference:
[1] Son K, Kim D, Kang W J, et al. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning[C]//International conference on machine learning. PMLR, 2019: 5887-5896.
Dear Reviewer GYBm,
During the rebuttal period, we have provided detailed responses to your concerns regarding:
(1) the rationale behind representing contexts using episodic returns and attributing non-stationarity to context switches;
(2) the performance against existing baselines (Hysteretic Q-learning, I2Q) and additional CTDE methods (i.e., add comparison results against QMIX and VDN);
(3) the discussion on why existing methods struggle to simultaneously address non-stationarity and relative overgeneralization;
(4) the clarifications on our contribution (i.e., the first work to attribute non-stationarity to context switching) and details about context modeling (i.e., include an illustrative figure);
(5) the advantage of our method in addressing the lazy agent issue.
We sincerely hope that the current responses can resolve your concerns. Looking forward to your response if you have any further concerns about this work.
Best regards,
The Authors
This paper, "In-Context Fully Decentralized Cooperative Multi-Agent Reinforcement Learning," proposes a method called Return-Aware Context (RAC) to tackle two big problems in decentralized multi-agent reinforcement learning (MARL): non-stationarity and relative overgeneralization. Basically, when agents can't see what other agents are doing, their local world keeps changing (non-stationarity), and they might make bad joint decisions because of over-optimism or bad estimates (relative overgeneralization).
RAC tries to fix this by treating each agent's local task as a contextual MDP, where "contexts" represent different joint policies of other agents. The clever bit is using the overall episodic return (like the total score for a game) to figure out what context they're in. This helps agents learn a context-specific value function to deal with the changing environment. They also introduce an "individual optimistic marginal value" to push agents towards better joint actions, hopefully avoiding that overgeneralization issue. The paper assumes that these joint policies only change between episodes, not during them. They tested RAC on simple games, predator-prey, and StarCraft II, showing it often works better than some other decentralized methods.
优缺点分析
Strengths
- Tries to solve two important problems at once: Many decentralized MARL methods only focus on non-stationarity OR relative overgeneralization separately.
- Context from returns is neat: Using episodic returns to define a "context" for other agents' behavior is a pretty smart way to get around not knowing their actions. It's a pragmatic solution for truly decentralized learning.
- Some good empirical results: On several tasks, especially some SMAC maps, RAC shows noticeably better performance than other fully decentralized baselines. This suggests it can indeed help in some settings.
Weaknesses
- Big Assumption: Policies only change between episodes. The paper explicitly states that the method assumes "joint policy changes occur only between episodes". While they say this is common in some MARL, it’s a massive simplification of how agents might learn or react in real time. If other agents' behaviors do shift mid-episode (even if their learning algorithms only update at the end), RAC's fixed context for that episode seems fundamentally unable to capture that dynamism.
- Reactive Context: After the fact. The context (episodic return) is only known after an episode finishes. This means during an episode, agents are flying blind regarding the actual context. They rely on a separate to guide action selection, which then tries to "imitate" the context-aware . This indirect approach feels clunky and might limit how quickly agents can adapt to subtle shifts. If the "true" context is always a step behind, how much can it really help with real-time non-stationarity?
- Fragile context learning, especially early on. The paper admits that learning the context-based value function needs "comprehensive coverage across the entire return space," which is "unavailable during the early training process and leads to sub-optimal outcomes". This suggests the context modeling itself might be quite brittle without enough diverse experience. The introduction of is a band-aid for this, adding complexity rather than solving the core issue of learning robust contexts.
- Why not universally better? RAC doesn't win on all SMAC maps; it's "comparably to IQL" on some. They hypothesize these maps are "fewer challenges from non-stationarity and relative overgeneralization". If that's the case, it raises questions about RAC's broader utility if its benefits are only seen when these specific problems are severe. Also, I2Q sometimes beats RAC. It suggests there isn't a clear, universal win for RAC, and its advantages are task-dependent.
- Hyperparameter tuning nightmare? The number of discretization intervals () and the scaling factor () for the auxiliary loss are key hyperparameters, and the ablation studies show RAC is sensitive to them. Finding the right values for these will likely be a pain for new environments, and the paper doesn't offer much guidance beyond "appropriate selection".
问题
- The "Between Episodes" Straitjacket: You lean heavily on the assumption that other agents' policies only change between episodes. But agents are constantly exploring and updating with stochastic policies, even if their parameters only officially change at episode end. This means their effective policy can drift within an episode. How would RAC handle a scenario where, for example, an agent's exploration strategy changes mid-episode, subtly shifting its local dynamics for others? Does the current single, episode-wide context become stale in such cases? It seems like this assumption significantly limits the true "non-stationarity" it can tackle.
- Dynamic Contexts and /: Your context relies on discretizing episodic returns, using and . How are these bounds determined, especially at the start of training when returns might be wildly unpredictable? Are they fixed or adaptive? If they're fixed and the actual returns go far outside those initial bounds, how does RAC cope? This could mess up context assignment and value learning. Did you consider more adaptive ways to define contexts, maybe using online clustering of returns or something that doesn't rely on fixed bounds?
- The Dependency: You introduced because learning directly "necessitates a comprehensive coverage across the entire return space, which is unavailable during the early training process". This sounds like alone struggles to learn effectively. Could you elaborate more on why can't learn on its own early on? And if is so crucial for "informative transitions," doesn't that suggest a fundamental weakness in 's ability to drive exploration and learning? What happens if the parameter (line 241) is set poorly, hindering 's ability to "imitate" ?
- Beyond Decentralized Baselines: You compare against other fully decentralized methods. While that's fair, the introduction briefly mentions centralized training with decentralized execution (CTDE) as a common paradigm. Since CTDE often achieves higher performance by using global information during training, it would be valuable to see how RAC stacks up against a well-known CTDE method (like QMIX or VDN) on one or two tasks. This would really highlight the performance cost (or benefit) of staying fully decentralized and help contextualize RAC's contributions in the broader MARL landscape.
局限性
The authors are upfront about two limitations, which is good:
- Exploration: They acknowledge that comprehensive coverage of the return space is needed, and that a separate value function () is needed because the context-based one () performs poorly in complex tasks without it. They suggest better exploration techniques as future work. This is a reasonable point.
- Context Changes Within Episode: They explicitly state that RAC only handles contexts changing between episodes, not within time steps. They plan to model local task dynamics over time intervals for future work. This is a major limitation for real-world application, as agent policies often evolve more frequently.
One key limitation they didn't really dwell on, but which is evident, is the sensitivity and arbitrary nature of the episodic return discretization (the parameter). It's a crucial part of how contexts are formed, yet its optimal value seems highly task-dependent, and there’s no clear guidance on how to set it. This could make RAC quite brittle and challenging to apply to new problems without extensive hyperparameter tuning. They should consider this as another significant area for future work, perhaps exploring adaptive or more principled context formation.
最终评判理由
My rating is now a 4 (Borderline Accept).
The authors' rebuttal was very constructive and helped resolve most of my primary concerns. The detailed explanations of their ongoing work on step-wise non-stationarity and adaptive context modeling show a clear path forward for the method's limitations. The additional experimental results, including comparisons with QMIX and VDN, provided a more complete picture of RAC's performance within the broader MARL landscape.
However, some limitations still exist. The core paper's reliance on the "policies change only between episodes" assumption is a significant constraint, even if it's being addressed in future work. While the authors' new results on adaptive context modeling are promising, they are preliminary and not part of the core submission. The complexity of the hyperparameter tuning for and also remains a practical challenge that wasn't fully resolved.
I'm raising my score because the authors demonstrated a deep understanding of my concerns and provided strong evidence that the most critical issues are being actively researched. The work, as presented with the clarifications and promises of revision, is technically solid enough to warrant acceptance, but the remaining limitations prevent a higher score.
格式问题
There are no major formatting issues in this paper.
We begin by revisiting two key insights underlying our work. (1) We aim to mitigate non-stationarity during training (i.e., in value function updates) by shaping contexts based on episodic returns, under the assumption that other agents update / change their policies only between episodes (that is, the case 2 setting, where context changes between episodes). This assumption allows us to focus exclusively on episode-wise non-stationarity. (2) During execution (i.e., in value function estimation), we address the issue of relative overgeneralization, thereby enabling agents to select their cooperative actions. Together, these two insights form the foundation of our approach.
1. Response to Weakness 1 and Question 1.
We develop RAC under the assumption that contexts change only between episodes, where agents keep their behavior (exploration) policies fixed within each episode and update their evaluation policies only between episodes. To follow this assumption, for agents with -greedy policy, we keep fixed during each episode and decrease it between episodes, as described in Lines 244-247 of the main text.
We begin with this simplified case to creatively develop an in-context fully decentralized cooperative multi-agent reinforcement learning (MARL) framework. For the more general case where contexts (other agents' policies) change between time steps, we acknowledge that RAC, in its current form, is not equipped to handle such dynamics. Extending RAC to address this more challenging scenario is the subject of our ongoing work.
2. Response to Weakness 2.
As stated at the outset, during execution, we focus on addressing relative overgeneralization using the per-agent individual optimistic marginal value . To support this, we learn a separate to achieve a coarse coverage of the return space, thereby enhancing the learning of the context-based value function , which facilitates the computation of . Based on the assumption that other agents update policies only between episodes, we address episode-wise non-stationarity during training by learning the context-based value function. In contrast, we do not need to consider real-time, step-wise non-stationarity during execution; instead, our primary objective at execution time is to guide agents toward cooperative action selection using or its imitator, .
3. Response to Weakness 3.
We agree with your argument. For cooperative tasks of moderate complexity, RAC directly using to select actions achieves superior performance, as demonstrated by RAC_w_ in Sec 5.2. For more complex tasks, such as SMAC, we propose learning as a coarse means to cover the return space. A more straightforward alternative would be to enhance RAC with efficient exploration techniques, for example, by separately learning an exploratory joint policy for all agents to generate informative transitions, a strategy commonly adopted in the single‑agent RL literature. We leave this extension for future work.
4. Response to Weakness 4.
RAC is specifically designed to simultaneously address both non-stationarity and relative overgeneralization. When these two issues are relatively mild, RAC is expected to perform comparably to baselines such as IQL, as evidenced by their similar performance on certain maps. However, when these challenges become more pronounced, the performance gap between RAC and other baselines widens, as demonstrated by RAC's superior results on tasks such as the matrix game, predator and prey, 5m_vs_6m, 10m_vs_11m, 3s_vs_4z, and 3s_vs_5z.
Regarding scenarios where I2Q outperforms RAC, it is important to note that I2Q and RAC address both non-stationarity and relative overgeneralization from different perspectives. Each method exhibits distinct adaptability to different tasks, and it is not guaranteed that RAC will be optimal across all tasks.
5. Response to Weakness 5.
As noted in our Response to Weakness 3, a more straightforward approach to covering the return space would be to equip RAC with efficient exploration techniques, such as separately learning an exploratory joint policy for all agents. This approach could eliminate the need to tune the additional hyper-parameter , and we plan to investigate it in the next stage of our work. For the remaining hyper-parameter , we recommend determining an approximate value through a grid‑search procedure.
6. Response to Question 2.
In this work, we accurately determine and based on prior knowledge of the task. We have also considered incremental context modeling; that is, as the return space covered by RAC expands, the context space should be expanded accordingly. As the reviewer suggested, when employing online clustering of returns to construct contexts, it is important to ensure that the same return interval consistently maps to the same context across multiple clustering processes. We plan to further explore approaches for adaptively defining contexts in the next stage of our work.
7. Response to Question 3.
is conditioned on the state , agent 's local action , and the context instantiated as the episodic return interval. During the early training process, coordinated behaviors among agents are limited, and the achieved returns tend to fall within a low-value region (corresponding to small values of ). Consequently, when computing the individual optimistic marginal value , tends to approximate associated with small , which in turn corresponds to sub-optimal actions yielding low episodic returns. In summary, accurately learning necessitates comprehensive coverage of the return space. We plan to achieve this by equipping RAC with efficient exploration techniques in the next stage of our work.
In this work, the learning processes of and form a positive feedback loop: generates informative transitions to ensure effective updates of , while guides in selecting per-agent local cooperative actions to produce more informative transitions. This mutual reinforcement facilitates efficient, fully decentralized cooperative policy learning. However, if is poorly set, such mutual reinforcement may be disrupted. As depicted in Fig.~3 (e), on the 5m_vs_6m, setting causes to overly imitate the poorly trained during the early training process, and in turn fails to generate informative transitions. Consequently, RAC fails to learn an efficient policy and suffers from poor performance.
8. Response to Question 4.
| RAC | Hysteretic Q | I2Q | IQL | QMIX | VDN | |
|---|---|---|---|---|---|---|
| matrix game (test return) | 8.00.0 | 8.00.0 | 8.00.0 | 6.00.0 | 6.00.0 | 6.00.0 |
| predator and prey (test return) | 34.321.35 | 2.482.57 | 0.00.0 | 0.00.0 | 0.00.0 | 31.131.50 |
| 3s_vs_4z (test win rate) | 0.830.06 | 0.690.12 | 0.810.12 | 0.750.14 | 0.980.01 | 0.980.01 |
| 3s_vs_5z (test win rate) | 0.580.06 | 0.300.18 | 0.060.06 | 0.410.08 | 0.790.19 | 0.860.02 |
| 5m_vs_6m (test win rate) | 0.460.04 | 0.290.10 | 0.420.06 | 0.390.05 | 0.620.03 | 0.740.03 |
| 2s3z (test win rate) | 0.730.02 | 0.610.13 | 0.610.27 | 0.700.14 | 0.970.02 | 0.930.05 |
| 2s_vs_1sc (test win rate) | 0.970.04 | 0.970.03 | 0.980.02 | 0.980.03 | 1.00.0 | 0.990.01 |
Thanks for your valuable suggestions. We evaluate QMIX and VDN on the matrix game, predator and prey, and five SMAC maps, reporting the median asymptotic performance along with the standard error across five random seeds. The results, presented in the table above, show that RAC significant outperforms both QMIX and VDN on the matrix game and predator and prey, where QMIX and VDN suffer from sub-optimal policies due to their representational limitations regarding the factored global action value function. On the five SMAC maps, RAC exhibits superior performance over other fully decentralized baselines but underperforms relative to QMIX and VDN. We attribute this to RAC's limited ability to handle partial observability and to adequately cover the entire return space. Future work will focus on enhancing RAC by incorporating belief state representations and integrating more efficient exploration techniques. We will add the comparison results in the revision for a better presentation.
9. Response to Limitation.
Thanks for your valuable advice. We will place emphasis on the following directions in our future work: (1) developing adaptive context modeling, (2) ensuring comprehensive coverage of the return space through efficient exploration, and (3) extending to the more general setting in which contexts (other agents' policies) change between time steps.
Thanks for the author's detailed response. It clarifies many aspects of the proposed method. However, some of my key concerns remain unaddressed.
- Step-wise Non-stationarity: The rebuttal confirms that RAC is designed for the simplified case where policies change only between episodes. It acknowledges that the current framework cannot handle the more general, and arguably more common, case of step-wise non-stationarity. While I appreciate the honesty, this is a major limitation that restricts the method's applicability. Acknowledging it as future work doesn't solve the problem for this paper. The majority of Q-learning based MARL methods can perform step-wise updates (e.g., with TD learning) and grapple with this issue directly. The lack of a solution for this common setting remains a weakness.
- The Role of : The explanation for why learning a separate achieves "a coarse coverage of the return space" is still vague. The response suggests that generates "informative transitions," which then helps train . This sounds like a bootstrap mechanism, but it's not clear why is uniquely suited for this task. A more detailed explanation is needed. For example, does use a different exploration strategy or learning rule that makes it more effective at covering the state-action space early on? The current explanation feels like a high-level description of a positive feedback loop rather than a technical justification.
- "Incremental Context Modeling": The authors mentioned they considered "incremental context modeling" but didn't provide any specifics on how this was done. Was there an actual implementation and analysis? Without a concrete description of the approach and some form of quantitative analysis (e.g., results of an ablation study or a comparison to the static discretization method), this claim is difficult to evaluate. I would need more details to understand if this is a viable solution to the problem of fixed / bounds and limited return coverage.
Dear reviewer KFLW,
We deeply appreciate the effort and time you have devoted to providing us with your valuable feedback. We have provided detailed responses to address your remaining concerns and questions.
As the discussion period concludes in approximately 48 hours, we would be grateful to know if you have any additional concerns or suggestions regarding our work. If possible, we would welcome the opportunity to engage in further technical discussion to clarify any remaining issues and, hopefully, earn your endorsement.
Best regards,
The authors
Dear Reviewer KFLW,
During the rebuttal period, we have provided detailed responses to your concerns regarding:
(1) the assumption that contexts change between episodes;
(2) the role of episodic return-based contexts in addressing episode-wise non-stationarity;
(3) the rationale behind separately learning ;
(4) the performance against other baselines and additional CTDE methods (i.e., add comparison results with QMIX and VDN);
(5) the further extensions involving adaptive context modeling and simplified hyper-parameter tuning.
We sincerely hope that the current responses can resolve your concerns. Looking forward to your response if you have any further concerns about this work.
Best regards,
The Authors
Response to Q2.
We appreciate the reviewer's insightful question and acknowledge that our previous explanation may have been too high-level, leading to potential misunderstandings. As discussed in our Response to Question 3, is conditioned on the state , agent 's local action , and the context instantiated as the episodic return interval. During the early training process, coordinated behaviors among agents are limited, and the achieved returns tend to fall within a low-value region (corresponding to small values of ). Consequently, when computing the individual optimistic marginal value , tends to approximate associated with small , which in turn corresponds to sub-optimal actions yielding low episodic returns. In summary, accurately learning necessitates comprehensive coverage of the return space.
For tasks with moderate complexity, directly selecting actions using for each agent yields superior performance, as stated in our Response to Weakness 3. For more complex tasks such as SMAC maps, we empirically find that separately learning and optimizing it with the standard IQL value loss and the supervision loss lead to satisfactory performance.
Importantly, we clarify that the inclusion of is not a unique choice, and other alternatives are viable. In the context of SMAC maps, the ability of to aid in return-space coverage can be attributed to the following two main factors:
(1) The competitive performance of vanilla IQL on SMAC maps, which makes IQL a reasonable candidate for generating informative transitions.
(2) The incorporation of additional supervision signals, such as the proposed supervision loss, through which helps guide in identifying and selecting agents' local cooperative actions, improving its effectiveness in supporting exploration and learning.
From perspective (1), could be replaced by other RL algorithms that demonstrate pretty performance. From perspective (2), its exploration capability could be further enhanced using intrinsic objectives (e.g., curiosity-based rewards) or alternative supervision signals derived from to encourage broader or more targeted exploration.
We sincerely thank the reviewer for prompting a deeper discussion on this topic. We will incorporate this elaboration into the revision to better communicate the flexibility and extensibility of our framework.
Thanks for your timely feedback. Our responses are detailed as follows.
Response to Q1.
We would like to emphasize that each work is typically featured by a critical innovation. The primary contribution of this work lies in the introduction of in-context fully decentralized cooperative MARL, a unified framework for addressing both non-stationarity and relative overgeneralization. The current focus on Case (2), where contexts (i.e., agents' policies) change only between episodes, serves as a temporary simplification. This treatment also underscores the potential of Case (1) as a promising direction for future research.
To further advance this line of research, we are actively working on extending the framework to Case (1), where contexts change between time steps. Our ongoing work effectively addresses this scenario through dynamics-aware context modeling. Specifically, from the local perspective of each agent , we denote its perceived local task dynamics by and , where the context is associated with the real-time joint policy of other agents at time step . As policies evolve over time, both and the local task dynamics also change. Accordingly, we propose to represent the contexts by real-time task dynamics distribution.
Under Case (1), the step-wise changing local task dynamics distribution parallels the setting of concept drift, where the underlying data distribution evolves over time. To address such shifts, maintaining a sliding window to hold the latest data within the data stream has proven effective in capturing the real-time data distribution, ensuring model adaptability and accuracy in dynamic settings [1]. Accordingly, we propose to maintain a sliding window alongside per-agent local trajectory, and model the task dynamics distribution using latent variables based on the transitions within this sliding window. This allows us to integrate dynamics-aware context modeling within the same unified framework proposed in this work, enabling the learning of context-based value functions that address step-wise non-stationarity.
This ongoing work directly corresponds to the future work outlined in Lines 336-338 of the main text (Sec.~6, Conclusion: Limitation and Future Work):
Second, RAC currently focuses only on the scenario where context changes between episodes (case (2)). To address case (1), where context changes between time steps, we plan to construct contexts by modeling the local task dynamics distribution over different time intervals. We leave them as our future works
We also present preliminary empirical results to validate the proposed method on the matrix game and 5m_vs_6m. As shown in the table below, the proposed method succeeds in selecting the optimal joint action with reward on the matrix game, and outperforms baselines on 5m_vs_6m, demonstrating its potential effectiveness.
| Ours (Ongoing Work) | Hysteretic Q-learning | I2Q | IQL | |
|---|---|---|---|---|
| matrix game (test return) | 8.00.0 | 8.00.0 | 8.00.0 | 6.00.0 |
| 5m_vs_6m (test win rate) | 0.520.07 | 0.290.10 | 0.420.06 | 0.390.05 |
In summary, we sincerely appreciate the reviewer's insightful comments, which inspired us to broaden the scope of our work to address both Case (1) and Case (2). We believe these two works form a strong foundation for the proposed in-context fully decentralized cooperative MARL framework. Further investigations into addressing step-wise non-stationarity remain a promising and worthwhile direction for future research.
Reference
[1] Lu J, Liu A, Dong F, et al. Learning under concept drift: A review[J]. IEEE transactions on knowledge and data engineering, 2018, 31(12): 2346-2363.
Response to Q3.
We apologize for the confusion caused by our earlier statement. In this work, we accurately determine and based on prior knowledge of the task. As such, there is no scenario where the actual returns go far outside these initial bounds.
The notion of incremental context modeling is part of our ongoing exploration into extending the proposed framework and is not implemented in the current version of the work. Nevertheless, we appreciate the reviewer's interest and would like to share some preliminary ideas related to this direction.
We identify two key challenges:
(1) How to adaptively identify the bounds and ?
(2) How to ensure consistent one-hot context representations when using an adaptive discretization of the return space across time? That is, we should ensure the same one-hot context representation consistently maps to the same return value interval across multiple adaptive modeling processes of the return bounds.
To address (1), we consider a simple adaptive approach where are maintained as the minimum and maximum returns observed so far. Based on these values, we dynamically partition the return space into intervals. However, this leads to challenge (2): as the return bounds evolve over time, the semantic meaning of each one-hot encoded context (i.e., return interval) may change, violating consistency and potentially confusing the learning process.
To address this issue, we propose a warm-up phase during which only and are collected over the first time steps, without updating and calculating the supervision loss. After this initialization, the return space is discretized into fixed intervals, and any out-of-bound returns are assigned to the first or last interval. Only then do we begin training and computing the supervision loss to guide .
We refer to this approach as Adaptive_RAC, and evaluate its performance on the matrix game and 5m_vs_6m. In the matrix game, we find that setting has yielded the best performance, where Adaptive_RAC succeeds in achieving the highest reward . On the 5m_vs_6m, we experiment with values of to 50000, 150000, 250000, 350000, 450000, and 500000. We report the corresponding test win rates as well as the recorded values of and under these configurations. The results, summarized in the table below, demonstrate that setting to can accurately find the maximum episodic return bound and yield satisfactory performance. In contrast, smaller warm-up durations struggle in sub-optimal policy and inaccurate return bounds.
We will include Adaptive_RAC as a baseline in the revision, and provide comparative results on the remaining tasks to further validate and enrich our contributions.
| Test Win Rate | 0.29 | 0.32 | 0.30 | 0.45 | 0.47 | 0.49 |
| / | 8.52 / 0 | 11.02 / 0 | 11.96 / 0 | 20 / 0 | 20 / 0 | 20 / 0 |
We note, however, that the effectiveness of this approach still depends on the return diversity observed during the warm-up phase. Inadequate coverage during this phase may lead to an incomplete context space, as also discussed in our Response to Q2. This limitation highlights the need for future work on both efficient return-space exploration and more flexible context representations, as advised by the reviewer.
Following the reviewer's valuable suggestion, we will place emphasis on the following directions in the revision to inspire further researches within the proposed in-context fully decentralized cooperative MARL framework: (1) developing adaptive context modeling, (2) ensuring comprehensive coverage of the return space through efficient exploration, and (3) extending to the more general setting in which contexts (other agents' policies) change between time steps.
We are grateful for the reviewer's constructive feedback, which motivates important future extensions of our work.
Thanks for the detailed response. The added explanations and preliminary analysis for the ongoing work and adaptive modeling are very helpful and address most of my initial concerns. I appreciate the authors' proactive approach in considering these extensions. If the proposed revisions and new experimental results are included in the final version, this work will be a strong contribution to the decentralized cooperative MARL community. I will raise my score accordingly.
Thank you for your valuable comments, which have significantly contributed to the improvement of our paper. At this stage, we have included the following key updates:
(1) Inclusion of comparison results against two representative CTDE methods, namely QMIX and VDN.
(2) Clarification of RAC's episodic return-based contexts in addressing episode-wise non-stationarity (corresponding to Case (2), where contexts change only between episodes).
(3) Detailed explanation of the separate learning of and discussion of viable alternatives.
(4) Analysis of potential extensions to handle step-wise non-stationarity (corresponding to Case (1), where contexts change between time steps).
(5) Preliminary experimental results for RAC with adaptive context modeling (referred to as Adaptive_RAC in our Response to Q3).
We will further enhance our work in the revision by incorporating the following primary updates:
(6) Inclusion of Adaptive_RAC as a baseline, along with its empirical results across all testbeds.
(7) Additional comparisons on more cooperative multi-agent tasks (e.g., the modified predator-prey [1], and a broader set of SMAC maps), as well as against more fully decentralized baselines and CTDE methods.
These additions will be integrated into the revision to further strengthen and enrich our work.
Reference
[1] Son K, Kim D, Kang W J, et al. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning[C]//International conference on machine learning. PMLR, 2019: 5887-5896.
Hi Reviewers,
Thank you for your efforts in providing valuable reviews. Since the authors have provided thorough responses, we do need your participation in the discussion period to ensure the quality of the review process.
Please read and respond to the author's feedback at your earliest convenience.
Thanks
Your AC.
Dear Reviewers:
Thank you for your valuable comments, which have greatly contributed to improving our paper. At this stage, we have included the following key updates:
(1) Inclusion of comparison results against two representative CTDE methods, namely QMIX and VDN. (Reviewers KFLW, GYBm, Fo1D)
(2) Clarification of RAC's episodic return-based contexts in representing the agents' joint policies and addressing episode-wise non-stationarity. (Reviewers KFLW, Fo1D, 7ZBz)
(3) Detailed explanation of the separate learning of and discussion of viable alternatives. (Reviewers KFLW, Fo1D)
(4) Analysis of potential extensions to handle step-wise non-stationarity. (Reviewers KFLW, 7ZBz)
(5) Preliminary experimental results for RAC with adaptive context modeling (referred to as Adaptive_RAC). (Reviewers KFLW, 7ZBz)
(6) Clarification of representing contexts using episodic returns and attributing non-stationarity to context switches. (Reviewer GYBm)
(7) Discussion on why existing methods struggle to simultaneously address non-stationarity and relative overgeneralization. (Reviewer GYBm)
(8) Clarifications of our contribution (i.e., the first work to attribute non-stationarity to context switching) and additional details about context modeling, including an illustrative figure. (Reviewer GYBm)
(9) Discussion of reducing computational complexity when facing a large context space and presenting accurate performance assessment caused by relative overgeneralization or sub-optimal policy learning. (Reviewer Fo1D)
We will further enhance our work in the revision by incorporating the following primary updates:
(10) Inclusion of Adaptive_RAC as a baseline, along with its empirical results across all testbeds. (Reviewers KFLW, 7ZBz)
(11) Additional comparisons on more cooperative multi-agent tasks (e.g., the modified predator-prey, and a broader set of SMAC maps), as well as against more fully decentralized baselines and CTDE methods. (Reviewers KFLW, GYBm)
These additions will be integrated into the revision to further strengthen and enrich our work.
Best regards,
The authors
This work investigates fully decentralized training in multi-agent reinforcement learning. The proposed method formulates the environment as well as the joint policy of other agents as contexts and learns context-conditioned value function for each agent. The proposed method outperforms existing decentralized training methods across different benchmarks.
Strengths:
- The studied question, decentralized learning, is under-explored in the domain of MARL
- The proposed solution of modeling the randomness as contexts is novel
- The paper is well-written
Weaknesses:
- Computing the individual optimistic marginal value adds additional computation cost
During the rebuttal session, the authors have made additional experiments and analysis to answer the questions and address the concerns of the reviewers. All reviewers select "borderline accept" for this work. In general, this work could be accepted due to its novelty and theoretical soundness.
Key changes during the rebuttal period include:
- Inclusion of comparison results against two representative CTDE methods, namely QMIX and VDN. (Reviewers KFLW, GYBm, Fo1D)
- Clarification of RAC's episodic return-based contexts in representing the agents' joint policies and addressing episode-wise non-stationarity. (Reviewers KFLW, Fo1D, 7ZBz)
- Detailed explanation of the separate learning of individual optimistic value and discussion of viable alternatives. (Reviewers KFLW, Fo1D)
- Analysis of potential extensions to handle step-wise non-stationarity. (Reviewers KFLW, 7ZBz)
- Preliminary experimental results for RAC with adaptive context modeling (referred to as Adaptive_RAC). (Reviewers KFLW, 7ZBz)
- Clarification of representing contexts using episodic returns and attributing non-stationarity to context switches. (Reviewer GYBm)
- Discussion on why existing methods struggle to simultaneously address non-stationarity and relative overgeneralization. (Reviewer GYBm)
- Discussion of reducing computational complexity when facing a large context space and presenting accurate performance assessment caused by relative overgeneralization or sub-optimal policy learning. (Reviewer Fo1D)
In summary, I recommend accepting this work due to the additional experiments and the novelty of the proposed method.