PaperHub
6.1
/10
Poster4 位审稿人
最低2最高4标准差0.8
2
3
4
4
ICML 2025

Agent-Centric Actor-Critic for Asynchronous Multi-Agent Reinforcement Learning

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

We propose ACAC, an algorithm designed to enhance learning efficiency in asynchronous MARL by eliminating padding, leading to faster convergence and improved performance.

摘要

关键词
Multi-Agent Reinforcement LearningAsynchronous Multi-Agent Reinforcement LearningMacDec-POMDP

评审与讨论

审稿意见
2

This paper proposes an Agent-Centric Actor-Critic framework for asynchronous multi-agent reinforcement learning, which includes a module that addresses asynchrony without relying on padding. The proposed module incorporates agent-centric history encoders for independent trajectory processing and an attention-based centralized critic for integrating agent information. Experimental results demonstrate that ACAC outperforms existing methods in macro-action benchmarks.

给作者的问题

  1. What is the total parameters in ACAC compared to the other baselines?

  2. Do the other baselines also incorporate RNNs in their value function?

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

N/A

实验设计与分析

The experimental designs, particularly the ablation studies, raise several concerns.

  1. The ACAC method includes additional encoders and a transformer module, which significantly increases the number of parameters. It is crucial to determine whether the observed performance improvements are attributable to the additional parameters.

  2. The effectiveness of each modification in the proposed method has not been sufficiently verified through ablation studies. For instance, the paper integrates timestep information—an ablation comparing ACAC with and without timestep information, and previous methods with this timestep information, would help clarify its impact. Similarly, applying the modified GAE in a PPO version of Mac-IAICC would help demonstrate its contribution.

  3. The current ablation study is somewhat unclear. For example, the authors appear to examine the negative effect of duplicated observations, but the ACAC-Duplicate condition does not accurately reflect previous methods, as they do not involve encoders or timestep updates. This might only indicate that the encoder mechanism, when poorly applied, leads to negative effects.

  4. The appendix lacks implementation details for the proposed module, which would be beneficial for replicating and understanding the method.

补充材料

Yes.

与现有文献的关系

This paper makes a contribution to improving the performance of asynchronous MARL by addressing the padding problem inherent in existing methods. The ACAC method may inspire further advancements in asynchronous MARL algorithms.

遗漏的重要参考文献

No.

其他优缺点

Strengths:

  • The paper is well-written and structured, making it easy to follow.

  • Illustrative figures and graphs effectively enhance understanding of the setting and methodology.

  • The motivation for the work is clear, and the proposed method is concisely presented.

Weaknesses:

  • A key concern is the practical applicability of the setting in MARL. The algorithm requires individually pre-defined macro-actions, which may not be realistic in a multi-agent environment, where the execution of one agent’s macro-action can be disrupted by other agents. For example, one agent’s macro-action may be hindered by others, causing the action to be incomplete or interrupted. The current macro-action framework does not explicitly account for the interactions between agents, which is critical in highly coordinated tasks where agents’ actions are interdependent.

  • The padding problem in asynchronous MARL seems relatively minor, and the proposed modification to the critic input does not appear to be significantly novel.

  • It remains unclear why the proposed method outperforms existing ones. The performance gains could stem from the additional parameters or the incorporation of time embeddings (as discussed in the experimental part above). Without isolating these factors, it is difficult to determine the source of the improvements. Furthermore, the use of self-attention seems unnecessary in this context.

其他意见或建议

No.

作者回复

We appreciate your insightful comments and would like to clarify our contributions and address concerns as follows:

[Q1] Parameter Comparison

We agree that comparing parameters is important, so we compared the total number of parameters between ACAC and Mac-IAICC. The results are as follows:

  • Overcooked/Rand: ACAC (320k) > Mac-IAICC (241k)
  • Overcooked-Large/Rand: ACAC (874k) < Mac-IAICC (963k)

Crucially, ACAC significantly outperforms Mac-IAICC in Overcooked-Large despite having fewer parameters. This strongly indicates ACAC's performance gains stem from its architectural design (more effective processing of asynchronous information), not just model capacity.

 

[Q2] Use of RNN in baselines

Yes, all macro baselines utilize RNNs as ACAC.

 

[W1] Practicality of Pre-defined Macro-Actions

Our framework doesn't assume uninterrupted macro-actions. In the environments used (e.g., Overcooked), actions can be interrupted by other agents (e.g., blocking). When this occurs, the action terminates early, allowing the agent to immediately choose a new macro-action based on the situation. This provides reactive adaptation to dynamic interactions. While the environment handles these interruptions, our paper focuses on ACAC's ability to handle the resulting asynchrony in decision points for effective learning in the CTDE framework.

 

[W2] Padding Issue & Critic Novelty

We appreciate the perspective on the padding problem. While padding is simple, our paper argues and results show its impact is significant in asynchronous MARL, especially complex tasks. It can cause misleading temporal information(padding obscures the actual time elapsed between an agent's valid observations, hiding crucial duration information), inaccurate history abstraction(reusing past observations via padding makes it hard to distinguish new vs repeated stale information, leading to inaccurate joint history representations), and hinder credit assignment. Our experiments (Figs 7, 8) consistently show ACAC (no padding) outperforms padding-based methods, especially in complex settings (Overcooked-Large), suggesting addressing padding is crucial, not minor. This motivated ACAC's design to handle asynchronous inputs directly.

 

[W3] Component Contribution Verification

We agree ablation studies are crucial, so we have performed ablations for time embedding ("No TE") and self-attention ("No SA", using MLP) across Overcooked and Overcooked-Rand environments. Since the complex Overcooked-Rand-B map showed the clearest and most significant performance differences between configurations, clearly highlighting component impacts, we focus on these results here for brevity. Full results will be included in the revised manuscript. We report the average final performances and standard errors over five random seeds.

Table 3. Ablation study on time embedding (TE) and self-attention (SA).

ACACNo TENo SA
212.42 ± 0.64174.65 ± 18.62136.69 ± 19.44

These results show that removing either the TE or SA component severely degrades both stability and the final performance, confirming that each component is vital to ACAC's effectiveness.

Regarding the PPO/GAE ablation, directly applying PPO to Mac-IAICC (N critics) is non-trivial. However, our ACAC vs. ACAC-Vanilla comparison (Sec 5.2, Fig 7) isolates the PPO objective's impact. They share the same architecture; only the optimization differs (PPO vs. simpler policy gradient). ACAC's superior performance demonstrates the benefit of our PPO-based objective.

 

[Experimental Designs 3] Validity of Ablation Comparison

This experiment aimed to isolate the negative impact of processing padded/duplicated information, which is difficult within standard padding methods. ACAC-Duplicate uses the ACAC architecture, including time embedding and per-agent history encoders, but adopts the padding method's history update rule (update on any agent's new observation). This creates a valid intermediate comparison, effectively showing the performance degradation caused by processing duplicated information within our agent-centric framework.

 

[Experimental Designs 4] Lack of Implementation Details

Key architectural details are in Sec 3/Fig 4, with hyperparameters in Appendix F/Table 1. Please feel free to ask if further hyperparameter details are needed. We will release the full source code publicly for reproducibility.

审稿人评论

Some follow-up questions:

  • How the framework decides when a macro-action is interrupted? For example, if two agents block each other—say, by attempting to occupy the same space in a narrow passage—will both of their macro-actions be interrupted, or does the system prioritize one agent over the other?

  • Another concern is whether this framework supports coordination between agents, which is essential in cooperative MARL. For instance, in a scenario where a path is blocked, an effective system might allow one agent to wait while the other passes, then proceed in turn. However, if macro-actions are predefined without considering inter-agent interactions—as they often are in single-agent settings—such coordination may not emerge naturally. This is my primary worry about the practical applicability of the approach. Directly extending single-agent macro-actions to cooperative MARL could undermine cooperation, as agents might pursue individual goals without adapting to each other's actions.

  • How are value and policy updates handled when macro-actions are interrupted?

作者评论

We thank Reviewer SHS8 for the thoughtful and detailed comments. We understand the concerns regarding the applicability of our MacDec-POMDP framework, particularly in scenarios requiring agent coordination and potential interruptions of macro-actions. It seems that the last paragraph of Section 2.2 may have caused some confusion. To clarify up front, macro-actions in our framework are not simple fixed sequences of micro-actions, but are instead defined using the Options framework within the Semi-Markov Decision Process (SMDP) setting, with agent interactions explicitly considered. The explanation below is intended to clarify this distinction and address the reviewer’s concerns. We will also revise the relevant part in the final version of the paper to avoid such confusion.

We hope this response helps resolve any misunderstandings and provides a clearer view of how our method works in practice.

 

1. Recap: Definition of Macro-Action

In our framework, macro-actions are implemented as options in the Semi-Markov Decision Process (SMDP) framework [1]. Each macro-action is defined as m=πm,Im,βmm = \langle \pi_m, I_m, \beta_m \rangle, where:

(i) πm(ahmic)\pi_m (a|h_{\text{mic}}) is an intra-option policy over micro-actions given a history of micro-observations hmich_\text{mic};

(ii) ImI_m is the initiation set from which the macro-action can start (typically assumed to be all states, as in our work);

(iii) βm(hmic)\beta_m(h_{\text{mic}}) is the termination condition, giving the probability that the macro-action ends given the current micro-observation history hmich_\text{mic}.

See Appendix D for more detail. Since ImI_m is typically unrestrictive, the key behaviors are encoded in (i) πm\pi_m and (iii) βm\beta_m. We explain how these components are used to support inter-agent interaction and adaptive behavior, using concrete examples from the Overcooked environment.

 

2. Example of Macro-Actions in the Overcooked Environment

To illustrate how inter-agent interaction is handled in practice, consider the Overcooked environment:

  • Intra-option policy: A macro-action like “go to tomato” may involve a path planning. If another agent is blocking the path, the intra-option policy handles this by having the agent wait until the path is clear, then continue. This behavior is naturally encoded into the policy without requiring the macro-action to terminate.
  • Termination condition: If the task goal becomes invalid—for instance, another agent picks up the tomato—the macro-action terminates automatically according to its predefined termination condition.

These behaviors are not considered "interruptions" in the conventional sense. Rather, they are expected outcomes under the macro-action’s design. The inter-agent interaction and adaptation are achieved through well-defined intra-option policies and termination conditions, enabling cooperative behaviors to naturally emerge within the framework.

 

3. On Macro-Action “Interruptions”

As all macro-actions are designed with agent interaction in mind, including blocking or dynamic invalidation of goals, what might appear as “interruptions” are in fact normal terminations governed by the macro-action’s own βm\beta_m. Consequently, and in direct response to Reviewer SHS8’s third question, the learning of value functions and policies require no special handling of such cases. There is no distinction made between “complete” and “incomplete” executions, as all terminations are intentional under the defined option.

 

4. Practical Applicability of MacDec-POMDP

We acknowledge the reviewer’s concern regarding the emergence of coordination in cooperative settings. Although our macro-actions may resemble single-agent options, they are learned and executed entirely within a multi-agent context. Specifically, we integrate intra-option policies and termination conditions to handle inter-agent interactions, enabling agents to adapt to one another’s behaviors during execution. In doing so, our approach demonstrates that macro-actions can effectively address inter-agent dynamics, allowing coordination behaviors—such as yielding in narrow spaces—to emerge naturally through the learning process.

 

We hope this final response clarifies how macro-actions are structured in our environments, how they enable coordination among agents, and how termination is handled during learning. Together with our earlier responses, we believe these address your concerns and offer a clearer understanding of our proposed approach.

 

References

[1] Sutton et. al., “Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning,” Artificial Intelligence, 1999.

审稿意见
3

This paper introduces Agent-Centric Actor-Critic (ACAC), a novel algorithm designed for asynchronous multi-agent reinforcement learning (MARL) in environments with sparse rewards and varying macro-action durations. Each agent's trajectory is processed independently, capturing the history of macro-observations along with their timesteps, enabling accurate temporal abstraction without padding. Meanwhile, an attention mechanism integrates the individual agent histories into a centralized critic, allowing for more accurate value estimation in asynchronous settings. This paper adapt the modified GAE to handle irregular intervals between macro-observations, ensuring effective policy optimization in asynchronous environments. This paper evaluates ACAC on several macro-action-based benchmarks, such as BoxPushing and Overcooked, demonstrating that ACAC achieves faster convergence and higher final returns compared to baseline methods.

给作者的问题

  1. Can the authors provide formal theoretical guarantees or convergence proofs for the proposed Agent-Centric Actor-Critic (ACAC) algorithm, particularly for the modified GAE in asynchronous settings?
  2. How does the agent-centric approach compare to padding-based methods in terms of information loss or distortion?
  3. Can the authors provide evaluations on the SMAC environment? Choose some representative tasks if possible.
  4. Will the ACAC be influenced by mixed opponent policies? I suggest conducting an experiment on SMAC-HARD to test whether ACAC is free from opponent strategy. Choose some representative tasks if possible.

If you respond to these questions and address these concerns, I'll be willing to raise the score.

论据与证据

Yes. The paper provides theoretical and empirical evidence for its main claims. This paper claims that ACAC generalizes well to more complex and randomized environments. This evidence is strong, but additional experiments with larger complexity, such as SMAC, could further validate the generalization capabilities.

方法与评估标准

The methods and evaluation criteria are well-designed and appropriate for the problem. The choice of benchmarks, baseline comparisons, and ablation studies supports the validity of the approach.

理论论述

The paper does not present formal theoretical proofs in the traditional sense, but it does provide detailed derivations and justifications for key algorithmic components. In the modified GAE section and agent-centric history encoder and centralized critic section, it would be beneficial to show that the modified GAE preserves the convergence guarantees of the original GAE or to discuss any potential limitations introduced by the asynchronous adaptation.

实验设计与分析

The paper presents a well-designed set of experiments, Overcooked and BoxPushing, to evaluate the proposed Agent-Centric Actor-Critic (ACAC) algorithm. The baselines such as Mac-NIACC, Mac-IAICC, and MAPPO are appropriate. However, I believe a much harder environment such as SMAC will be more beneficial.

补充材料

Yes, I review all the supplementary materials.

与现有文献的关系

The paper's contributions are related to MARL and HRL. By addressing the challenges of asynchronous MARL with macro-actions, avoiding the pitfalls of padding-based methods, and adapting GAE for asynchronous settings, the paper extends the state-of-the-art in the field. The use of attention mechanisms and rigorous empirical validation further strengthens the paper's contributions.

遗漏的重要参考文献

I believe the references are sufficient.

其他优缺点

  • Strengths:
  1. The paper introduces a novel approach to handling asynchronous multi-agent reinforcement learning (MARL) by avoiding the common padding technique, which is a significant departure from existing methods.
  2. The proposed Agent-Centric Actor-Critic (ACAC) algorithm has the potential to significantly impact real-world applications where agents operate asynchronously.
  3. The adaptation of Generalized Advantage Estimation (GAE) to asynchronous settings is an original contribution that addresses a gap in the literature.
  • Weaknesses:
  1. While the paper provides detailed derivations for the modified GAE, it lacks formal theoretical guarantees or convergence proofs for the proposed methods.
  2. The paper could benefit from a sensitivity analysis to show how the performance of ACAC varies with different hyperparameter settings.
  3. While the chosen benchmarks are appropriate, including additional environments could further validate the generalizability of ACAC, such as SMAC and its variants.

其他意见或建议

NA

作者回复

We are grateful for your thoughtful remarks and would like to provide clarification on our contributions and address the raised concerns.

 

[Q1,W1] Analysis of Modified GAE

Similar to the original GAE, our proposed method does not guarantee convergence in general. However, just as the original GAE converges when λ\lambda equals 0 or 1, our proposed GAE also inherently converges in these specific cases. The rationale behind using macro-action-based λ\lambda-discounting in GAE within ACAC is detailed in our response to reviewer Cf7Q (Analysis of Modified GAE); please refer to our comments there for further clarification.

 

[Q2] Information Loss Comparison

Given an episode obtained using macro-actions, both ACAC and padding-based methods theoretically acquire the same amount of information. However, the padding-based approach continuously uses information from observations that are not actually collected, making it difficult to distinguish between cases where no new information is obtained and those where identical information from a previous observation is repeated. This can result in information distortion. To eliminate such distortion without losing information, ACAC employs an agent-centric encoder that utilizes only each agent's available information for history encoding. This structure allows ACAC to more accurately estimate joint histories compared to padding-based methods, thereby achieving superior performance.

 

[Q3,W3] Evaluations on the SMAC

We agree with your point that evaluating our approach in environments like SMAC could enhance the persuasiveness of our results. Indeed, several studies have explored hierarchical approaches in SMAC; however, most of them have focused on synchronous scenarios where macro-actions have a fixed, identical duration for all agents. Our work, in contrast, specifically addresses scenarios involving asynchronous macro-actions, where each macro-action has varying durations. Consequently, conducting performance comparisons in synchronous SMAC environments would not meaningfully reflect the contributions of our method.

To compare ACAC and the baseline methods effectively in SMAC under asynchronous settings, we would need to explicitly define macro-actions with varying durations. Unfortunately, it is not feasible to develop and test such an asynchronous version of SMAC within the limited time frame available for this rebuttal. Nevertheless, we are actively working on developing an asynchronous variant of SMAC. We strongly anticipate that ACAC will exhibit superior performance over existing baselines, such as Mac-IAICC and Mac-NIACC, once evaluated in this asynchronous SMAC environment.

 

[Q4] Influence by Opponent Policies

Thank you very much for suggesting an evaluation in environments like SMAC-Hard, where multiple opponent policies are mixed. Similar to our earlier explanation regarding SMAC, we believe conducting evaluations in such environments would be meaningful only within an asynchronous version of SMAC. This, however, requires developing a dedicated asynchronous SMAC environment. Additionally, effectively responding to various opponent strategies, as seen in SMAC-Hard, would necessitate developing a module explicitly designed for opponent strategy inference. While this is beyond the current scope of our research, which primarily targets asynchronous MARL scenarios, we anticipate that ACAC's per-agent history encoder could be extended or enhanced to perform opponent strategy inference. This capability would potentially enable the model to adapt effectively to different opponent strategies, representing an exciting and promising direction for future research.

 

[W2] Sensitivity Analysis on Hyperparameter

We agree your suggestion, so we have performed sensitivity analysis on hyperparameter, clipping ratio and GAE λ\lambda(micro-level vs macro-level λ\lambda-discounting), across Overcooked and Overcooked-Rand environments. Since the complex Overcooked-Rand-B map showed the clearest and most significant performance differences between configurations, clearly highlighting component impacts, we focus on these results here for brevity. Full results will be included in the revised manuscript. We report the average final performances and standard errors over five random seeds.

Table 1. Sensitivity analysis on hyperparameter: clipping ratio (ε).

ACAC ( ε=0.01)ε=0.005ε=0.015
212.42 ± 0.64188.73 ± 14.99209.62 ± 3.33

Table 2. GAE λ\lambda comparison.

ACACGAE with Micro-level λ\lambda-discounting
212.42 ± 0.64132.98 ± 19.02

The results shows that performance is robust to clipping ratio variations; ACAC's macro-action based GAE discounting outperforms the original timestep-based discounting.

审稿意见
4

This paper proposes the Agent-Centric Actor-Critic (ACAC) algorithm to address asynchronous multi-agent reinforcement learning (MARL) in sparse-reward environments with macro-actions. The key innovation lies in replacing padding-based centralized critics with agent-centric history encoders and attention-based aggregation, which avoids spurious correlations from padded observations. ACAC introduces a modified GAE for asynchronous settings and demonstrates superior performance on macro-action benchmarks like BoxPushing and Overcooked, showing faster convergence and higher returns compared to baselines.

给作者的问题

please see the weakness

论据与证据

The superiority of ACAC over padding-based methods is supported by experiments. The ablation study (ACAC-Duplicate) convincingly demonstrates padding’s negative impact.

方法与评估标准

The agent-centric encoder and attention-based critic are well-justified for asynchronous settings. The modified GAE adapts standard GAE to variable intervals logically. Benchmarks selected are appropriate for demonstrating algorithm efficacy.

理论论述

The theoretical foundations are basically sound.

实验设计与分析

The experiments are thoughtfully designed and robust in demonstrating the method's strengths. Broader scenario testing would better illustrate the adaptability of the techniques.

补充材料

I have read the Supplementary Materials, including MacDec-POMDP definitions, GAE derivations, hyperparameters, etc.

与现有文献的关系

The paper contextualizes its work within macro-action MARL and CTDE frameworks, citing key works (Xiao et al., 2020a; Amato et al., 2019).

遗漏的重要参考文献

N/A

其他优缺点

Paper Strength ACAC introduces agent-centric history encoders with timestep embeddings and an attention-based critic, effectively addressing the misalignment caused by padding in asynchronous MARL. This architecture is novel and well-motivated. Extensive experiments on diverse benchmarks (including randomized and large-scale variants) demonstrate ACAC’s advantages in convergence speed, final performance, and generalization. Ablation studies (e.g., ACAC-Duplicate) confirm the necessity of avoiding padding.The modified GAE formulation for asynchronous MARL is theoretically justified and empirically validated, addressing irregular macro-observation intervals. The paper clearly identifies limitations of padding-based methods (e.g., spurious correlations) and provides a structured comparison of ACAC’s workflow versus baselines Paper Weakness While the modified GAE is motivated empirically, its convergence properties or bias-variance trade-off in asynchronous settings lack formal analysis. All experiments focus on grid-world tasks (BoxPushing, Overcooked). Real-world applicability (e.g., continuous control) remains unverified.

其他意见或建议

N/A

作者回复

We are grateful for your feedback and would like to offer a detailed explanation of our contributions while addressing the concerns raised.

[W1] Analysis of Modified GAE

It is known that the original GAE does not guarantee convergence in general. However, specific boundary cases are clearly defined: when λ\lambda=0, GAE reduces to the TD-error; when λ\lambda=1, it simplifies precisely to the empirical return. Similarly, the proposed GAE in our paper maintains this important property—when λ\lambda=1, it equals the empirical return, and when λ\lambda=0, it becomes the multi-step TD-error between consecutive decision points (δl(k)\delta_{l(k)} in our formulation, where l(k)l(k) is the kk-th timestep a new observation becomes available for any agent).

In MacDec-POMDP settings featuring temporal abstraction via macro-actions, there can be two choices for designing the advantage function based on how λ\lambda-discounting is applied:

    1. Applying λ\lambda-discounting at the micro-timestep level: This approach discounts future TD errors based on the number of primitive timesteps elapsed.
    1. Applying λ\lambda-discounting at the macro-timestep level: This approach discounts future TD errors based on the number of macro-action decision steps taken.
  • Let's illustrate using an example where macro-observations (and value estimates VtV_t) are obtained at timesteps 0, 2, 5 and 6:
    • Rewards: r0,r1,r2,r3,r4,r5,r6,r_0, r_1, r_2, r_3, r_4, r_5, r_6, \dots
    • Values: V0,,V2,,,V5,V6,V_0, -, V_2, -, -, V_5, V_6, \dots
    • Multi-step TD errors between decision points:
      • δ0=r0+γr1+γ2V2V0\delta_0 = r_0 + \gamma r_1 + \gamma^2 V_2 - V_0
      • δ2=r2+γr3+γ2r4+γ3V5V2\delta_2 = r_2 + \gamma r_3 + \gamma^2 r_4 + \gamma^3 V_5 - V_2
      • δ5=r5+γV6V5\delta_5 = r_5 + \gamma V_6 - V_5
    • Below are the resulting advantage calculations at t=0 for each approach:
      • (1) Micro-level λ\lambda-discounting : A0λ,micro:=δ0+(λ2γ2)δ2+(λ5γ5)δ5+A^{\lambda, \text{micro}}_0 := \delta_0 + (\lambda^{2} \gamma^{2}) \delta_2 + (\lambda^{5} \gamma^{5}) \delta_5 + \dots
      • (2) Macro-level λ\lambda-discounting : A0λ,macro:=δ0+(λγ2)δ2+(λ2γ5)δ5+A^{\lambda, \text{macro}}_0 := \delta_0 + (\lambda \gamma^{2}) \delta_2 + (\lambda^{2} \gamma^{5}) \delta_5 + \dots We adopted the second approach (macro-level λ\lambda-discounting) for ACAC, as it emphasize both the significance of future rewards and the critical decision points associated with macro-actions. As in Table 2 of response for RGMo (Sensitivity Analysis on Hyperparameter), this approach effectively handles the variable intervals in our asynchronous setting and contributes to the strong performance observed.

 

[W2] Real-world Applicability

Thank you for raising this important point regarding the evaluation environments. We agree that verifying applicability beyond grid-world tasks is crucial. While settings requiring asynchronous coordination are common in the real world, the field of Asynchronous MARL, especially involving macro-actions, is still emerging. Consequently, there is currently a limited set of established benchmarks available, most of which are grid-based environments like BoxPushing and Overcooked. We aimed for a comprehensive evaluation within the current scope of the field and thus tested our ACAC method on these known and publicly available benchmarks specifically designed for macro-action-based asynchronous MARL. We recognize the importance of demonstrating applicability in more realistic scenarios. Extending our work to real-world problems is a key direction for our future research.

审稿意见
4

This paper tackles the challenges encountered in asynchronous multi-agent reinforcement learning (MARL) arising from the use of macro-actions with varying durations. In traditional Centralized Training with Decentralized Execution (CTDE) frameworks, a padding technique is often used to fill in missing macro-observations. However, such padding can introduce redundancy and misleading correlations in the history representation. To address this, the authors propose the Agent-Centric Actor-Critic (ACAC) algorithm. ACAC utilizes individual history encoders for each agent to process its own macro-observation trajectory, and an attention-based aggregation module to integrate these histories into a centralized critic. Additionally, the algorithm incorporates a modified Proximal Policy Optimization (PPO) objective and an adapted version of Generalized Advantage Estimation (GAE) suitable for asynchronous settings. The experimental results, conducted on several benchmark tasks including BoxPushing, Overcooked, Overcooked-Rand, and the more challenging Overcooked-Large, demonstrate that ACAC converges faster and achieves higher returns compared to padding-based baselines.

给作者的问题

  1. Can the authors provide qualitative analyses (e.g., visualizations of attention weights or hidden state dynamics) that illustrate how the agent-centric encoder and aggregation module enhance the representation of asynchronous histories compared to the padding-based approach? A detailed response and accompanying visualizations would help clarify the internal workings of the proposed model and further justify the design choices, potentially increasing the paper's impact.
  2. Have the authors conducted sensitivity analyses on key hyperparameters such as the clipping ratio in PPO and the λ parameter in the modified GAE? How robust is the performance of ACAC across different settings? Insights into hyperparameter robustness would strengthen the empirical evidence and provide practical guidance for applying the method in various settings.
  3. Could the authors elaborate on the potential challenges and necessary modifications for extending ACAC to continuous action spaces? Are there any preliminary experiments or theoretical insights in this direction? A clearer discussion on this topic would help situate the current work within broader applications and guide future research efforts.
  4. What assumptions, if any, are made regarding the distribution or variability of the time intervals between macro-observations? Could extreme variability in these intervals affect the validity of the modified GAE formulation? Clarifying these assumptions would help determine the generality of the method and its applicability to diverse asynchronous environments.

论据与证据

The paper makes several key claims:

  1. Problem Identification: The conventional padding method in asynchronous MARL introduces redundant and inaccurate information into the joint history representation, thus impairing the effectiveness of the centralized critic.
  2. Methodological Contribution: By employing agent-centric history encoders and an attention-based aggregation mechanism, ACAC can directly utilize the latest available macro-observations without resorting to padding.
  3. Performance Improvement: The proposed ACAC, when combined with a modified PPO objective and an adapted asynchronous GAE, outperforms existing padding-based approaches in terms of convergence speed, stability, and final performance. The experimental results, including learning curves, ablation studies (e.g., comparison with ACAC-Duplicate), and evaluations across various environments, provide clear and convincing evidence supporting these claims. In particular, the superior performance in complex scenarios like Overcooked-Large reinforces the paper’s assertions about the benefits of the proposed method.

方法与评估标准

The methodological design and evaluation criteria in the paper are well-motivated and appropriate:

  1. Method Design: The authors identify the core issue in asynchronous settings—the irregularity in receiving new macro-observations—and address it by designing an agent-centric encoder that processes each agent’s macro-observation along with its associated timestamp. This enriched representation allows for a more accurate reconstruction of the agents’ histories. The subsequent attention-based aggregation module enables the centralized critic to combine these individual histories without the need for padding.
  2. Evaluation Criteria: The paper employs a diverse set of benchmark environments that vary in complexity (from BoxPushing to Overcooked variants). The experiments are conducted using multiple random seeds and include metrics such as convergence speed and final returns. Additionally, ablation studies are performed to isolate the effects of padding versus non-padding, providing further empirical support for the proposed approach.

理论论述

The theoretical contributions primarily focus on the derivation of an adapted GAE for asynchronous MARL:

  1. Theoretical Derivation: The authors provide a detailed derivation of a modified GAE formulation that accounts for irregular intervals between macro-observations. By introducing a variable that captures these intervals, the new formulation properly adjusts the temporal difference errors and advantage estimates.
  2. Assessment: The derivation is logically sound and recovers the standard GAE formulation as a special case when the intervals are uniform. While the derivation is complex, the underlying assumptions and steps are clearly articulated. It would be beneficial for the final version to further clarify any underlying assumptions about the distribution of the time intervals and boundary conditions to ensure readers fully grasp the scope of the theoretical results.

实验设计与分析

The experimental design and analysis are robust and comprehensive:

  1. Experimental Setup: The paper evaluates ACAC across multiple environments (BoxPushing, Overcooked, Overcooked-Rand, and Overcooked-Large) that effectively capture the challenges of asynchronous MARL. The inclusion of both standard and randomized scenarios demonstrates the method’s generalization capabilities.
  2. Data Analysis: Learning curves, accompanied by mean and standard error statistics, clearly illustrate the performance improvements of ACAC over baseline methods. The ablation studies, particularly the comparison between ACAC and a variant that duplicates macro-observations (ACAC-Duplicate), effectively show the detrimental effects of padding in the learning process.
  3. Recommendations for Improvement: Future work might explore the extension of ACAC to continuous action spaces and larger-scale multi-agent systems. Additionally, a discussion on hyperparameter sensitivity and computational efficiency would further enhance the experimental analysis.

补充材料

I reviewed the supplementary material thoroughly. In particular, I focused on: Appendix B: Detailed derivation of the asynchronous Generalized Advantage Estimation (GAE), which clarifies how the modified GAE adapts to irregular macro-observation intervals. Appendix C: A comparison between padding-based methods and the proposed ACAC, providing a deeper understanding of the advantages of the agent-centric approach. Appendix D: The formal definition of the MacDec-POMDP framework, which helps ground the theoretical discussion in a precise problem formulation. Appendix E: Additional experimental results (ablation studies).

与现有文献的关系

The key contributions of the paper are well-situated within the broader literature:

  1. Macro-actions and Hierarchical RL: The paper builds on the established ideas of macro-actions and the options framework (e.g., Sutton et al., 1999) as well as hierarchical reinforcement learning methods. It extends these ideas by addressing the challenges of asynchrony, a topic that has been explored in works such as Amato et al. (2014, 2019) and Xiao et al. (2020, 2022).
  2. CTDE and Multi-Agent Coordination: The work leverages the Centralized Training with Decentralized Execution (CTDE) framework—common in multi-agent RL—to propose an innovative agent-centric design. This is related to recent advances in methods like MAPPO (Yu et al., 2022) and other actor-critic variants that tackle coordination issues in multi-agent settings.
  3. Modified GAE: The adaptation of Generalized Advantage Estimation for asynchronous settings ties back to foundational work by Schulman et al. (2016), but the modification to account for variable macro-observation intervals is novel and addresses a gap in existing methods.
  4. Attention Mechanisms: The use of attention-based aggregation to combine agent histories connects with broader trends in using attention (e.g., Vaswani et al., 2017) for handling sequence data, which is increasingly common in multi-agent communication and coordination research.

遗漏的重要参考文献

No.

其他优缺点

Strengths:

  1. Originality: The agent-centric approach to avoid padding issues in asynchronous MARL is innovative and addresses a well-known challenge in the field.
  2. Theoretical Rigor: The derivation of the modified GAE for asynchronous settings is detailed and, overall, mathematically sound.
  3. Empirical Validation: The experiments are comprehensive, spanning several benchmark environments and including ablation studies that clearly demonstrate the benefits of the proposed method.

Weaknesses:

  1. Complexity of Theoretical Derivations: Some parts of the derivation (especially the asynchronous GAE) are complex and may benefit from additional intuitive explanations or visual aids.
  2. Scalability and Computational Overhead: While the experiments are convincing, more discussion on the computational cost and scalability of the attention-based aggregation, especially in larger agent populations, would be helpful.
  3. Extension to Continuous Actions: The paper acknowledges that extending ACAC to continuous action spaces is an open challenge. More discussion on potential approaches or anticipated difficulties could strengthen the contribution.

其他意见或建议

  1. Supplementary Material Organization: Consider reorganizing the supplementary material for clearer navigation. For instance, clearly labeling each appendix section with a brief overview of its contents could help readers.
  2. Minor Typos and Clarity: I noticed a few minor typos and formatting issues in the text; a careful proofreading would enhance readability.
  3. Visualization of Encoder Dynamics: Including visualizations of the agent-centric encoder’s hidden state evolution or attention weight distributions could provide deeper insights into how the aggregation module improves history representation.
作者回复

Your valuable comments are much appreciated. In response, we aim to clarify our contributions and address the points of concern.

[Q1] Qualitative Analysis Request

Thank you for the insightful suggestion to correlate attention scores with behavior. This is an excellent direction for future work to gain deeper understanding.

 

[Q2] Hyperparameter Sensitivity

We conducted experiments to analyze hyperparameter sensitivity. Due to space limitations in this response, we have detailed these results in our separate response provided to reviewer RGMo (Sensitivity Analysis on Hyperparameter). We kindly request your understanding and refer you to that specific comment for the detailed results.

 

[Q3,W3] Continuous Action Space

We believe ACAC is structurally suitable for continuous actions with standard actor-critic modifications. The main challenge is the current lack of asynchronous MARL benchmarks with continuous action spaces and defined macro-actions. Defining these requires careful effort, preventing results within the rebuttal period, but it remains important future work.

 

[Q4,W1] Analysis of Modified GAE

Similar to the original GAE, our proposed method does not guarantee convergence in general. However, just as the original GAE converges when λ\lambda equals 0 or 1, our proposed GAE also inherently converges in these specific cases. The rationale behind using macro-action-based λ\lambda discounting in GAE within ACAC is detailed in our response to reviewer Cf7Q (Analysis of Modified GAE); please refer to our comments there for further clarification.

 

[W2] Computational Complexity of ACAC Structure

Defining N=#agents, K=#observations per agent, Z=obs dim, H=hidden dim:

  • Actor: Complexity is nearly identical, aside from ACAC's minor time embedding overhead.
  • Critic:
    • History Encoding
      • Mac-IAICC: Encodes the concatenated joint history (size NZ) whenever any agent gets a new observation (up to NK times total). This leads to a worst-case complexity of O(N²KZ).
      • ACAC: Processes each agent's observation (size Z) individually through its dedicated encoder. Across all agents and observations (NK total), the complexity is O(NKZ). This scales better than Mac-IAICC's encoding by a factor of N.
      • Thus, ACAC's history encoding scales linearly with N (O(NKZ)), offering better scalability than the quadratic O(N²KZ) of joint encoding, especially for larger N.
    • Value Estimation
      • Mac-IAICC: Typically uses an MLP processing aggregated hidden features (size NH). Computed up to NK times, giving O(N²KH) total complexity.
      • ACAC: Employs an attention mechanism over N agent representations (size H), requiring O(N²H) computation per step where an observation arrives. In the worst case (NK steps), this totals O(N³KH) complexity.
      • While a simpler MLP aggregation (as in Mac-IAICC) offers one potential alternative for scalability, potentially trading off some performance, another promising direction for future work is to explore the integration of efficient attention mechanisms [1] that reduce complexity (e.g., to linear O(NH) per step) while aiming to retain the expressive capacity for modeling inter-agent dependencies.
  • Practical Runtime: Despite theoretical differences, the observed runtimes for ACAC and Mac-IAICC were similar in our experiments (N=3, 6). This suggests ACAC's more efficient history encoding helps offset the computational cost of the attention module in practice for these agent populations.

[1] Wang, S., Li, B. Z., Khabsa, M., Fang, H., & Ma, H. (2020). Linformer: Self-Attention with Linear Complexity. arXiv preprint arXiv:2006.04768.

最终决定

This paper proposes an agent-centric actor-critic algorithm that targets the challenge of asynchronous decision-making in multi-agent reinforcement learning by eliminating padding through independent trajectory encoding and attention-based aggregation, resulting in improved learning efficiency and convergence.

The paper demonstrates multiple strengths:

  • Methodological novelty and strong motivation: The paper presents a well-justified and innovative solution to asynchronous MARL by avoiding spurious correlations introduced by padding (Reviewer FJpH, Reviewer Cf7Q, Reviewer RGMo).

  • Sound theoretical insights: The derivation of a modified GAE formulation for irregular macro-action intervals is logical and recovers standard GAE as a special case (Reviewer FJpH, Reviewer Cf7Q).

  • Comprehensive empirical validation: ACAC outperforms strong baselines across diverse macro-action benchmarks with robust ablations and sensitivity analyses (Reviewer FJpH, Reviewer Cf7Q, Reviewer RGMo).

  • Clear presentation and accessibility: The paper is well-written, with helpful figures and well-organized supplementary materials, enhancing readability and reproducibility (Reviewer SHS8, Reviewer FJpH).

  • Scalability and practicality: The authors demonstrate that ACAC has favorable complexity and runtime characteristics compared to existing methods, even in larger agent populations (Reviewer FJpH).

Prior to the rebuttal, common concerns raised by reviewers included the lack of evaluation in highly coordinated or continuous control environments and the potential impact of added model complexity (Reviewer SHS8, Reviewer RGMo, Reviewer Cf7Q), as well as the unclear contribution of individual components such as timestep embeddings and attention modules (Reviewer SHS8), and limited discussion on convergence guarantees for the modified GAE (Reviewer RGMo, Reviewer Cf7Q). These concerns were addressed through detailed clarifications, new ablation studies, architectural comparisons, and explanations of the macro-action framework’s adaptability to agent interaction.

Therefore, AC recommend acceptance of this paper.