PaperHub
4.3
/10
Rejected4 位审稿人
最低3最高6标准差1.3
6
3
5
3
3.8
置信度
正确性2.3
贡献度2.0
表达2.5
ICLR 2025

Multi-Agent Decision S4: Leveraging State Space Models for Offline Multi-Agent Reinforcement Learning

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05
TL;DR

State space models-based offline multi-agent reinforcement learning

摘要

关键词
offline multi-agent reinforcement learningsequence-based offline reinforcement learning

评审与讨论

审稿意见
6

The paper presents Multi-Agent Decision S4 (MADS4), an innovative approach to offline multi-agent reinforcement learning (MARL). The authors build on the success of Structured State Space Sequence (S4) models in single-agent tasks and extend them to MARL by leveraging their parameter efficiency, long-context handling capabilities, and faster inference compared to transformers. The paper introduces a Sequentially Expanded Markov Decision Process (SE-MDP) framework that promotes inter-agent cooperation by enabling sequential action-taking with limited communication between agents. The results on benchmark environments (Multi-Robot Warehouse and StarCraft2) demonstrate the superior performance of MADS4 over state-of-the-art baselines in offline MARL settings.

优点

  1. The use of S4 models provides clear benefits in terms of efficiency, handling longer temporal contexts, and faster inference times. This is a critical advantage over transformers, especially for offline reinforcement learning tasks that involve long trajectories.
  2. The paper provides empirical results showing that MADS4 outperforms state-of-the-art methods on RWARE and SMAC, highlighting its potential for cooperative multi-agent tasks that require coordination among agents.
  3. Offline reinforcement learning in MARL is a challenging but important area, especially when online interaction with the environment is costly or risky. The focus on leveraging pre-collected data is a strength of this work, aligning it with real-world applications.

缺点

  1. The choice of environments (RWARE and SMAC) does not sufficiently test the generalizability of MADS4. These benchmarks are widely used but do not fully capture the complexity of real-world multi-agent tasks. Testing in environments with higher-dimensional state spaces or dynamic agent interactions (e.g., more heterogeneous teams or real-time strategy games) would provide stronger evidence of scalability and robustness.
  2. The paper mainly compares MADS4 to transformer-based models. However, given the growing body of work in MARL using other architectures like graph neural networks (GNNs) or attention mechanisms that directly model agent interactions, it would be valuable to include comparisons to such baselines. Insufficient Analysis of Sequential Dependencies: The SE-MDP framework introduces a strong assumption of sequential action-taking, which may not always be practical in environments where agents need to act simultaneously or in parallel. This limitation isn’t fully addressed, and the paper does not provide an alternative for situations where this sequential action may be a bottleneck.
  3. The focus on S4 models without considering the possible limitations of these models (such as their effectiveness in high-dimensional continuous action spaces) reduces the generality of the conclusions. Additionally, the paper does not explore whether there are settings where transformers might outperform S4, such as tasks that require more flexible attention mechanisms.

问题

  1. How does the sequential dependency between agents in SE-MDP affect performance in environments that require real-time parallel actions from all agents? Would MADS4 struggle in such settings, and if so, how could it be adapted?
  2. Have you considered testing MADS4 in environments with more dynamic interactions, such as continuous control tasks or real-world applications (e.g., autonomous driving, resource allocation)?
  3. How does MADS4 compare to architectures like graph neural networks, which are also designed for multi-agent cooperation? It would be beneficial to see a comparison with GNN-based approaches to understand the specific advantages of S4.
  4. Given that the S4 model relies on long-term dependencies, what are the trade-offs in using S4 over transformers in tasks with varying sequence lengths? Would transformers perform better in environments with shorter sequences or fewer agents?
  5. Can you provide more intuition or formal analysis regarding the gradient flow between agents in SE-MDP? How does this contribute to stability, and why does it improve learning compared to other methods?
评论

We would like to thank the reviewer for their thorough technical review. We are glad that the reviewer finds our S4-based offline MARL approach novel and beneficial.

Comment 1

The choice of environments (RWARE and SMAC) does not sufficiently test the generalizability of MADS4. These benchmarks are widely used but do not fully capture the complexity of real-world multi-agent tasks. Testing in environments with higher-dimensional state spaces or dynamic agent interactions (e.g., more heterogeneous teams or real-time strategy games) would provide stronger evidence of scalability and robustness.

Response Thank you for this feedback. Offline MARL algorithms heavily rely on pre-collected offline datasets which are typically available and easy to collect in existing MARL benchmark environments, and therefore most (if not all) offline MARL algorithms are evaluated across these well-established environments. In our work, we therefore also compared our performance against state-of-the-art offline RL-based and transformer-based supervised-learning-based offline MARL algorithms across these domains. Additionally, we are working on experiments on the challenging Multi-Agent MuJoCo HalfCheetah-v2 domain to extend the scope of our evaluation and test our algorithm's effectiveness in continuous action control tasks. S4 models have been shown to outperform transformers in continuous control domains (e.g., single-agent Mujoco), and we anticipate that S4-based MADS4 would also outperform or perform on par with transformer-based models like MADT and MADTKD, even though their performance in continuous control tasks has also not been reported to our knowledge.

Comment 2

The paper mainly compares MADS4 to transformer-based models. However, given the growing body of work in MARL using other architectures like graph neural networks (GNNs) or attention mechanisms that directly model agent interactions, it would be valuable to include comparisons to such baselines.

Response Thank you for this suggestion. Our work focuses on developing an offline MARL algorithm within the sequence-to-sequence supervised learning paradigm. As a result, the most relevant benchmarks for comparison are transformer-based offline MARL algorithms, such as MADT and MADTKD, which also aim to learn policies in offline supervised learning settings. While graph-based algorithms are often employed in MARL to effectively model agent communication, they are primarily designed for online RL scenarios. Since no graph-based offline MARL baselines are currently available, we have not included them in our comparisons with MADS4.

Comment 2

Insufficient Analysis of Sequential Dependencies

Response In the SE-MDP framework used in this work, information sharing is needed only between consecutive agents, and each agent cooperates with its preceding agent in the sequence to take better cooperative actions. By sharing information based on the hidden state representation of the S4 model which essentially accumulates information about previous agents in the sequence, an agent does not need to communicate with all the previous agents, but just its immediate predecessor, thus allowing for a scalable communication mechanism. Thus, this communication mechanism can be performed in constant O(1) communication memory overhead, since any agent needs to communicate with only one other agent; and still achieve bi-directional cooperative behavior with the entire team since the memory representation holds information of all its preceding agents and gradients are backpropagated from the agents ahead of the agent in the sequence. This communication mechanism induces a recurrent sequential flow of information between agents, resulting in a linear worst-case complexity of O(M)O(M) for the last agent in the sequence, where MM is the number of agents. However, by employing a simple adjustment—passing the previous agent's hidden state from the previous timestep instead of the current updated hidden state—we can enable fully decentralized policies across agents. This modification removes the need for sequential, recurrent communication, thereby eliminating the linear O(M)O(M) time complexity. Furthermore, leveraging S4-based models enables per-agent inference to be performed in constant memory and time, offering a significant advantage over attention-based transformer models like MADT and MADTKD, which incur quadratic runtime relative to the sequence length.

As mentioned above, MADS4 can be adapted for fully decentralized settings with a minor adjustment to how memory information is shared. Instead of accessing the current timestep's memory from the previous agent in the sequence, each agent uses the memory information of the previous agent from the previous timestep. This approach allows all agents to take actions in parallel, as the required memory information from the previous timestep is already available when making decisions at the current timestep.

评论

Comment 3

The focus on S4 models without considering the possible limitations of these models (such as their effectiveness in high-dimensional continuous action spaces) reduces the generality of the conclusions. Additionally, the paper does not explore whether there are settings where transformers might outperform S4, such as tasks that require more flexible attention mechanisms.

Response Thank you for this insightful comment. Our work builds on the demonstrated success of S4-based models over transformer-based Decision Transformers in single-agent tasks (Decision S4 by Shmuel et al, 2023). This motivated our decision to leverage S4-based models within the SE-MDP framework for scalable cooperative multi-agent tasks. S4 models have consistently outperformed transformers in continuous control domains (e.g., single-agent Mujoco), and we anticipate that S4-based MADS4 would either outperform or perform on par with transformer-based models like MADT and MADTKD, even though their performance in continuous control tasks has not been reported to our knowledge. The key advantage of S4-based models lies in their parameter efficiency and computational scalability. For example, our best-performing model achieves superior performance while using only 12% of the parameters of MADTKD. Moreover, S4 enables constant-time and memory-efficient inference, unlike transformers, which scale quadratically with sequence length. These strengths make S4 particularly suited for our focus on scalable and efficient multi-agent reinforcement learning.

Environments(N=96, H=96)(N=64, H=64)(N=32, H=32)(N=64, H=96)(N=96, H=64)(N=32, H=64)MADTKD
2 agents6.586.215.536.536.255.873.65
4 agents9.478.868.579.158.888.646.85
6 agents10.8710.319.5510.769.979.857.85
% Parameters (Ours)1006040818255100
% Parameters (MADTKD)1275886100

Comment 4 How does the sequential dependency between agents in SE-MDP affect performance in environments that require real-time parallel actions from all agents? Would MADS4 struggle in such settings, and if so, how could it be adapted?

Response Thank you for this comment. MADS4 can be adapted for fully decentralized settings with a minor adjustment to how memory information is shared. Instead of accessing the current timestep's memory from the previous agent in the sequence, each agent uses the memory information of the previous agent from the previous timestep. This approach allows all agents to take actions in parallel, as the required memory information from the previous timestep is already available when making decisions at the current timestep.

Comment 5 Have you considered testing MADS4 in environments with more dynamic interactions, such as continuous control tasks or real-world applications (e.g., autonomous driving, resource allocation)?

Response Thank you for this comment. S4 models have been shown to outperform transformers in continuous control domains (e.g., single-agent Mujoco), and we anticipate that S4-based MADS4 would also outperform or perform on par with transformer-based models like MADT and MADTKD, even though their performance in continuous control tasks has also not been reported to our knowledge.

评论

Comment 6 How does MADS4 compare to architectures like graph neural networks, which are also designed for multi-agent cooperation? It would be beneficial to see a comparison with GNN-based approaches to understand the specific advantages of S4.

Response Thank you for this suggestion. Since our work focuses on developing an offline MARL algorithm within the sequence-to-sequence supervised learning paradigm, the most relevant benchmarks for comparison are transformer-based offline MARL algorithms, such as MADT and MADTKD, which also aim to learn policies in offline supervised learning settings. While graph-based algorithms are often employed in MARL to effectively model agent communication, they are primarily designed for online RL scenarios. Since no graph-based offline MARL baselines are currently available, we have not included them in our comparisons with MADS4.

Comment 7 Given that the S4 model relies on long-term dependencies, what are the trade-offs in using S4 over transformers in tasks with varying sequence lengths? Would transformers perform better in environments with shorter sequences or fewer agents?

Response Thank you for this insightful comment. Our findings indicate that S4-based MADS4 excels in handling longer contexts, as evidenced by its significantly better performance than transformer-based MADTKD on the RWARE environment, where trajectories span 500 timesteps. Notably, MADS4 achieves this superior performance while utilizing only 12% of the model size of MADTKD.

In the SMAC domain, where the longest trajectories (e.g., in the Corridor map) are limited to 100 timesteps, the performance gap between MADS4 and MADTKD narrows. We observed that reducing trajectory lengths leads to performance drops for both S4-based and transformer-based models. However, no consistent pattern emerged to suggest that MADS4 underperforms compared to MADT in shorter-context scenarios.

Comment 8 Can you provide more intuition or formal analysis regarding the gradient flow between agents in SE-MDP? How does this contribute to stability, and why does it improve learning compared to other methods?

Response Thank you for this helpful feedback. In response, we have shown a more formal analysis of information flow in our SE-MDP framework in manuscript lines 233-253. Each agent’s memory, represented by the hidden state of its S4 module, encodes information about all prior agents in the sequence. A projection of this latent state, hi1th^t_{i-1}, is passed as input to the next agent along with other inputs u^it\hat{u}^t_i, influencing its action aita_i^t and its memory hith_{i}^t:

ait,hit=πi(u^it,hjt;θi)a_i^t, h_i^t = \pi_i ( \hat{u}_i^t, h^t_j ; \theta_i)

where j=i1j=i-1

During training, gradients flow backward through the shared latent states, enabling the entire system to learn cooperative strategies:
Jθi=Jaitaitθi+Jai+1tai+1thithitθi.\frac{\partial J}{\partial \theta_i} = \frac{\partial J}{\partial a_i^t} \cdot \frac{\partial a_i^t}{\partial \theta_i} + \frac{\partial J}{\partial a_{i+1}^t} \cdot \frac{\partial a_{i+1}^t}{\partial h_i^t} \cdot \frac{\partial h_i^t}{\partial \theta_i}.

Here, JJ represents the supervised loss function computed across all agents in the system. This sequential flow of information eliminates the need for an agent to communicate with more than one peer or identify useful collaborators, a challenge that grows with the number of agents.

We sincerely thank the reviewer for their valuable feedback. Following the clarifications and revisions made to address their comments and suggestions, we kindly request the reviewer to consider revising their score. We remain open and eager to address any further comments or feedback.

评论

Dear Reviewer Ntod, Could you please read the authors' rebuttal and give them feedback at your earliest convenience? Thanks. AC

评论

Dear Area Chair b1DV, I have read and modified the scores. Best, Ntod

评论

Thank you for your feedback and comments. We have improved the clarity of our manuscript based on the insightful discussions and suggestions shared during the rebuttal.

审稿意见
3

This paper proposes using the Structured State Space Sequence (S4) model for offline MARL and tests it on the Multi-Robot Warehouse (RWARE) and StarCraft Multi-Agent Challenge (SMAC).

优点

This work is interesting in exploring whether offline MARL could benefit from advancements in models such as S4.

缺点

(1) The motivation for using S4 instead of a transformer does not fully convince me. The paper claims, “Existing transformer-based approaches for offline MARL either train agents independently, without fully considering them as a multi-agent system, or rely on a centralized transformer model, which suffers from scalability issues.” However, the proposed method also requires the global environment state, and it seems to require the global environment state even during execution (This is strong requirement!)! Then when does your method scale with respect to the number of agents? This needs further clarification.

(2) The paper introduces the setting of a Sequentially Expanded MDP, which implies a sequential structure in the decision-making process. Why limit the proposed algorithm to this setting? This requires justification. Additionally, how do you determine the order of agents? Is it done randomly, since your test cases do not have a predefined order?

(3) Another major concern is the information sharing with limited communication. This assumes that an agent can communicate with the previous agent who just made a decision. This is already a strong form of coordination and should not be assumed to come freely. Could you provide real-world examples where this is applicable? And how the agents could acheieve this?

(4) Is the comparison with other works fair? The proposed algorithm is tested with this information-sharing mechanism and sequential structure, while other algorithms are not. If other algorithms had access to similar forms of communication and coordination, they might also benefit.

问题

(1) It is necessary to discuss and compare your approach with CFCQL (see 1’), which is the state-of-the-art CQL-based method designed for offline MARL. It performs very well in SMAC domains, so a comparison would be valuable.

(3) The claim, “However, extending these regularization techniques to multi-agent learning presents significant challenges due to the exponential growth of the joint state-action space as the number of agents increases,” is inaccurate. CFCQL has extended CQL to the multi-agent setting, and the growth is linear with the number of agents, not exponential.

(4) This works states that "these algorithms do not provide guarantees of global-level regularizations and fail to explicitly or implicitly learn cooperative behavior." What exactly does it mean by "global-level regularizations"? What are the global regularizations in the proposed method?

(5) For the 5m_vs_6m task, why do all algorithms fail to reach the performance level of the offline good dataset?

(6) How are the the results obtained for OMAR on the SMAC task? The official repository only contains the continuous version of OMAR, with no discrete versions available. Did authors implement the descrete action version for OMAR by yourself?

1' Shao, J., Qu, Y., Chen, C., Zhang, H., & Ji, X. (2024). Counterfactual conservative Q learning for offline multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 36.

评论

Comment 2

The paper introduces the setting of a Sequentially Expanded MDP, which implies a sequential structure in the decision-making process. Why limit the proposed algorithm to this setting? This requires justification. Additionally, how do you determine the order of agents? Is it done randomly, since your test cases do not have a predefined order?

Response Thank you for this feedback and insightful questions. Our proposed algorithm MADS4 remains applicable under the setting of MADT, which is a transformer-based sequence-to-sequence offline MARL algorithm, where agents are trained independently. However, we have utilized the Sequentially Expanded MDP setting to incorporate scalable cooperation among the S4-based agents, where communication is limited to only consecutive agents in the sequence. This setting results in our model's better performance over the independent setting where the agents are trained independently. Utilizing the hidden state of the S4 module of the current agent, information on all its prior agents is efficiently passed down to the next agent, and gradients of the agents later in the sequence are also backpropagated during training. Even though this communication mechanism induces a sequential flow of information over the sequence of agents, it eliminates the need for an agent to communicate with more than one agent in the system or to identify useful peers, which becomes challenging as the number of agents grows. Typical communication-based MARL algorithms scale poorly with the number of agents; as the memory required during training and execution becomes exceedingly large and scales quadratically with the number of agents. On the other hand, the utilized communication mechanism in our work is relatively scalable requiring only constant memory since any agent only needs information of the previous agent in the sequence.

The order of the agents is decided randomly during training and execution. We have evaluated the effect of the order of agents on the performance by comparing two training settings of MADS4: 1) Random Order: where the order of the agents is randomly generated during training, and 2) Fixed Order: where the agents are trained in the fixed order in which the offline dataset is collected. The results in Figure 8 in Appendix Section C.4 show very little to no difference in performance between the two settings, which indicates MADS4 is robust to the order of agents in the SE-MDP framework.

Comment 3

Another major concern is the information sharing with limited communication. This assumes that an agent can communicate with the previous agent who just made a decision. This is already a strong form of coordination and should not be assumed to come freely. Could you provide real-world examples where this is applicable? And how the agents could acheieve this?

Response Thank you for this comment. MARL with limited communication typically assumes agents sharing information with their useful peers in the system. This poses significant scalability challenges since the identification of useful peers for each agent becomes challenging as the number of agents grows. Moreover, typical algorithms show both training and execution communication memory scaling poorly with the number of agents. In our work, utilizing the hidden state of the S4 module of the current agent in the proposed SE-MDP framework, information on all its prior agents is efficiently passed down to the next agent, and gradients of the agents later in the sequence are also backpropagated during training. Even though this communication mechanism induces a sequential flow of information over the sequence of agents (which can be eliminated by following a small modification as indicated below), it eliminates the need for an agent to communicate with more than one agent in the system or to identify useful peers. Thus, the utilized communication mechanism in our work is relatively scalable requiring only constant memory since any agent only needs information of the previous agent in the sequence.

MADS4 can also be adapted for fully decentralized settings with a minor adjustment to how memory information is shared. To achieve a fully decentralized MARL setting, instead of accessing the current timestep's updated memory from the previous agent in the sequence, each agent uses the memory information of the previous agent from the previous timestep. This approach allows all agents to take actions in parallel, as the required memory information from the previous timestep is already available when making decisions at the current timestep. Since memory accumulates over multiple timesteps, relying on the previous timestep's information does not compromise performance, as demonstrated in Figure 10 in Appendix C.6. This simple modification enables our algorithm to function effectively in decentralized policy settings without performance degradation.

评论

Comment 4

Is the comparison with other works fair? The proposed algorithm is tested with this information-sharing mechanism and sequential structure, while other algorithms are not. If other algorithms had access to similar forms of communication and coordination, they might also benefit.

Response Thank you for this comment. Our work focuses on developing an offline MARL algorithm within the sequence-to-sequence supervised learning paradigm. As a result, the most relevant benchmarks for comparison are transformer-based offline MARL algorithms, such as MADT and MADTKD, which also aim to learn policies in offline supervised learning settings. There is notably a lack of offline MARL baselines with communication; therefore, we could not compare our method against any existing methods that allow information sharing. To develop an equivalent transformer-based method, MADT needs to be augmented with memory in a Recurrent Memory Transformer style; this warrants a substantial amount of work in itself.

Comment 5

It is necessary to discuss and compare your approach with CFCQL (see 1’), which is the state-of-the-art CQL-based method designed for offline MARL. It performs very well in SMAC domains, so a comparison would be valuable.

Response Thank you for your thoughtful comment. Our work focuses on developing an offline MARL algorithm within the sequence-to-sequence supervised learning paradigm. Consequently, the most relevant benchmarks for comparison are transformer-based offline MARL algorithms, such as MADT and MADTKD, which also operate within offline supervised learning frameworks. Nevertheless, we have also compared our method with state-of-the-art offline RL-based baselines. For the SMAC domain, we utilized the offline datasets and baseline results provided by OMIGA (Wang, Xiangsen, et al., "Offline multi-agent reinforcement learning with implicit global-to-local value regularization," NeurIPS 2024), which reports state-of-the-art performance across various MARL benchmarks. While CFCQL offers strong baseline results on many SMAC maps, it relies on its own datasets, which differ from those used in our work. Therefore, a direct comparison would not be entirely fair, as the performance of any offline MARL algorithm is highly dependent on the quality and characteristics of the training dataset.

Comment 6

The claim, “However, extending these regularization techniques to multi-agent learning presents significant challenges due to the exponential growth of the joint state-action space as the number of agents increases,” is inaccurate. CFCQL has extended CQL to the multi-agent setting, and the growth is linear with the number of agents, not exponential.

Response Thank you for this insightful comment. In our literature review section, we wanted to give a brief overview of the current offline MARL algorithms and their challenges; which lie in the domain of extrapolation errors on OOD state-actions pairs that scale exponentially with the number of agents; as also highlighted in the CFCQL paper. The reviewer is right to point out that the CFCQL paper does address this challenge by employing counterfactual regularization for each agent and combining these via a weighted average for global regularization. We have now added this paper appropriately in our literature review.

评论

Comment 7

This works states that "these algorithms do not provide guarantees of global-level regularizations and fail to explicitly or implicitly learn cooperative behavior." What exactly does it mean by "global-level regularizations"? What are the global regularizations in the proposed method?

Response Thank you for this question. As highlighted in prior literature (e.g., CFCQL, OMIGA), most regularization-based offline MARL algorithms apply regularization at the local or individual agent level. However, these approaches do not guarantee that locally regularized policies will remain jointly optimal under a given value decomposition method. Achieving jointly optimal policies from offline datasets requires global-level regularization imposed on the joint action space. OMIGA attempts to address this by claiming to achieve implicit global-to-local regularization.

In contrast, our proposed method takes a different approach. Rather than competing with regularization-based offline RL methods, we focus on learning joint policies from offline datasets using sequence-to-sequence supervised learning techniques. As such, the most relevant baselines for comparison are transformer-based offline MARL algorithms like MADT and MADTKD.

Unlike MADT, which trains agents locally, our approach aims to learn jointly optimal policies by leveraging the hidden state (memory) of the S4 module from each agent. In our proposed SE-MDP framework, this hidden state is passed to the next agent in sequence, enabling more effective cooperative behavior across the system. The hidden state of the current agent encapsulates information about all prior agents, allowing efficient information sharing in constant memory overhead. During training, gradients from agents later in the sequence are also backpropagated through this shared state, further reinforcing joint optimization and cooperation. To clarify this mechanism, we have revised lines 233–253 in the manuscript.

Comment 8

For the 5m_vs_6m task, why do all algorithms fail to reach the performance level of the offline good dataset?

Response Thank you for this question! While the exact reason is not entirely clear, a likely explanation is that the good offline dataset collected for the 5m_vs_6m task has very low variance and lacks diversity. As a result, the dataset fails to expose the algorithms to a wide range of scenarios, making it difficult for them to generalize effectively to new or previously unseen situations during evaluation.

Comment 9

How are the the results obtained for OMAR on the SMAC task? The official repository only contains the continuous version of OMAR, with no discrete versions available. Did authors implement the descrete action version for OMAR by yourself?

Response Thank you for this question. The results for OMAR on the SMAC task have been obtained from the OMIGA paper (Wang, Xiangsen, et al., "Offline multi-agent reinforcement learning with implicit global-to-local value regularization," NeurIPS 2024).

We sincerely thank the reviewer for their thorough feedback. Following the clarifications and additional analysis made to address their comments and suggestions, we kindly request the reviewer to consider revising their score. We would be happy to address any further comments or feedback.

评论

We sincerely thank the reviewer for their critical feedback which helped us improve our paper. Below, we have provided our responses to the comments and suggestions.

Comment 1

The motivation for using S4 instead of a transformer does not fully convince me. The paper claims, “Existing transformer-based approaches for offline MARL either train agents independently, without fully considering them as a multi-agent system, or rely on a centralized transformer model, which suffers from scalability issues.” However, the proposed method also requires the global environment state, and it seems to require the global environment state even during execution (This is strong requirement!)! Then when does your method scale with respect to the number of agents? This needs further clarification.

Response Thank you for this comment. Let us clarify here how our proposed method compares against transformer-based methods in terms of scalability. The existing literature on offline MARL based on transformers typically involves training agents independently (MADT) where, each agent is trained independently on its own sequence of observations, actions and desired returns. Another transformer-based method MADTKD aims to incorporate cooperation among agents by learning a centralized teacher model which requires information from all the agents involved in the problem and distills individual agent policies from this teacher model using knowledge distillation. Learning the centralized teacher is not scalable with an increasing number of agents as it introduces a quadratic computational overhead in the attention mechanism due to inter-agent interactions during training. In both of these methods, global environment state variables are used as inputs to the transformer models. Moreover, the global environment state is a fixed-size input which may not pose scalability issues with respect to the number of agents. To analyze the effect of using the global state variable, we have run results without passing the global state variable as input, and the results do not show a significant drop in performance. We show this ablation in Figure 9 in the Appendix section C.5.

We also want to clarify why we chose S4-based models over traditional transformers, particularly within the SE-MDP framework. S4-based models offer several advantages over transformers, including parameter efficiency (as shown below), the ability to handle longer context lengths, and faster inference times. In contrast to the quadratic runtime of transformers with respect to sequence length, S4 models operate with asymptotic constant time and memory. In addition to these advantages, the S4 model is associated with inherently learning an internal (hidden) state or memory. In our SE-MDP framework, we uniquely leverage this memory to facilitate limited communication between agents, where each agent only needs to access information from its immediate predecessor, shared through its S4-based latent state.

Environments(N=96, H=96)(N=64, H=64)(N=32, H=32)(N=64, H=96)(N=96, H=64)(N=32, H=64)MADTKD
2 agents6.586.215.536.536.255.873.65
4 agents9.478.868.579.158.888.646.85
6 agents10.8710.319.5510.769.979.857.85
% Parameters (Ours)1006040818255100
% Parameters (MADTKD)1275886100
评论

Thank you to the authors for their detailed responses. However, the responses still do not fully address my concerns. My biggest concern remains the information sharing mechanism. While this mechanism is feasible in the SE-MDP setting, it may not be feasible for more general scenarios. The authors should propose a way to ensure that agents can (even approximately) obtain these hidden states from other agents. Passing a hidden state from the previous decision-making agent to the next should not come for free! Are there real-world examples where you could automatically achieve this sequential information sharing pattern?And your proposed decentralized version is not really fully decentralized, It still needs access to other agent's hidden states. I view the SE-MDP not as a setting but as a strong requirement of the algorithm. I think this is limiting, and I am not convinced by the proposed approach. With its current form, I think the paper should not be accepted to ICLR.

评论

Response

We would like to clarify again that our proposed algorithm falls in the category of multi-agent learning with limited communication. Algorithms in this category enable some form of information sharing among agents, whether through observations, actions, memory, or other means. For example, many MARL works with communication (e.g., Sainbayar et al., 2016; Foerster et al., 2016; Ma et al., 2024; Guan et al., 2022; Wang et al., 2020) allow agents to access the observations of other agents. In the ACE paper (Li et al 2023), they allow the sharing of actions between agents. In a concurrent work (“Shared Memory for Multi-agent Lifelong Pathfinding”) also falling under this category, the agents are allowed to access a shared memory buffer containing the memory information of all the agents in the problem for making decisions.

In our work, we develop an offline MARL algorithm based on S4 under the paradigm of multi-agent learning with communication. Here, we focus on utilizing the concise S4 hidden state representation or memory of only one agent to be accessed by another agent in the system. Importantly, our work is not constrained to the sequential execution MDP (SE-MDP) framework. Agents can also access the memory information of a peer agent from the previous timestep, which is already available when making decisions at the current timestep. This mechanism supports decentralized decision-making, eliminating the need for a strictly sequential flow of information. As shown in Figure 10, accessing a peer agent’s memory from the previous timestep does not negatively impact performance. The S4 models’ ability to effectively capture long-context memory ensures that the memory from the previous timestep serves as a highly reliable proxy for the current timestep. This allows our framework to maintain its efficacy while facilitating decentralized decision-making while maintaining minimal communication overhead.

Our method remains largely applicable in practical settings where communication can be allowed while making decisions. Examples include traffic junction control [1], combat tasks [1], and infrastructure management planning [7], where the states of the agent components are almost always available and can be used for better system performance. MADS4 leverages the inference efficiency of S4-based models and utilizes a fixed-size, concise memory representation derived from S4 latent states. This design enables constant time and memory requirements per agent during the decision-making process, ensuring scalability and efficiency.

[1] Sukhbaatar, Sainbayar, and Rob Fergus. "Learning multiagent communication with backpropagation." Advances in neural information processing systems 29 (2016).

[2] Jakob Foerster, Ioannis Alexandros Assael, Nando De Freitas, and Shimon Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, volume 29, 2016.

[3] Ma, Chengdong, et al. "Efficient and scalable reinforcement learning for large-scale network control." Nature Machine Intelligence (2024): 1-15.

[4] Guan, Cong, et al. "Efficient multi-agent communication via self-supervised information aggregation." Advances in Neural Information Processing Systems 35 (2022): 1020-1033.

[5] Tonghan Wang, Jianhao Wang, Chongyi Zheng, and Chongjie Zhang. Learning nearly decomposable value functions via communication minimization. In Proceedings of the International Conference on Learning Representations, 2020b.

[6] Li, Chuming, et al. "Ace: Cooperative multi-agent q-learning with bidirectional action-dependency." Proceedings of the AAAI conference on artificial intelligence. Vol. 37. No. 7. 2023.

[7] Leroy et al., "IMP-MARL: A Suite of Environments for Large-Scale Infrastructure Management Planning via MARL," Advances in Neural Information Processing Systems 36, 2024

We have revised the manuscript to better clarify the positioning of our algorithm within the existing literature. We hope this discussion addresses the reviewer's concern and convinces them to reconsider their evaluation. We would be happy to address any additional feedback.

评论

I have concerns about the proposed information-sharing mechanism for a reason. It requires agents to access the hidden state of the previous decision-making agent, which involves not only information sharing but also a form of coordination. How are the agents able to determine which agent just made an action, how to communicate with that agent, and how synchronization among agents is achieved? Why are the agents assumed to always perfectly identify and communicate with the previous agent? Could this lead to deadlocks or other potential issues?

Even in the decentralized version, this kind of coordination is still assumed to come for free. However, this work does not propose any mechanisms to achieve such information sharing. I feel that if we take this type of coordination for granted, it becomes easier to design algorithms that perform well on certain benchmarks. However, this approach is not particularly useful and is unlikely to have a positive impact on either the advancement of MARL algorithms or their applications in the real world.

评论

Response

The coordination can be established in any random order. As demonstrated in Figure 8, the sequence/order in which agents interact with each other does not affect the outcome. This is only possible because agents share information through their memory or hidden state. As a result, an agent interacting with any other agent is also informed about its peer's behavior via the shared memory. Therefore, any order of agents can be selected for information sharing, eliminating the need to identify useful peers or communicate with more than one agent in the system, a task that becomes increasingly challenging as the number of agents grows.

Thank you for your question. We hope this clarifies that MADS4 does not require any additional coordination scheme to identify peers. Any agent can choose to share information with any other agent in the system.

评论

I am still not convinced. It is not about order! Why could other agents perfectly identify the agent who just made an action? Suppose an agent fails to identify the agent who just made a decision - is it possible for it to encounter a deadlock and wait forever? Also, suppose even if all agents are globally visible to each other and one agent just made a decision, should all agents reach out to that particular agent to get its hidden state? What if two or more agents made decisions at the same time? How do the agents select one to make the next decision? How do the agents as a whole decentralized system know that it is time when all agents have made a decision this turn, and should proceed to a new round of decisions? There is clearly some kind of non-trivial coordination required, but the proposed work has not proposed a way to achieve this coordination, but just take if for free!

评论

Response

Our method is designed to operate in decision-making settings typical of multi-agent reinforcement learning (MARL), where all agents make decisions at each timestep. Each agent will always take an action; no-op (do nothing) is always available as an action choice for an agent. During this process, each agent can share or rely on information from a fixed peer agent. By "random order," we mean that the peer for each agent can be selected arbitrarily, eliminating the need for any coordination mechanism to identify specific useful peers. The only requirement is that each agent communicates with a unique peer through a one-to-one communication channel, which can be established randomly before the start of an episode.

To enable decentralized or parallel actions across agents, each agent accesses the hidden state or memory of its peer from the previous timestep. This eliminates the need to wait for the peer to make decisions at the current timestep, thereby avoiding sequential dependencies at the current decision step. Since decisions from the previous timestep have already been made, the corresponding hidden states or memories are readily available for use when making decisions at the current timestep.

We hope this explanation addresses the reviewer's concern.

评论

I do not agree. Typical MARL algorithms do not require this sequential information-sharing mechanism. Furthermore, why is "communicating with a unique peer through a one-to-one communication channel" considered sufficient? Agents must at least have the ability to identify who just made a decision and determine who is next. What if multiple agents try to communicate with the agent that just made a decision, each assuming they are the next to act? Even with a predefined fixed order, what happens if an agent is unavailable? The predefined next agent would not know whether the previous agent is unavailable or has simply not yet made a decision, which could potentially lead to a deadlock (waiting forever is different from performing a no-op!). Or in such cases, how are the agents going to figure out a new fixed order?

The authors should at least propose a procedure for agents to achieve the suggested information-sharing mechanism, rather than ignoring these challenges and claiming this is your setting, and the following comes for free. This hides many details and could be unrealistic. It might be easy to hack SMAC, but such an approach is neither meaningful nor useful, and this is not a good sign for future MARL research.

评论

Given the number of questions raised, we believe it would be most effective to address them individually for clarity, as below.

Comment

Typical MARL algorithms do not require this sequential information-sharing mechanism

Response It is important to clarify that MADS4 is not restricted to operating solely within the SE-MDP framework. If agents are needed to make decisions simultaneously at the current timestep, each agent in the network can rely on its peer’s memory information from the previous timestep to make decisions at the current timestep.

When decisions are made at the current timestep, all decisions from the previous timestep will already be finalized. As a result, the memory information of all agents is readily available for use. This eliminates the need for any agent to wait for its peer to decide the current timestep. By utilizing the memory information from the previous timestep, agents can make decisions without relying on sequential dependencies during the current timestep.

Comment

Why is "communicating with a unique peer through a one-to-one communication channel" considered sufficient?

Response Thank you for this question. Communicating with one peer is only sufficient since the agents share memory information. To illustrate, consider a scenario with three agents (1, 2, and 3). If agent 1 shares its memory with agent 2, and agent 2 shares its memory with agent 3, there is no need for agent 3 to directly communicate with agent 1. This is because agent 2’s memory already encapsulates the information from agent 1, effectively propagating agent 1’s behavior to agent 3 indirectly. As a result, each agent only needs to communicate with one unique peer to ensure that every agent’s information is passed across the network.

Comment

What if multiple agents try to communicate with the agent that just made a decision, each assuming they are the next to act?

Response This scenario will not arise; since the agents will only have to communicate with another unique peer. For instance, if agent 3 communicates with agent 2, no other agent will attempt to communicate with agent 2. Instead, they can communicate with agent 3, which already encapsulates information about agent 2 through its shared memory.

The selection of unique peers can be predetermined randomly as any permutation of agent numbers (1 to m) before the start of an episode, thereby establishing a clear communication structure in advance (no optimization or sophisticated coordination mechanism needed). Once the episode begins, every agent knows exactly which peer to communicate with from the generated permutation, preventing any confusion or redundancy during the decision-making process.

Comment

Even with a predefined fixed order, what happens if an agent is unavailable?

Response We interpret the reviewer’s query as addressing real-world scenarios where an agent might become unavailable due to various reasons. However, this scenario is not addressed by any MARL algorithm, whether it is online (e.g., QMIX, MAPPO) or offline (e.g., MADT, CFCQL).

In established MARL benchmark environments such as MAMuJoCo, SMAC, or RWARE, it is always assumed that all agents complete their actions to proceed to the next time step. If an agent becomes unavailable (e.g., is "killed" in combat), it typically defaults to taking a no-op action for the remainder of the episode. Consequently, its actions and memory information remain available, ensuring that parallel decision-making across all agents at the current timestep is unaffected.

The notion of an agent being "unavailable" also appears ambiguous in this context. Every agent’s policy is implemented as a neural network model, which outputs actions based on current observations or inputs without interruption. Similarly, in our proposed algorithm, MADS4, each agent is represented by an S4-based model that always predicts the next action and updates its memory. These predictions are based on current observations and the peer's memory from the previous timestep, which is guaranteed to be available because decisions for the previous timestep are already complete. Thus, scenarios where an agent becomes unavailable are not encountered in MARL benchmarks and are not addressed by any MARL algorithm.

That said, real-world scenarios where communication channels between agents might fail or become unreliable could pose challenges. However, such issues fall outside the scope of reinforcement learning algorithms and pertain more to the reliability of the communication infrastructure. In MARL algorithm design with communication, it is typically assumed that agents can reliably share information. If the reviewer is referring to specific challenges or scenarios beyond these assumptions, we kindly request further clarification to better address the concern.

评论

I am still not a fan of the proposed hidden state information sharing mechanism. It is too restrictive and makes a strong assumption. It is more like a trick to hack these test benchmarks. While it might lead to better performance on benchmarks like SMAC, it does not bring any real advances in terms of MARL algorithms. We should not design MARL algorithms specifically for these benchmarks, and just ignoring being unpractical. Also, the comparison with other algorithms is not fair, as they do not use this information-sharing mechanism. QMIX or other RNN-based methods could benefit from this mechanism and perform better on these benchmarks. It solves no fundamental problems nor advances the field. I believe I have spent enough time discussing this information-sharing mechanism and do not want to discuss it further.

评论

We sincerely thank the reviewer for their valuable feedback and comments. To summarize, our work focuses on developing an offline MARL algorithm within the sequence-to-sequence supervised learning paradigm. Consequently, the most relevant benchmarks for comparison are transformer-based offline MARL algorithms, such as MADT and MADTKD, which also aim to learn policies in offline supervised learning settings.

Currently, there is a notable lack of offline MARL baselines that incorporate communication, limiting our ability to compare directly with existing methods that support information sharing. Creating an equivalent transformer-based method for comparison, such as augmenting MADT with memory in the style of a Recurrent Memory Transformer, would require substantial additional development. Communication is often allowed in online MARL settings (e.g., the concurrent work “Shared Memory for Multi-Agent Lifelong Pathfinding,” where agents use a shared memory buffer to access information from all agents), and our algorithm offers a distinct offline MARL approach. Specifically, it is based on S4 models, which inherently learn a hidden state representing each agent’s memory and require information only from one other agent in the environment.

Regarding the choice of benchmark environments, offline MARL algorithms rely on pre-collected offline datasets, which are typically available and easy to collect in existing MARL benchmark environments. As a result, most (if not all) offline MARL algorithms are evaluated in these well-established domains. In line with this standard, we compared our approach against state-of-the-art offline RL-based and transformer-based supervised learning-based offline MARL algorithms across these widely used benchmarks.

We would like to once again thank the reviewer for their insightful discussions and feedback during the rebuttal process.

审稿意见
5

This paper investigates using Structured State Space Sequence (S4) models for multi-agent offline reinforcement learning. The authors extend the Decision S4 model, originally designed for single-agent RL, to a multi-agent setting, following the decision transformer approach that frames offline RL as a sequence modeling problem. To enable cooperation among agents, the authors adopt a sequentially expanded MDP (SE-MDP) paradigm, dividing each decision-making step into mini-steps. Agents share information through the latent representation of S4 models. The proposed approach is evaluated on a subset of the Multi-Robot Warehouse (RWARE) and StarCraft2 Multi-Agent Challenge (SMAC) benchmarks. Experimental results demonstrate that the proposed approach outperforms several baseline methods.

优点

Originality & Significance

While there is abundant research on using S4 for reinforcement learning, there is limited work on adapting S4 in a multi-agent offline RL setting. This work is well-placed in the literature and could provide the community with valuable insights on the effectiveness of S4 in multi-agent offline RL.

Clarity

The paper is well-structured and generally easy to follow. However, some technical details require further clarification. For instance, the model inference process and communication costs could be elaborated upon. Please refer to the "Weaknesses" section for a more detailed discussion.

缺点

  1. The reviewer has some concerns about the experimental results presented in the paper. Specifically, only one sequence-based baseline (MADT) is used, and the improvements demonstrated by the proposed method over this baseline are relatively small, with some falling within confidence intervals. To strengthen the experimental section, the reviewer suggests comparing the proposed method with more competitive baselines o, such as [a], which have been reported to achieve better results on SMAC than MADT. This would provide a more convincing demonstration of the effectiveness of the proposed method.

  2. The reviewer found some technical contents that need further clarification. Specifically, the mechanism for information sharing during the testing and inference phases remains unclear. The current configuration, which appears to involve a chain of information sharing, could potentially introduce dependencies between agents during inference. If the reviewer understands correctly, the authors use the S4 convolutional nature to address the issue.However, the precise methodology remains unclear. This is a crucial technical aspect of the paper and requires further elaboration.

  3. The reviewer found some citations missing, which may lead to confusion. For example, the Methodology section discusses the concept of Sequentially Expanded MDPs without providing any accompanying citations. This omission could lead to confusion among readers regarding the original contributions of this work versus those of prior research.The reviewer recommends that add appropriate citations, such as [b], and explicitly differentiate between the elements that are drawn from existing literature and the novel contributions introduced in this work. This clarification will enhance the overall transparency and understanding of the research presented.

[a] Offline Multi-Agent Reinforcement Learning with Knowledge Distillation, Tseng et al, 2022

[b] ACE: Cooperative Multi-agent Q-learning with Bidirectional Action-Dependency, Li et al, 2022

问题

  1. In a sequentially expanded MDP, how to decide the order of agents? Does the order have a significant impact on the overall performance?

  2. L52 - L53. Please provide citations for the mentioned regularization methods. It is unclear to the reviewer what ‘these regularization techniques’ are referring to and why it is challenging.

  3. Could you elaborate on the potential communication cost and delay by sharing the information among all agents? Would it be scalable to hundreds of agents?

  4. There are some inconsistencies between figure 2 and Algo 1. In Figure 2, the agent doesn’t take observations as input, but in Algo 1 it does.

  5. Algo 1, what is ditd^t_i? Please provide a formal definition.

评论

We sincerely thank the reviewer for their thoughtful feedback and for acknowledging the originality of our work in implementing S4 for offline multi-agent tasks. Below, we have provided our responses to the comments and suggestions.

Comment 1

The reviewer has some concerns about the experimental results presented in the paper. Specifically, only one sequence-based baseline (MADT) is used, and the improvements demonstrated by the proposed method over this baseline are relatively small, with some falling within confidence intervals. To strengthen the experimental section, the reviewer suggests comparing the proposed method with more competitive baselines o, such as [a], which have been reported to achieve better results on SMAC than MADT. This would provide a more convincing demonstration of the effectiveness of the proposed method.

Response We would like to thank the reviewer for this comment and suggestion. While we would like to compare the performance of MADS4 against MADTKD, there is no public codebase/ reported performance available for MADTKD for us to evaluate MADTKD on some of these SMAC maps. However, we have compared our performance with MADTKD in the challenging RWARE domain in Table 1. This domain has been demonstrated in prior literature to be particularly challenging due to its longer context lengths of 500 timesteps (in contrast to the maximum trajectory length of 100 timesteps encountered in the SMAC Corridor domain) and higher-dimensional observations. In this setting, MADS4 significantly outperforms MADTKD.

Comment 2

The reviewer found some technical contents that need further clarification. Specifically, the mechanism for information sharing during the testing and inference phases remains unclear. The current configuration, which appears to involve a chain of information sharing, could potentially introduce dependencies between agents during inference. If the reviewer understands correctly, the authors use the S4 convolutional nature to address the issue.However, the precise methodology remains unclear. This is a crucial technical aspect of the paper and requires further elaboration.

Response Thank you for this comment. In the SE-MDP framework used in this work, information sharing is needed only between consecutive agents, and each agent cooperates with its preceding agent in the sequence to take better cooperative actions. By sharing information based on the hidden state representation of the S4 model which essentially accumulates information about previous agents in the sequence, an agent does not need to communicate with all the previous agents, but just its immediate predecessor, thus allowing for a scalable communication mechanism. This induces a sequential nature in the decision-making process (which can also be avoided by applying a small modification as detailed next), where each decision step takes linear O(m) time where m is the number of agents. However, this communication mechanism can be performed in constant O(1) communication memory overhead, since any agent needs to communicate with only one other agent; and still achieve bi-directional cooperative behavior with the entire team since the memory representation holds information of all its preceding agents and gradients are backpropagated from the agents ahead of the agent in the sequence. Moreover, by leveraging the convolutional view of the S4 model, the training is executed in an efficient and parallel manner over the time sequence length for each agent.

MADS4 can also be adapted for fully decentralized settings with a minor adjustment to how memory information is shared. To achieve a fully decentralized MARL setting, instead of accessing the current timestep's updated memory from the previous agent in the sequence, each agent uses the memory information of the previous agent from the previous timestep. This approach allows all agents to take actions in parallel, as the required memory information from the previous timestep is already available when making decisions at the current timestep.

Leveraging S4-based models enables per-agent inference to be performed in constant memory and time, offering a significant advantage over attention-based transformer models like MADT and MADTKD, which incur quadratic runtime relative to the sequence length. This algorithm is thus highly scalable and applicable in scenarios with limited communication budgets, as it achieves constant time and memory communication overhead with the number of agents.

评论

Comment 3

The reviewer found some citations missing, which may lead to confusion. For example, the Methodology section discusses the concept of Sequentially Expanded MDPs without providing any accompanying citations. This omission could lead to confusion among readers regarding the original contributions of this work versus those of prior research.The reviewer recommends that add appropriate citations, such as [b], and explicitly differentiate between the elements that are drawn from existing literature and the novel contributions introduced in this work. This clarification will enhance the overall transparency and understanding of the research presented.

Response Thank you for this comment. We would like to point out that we had cited the mentioned paper in Lines 71-76 as follows: “To model cooperation among agents, we adopt a sequentially expanded MDP (SE-MDP) paradigm, where each decision-making step is divided into mini-steps. This approach has also been recently used in online MARL settings (Li et al., 2023), where each agent takes an action at a given mini-step based on the actions of all preceding agents in the sequence. In our work, we allow limited communication between the agents, with each agent receiving information only from its immediate predecessor in the sequence.”

In the SE-MDP framework proposed by Li et al. (2023), agents share information through their actions, with each agent's action depending on the actions of all preceding agents in the sequence. In contrast, our approach introduces sharing representations of the hidden state of the S4 model, which encapsulates the accumulated memory of an agent. By leveraging this hidden state, we enable the transfer of information about all preceding agents while limiting the information exchange to consecutive agents in the SE-MDP sequence. This approach ensures a more scalable communication mechanism, where each agent requires information only from its immediate predecessor. Consequently, the framework scales efficiently to a larger number of agents, maintaining a constant O(1) communication overhead per agent, compared to the linear O(m) overhead per agent, where m is the number of agents, in the original SE-MDP framework proposed by Li et al. (2023).

Comment 4

In a sequentially expanded MDP, how to decide the order of agents? Does the order have a significant impact on the overall performance?

Response Thank you for this insightful comment. We have evaluated the effect of the order of agents on the performance by comparing two training settings of MADS4: 1) Random Order: where the order of the agents is randomly generated during training, and 2) Fixed Order: where the agents are trained in the fixed order in which the offline dataset is collected. The results in Figure 8 in Appendix Section C.4 show very little to no difference in performance between the two settings, which indicates that MADS4 is robust to the order of agents in the SE-MDP framework. However, we would suggest training the agents in a random order to prevent introducing any bias in the learning process.

Comment 5

L52 - L53. Please provide citations for the mentioned regularization methods. It is unclear to the reviewer what ‘these regularization techniques’ are referring to and why it is challenging.

Response Thank you for this comment. In response, we have added appropriate citations for the regularization methods used in single-agent offline RL and revised manuscript lines 47-53 to better articulate the challenges of extending these methods to multi-agent settings. Specifically, in single-agent offline RL, distribution shift is mitigated by enforcing explicit or implicit constraints on the learned policy to ensure it remains within the support of the offline dataset. However, applying such constraints to the global joint policy space in multi-agent settings is significantly more challenging. The exponential growth of the joint action space with the number of agents can lead to sparse and ineffective constraints, making it difficult to prevent out-of-distribution actions at the global level.

评论

Comment 6

Could you elaborate on the potential communication cost and delay by sharing the information among all agents? Would it be scalable to hundreds of agents?

Response We provide a detailed analysis of the communication costs, including memory and time requirements for information sharing between agents, and how these scale with the number of agents. We have also revised the manuscript to incorporate these details.

Computational Requirements During Training

The training process unfolds over two sequences: 1. Timesteps for each agent: Training over timesteps is efficiently parallelized using the convolutional view of the S4 model. 2. Agents: Training over the number of agents may occur sequentially or in a parallel manner depending on the memory information being shared. Sharing current and updated memory information induces a linear flow of information from one agent to the next. However, if the previous timestep’s memory information is used as a surrogate for the current memory information; the agents can take parallel actions as the required memory information from the previous timestep is already available when making decisions at the current timestep. Each agent shares its hidden state representation from the S4 model, encapsulating information about all preceding agents in the sequence. This eliminates the need for direct communication between non-adjacent agents. As a result:

  • Memory: Only the hidden state of the previous agent is required, resulting in constant memory usage O(1)O(1).
  • Computation time: Decisions can be made in constant time, simultaneously across all agents if previous timestep’s memory information is used. If however decisions are made sequentially, it can lead to worst-case linear time complexity O(m)O(m) for the last agent, where mm is the number of agents.
    During training, the parallelized training over timesteps for each agent (typically larger than the number of agents) ensures training efficiency.

Computational Requirements During Inference

The same communication mechanism remains valid during inference. This maintains constant memory and constant time decisions at each timestep. This communication avoids the quadratic or near-quadratic communication overheads often encountered in MARL settings with limited communication. Moreover, S4-based MADS4 offers constant memory and constant time inference relative to the sequence length for each agent at any time, compared to the quadratic runtime relative to the sequence length for attention-based transformer architecture like MADT and MADTKD, thus offering a scalable and efficient solution for multi-agent coordination.

Comment 7

There are some inconsistencies between figure 2 and Algo 1. In Figure 2, the agent doesn’t take observations as input, but in Algo 1 it does.

Response Thank you for pointing this out. We have mentioned in Lines 266-267 that the state of each agent at each time step is composed of the global environment state and its local observation. We have also clarified this in the caption of Figure 2.

Comment 8

Algo 1, what is ditd_i^t? Please provide a formal definition.

Response Thank you for pointing this out as well. ditd_i^t denotes the done or termination signal of an episode. We have now defined this formally in Algorithm 1.

We sincerely thank the reviewer once again for their valuable feedback. In light of the clarifications and revisions made to address their comments and suggestions, we kindly request the reviewer to reconsider their score. We are happy to address any further comments or feedback.

评论

Dear Reviewer R7cj, Could you please read the authors' rebuttal and give them feedback at your earliest convenience? Thanks. AC

评论

Thanks the authors for the response. The response addressed most of my questions including agent's order and the computation time overhead. However, I still have concerns about the experimental results presented in the paper. The improvements of the proposed method are relatively small, with some falling within confidence intervals.

评论

Response

We appreciate the reviewer’s additional comment and are glad that our responses addressed most of their concerns. Our proposed algorithm, MADS4, is designed to compete with transformers as a sequence-to-sequence learning-based offline MARL framework. As such, the most relevant baselines for comparison are transformer-based offline MARL algorithms like MADT and MADTKD. We selected S4-based models due to their strong performance—comparable to or exceeding MADT and MADTKD—particularly in the RWARE domain while offering their other distinct advantages over transformers:

  1. Parameter Efficiency: S4-based MADS4 requires only about 12% of the model size of MADTKD, making it significantly more resource-efficient.
  2. Handling Longer Context Lengths: S4-based models excel in scenarios with longer trajectories, as observed in the RWARE domain (typically 500 timesteps). This allows MADS4 to outperform MADTKD in such environments.
  3. Faster Inference Times: Unlike transformers, which require quadratic runtime relative to sequence length, S4-based models offer constant time and memory inference.

Moreover, the S4-based model inherently learns an internal (hidden) state or memory. In our framework, this memory property of an agent is uniquely leveraged to foster cooperation among agents, further enhancing the effectiveness of MADS4 in multi-agent learning tasks.

We again thank the reviewer for their comments and suggestions that helped us improve our manuscript. We hope this discussion adequately addresses the reviewer's comment above.

评论

Thank you for your valuable feedback, which has helped us enhance the quality and clarity of our work. We are pleased that our rebuttal addressed most of your concerns. Based on your suggestions, we have revised the manuscript and believe our responses effectively address your remaining comments. In light of these revisions and clarifications, we sincerely hope you will reconsider your evaluation. As the discussion period is closing soon, please let us know if you have any remaining questions or need further clarification; we are more than willing to provide additional clarification.

审稿意见
3

This paper introduces a Structured State Space Sequence (S4) model to enhance the learning performance of offline pre-training combined with online fine-tuning in multi-agent reinforcement learning (MARL). Experimental results demonstrate that the proposed approach outperforms both offline RL-based and transformer-based MARL algorithms across several SMAC and RWARE tasks.

优点

This paper is first to apply the S4 model to sequence-based offline MARL tasks, contributing a novel framework for improving performance in these settings.

缺点

First of all, the writing in the paper lacks clarity, and the logical flow is sometimes hard to follow. There are several instances of grammatical errors and awkward phrasing that undermine the overall readability of the paper.

The paper primarily combines the S4 model with the SE MDP framework, which does not represent a significant departure from existing methods. While the S4 model itself is innovative, its application to SE MDP does not present a new solution that clearly advances the field. The motivation for the proposed method is not fully convincing. The paper asserts that the introduction of the S4 model to the SE MDP framework provides an advantage over MADT; however, MADT is also compatible with SE MDP, and the paper does not sufficiently clarify the comparative benefits of S4. The paper should provide a more detailed analysis comparing S4 with other multi-agent methods (such as MADT, QMIX, etc.) in the same experimental settings, particularly focusing on key aspects like cooperation, information sharing, and stability. A thorough comparison would help determine whether S4 offers substantial improvements over current approaches.

It is easy to see that using s4 model improves the ability of representation, and thus the results are not surprising. The experimental evaluation is insufficient, and the paper would benefit from comparisons across a broader range of SMAC maps. Additionally, the paper mentions that online fine-tuning has limited effectiveness, with some cases showing performance degradation. However, the authors do not provide a clear explanation of why this occurs, nor do they offer potential solutions to mitigate this issue. It is worth noting that in the SMAC 2c vs 64zg task, the proposed MADS4 method shows lower average returns with on-policy training without offline pretraining compared to MAPPO (which achieves a 100% win rate and an average return of 20.0), raising concerns about the practical effectiveness of the method.

问题

  1. What does the sequential decision-making process bring to the cooperation in MARL, particularly in environments with limited communication or delayed feedback?
  2. Could you provide a detailed computational comparison between the MADS4 and MADT methods, particularly in terms of training time, scalability, and computational complexity?
  3. In Figure 4, could you clarify the number of interactions represented on the x-axis? Specifically, how many interactions correspond to the 60×\times1e3 training iterations?
评论

Comment 4

What does the sequential decision-making process bring to the cooperation in MARL, particularly in environments with limited communication or delayed feedback?

Response Thank you for this question. The sequential decision-making process formulation as an SE-MDP with communication limited to only consecutive agents in the sequence is beneficial, evidenced by our model's better performance over the independent setting where the agents are trained independently. Utilizing the hidden state of the S4 module of the current agent, information on all its prior agents is efficiently passed down to the next agent, and gradients of the agents later in the sequence are also backpropagated during training. Even though this communication mechanism induces a sequential flow of information over the sequence of agents, it eliminates the need for an agent to communicate with more than one agent in the system or to identify useful peers, which becomes challenging as the number of agents grows. Typical communication-based MARL algorithms scale poorly with the number of agents; as the memory required during training and execution becomes exceedingly large and scales quadratically with the number of agents. On the other hand, the utilized communication mechanism in our work is relatively scalable requiring only constant memory since any agent only needs information of the previous agent in the sequence.

MADS4 can also be adapted for fully decentralized settings with a minor adjustment to how memory information is shared. To achieve a fully decentralized MARL setting, instead of accessing the current timestep's updated memory from the previous agent in the sequence, each agent uses the memory information of the previous agent from the previous timestep. This approach allows all agents to take actions in parallel, as the required memory information from the previous timestep is already available when making decisions at the current timestep.

Comment 5

Could you provide a detailed computational comparison between the MADS4 and MADT methods, particularly in terms of training time, scalability, and computational complexity?

Response Thank you for this question. We have provided a detailed comparison of the computational complexity and model size (number of parameters) between MADS4 and MADTKD, the state-of-the-art MADT architecture.

Computational Complexity During Training

MADS4 leverages the convolutional view of S4 modules, enabling parallelizable training across timesteps. Specifically, S4 modules use a parallel scan operation, achieving a training complexity of O(log(N))O(log(N)) for a sequence of length NN with NN parallel processors. In contrast, the transformer-based MADT architecture has a quadratic complexity of O(N2)O(N^2), which, while parallelizable, is significantly higher.
MADT trains agents independently, resulting in suboptimal performance compared to MADTKD, which uses a centralized transformer to model agent interactions. However, MADTKD's centralized approach introduces a quadratic computational overhead in the attention mechanism due to inter-agent interactions during training. In MADS4, cooperation among agents is achieved by limiting communication to consecutive agents in the sequence. This design requires only constant memory, as each agent only accesses the latent state of its predecessor. This communication mechanism induces a recurrent sequential flow of information between agents, resulting in a linear worst-case complexity of O(M)O(M) for the last agent in the sequence, where MM is the number of agents, which is typically smaller than the number of timesteps NN involved in decision-making. However, by employing a simple adjustment—passing the previous agent's hidden state from the previous timestep instead of the current updated hidden state—we can enable fully decentralized policies across agents. This modification removes the need for sequential, recurrent communication, thereby eliminating the linear O(M)O(M) time complexity.

Computational Complexity During Inference

MADS4 offers highly scalable runtime performance, with constant memory and time per timestep, irrespective of the sequence length. In contrast, MADT's inference complexity scales quadratically with the sequence length for each agent due to its attention mechanism.

Model Size and Performance Comparison

MADS4 achieves superior performance compared to MADT and MADTKD while being far more parameter-efficient. For example, on the RWARE small map, the best-performing MADS4 model (input channels H=96H = 96 and hidden state size N=96N = 96) requires only about 12% of the parameters used by MADTKD. Despite this, MADS4 delivers better results, as demonstrated in the experiments.
We hope this addresses your query and highlights the computational and performance benefits of MADS4 over MADT and MADTKD.

评论

Comment 6

In Figure 4, could you clarify the number of interactions represented on the x-axis? Specifically, how many interactions correspond to the 60×1e3 training iterations?

Response Thank you for this question. During the online fine-tuning stage, 100 environment interactions (number of timesteps) are generated and stored in the buffer per training iteration. Therefore, 60x1e3 training iterations correspond to 6M environment timesteps. The advantage of offline pretraining is particularly noteworthy here, as it leverages just 1,000 episode trajectories—amounting to approximately 50,000 environment interactions—to establish a strong initial foundation for subsequent online fine-tuning.

We would like to once again thank the reviewer for their comments and suggestions. Following the revisions and clarifications provided, we kindly request the reviewer to consider revising the score. We are happy to address any further comments or feedback.

评论

I apologize for the delayed response, as I’ve been occupied with a project. I’ve reviewed the responses and some of the comments from other reviewers. The clarifications provided in the rebuttal have addressed several details, including the sequential decision-making mechanism and the computational complexity. However, two key concerns still remain:

  • Fine-Tuning Results: The fine-tuning results are not convincing. As shown in Figure 4 of the revised paper, fine-tuning on SMAC 6h vs 8z (a Super-Hard task) performs even worse than training from scratch. This raises questions about the effectiveness of the proposed method. Additionally, the fine-tuning process requires over 6 million environment samples, yet it shows no substantial improvement, while the offline training method—using only 50,000 to 100,000 samples—achieves significant performance gains. Given this, I find it difficult to justify the value of online fine-tuning in these two domains.

  • Information Sharing Mechanism: The proposed information-sharing mechanism, which requires access to the last hidden state of other agents, introduces a new concern. Specifically, this could be restrictive during the inference phase, as the hidden states would need to be shared in real-time across the group, which could limit the scalability and flexibility of the approach.

While the rebuttal has certainly enhanced the clarity and quality of the work, I believe there are still areas that could be improved.

评论

Response

We sincerely thank the reviewer for their comments and suggestions above. Below, we address each point in detail.

Comment

Fine-Tuning Results

Response

We appreciate the reviewer’s question and their previous feedback, which guided us in addressing this issue in Section C.3 of the Appendix. The observed degradation in performance during fine-tuning of the pre-trained S4-based model likely stems from the recurrent training of the S4 kernel parameters (AA, BB, and CC). These parameters are initially trained offline using a stable convolutional operation.

To mitigate the degradation in performance during fine-tuning, we froze the S4 kernel parameter AA, which governs the state-to-state transitions independent of inputs. By fixing AA during fine-tuning, we allow only the input-dependent parameters (BB and CC) to be updated. This approach significantly improves on-policy fine-tuning performance without any adverse effects, as demonstrated in Figure 7 of the Appendix.

In light of this feedback, we propose moving this discussion from the Appendix to the main manuscript, along with additional analyses conducted during the rebuttal phase, to strengthen the presentation of our findings.

评论

Comment

Information Sharing Mechanism

Response

Thank you for this comment. Our algorithm develops an offline MARL algorithm under the category of multi-agent learning with communication, where agents are allowed to share information in terms of observations, actions, or other information. Common approaches in this field typically involve agents sharing peer observations (e.g., Sainbayar et al., 2016; Foerster et al., 2016; Ma et al., 2024; Guan et al., 2022; Wang et al., 2020). In the ACE paper (Li et al 2023), they allow the sharing of actions between agents. In a concurrent work (“Shared Memory for Multi-agent Lifelong Pathfinding”) also falling under this category, the agents are allowed to access a shared memory buffer, containing the memory information of all the agents in the problem, for making decisions.

In contrast, our method requires each agent to communicate with only one peer in the network. This eliminates the need for centralized optimization or coordination to identify communication pairs, as the peer can be chosen uniquely but arbitrarily. Communicating with one unique peer is only sufficient since the agents share memory information. To illustrate, consider a scenario with three agents (1, 2, and 3). If agent 1 shares its memory with agent 2, and agent 2 shares its memory with agent 3, there is no need for agent 3 to directly communicate with agent 1. This is because agent 2’s memory already encapsulates the information from agent 1, effectively propagating agent 1’s behavior to agent 3 indirectly. As a result, each agent only needs to communicate with one unique peer to ensure that every agent’s information is passed across the network. Thus, in this design, each agent only needs constant memory communication per decision step, as the S4-based hidden state has a fixed-size representation.

Now, for real-time parallel action execution across agents, each agent in the network can rely on its peer’s memory information from the previous timestep to make decisions at the current timestep. When decisions are made at the current timestep, all decisions from the previous timestep will already be finalized. As a result, the memory information of all agents is readily available for use. This eliminates the need for any agent to wait for its peer to decide the current timestep. By utilizing the memory information from the previous timestep, agents can make decisions without relying on sequential dependencies during the current timestep. Since memory accumulates over multiple timesteps, relying on the previous timestep's information does not compromise performance, as demonstrated in Figure 10 in Appendix C.6.

In response to this comment, we plan to move this discussion from the Appendix to the main manuscript to better address the reviewer’s points. We sincerely thank the reviewer again for their constructive feedback and hope this clarification adequately addresses their concerns.

[1] Sukhbaatar, Sainbayar, and Rob Fergus. "Learning multiagent communication with backpropagation." Advances in neural information processing systems 29 (2016).

[2] Jakob Foerster, Ioannis Alexandros Assael, Nando De Freitas, and Shimon Whiteson. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, volume 29, 2016.

[3] Ma, Chengdong, et al. "Efficient and scalable reinforcement learning for large-scale network control." Nature Machine Intelligence (2024): 1-15.

[4] Guan, Cong, et al. "Efficient multi-agent communication via self-supervised information aggregation." Advances in Neural Information Processing Systems 35 (2022): 1020-1033.

[5] Tonghan Wang, Jianhao Wang, Chongyi Zheng, and Chongjie Zhang. Learning nearly decomposable value functions via communication minimization. In Proceedings of the International Conference on Learning Representations, 2020b.

[6] Li, Chuming, et al. "Ace: Cooperative multi-agent q-learning with bidirectional action-dependency." Proceedings of the AAAI conference on artificial intelligence. Vol. 37. No. 7. 2023.

评论

We sincerely thank the reviewer for their detailed feedback. Below, we provide our responses to the comments and concerns raised.

Comment 1

First of all, the writing in the paper lacks clarity, and the logical flow is sometimes hard to follow. There are several instances of grammatical errors and awkward phrasing that undermine the overall readability of the paper.

Response We sincerely appreciate the reviewer’s feedback and have made substantial revisions to enhance the paper's clarity and readability, incorporating the provided comments and suggestions.

Comment 2

The paper primarily combines the S4 model with the SE MDP framework, which does not represent a significant departure from existing methods. While the S4 model itself is innovative, its application to SE MDP does not present a new solution that clearly advances the field. The motivation for the proposed method is not fully convincing. The paper asserts that the introduction of the S4 model to the SE MDP framework provides an advantage over MADT; however, MADT is also compatible with SE MDP, and the paper does not sufficiently clarify the comparative benefits of S4. The paper should provide a more detailed analysis comparing S4 with other multi-agent methods (such as MADT, QMIX, etc.) in the same experimental settings, particularly focusing on key aspects like cooperation, information sharing, and stability. A thorough comparison would help determine whether S4 offers substantial improvements over current approaches.

Response We appreciate the reviewer’s comment and the general recognition of our novel implementation of S4-based models in offline MARL. We want to clarify why we chose S4-based models over traditional transformers, particularly within the SE-MDP framework. S4-based models offer several advantages over transformers, including parameter efficiency (as shown below), the ability to handle longer context lengths, and faster inference times. In contrast to the quadratic runtime of transformers with respect to sequence length, S4 models operate with asymptotic constant time and memory. In addition to these advantages, the S4 model is associated with inherently learning an internal (hidden) state or memory. In our SE-MDP framework, we uniquely leverage this memory to facilitate limited communication between agents, where each agent only needs to access information from its immediate predecessor, shared through its S4-based latent state. Utilizing this hidden state of the S4 module of the current agent, information on all its prior agents is efficiently passed down to the next agent, and gradients flow backward from the current agent through this shared memory to the previous agents during training. This sequential flow of information eliminates the need for an agent to communicate with more than one peer or identify useful collaborators, a challenge that grows with the number of agents. In contrast to typical communication-based MARL algorithms, which scale poorly due to the quadratic growth in memory requirements during training and execution, our mechanism is highly efficient, requiring only constant memory per agent. This sharing of information helps the learning process become more stable resulting in better performance as evidenced in Figure; where the agents are trained independently vs trained with information sharing between them.

Environments(N=96, H=96)(N=64, H=64)(N=32, H=32)(N=64, H=96)(N=96, H=64)(N=32, H=64)MADTKD
2 agents6.586.215.536.536.255.873.65
4 agents9.478.868.579.158.888.646.85
6 agents10.8710.319.5510.769.979.857.85
% Parameters (Ours)1006040818255100
% Parameters (MADTKD)1275886100
评论

Comment 3

It is easy to see that using s4 model improves the ability of representation, and thus the results are not surprising. The experimental evaluation is insufficient, and the paper would benefit from comparisons across a broader range of SMAC maps. Additionally, the paper mentions that online fine-tuning has limited effectiveness, with some cases showing performance degradation. However, the authors do not provide a clear explanation of why this occurs, nor do they offer potential solutions to mitigate this issue. It is worth noting that in the SMAC 2c vs 64zg task, the proposed MADS4 method shows lower average returns with on-policy training without offline pretraining compared to MAPPO (which achieves a 100% win rate and an average return of 20.0), raising concerns about the practical effectiveness of the method.

Response We thank the reviewer for their thoughtful comments, which we address point-by-point below. To validate our algorithm, we selected two environments with distinct characteristics. The first is the RWARE domain, which features high-dimensional observations and challenging learning patterns with long context lengths (up to 500 timesteps) in offline datasets. While MADT typically trains with truncated context lengths to reduce computational costs, MADS4 efficiently processes full trajectory lengths with significantly fewer parameters than MADT. The second environment is SMAC, where we evaluated performance on two hard and two super-hard maps. These maps used offline datasets provided by OMIGA, the current state-of-the-art across these scenarios. The longest trajectory encountered was 100 timesteps in the challenging Corridor map. MADS4 demonstrated comparable or superior performance to OMIGA on these maps, showcasing its robustness and effectiveness in handling diverse tasks.

The offline training of MADS4 takes place using convolution over entire trajectories in the offline datasets. For online finetuning, the MADS4 is trained using recurrence on the online-collected transition tuples. Recurrent training, however, is prone to error accumulation, which can degrade performance. To mitigate this effect, we propose freezing the offline pre-trained S4 kernel parameter AA during online fine-tuning and updating only parameters BB and CC, ensuring stability and improved adaptation in the fine-tuning phase. To clarify this point, we have added lines 473–478 to the manuscript and Figure 7 in the Appendix.

The comparison between the online-trained version of MADS4 without pretraining and MAPPO may not be entirely fair for two key reasons. First, MADS4's online training relies on the recurrent view of the S4 modules, which inherently results in lower performance. Second, MADS4 is designed as a sequence learning-based offline reinforcement learning algorithm, capable of achieving an average return of 20.0 on the SMAC map 2c vs 64zg using just 1,000 offline trajectories (equivalent to 50,000–100,000 environment steps). In contrast, online MARL algorithms like QMIX and MAPPO typically require significantly more interactions—on the order of 10 million environment steps—to achieve comparable performance. Importantly, offline pre-trained MADS4 provides a strong starting point for further fine-tuning with any online MARL algorithm, combining the strengths of both approaches.

评论

Dear Reviewer zdmt, Could you please read the authors' rebuttal and give them feedback at your earliest convenience? Thanks. AC

评论

Thank you for your valuable feedback. We are glad our rebuttal addressed most of your concerns and improved the quality and clarity of our work. We have revised the manuscript based on your suggestions and believe our responses effectively address your additional comments. In light of these revisions and clarifications, we sincerely hope you will reconsider your evaluation. As the discussion period is closing soon, please let us know if you have any remaining questions or need further clarification; we are more than willing to provide additional clarification.

评论

We sincerely thank all the reviewers for their detailed comments and valuable suggestions. As the paper revision deadline is tomorrow, we kindly request the reviewers to review our rebuttal and share any additional feedback. This will enable us to address any remaining concerns thoroughly in the revised manuscript.

AC 元评审

This paper proposes to use the state space model S4 for multi-agent tasks. However, the idea is only moderately motivated, without a clear intuition. The evaluation domains are on the relatively simple side of the spectrum, which might undermine the true evaluation of the proposed approach. Furthermore, the comparison is not comprehensive as it only compares with the transformer-based approach, neglecting part of the literature that approaches these tasks from other perspectives. Overall, this paper is below the bar of ICLR.

审稿人讨论附加意见

The rebuttal addressed some concerns of the reviewers.

最终决定

Reject