/10

Poster4 位审稿人

最低2最高5标准差1.1

ICML 2025

Sable: a Performant, Efficient and Scalable Sequence Model for MARL

Omayma Mahjoub,Sasha Abramowitz,Ruan John de Kock,Wiem Khlifi,Simon Verster Du Toit,Jemma Daniel,Louay Ben Nessir,Louise Beyers,Juan Claude Formanek,Liam Clark,Arnu Pretorius

OpenReview PDF

提交: 2025-01-23更新: 2025-07-24

TL;DR

We develop a state-of-the-art, memory efficient and scalable cooperative multi-agent reinforcement learning algorithm that leverages retention.

摘要

关键词

Multi-agent reinforcement learningReinforcement LearningSequence modelinglinear recurrent modelsRetentive Networks

评审与讨论

审稿意见

评分: 32025-03-10

This paper proposes to use retentive networks to process multiple agents' observations and actions in MARL. The proposed framework can scale to a large number of agents. Extensive experiments and analysis are conducted. As a result, the proposed framework shows performance improvements in 34/45 of tasks and also achieves memory efficiency compared to MAT and IPPO.

给作者的问题

In euqation 5, do you share K dan V for different agents and different time steps?
in page 4: which decomposition theorem are you referring to? In the auto-regressive way, agents' actions are dependent in a order. How would \hat{h}_0 be?
it is unclear about the \tau and \tau_prev. The author needs to explain how does \tau and \tau_prev differ from the subscript i used in Equation 3.
it is unclear whether you refer to a chunk as a set of agents or a set of observations over time step. In line 205, you are talking about processing agents in parallel. However, in equation 6, L is actually analogous to B in euqation 3 so a chunk is a set of observations over time step. Based on this, using \tau to refer to a chunk can be confused since \tau refers to a trajectory. Moreover, in equation 6, i <= Nt_do is also confusing since i is agent index so only when t_do=0 you will get i > Nt_do. And, I'm not sure what j stands for. Is it an agent or a time step?
it is confusing that in section 2 you are using continuing tasks while in equation 4 you consider epsiodes. If you are using episodic tasks, Equation 6 will never achieve a second terminal time step (therefore t_do is the terminal time step rather than the first terminal timestep).
it is not clear how's the decay matrix updated according to your equation (where equation 3 explicitly incorporate exponentially updates)
since the GPU memories with chunk size 128 is below 3.3 and the performance of Sable seems to be insensitive to the size of chunk. Would it be possible to use a smaller chunk size, e..g, 8, for all tasks? What if the number of agents increase as well?

论据与证据

Regarding the memory used, I'm wondering if agents share parameteres, espicially on Q, V, K matrices, which may lead to linearly increased memory with respect to agents and time steps.

方法与评估标准

Equation 6 is unclear and confusing. Please read Questions For Authors

理论论述

No theoretical claims provided

实验设计与分析

The experiments are cndense. I'm wondering what would be the effect when the chunk size and the number of agents increase jointly.

补充材料

Yes

与现有文献的关系

Due to the auto-regressive way used in generating actions, the paper may relate to extensive-form games or MARL which considers communication among agents.

遗漏的重要参考文献

其他优缺点

This paper provides extensive details about the algorithms, implementatio, and experiments with figures and tables.

其他意见或建议

it is a bit confusing to concurrently use subscript for agent index and time step. besides, it is confusing to use i refers to both agent index and a chunk.

作者回复

2025-04-01

Thank you for the feedback. Your comments on the retention equations and other aspects of our work have helped us identify areas for improvement. We address your questions and comments below.

wondering if agents share parameteres, espicially on Q, V, K matrices

Sable uses a single network for all agents. The QVK matrices are not unique per agent, but shared in the sense that each agent is treated as another element in the sequence. While longer sequences incur higher memory usage, RetNets address this by constraining computational memory to a fixed chunk size.

the effect when the chunk size and the number of agents increase jointly

The chunk size can be increased as the number of agents increases, but this will affect memory requirements. The optimal setting is to process the entire training sequence at once, but if this isn't feasible, Sable allows the chunk size to be tuned to maximize hardware utilization and enable training on arbitrarily large sequences.

which decomposition theorem are you referring to

Please see our reply to D22g marked (**).

agents' actions are dependent in a order. How would \hat{h}_0 be?

As shown in Line 177 Column 2, $\hat{h}\_{0}$ is $h\_{t-1}^{dec}$ , where $h_{t-1}^{dec}$ is the decayed hidden state from the previous timestep. $\hat{h}$ is an intermediary variable that accumulates the hidden state over agents within a single timestep. This is decayed once per timestep to produce $h_{t}^{dec}$ . Regarding the agents’ actions order dependency, we mitigate any potential bias by shuffling agent order during training.

it is unclear about the \tau and \tau_prev using \tau to refer to a chunk can be confused

(***) Thank you for going through our paper in such detail, we will make the following changes to the paper to help with the clarity of our method.

For the chunkwise representation, we split a trajectory $\tau$ consisting of $L$ timesteps, with $N$ agents gathered during inference into smaller chunks each of length $C$ such that the retention for chunk $i$ can be given as:

\begin{aligned}Q_{[ \tau_{i} ]} &= Q_{C(i-1):Ci}, K_{[\tau_i]} = K_{C(i-1):Ci}, V_{[\tau_i]} = V_{C(i-1):Ci} \\h_i &= K^T_{[\tau_i]} \left( V_{[\tau_i]} \odot \zeta \right) + \delta \kappa^{ \lfloor L/C \rfloor } h_{i-1}, \zeta = D_{N \cdot \lfloor L/C \rfloor, 1:N \cdot \lfloor L/C \rfloor} \\\text{Ret}(\boldsymbol x_{[\tau_i]}) &= \left( Q_{[\tau_i]} K^T_{[\tau_i]} \odot D \right) V_{[\tau_i]} + \left( Q_{[\tau_i]} h_i \right) \odot \xi \\\text{where } \ & \xi_{j} = \begin{cases} \kappa^{\left\lfloor j / N \right\rfloor + 1}, & \text{if } j\leq Nt_{d_0} \\ 0, & \text{if } j > Nt_{d_0} \\\end{cases}.\end{aligned}

Here $h$ is an intermediary variable carrying information from one chunk to the next, $h_0$ is the hidden state at the beginning of $\tau$ that will be used for training, $\zeta$ is the last row of the decay matrix that is created from the data of the chunked trajectory for chunk $i$ and $\xi$ is a duplicated column vector. Please see our answer to (C1) of reviewer qBu1.

This removes the confusing $\tau$ and $\tau_{prev}$ notation from the text and should link Equations 3 and 6 more clearly. It also removes $i$ from the definition of $\xi$ giving clarity around $Nt_{d_0}$ , as $j$ is now an index in $\xi$ .

it is unclear whether you refer to a chunk as a set of agents or a set of observations over time step

Sable's flexibility allows for treating either entire rollouts with multiple agents at each timestep, or just the number of agents, as the training sequence length. With E environments, L timesteps, N agents, and C chunks, the default training batch shape is $(E, NL)$ , divisible into $(E, N [L / C] )$ size chunks. When using only the number of agents as the sequence length, the shape is $(EL, N)$ , divisible into $(EL, N/C)$ size chunks.

in section 2 you are using continuing tasks while in equation 4 you consider epsiodes

All tasks we consider have fixed length time horizons and termination conditions and we allow for environments to automatically reset once an episode terminates. Thus, for a fixed rollout length it is possible for there to be multiple terminal timesteps.

not clear how's the decay matrix updated according to your equation

In our case, the decay matrix is blockwise lower diagonal of size $(LN, LN)$ with block size $(N, N)$ where $N$ is the number of agents. Each element is exponentially decayed given its position in time for a given trajectory which follows Equation 2. We discuss our adaptations to the decay matrix after Equation 6 in Lines 213-240 and give an example in Appendix D.

Would it be possible to use a smaller chunk size, e..g, 8, for all tasks?

This is possible, but not practical. Smaller chunks during training use less memory at the cost of wall clock time. In practice, we train with a chunk size that is as large as our computational memory permits for the fastest training wall clock time.

审稿人评论

2025-04-05

I confirm that I have read the author response to my review and will update my review in light of this response as necessary.

审稿意见

评分: 22025-03-11

The paper proposes to replace the attention mechanism in Multi-Agent Transformers with Retentive Networks and shows that such tweak (called Sable method by the paper) leads to improvement in the following three dimensions: strong performance, memory efficiency and scalability. The paper evaluates the Sable method on multiple multi-agent benchmarks to show its performance against the baseline methods spanned from independent learning, centralized training with decentralized execution, to centralized learning.

给作者的问题

The paper provides the hyperparameter search space for Sable, MAT etc. but what are the final configs used to produce the reported results?

论据与证据

The claim on the scalability may be a bit untenable. My understanding of RetNet is that it was proposed to reduce the memory cost in inference time, assuming that the model is well trained and everything else would remain the same as the transformers. So, in this sense, Sable is essentially the same as MAT and should be a centralized method. So why Centralized Learning is deemed as not memory efficient or scalable. But Sable, as one of such kind, is scalable? Further, if the memory constraint is the main issue in MAT when scaling up to large numbers of agents, we can then resort to some memory-efficient optimizations of transformers (E.g., SGD instead of Adam) or memory-efficient transformers? Can the authors explain why RetNet is required specifically here?
The claim of “a new sequence model for MARL” is also a bit debatable. The Sable is no different than MAT in the sequence modelling perspective. They are both centralized methods that take the whole sequence from all agents as input and output the joint actions autoregressively.

方法与评估标准

The method part is rather vague in general and there is not much info on this in the main text. The paper, possibly intentionally, puts the method details (implementation details) into the appendix, which I guess might imply that the algo details of Sable are pretty like MAT: they both use the PPO-like training for the actor and critic updates, as specified in Algorithm 1, and the observation sequence encodings from the encoder are fed into the decoder to produce actions.

However, it’s unclear what $o_b$ is. What does $b$ mean here?
In the main text, “The decoder takes a similar sequence but of actions instead of observations as input”, which implies the decoder only takes the actions as input (different from the algorithm).
Again, “we use MAT-style single-timestep sequences to optimise memory usage and reserve chunking to be applied across agents.”, which implies that Sable does not take the trajectories as input (different from algorithm).
Furthermore, “this change to the encoder makes it unable to perform full self-retention across agents, as it cannot be applied across chunks” what does it mean here? The Sable does not rely on the retention then?

理论论述

There are no theoretical proofs and claims in this paper.

The paper mentions "It is this autoregressive action selection which leverages the advantage decomposition theorem to give Sable theoretically grounded convergence guarantees." But there is no such analysis throughout the paper, including appendix. It's even unsure what advantage decomposition theorem it refers to (no references provided).

实验设计与分析

In section 4.2, the degradation of IPPO performance looks a bit suspicious: any reasons why it happened? Can it be addressed by normalizing the returns or learning rate decay? Regarding the memory usage, is it in the training or in the inference?

补充材料

Checked the appendix in detail, especially D. Sable implementation details, and C. Hyperparameters

与现有文献的关系

n/a

遗漏的重要参考文献

there are quite a lot of papers on reducing the memory cost in training transformers, e.g., Memory-efficient Transformers via Top-k Attention, Memory Efficient Continual Learning with Transformers, and in transformer optimizations, e.g., ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, Full Parameter Fine-tuning for Large Language Models with Limited Resources. The paper should discuss how these are related and should consider some of them as the baseline improvements for transformer architectures used in MAT

其他优缺点

Strengths: the empirical results are promising
Weakness: the paper in its current version does not present the method in a clear and convincing way; the paper also misses quite a large chunk of related work on memory-efficient transformers and the optimizations.

其他意见或建议

There is an error in Line 20 in algorithm 1: should be gradient descent for $\phi$ .

伦理审查问题

n/a

作者回复

2025-04-01

Thank you for your clarifying questions and feedback. We provide detailed responses below.

The claim on the scalability may be a bit untenable

RetNets' chunkwise formulation allows us to process long sequences in small chunks, scaling to arbitrarily long sequences regardless of whether they consist of many agents at a single timestep or over multiple timesteps. This scalability difference is evident in Figure 4, where MAT scales much worse than Sable.

Sable’s details are vague, it is essentially the same as MAT, both are centralized methods, the claim of "a new sequence model for MARL" is debatable

We emphasize that Sable introduces a fundamentally different sequence modeling approach compared to MAT. While both Sable and MAT are CL methods, Sable can reason temporally, which allows it to model sequences of agents over time and capture long-term dependencies, unlike MAT, which is limited to reasoning within a single timestep.

Due to the page limit, we did not have enough space to add all the method details in our initial submission but they can all be found in the appendix. We have an extra page for the camera-ready version and will move these method details into the main text to improve clarity.

What does b mean

The b subscript denotes a batch of trajectories from the buffer. We acknowledge that this notation could be made more clear, and we will update Algorithm 1 and the notation accordingly to avoid confusion.

the decoder only takes the actions as input

We are only referring to the first block of the decoder, the second block performs cross retention with the output of the first block and the encoded observations as can be seen in Figure 13.

Sable does not take the trajectories as input

This statement referred to is specifically to the scaling strategy described in Section 3 (Scaling the number of agents) for handling thousands of agents. It represents a variant of Sable optimized for extremely large agent counts, not the default implementation used in most experiments. The algorithm in Appendix D shows the full version of Sable that conditions on trajectories.

Sable does not rely on the retention

Sable still relies on retention. This statement referred to a limitation when using the agent-chunking scaling strategy described in Section 3. When chunking across a large number of agents, self-retention can only be applied per chunk, which means that agents in different chunks do not “retend” to each other. When the number of agents all fit into a single chunk, it is possible to perform full self-retention across all agents as they all fit within the same chunk.

This provides a design trade-off to the user when considering the scale of the problem at hand and the memory available. Furthermore, we would like to point out that in Figure 4, we show that even though Sable cannot perform full self-retention across all agents (only per chunk) at large scales, it still achieves the best performance, whereas MAT is not even able to fit into memory at the extreme end.

unsure what advantage decomposition theorem it refers to

(**) We are referring to the advantage decomposition theorem (ADT) originally derived in [Kuba (2022)] (there called Lemma 1). This theorem underpins the Fundamental Theorem of Heterogeneous-Agent Mirror Learning (HAML) as mentioned on Line 58 Column 2 in the introduction. We will amend the text to make the link between HAML and the ADT clear.

degradation of IPPO performance looks a bit suspicious

Please refer to our answer to a similar question from reviewer pmav marked (*).

memory usage, is it in the training or in the inference?

The memory usage reported in Figure 5 is measured during training as this is when the bulk of the memory is used.

On the usage of memory efficient transformers

Memory efficient transformers without a dual form do not maintain a hidden state, thus would need to keep a cache of per-agent/timestep observations during inference. The two most important aspects of retention in Sable for scaling is low memory requirements and the dual form for efficient inference. For a further discussion on the differences between RetNets and Transformers, we refer the reviewer to Section 2.4 in [Sun (2023)]. We are not against adding some of these related works into the text if the reviewer thinks it would be useful.

should be gradient descent for ϕ

Thank you for pointing this out, we will update it.

final configs used to produce the reported results?

The final configs, source code, and raw experimental data are available at the link at the end of Section 3. You can download the data by pressing "Experiment Data" at the top of the page and find optimal hyperparameters at All experiments data/Benchmark/optimal-hyperparams/… to reproduce our results. We chose not to add the configs directly to the appendix as it would add a large number of extra pages.

审稿意见

评分: 52025-03-12

The work proposes a novel sequence model architecture for multi-agent reinforcement learning (MARL) and conducts a large empirical evaluation to validate the efficacy of the new approach. The architecture is based on retention networks and optimises the sequence model architecture similar to the prior multi-agent transformer (MAT) in a central learning fashion. However, in evaluations across 45 tasks, the novel architecture is found to outperform standard MARL baselines and MAT by significant margins, and to be significantly more efficient in terms of memory requirements, model inference speed, and to be more scalable to tasks with many agents. To verify the importance of different components of the proposed approach, ablation studies are provided in two tasks that verify the importance of each novel component.

给作者的问题

Would the authors be able to elaborate why the multi-agent transformer architecture is stated to not be able to use observation history as inputs?
In Figure 4, the performance of IPPO is shown to degrade throughout training for a LBF task with 128 agents which is unexpected. Would the authors be able to elaborate on why this might occur?
Would the authors be able to provide information on the training and inference cost of Sable in comparison to MAT, IPPO and MAPPO?
Figure 5 illustrates how a reduced chunk size can reduce the GPU memory cost without any cost to algorithm performance. Would I be correct in understanding that a smaller chunk size comes at a cost of reduced training speed due to lower degrees of parallelisation?

In response to the author rebuttal, I increased my score

论据与证据

Overall, I find the claims made in this work to be clearly presented and well supported by clarifications and empirical evidence.

The only minor point that I found confusing is in the introduction, in which the authors contrast their approach to three categories of MARL: independent learning, centralised training with decentralised execution, and centralised learning. However, as I understand, Sable represents a centralised learning in the same way as the multi-agent transformer is centralised learning. This makes the contrast to the categories somewhat confusing, and I would suggest to clarify the relationship of Sable to such prior work to avoid confusion.

方法与评估标准

The methodology appears sound and is largely well presented in Section 3 of the work. However, the work omits several details that are only mentioned in the Appendix and should at least be briefly stated in the main part of the work:

The training objective is only defined in Appendix D and should be included in Section 3.
The network architecture is somewhat hard to follow from Section 3 without visualisation. Figure 13 is very well presented but unfortunately only shown in the Appendix of this work.
Experiments include tasks with continuous and discrete action spaces but the work does not clarify how the Sable network is adapted to adjust for these differences. Would the authors be able to clarify how the policy is adjusted for these settings and whether the optimisation objective differs across these settings?

理论论述

The work does not provide any theoretical claims and proofs.

实验设计与分析

I verified the details provided about the conducted experiments. I find the evaluation to be well presented and very detailed. I commend the authors for following suggestions of recent work on evaluation practises in RL, and for providing plenty of details in the supplementary material.

Below are some clarification questions and further comparison points that have not been presented in this work and would benefit the contextualisation of this work:

The work compares heavily to the multi-agent transformer approach and states that MAT is "not able to condition on observation histories". Would the authors be able to elaborate on this statement? Given the MAT architecture is based on a transformer, it would seem plausible to add longer context based on the observation history of agents, similar to in Sable, even if this has not been done in the original work.
In Figure 4, the performance of IPPO is shown to degrade throughout training for a LBF task with 128 agents which is unexpected. Would it be possible that this degradation is the result of suboptimal hyperparameter tuning? How do the authors explain that performance of IPPO becomes worse as the algorithm continues to train?
The work discusses the achieved returns and memory efficiency of Sable and MARL baselines but does not discuss their training cost. Would the authors be able to provide the cost of training Sable in comparison to MAT, IPPO and MAPPO? Related, Figure 5 (b) shows how the memory cost can be reduced by using smaller chunk sizes without deteriorating performance. Would I be correct in assuming that reduced chunk size comes at a cost of reduced training speed?
Figure 1 compares the throughput of Sable to MAT in terms of steps per second. Would the authors be able to provide a similar comparison to IPPO and MAPPO and clarify how exactly these numbers were obtained?

补充材料

I reviewed supplementary material B, C and D.

与现有文献的关系

The authors state that their work takes inspiration from recent work in linear recurrent models that considered e.g. the application of state space models in RL [1]. Would the authors be able to elaborate in what way the retentive network architecture applied in Sable differs from this work?

[1] Lu, Chris, Yannick Schroecker, Albert Gu, Emilio Parisotto, Jakob Foerster, Satinder Singh, and Feryal Behbahani. "Structured state space models for in-context reinforcement learning." Advances in Neural Information Processing Systems 36 (2023): 47016-47031.

遗漏的重要参考文献

I am not aware of any essential references that are not discussed.

其他优缺点

I would like to commend the authors on a strong empirical evaluation that provides depth and breadth and answers focused questions. I further would like to emphasise that the work releases all experimental data and code.

其他意见或建议

No further suggestions.

作者回复

2025-04-01

Thank you for your feedback, especially your comments on our positioning within the context of different MARL algorithms, and questions on implementation and experimental details, which helped us improve the paper. We provide detailed responses below.

Confusion from the introduction's contrast between Sable and prior MARL approaches.

We acknowledge that the introduction to Sable can be improved and will make it clear that Sable is a CL method. Our narrative was that Sable breaks the typical CL mold by being both performant, memory efficient and scalable, unlike other CL methods. We will update the introduction to convey this more clearly.

Include the training objective in Section 3 and add Figure 13 to Section 3 to visualize the network architecture.

Due to the page limit constraint, we moved some additional details and the architecture figure to the appendix. We will add these back into the main text of the updated version.

How is the Sable network and policy adjusted for continuous vs discrete action spaces, and does the optimization objective differ?

Sable's policy network uses different output heads for discrete and continuous actions, but the architecture and PPO optimisation objective remain the same. For discrete actions, the policy head outputs action logits per agent, which are used to sample actions and train the policy. For continuous actions, the policy outputs mean values and a shared log standard deviation parameter, to sample actions from a Gaussian distribution. We will add this and more details to the appendix.

in what way the retentive network architecture applied in Sable differs from Structured state space models for in-context reinforcement learning?

While both approaches share the core idea of replacing attention with a more memory-efficient mechanism to enable scalable sequence processing, they differ in their underlying architecture, and research goals. Sable relies on the cross-retention mechanism which is an extension we added to RetNets. There is no obvious analogue for this in S5. Additionally, the focus of S5 is on single-agent RL, in-context learning and meta-learning, while our work focusses on computationally efficient long context memory in MARL. If interested, we refer the reviewer to Section 2.4 of the RetNet paper [Sun (2023)] where the difference between S4 and RetNets are discussed in detail.

Why can't the multi-agent transformer architecture use observation history as inputs?

MAT’s architecture lacks a recurrent formulation and handling temporal memory with transformers is challenging [Parisotto (2019), Meng (2022)]. Although it is possible to maintain a cache at inference time for memory over the sequence, it is less scalable due to high memory requirements from maintaining a cache. Our RetNet for RL is advantageous in that it only requires a hidden state and constant memory at inference time.

Why does IPPO's performance degrade during training in the 128-agent LBF task in Figure 4?

(*) Sharing parameters makes distinguishing agents harder, and partial observability leads to non-stationarity. Sable and MAT overcome this with auto-regressive action selection, which aids coordination and non-stationarity. Hyperparameters shouldn't be the issue, as they were tuned. For additional information, we refer the reviewer to Appendix A3, Lines 808-824.

Would the authors be able to provide information on the training and inference cost of Sable in comparison to MAT, IPPO and MAPPO?

Below we show training and inference SPS in Neom.

Table 1: Training SPS

Task Name	IPPO	Sable	MAT
Neom-512-ag	~24k	410	63
Neom-128-ag	~59k	2542	1505
Neom-32-ag	~180k	11111	10391

Table 2: Inference SPS

Task Name	IPPO	Sable	MAT
Neom-512-ag	4600	759	234
Neom-128-ag	3735	1590	1503
Neom-32-ag	5022	3229	3198

Sable is significantly faster than MAT but slower than IPPO. This is expected as Sable and MAT both use larger transformer style networks, while IPPO uses a smaller MLP. Additionally, we believe this is a more fair comparison as MAT is also a centralised learning method and the previous SOTA in MARL, while IPPO is independent and has significantly worse performance than Sable.

Would I be correct in understanding that a smaller chunk size comes at a cost of reduced training speed due to lower degrees of parallelisation?

Yes, exactly. Decreasing chunk size reduces training speed. Larger chunk sizes allow more parallel computation and faster training.

审稿人评论

2025-04-07

I thank the authors for their clarifications that address most of my comments. I remain convinced that this is an excellent submission that should be accepted. I decided to increase my score to strong accept.

That being said, in line with reviewer qBU1 and my prior comments, I hope the authors will be able to make the assumptions made by Sable and its central learning setting more clear and include a nuanced discussion of it with respect to other algorithms. Similarly, as stated by the authors in their response, I hope to see additional details and Figures (e.g. Figure 13) in the main text of the work.

作者评论

2025-04-08

Thank you for taking our reply into consideration and increasing your score. We truly appreciate your constructive feedback and we are happy to hear that our clarifications helped to address your comments.

In our updated manuscript, we will make sure to include what you, and reviewer qBU1, have asked for. Specifically, we will:

Clarify the assumptions made by Sable with a more nuanced discussion of its positioning as a CL method with respect to the other algorithms.
Include additional details and figures in the main text, in particular, the pseudocode (optimisation objectives and Algorithm 1), architecture diagram (Figure 13), visualisation of Neom and an improved Equation 6, to further aid in clarity and understanding.
Update the problem formulation of the Dec-POMDP to include an observation function to make it more clear what agents condition on during execution and mention in the experiment section how this could influence performance when comparing IL, CTDE and CL.

审稿意见

评分: 32025-03-12

This paper presents a novel sequence modeling approach for MARL. It adopts the retention mechanism instead of the attention mechanism in MAT to achieve computational efficiency, memory efficiency, and scalability.

update after rebuttal

During the rebuttal, the authors adequately addressed most of my concerns. Although the problem settings are not clearly presented in the current version of the manuscript, the authors showed their willingness and plan to address this in the modified manuscript. In this regard, I will maintain my score toward acceptance.

给作者的问题

(Q1) Do all baseline methods are trained in via centralized training?

(Q2) The authors mentioned that their approach is classified as centralized training. Then, what is the formal formulation for the main problem setting? DecMDP? MMDP? or what? The proper formulation rather than just general problem formulation for cooperative MARL tasks should be mentioned somewhere in the manuscript.

(Q3) How critical is it for performance to conduct a random permutation of the order of agents within a timestep?

(Q4) Do QMIX and MAPPO still leverage partial information during decision-making, while Sable and MAT utilize global information?

(Q5) It would be helpful for readers to better understand the content if the dimensions of each matrix were explicitly defined somewhere in the manuscript. For example, $\zeta=D_{NL,1:NL}$ is a confusing. Is it different from $\zeta=D_{NL,NL}$ ?

论据与证据

In general, yes. The authors elaborate the reasoning mathematically and prove their argument experimentally. (e.g. Memory usage comparison with baseline methods)

方法与评估标准

Yes. They compared the proposed methods in various MARL benchmark problems.

理论论述

N/A

实验设计与分析

Yes, their experiments mostly seem valid. However, to my understanding, some of the baselines utilize partial information (in the default setting), unlike the proposed method or MAT. This information gap makes direct comparisons unfair. An explicit acknowledgment of this information gap (if it exists) may be necessary to avoid misleading readers unfamiliar with the baselines and experiments. If the authors have modified their implementation to address this gap, it should be properly mentioned in the manuscript.

补充材料

Most of them, including additional experimental results, task settings, and the structure of the proposed method.

与现有文献的关系

The paper presents a novel approach that adopts RetNet (Retentive Networks) for MARL, achieving scalable methods applicable to very large-scale multi-agent tasks, including scenarios with thousands of agents.

遗漏的重要参考文献

As the paper covers various MARL settings, this version reasonably includes essential literature, although it omits some state-of-the-art (SOTA) algorithms in specific test settings. For example, the paper introduces and compares somewhat outdated literature on value-based methods.

其他优缺点

Strength

The paper explored the multi-agent problems in various perspectives, such as IL, CTDE, and CL.
The paper conducted extensive experiments to evaluate the proposed model in diverse benchmark problems.
The authors open-sourced their code.
The proposed methods are applicable to very large-scale multi-agent problems.

Weakness

Although the paper evaluated the proposed method in diverse MARL tasks, its major contribution is replacing the attention mechanism in MAT with the retention mechanism.

其他意见或建议

(C1) In Eq. (3), $\nu_{ij}$ and $\zeta_{ij}$ contain index $j$ but $j$ does not appear in their expressions. Perhaps, it could be expressed differently to avoid any confusion.

(C2) In training part, the corresponding loss functions and algorithm presented in Appendix should be mentioned for readers to refer to them.

(C3) Some pictorial illustration would be helpful for readers to understand Neom, a newly introduced task, if possible.

作者回复

2025-04-01

Thank you for your feedback, especially the close attention paid to our equations/notation. We provide detailed responses below.

(Q1) Do all baseline methods are trained in via centralized training?

In addition to answering the above question, we also wish to clarify a misunderstanding evidenced by the following comment:

“However, to my understanding, some of the baselines utilize partial information (in the default setting), unlike the proposed method or MAT. This information gap makes direct comparisons unfair.”

While MAT and Sable process information from all agents using a single network, they only use local observations during training and not the global state, unlike CTDE methods, e.g. MAPPO, QMIX. All methods use the same observations at inference time. We will clarify the definition of CL in the introduction. Not all baselines belong to the CL paradigm; we include baselines from IL, CTDE and CL.

(Q2) Formal problem setting

In Section 2, we define the problem setting as a decentralised-POMDP.

(Q3) Random permutation of agents and performance

We didn't investigate this, believing it's more principled to randomly permute agent order each timestep. This prevents the model from relying on specific orderings, avoiding bias [Kuba (2022)].

(Q4) Do QMIX and MAPPO still leverage partial information during decision-making, while Sable and MAT utilize global information?

Please refer to our answer in Q1.

(Q5) Dimensions of matrices

Please refer to our answer to (C1) below and to how we intend to rewrite Equation 6 in our answer to reviewer wrvH marked (***).

outdated literature/baselines on value-based methods.

A well-tuned QMIX has been shown to outperform various extensions [Hu (2023)]. For this reason, we feel that QMIX represents a sufficiently strong value-based baseline. We will also clarify this in the experiments section.

(C1) In Eq. (3), $\nu_{ij}$ and $\zeta_{ij}$ contain index $j$ but $j$ does not appear in their expressions. Perhaps, it could be expressed differently to avoid any confusion.

Since Equation 3 does not have $\xi$ or $\nu$ , we assume the reviewer is referring to $\xi$ and $\zeta$ in Equation 6. We acknowledge that our presentation of Equation 6 was imprecise. This misrepresentation was inadvertently transferred from the original RetNet Paper. Both $\zeta$ and $\xi$ are matrices of the same shape as the decay matrix $D$ , with dimensions $C \times C$ , where $C$ is the chunk size. These matrices contain values that are constant across columns but vary across rows. We will revise Equation 6 to eliminate the ambiguity caused by the overloaded use of the index $i$ , and we will explicitly define the role of $j$ to avoid confusion. We refer the reviewer to our response to reviewer wrvH marked (***) for an overview of how we intend to update Equation 6.

(C2) In training part, the corresponding loss functions and algorithm presented in Appendix should be mentioned for readers to refer to them.

Due to the page limit, we did not have enough space to add this in our initial submission (moving it to the appendix), but since we have an extra page for the camera-ready version, we will add the loss function back to the main text.

(C3) Some pictorial illustration would be helpful for readers to understand Neom, a newly introduced task, if possible.

Thank you for the suggestion, we will add a render of a step in Neom.

major contribution is replacing the attention mechanism in MAT with the retention.

Indeed, the reviewer is correct that this is our main contribution. However, we do not see it as a weakness of our work. The use of retention in Sable goes beyond a straightforward replacement of attention in MAT. To get it to work, we had to change several aspects of the original retention mechanism including:

Introducing a reset mechanism within the decay matrix to ensure that memory is retained within episodes and not across their termination boundaries.
Carefully control the decay over timesteps, which, unlike the original RetNet’s decay that operates only over token positions, has to handle multiple tokens/observations in each timestep.
Developing a cross-retention mechanism, a retentive encoder and an encoder-decoder RetNet, none of which are part of the original RetNet design and are also not straightforward implementations.

Therefore, Sable as a working retention-based sequence model for RL, is a highly non-trivial algorithmic implementation. This should also be clear when comparing our code with the original implementation of RetNets and/or MAT. Additionally, retention enables Sable to attend over entire trajectories, which is impossible in MAT and is the main reason for Sable's impressive performance. We are excited about what Sable is capable of, with extensive empirical evidence giving such a strong signal for its potential use in applications.

审稿人评论

2025-04-04

Thank you for the response. I need some more clarifications on some points regarding Q1 and Q2.

In general cooperative MARL settings, the problem is considered as DecPOMDP, as each agent executes based on its own partial observation, not including others. In CL and 3.Method.Execution, the model utilizes aggregations of observations from all agents and "iteratively" generates action via the "centralized decision" maker. How is this viewed as DecPOMDP? Do authors view partial observability as there are some states (perhaps part of the global state) affecting transitions but not being included in the aggregation of observations?

If the decision maker utilizes the aggregated observations from all agents during execution, this additional information can lead to improved performance compared to general MARL settings based on Dec-POMDPs, which rely on partial information during execution.

Please clarify if I’m mistaken; otherwise, I hope these differences are clearly addressed in the problem formulation and experimental settings.

作者评论

2025-04-04

Thank you for engaging with us in discussion, we sincerely appreciate it.

“Do authors view partial observability as there are some states (perhaps part of the global state) affecting transitions but not being included in the aggregation of observations?”

This is exactly correct, the general problem setting we consider is a cooperative task with shared rewards where the global state is not factorised across individual agent observations. That is, even if at execution the agents can condition on other agents’ observations through attention/retention for CL, this does not reconstruct the full state, and therefore remains a partial (but aggregated) observation. We do acknowledge this provides more information per agent compared to CTDE and IL methods at execution time, but it also comes with increased inference costs, which is exactly what we are addressing with Sable.

As a concrete example, consider a two-agent grid world where agents receive a joint reward when they simultaneously reach a goal G.

|-----|-----|-----|-----|-----|-----|-----|
|  #  |  #  |  #  |  #  |  #  |  #  |  #  |
|  #  |  .  | A2  |  .  |  .  |  G  |  #  |
|  #  |  .  |  #  |  .  |  .  |  .  |  #  |
|  #  |  .  |  .  |  .  |  .  |  .  |  #  |
|  #  | A1  |  .  |  .  |  #  |  .  |  #  |
|  #  |  #  |  #  |  #  |  #  |  #  |  #  |
|-----|-----|-----|-----|-----|-----|-----|

A1 then has a partial observation of the grid, which can be given as

  #  |  .  |  .  
  #  | A1  |  . 
  #  |  #  |  #

while A2 has partial observation

 #  |  # | #
 .  | A2 | .
 .  |  # | .

An aggregation over these observations won’t reconstruct the true global state which implies that the problem remains 1) partially observable and 2) cooperative due to the shared reward.

We do however notice that our current notation concerning what agents condition on (in section 2) does not capture this as precisely as it should. We will update our definition of a Dec-POMDP to include an observation function (which is quite standard, e.g. Oliehoek and Amato, 2016). In our case, the observation function maps from the underlying global state and agent id to the agent’s probability distribution over the power set of concatenated observations. For IL, the probabilities are only non-zero over singleton sets (i.e. single observations) and for CL it has full support (i.e. includes probability mass on all possible combinations). We note, still in both these cases, the emitted observation remains partial with respect to the full state. We will also make this more clear in our experiment section to highlight the differences and that this could influence performance.

Finally, we note that the MAT paper (Wen, et al., 2022) considers the Markov game formulation of the problem. We do not feel this is the best setting given the environments considered. Most, if not all the environments in MAT, and those we consider in our work (as well as the practical applications we ultimately care about) do not have full state observability at execution. Therefore, we remain convinced that the Dec-POMDP formulation is the most well-suited to describe our problem setting. That said, we remain open to any counter arguments to this view, and would happily update our definition if an improved formulation is proposed.

References

Wen, M., Kuba, J., Lin, R., Zhang, W., Wen, Y., Wang, J. and Yang, Y., 2022. Multi-agent reinforcement learning is a sequence modeling problem. Advances in Neural Information Processing Systems, 35, pp.16509-16521.
Oliehoek, F.A. and Amato, C., 2016. A concise introduction to decentralized POMDPs (Vol. 1). Cham, Switzerland: Springer International Publishing

最终决定Accept (poster)

2025-05-01

The authors adapt a RetNet architecture to the cooperative MARL setting to produce a scalable (high throughput, low memory cost) centralized learning method that beats SOTA by significant margins on a wide range of domains. The discussion period raised several constructive comments. As the authors have agreed in their rebuttal, they should carry on with the improvements suggested by the reviewers (refine the problem formulation, pull in algorithmic details from the appendix, etc).