/10

Poster4 位审稿人

最低3最高5标准差0.8

ICML 2025

Decision Mixer: Integrating Long-term and Local Dependencies via Dynamic Token Selection for Decision-Making

Hongling Zheng,Li Shen,Yong Luo,Deheng Ye,Bo Du,Jialie Shen,Dacheng Tao

提交: 2025-01-16更新: 2025-08-12

TL;DR

In this paper, we propose Decision Mixer (DM), which addresses the conflict between features of different scales in the modeling process from the perspective of dynamic integration.

摘要

关键词

Offline RLConditional Sequence ModelingDecision Transformer

评审与讨论

审稿意见

评分: 32025-02-23

This paper proposes adapting the attention block in transformers based on the Mixture of Experts (MoE) design to effectively balance capturing long-term dependencies and extracting local features. By adapting it to the Decision Transformer architecture, extensive experiments demonstrate its superior performance in the areas of offline reinforcement learning.

给作者的问题

No other questions.

论据与证据

Extensive experiments on standard benchmark datasets empirically demonstrate its superior performance compared to various offline RL methods, including both value-based and CSM-based approaches. Additionally, experiments highlight its computational efficiency over existing CSM methods, suggesting its potential to support scalable research in offline RL. However, the claim of theoretical consideration is unreasonable due to the absence of any formal theoretical analysis.

方法与评估标准

The module design is generally reasonable for its intended purpose. However, I have two concerns:

The inconsistency of training and inference. During training, the selection of tokens passed through the attention block is determined by the hypernetwork, which considers the entire input sequence—an approach that is not feasible during inference. To address this, the authors introduce an auxiliary predictor to approximate the hypernetwork’s decision. However, this predictor operates on an incomplete sequence, missing crucial information available to the hypernetwork. This raises doubts about its ability to make accurate predictions. A more consistent approach might involve ensuring the hypernetwork also processes a causal sequence by masking future steps during training.
The novelty and contribution seem to be somewhat incremental. Many studies have explored the combination of MoE and transformers to improve computational efficiency. While adapting this to the Decision Transformer is a reasonable extension, the paper's novelty and impact are limited due to insufficient discussion and comparison from a broader perspective beyond its specific application area.

理论论述

Although the authors claimed theoretical consideration in Introduction, there is no theoretical analysis in this paper.

实验设计与分析

The experiments are thorough and well-executed. The ablation study and computational complexity analyses effectively validate the strength of their design.

补充材料

I briefly reviewed the additional experimental analyses, which appear to be informative and supportive.

与现有文献的关系

Scaling offline RL is both a compelling and significant topic. However, there has been many works exploring combing MoE and transformer to enhance the computation efficiency. Although adapting it to the Decision Transformer is reasonable, the novelty and impact of this paper are limited due to the lack of broader discussion and comparative analysis beyond a specific scenario.

遗漏的重要参考文献

Several previous works exploring combing MoE and transformer to enhance the computation efficiency:

Fedus, William, Barret Zoph, and Noam Shazeer. "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity." Journal of Machine Learning Research 23.120 (2022): 1-39.

Dai, Damai, et al. "Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models." arXiv preprint arXiv:2401.06066 (2024).

Csordás, Róbert, et al. "Moeut: Mixture-of-experts universal transformers." arXiv preprint arXiv:2405.16039 (2024).

Lepikhin, Dmitry, et al. "Gshard: Scaling giant models with conditional computation and automatic sharding." arXiv preprint arXiv:2006.16668 (2020).

其他优缺点

The paper is well-written, providing clear explanations and precise notations.

其他意见或建议

No other comments.

作者回复

2025-03-31

Thanks for the careful review of our work. Due to the strict word limit, we have tried to address the reviewers' comments carefully. All additional experiments will be incorporated into the text.

Theoretical Claims
We provided a brief analysis of the CSM method from the perspective of re-weighting in equation (2). Given the word limit and the need for academic rigor, we have removed the statement "theoretical analysis" from the main text.

Concern 1: The inconsistency of training and inference
We carefully considered using a more consistent approach to achieve our objectives. We designed three schemes: (1) The hypernetwork uses a single token as input for training and inference without needing an auxiliary predictor. (2) The hypernetwork uses a causal sequence as input for training and inference without needing an auxiliary predictor. (3) The hypernetwork uses the entire sequence as input for training, providing data for the binary classification training of the auxiliary predictor, which is then used for inference (adopted by DM).

	(1)	(2)	(3)
halfcheetah-medium	27.5 $\pm$ 5.1 (48.0h)	40.9 $\pm$ 0.6 (20.5h)	43.5 $\pm$ 0.7 (7.5h)
hopper-medium	82.7 $\pm$ 18.0 (48.0h)	94.7 $\pm$ 5.0 (20.5h)	98.1 $\pm$ 3.6 (7.5h)
walker2d-medium	55.2 $\pm$ 4.9 (48.0h)	83.4 $\pm$ 2.7 (20.5h)	83.8 $\pm$ 0.8 (7.5h)
maze2d-umaze	59.1 $\pm$ 4.7 (24.0h)	83.2 $\pm$ 5.8 (15.0h)	86.9 $\pm$ 1.9 (6.9h)
antmaze-umaze	80.3 $\pm$ 12.4 (24.0h)	75.0 $\pm$ 9.1 (16.0h)	100.0 $\pm$ 0.5 (6.5h)

(3) demonstrated the best performance and stability across all tasks, with significantly shorter training times. Although (1) and (2) ensured consistency between training and inference, they struggled with convergence due to insufficient utilization of global information from the training data. (3) adopts a task decomposition approach to implement training and inference hierarchically, with the auxiliary predictor trained based on the prediction data from each round of the hypernetwork and router. Figure 1 of the pdf (https://anonymous.4open.science/r/Decision-Mixer-1068/rebuttal.pdf) shows the training loss curves, which indirectly suggest that binary classification of data on a specific Mixer layer for a given task is relatively easy to learn. Experimental results in Table 1 and the ablation study in Table 2 confirm that potential distribution shifts were addressed through synchronized training.

Concern 2: The novelty and contribution seem to be incremental
All prior works have tended to introduce expert routing mechanisms in FFN or attention layers without involving token selection, where a fixed number of experts handle different tokens in the complete sequence. We emphasize that DM differs fundamentally from existing works in component design and training approach. DM innovatively designs (1) a dynamic token selection mechanism to address sequence modeling conflicts specific to offline RL, differing from conventional static MoE. DM handles incomplete sequences and uses generalized residual connections after each layer to ensure the consistency of output and input lengths. (2) During inference, we also designed a unique auxiliary predictor from the task decomposition perspective to address inconsistencies between training and inference. (3) We deploy the selection before the attention layer, making DM a more flexible plug-and-play architecture. It is more cost-efficient than conventional combinations of MoE and transformers, providing a reasonable direction for exploring scaling laws under the DT architecture.

Essential References Not Discussed
We will discuss the combination of MoE and transformers in the text. GShard[1] introduced MoE into transformers to address load imbalance via routing and expert capacity constraints. Switch Transformers[2] proposed a Top-1 gating mechanism to reduce computation and communication overhead. DeepSeekMoE[3] optimized expert utilization with fine-grained segmentation and expert isolation. MoEUT[4] combined MoE with the Universal Transformer, addressing the parameter-computation efficiency trade-off. Additionally, several works[5,6,7] explored MoE integration in visual tasks from the perspectives of data processing[5], multi-task learning[6], pre-training[7], and global training[8].

We discuss MoE because DM can be viewed as using a single expert to filter tokens before the standard transformer layer. However, this does not mean we have merely made a simple transplant. The collaboration between the router and the hypernetwork in DM, along with the auxiliary predictor and generalized residual connections, are seamlessly integrated, forming an efficient and practical framework that paves the way for future paradigms in offline RL.

[1] Gshard...
[2] Switch transformers...
[3] Deepseekmoe...
[4] Moeut...
[5] Scaling vision with sparse mixture...
[6] Adamv-moe: Adaptive multi-task vision...
[7] Moe jetpack: From dense checkpoints...
[8] Mod-squad: Designing mixtures...

审稿人评论

2025-04-06

The rebuttal has satisfactorily addressed my concern regarding the inconsistency. I recommend that the authors incorporate the relevant discussion and supporting experiments into the main paper, as this will significantly enhance its quality. Additionally, the discussion on prior work combining MoE and Transformers is valuable and should also be included. However, I still find the novelty to be somewhat limited. Taking these factors into account, I have decided to raise my score to a weak accept.

作者评论

2025-04-06

We sincerely thank the reviewer for kindly raising the score. We will incorporate all relevant discussions and supporting experiments suggested by the reviewer into the main paper. We are also grateful for the reviewer's thoughts on the innovation of our work, which have prompted us to further reflect on and rearticulate DM's unique contributions in terms of novelty. We strive to do our utmost to address reviewers' concerns, not only about the paper's rebuttal but also in providing meaningful insights for the RL community.

DM approaches the inherent feature trade-off problem in the DT architecture from a novel perspective of dynamic token selection, enabling a systematic exploration that significantly improves computational efficiency while ensuring robust performance gains.
Unlike previous MoE+Transformer approaches that focus on statically expanding the number of experts and assigning tokens to experts, DM simplifies the design by discarding the concept of "experts." It adopts a single-router structure and performs token selection through a tightly coupled mechanism between a hypernetwork and an auxiliary predictor.
The router, hypernetwork, and auxiliary predictor in DM are all simple in structure, tightly integrated, and highly modular, making them easy to plug into existing architectures to enhance performance.
DM is the first to explore potential scaling laws in conditional sequence modeling (CSM). In contrast to prior closed-source studies that focus solely on increasing parameter scale, we have released all our code to support reproducibility and further research by the community.

We also explored integrating existing MoE+Transformer methods into the DT framework. Specifically, we focused on two approaches: MoE with token choice routing [1] + DT (referred to as Method 1) and MoE with expert choice routing [2] + DT (referred to as Method 2). Given the architectural differences between DM and existing MoE methods, we have made every effort to ensure fairness by limiting the number of experts to 8 and the top-k value to 2 in both Method 1 and Method 2. Other hyperparameters (e.g., model depth, batch size) were kept broadly consistent with Table 6 in the appendix, with minor adjustments to maximize performance. All experiments were conducted in the Gym environment, and results were averaged over three random seeds.

	1	2	DM
halfcheetah-medium	23.9	20.1	43.5
hopper-medium	54.9	72.1	98.1
walker2d-medium	64.3	40.3	83.8
maze2d-umaze	49.5	77.6	86.9

The performance of MoE+Transformer methods is inferior to that of DM, which is consistent with our intuition. Method 1 and Method 2 rely on static mechanisms to assign tokens to specific experts within the FFN layers, resulting in lower flexibility than DM. Moreover, the absence of a precise token selection mechanism before the standard transformer layers makes it difficult to support trajectory stitching, which limits the effectiveness of off-the-shelf MoE+Transformer methods in RL tasks. Additionally, MoE architectures are notoriously difficult to tune, and when data quality or quantity is insufficient, they are prone to sudden performance degradation. We will further elaborate on the uniqueness and contributions of our approach in the main paper.

[1] Gshard: Scaling giant models with conditional computation and automatic sharding.
[2] Mixture-of-Experts with Expert Choice Routing.

审稿意见

评分: 32025-03-10

This paper introduces Decision Mixer (DM), a Transformer-based architecture for offline reinforcement learning. DM features a dynamic token selection mechanism, where a routing module learns to selectively attend to relevant past tokens during training. To enable efficient inference, an auxiliary predictor is trained concurrently to approximate token importance without access to future information. Experimental results across a diverse set of offline RL benchmarks—including standard locomotion tasks and MetaWorld—demonstrate that DM consistently outperforms existing approaches in both performance and efficiency.

update after rebuttal

Considering the authors' response and other reviews, I have changed my score to weak accept.

给作者的问题

Can you explain more about the results in Table 2. as to why incorporating such auxiliary predictor improves the overall performance compared to just dynamically token selection?
I believe it is necessary to include a clock-time measurement in addition to the complexity measurement, as the method involves the additional training of multiple networks. This would provide a more comprehensive assessment of the method's efficiency and resource requirements.

论据与证据

Although the motivations and solutions seem to be novel, there are a number of unclear factors in the paper that required clarification: I find it unclear how the router R and the hypernetwork H were trained in the proposed approach. Specifically, the training process for these components is not well-explained in the manuscript. What is the architecture of these networks (not included in the manuscript) and whether the training process is only based on the downstream loss function in Eq (6)? Additionally, I am curious about how the auxiliary predictor is trained. More details are needed on what constitutes the ground truth for training this predictor and how its predictions are evaluated during the learning process.

方法与评估标准

The benchmarks that were for evaluation in this paper are diverse and suitable. These are benchmarks that have been used in previous DT related research.

理论论述

There are no theoretical proofs in this paper.

实验设计与分析

Experiments design have been checked, and they are reasonably formulated. The paper includes standard evaluation on Mujoco benchmarks, in addition, they also include evaluation on MetaWorld environments. However, the paper lacks evaluation on discrete-action-space environments (Atari games are often being used in previous DT research).

补充材料

Supplementary Material has been reviewed. The architecture of router R and hypernetwork H seems to be missing from the appendix.

与现有文献的关系

The paper introduces Decision Mixer (DM), a low-complexity architecture aimed at balancing local and long-range dependencies in Decision Transformer (DT) models through a layer-wise token selection mechanism. Inspired by the Mixture-of-Experts (MoE) architecture, the authors propose a router network to select the tokens.
While this method is novel in its application to offline reinforcement learning, token selection and dropping strategies have been explored in prior Transformer research. For example, Token Dropping for Efficient BERT Pretraining (arXiv:2203.13240) proposes dropping less important tokens mid-layer to improve training speed without sacrificing performance. Similarly, Random-LTD: Random and Layerwise Token Dropping (arXiv:2211.11586) presents a technique that skips computation for random subsets of tokens at intermediate layers to reduce cost.
The authors should more thoroughly engage with the existing literature on token dropping in Transformers, including a comparison and contrast to clearly articulate the novelty of their approach. Additionally, further empirical evidence is needed to convincingly demonstrate the superiority of the proposed method in reinforcement learning settings.

遗漏的重要参考文献

Missing baseline: The paper “Long-Short Decision Transformer: Bridging Global and Local Dependencies for Generalized Decision-Making” acknowledges the same problem and proposes solution that should be compared here.

其他优缺点

The paper presents a well-motivated problem, and the proposed solution seems to be novel in RL settting, particularly in its ability to dynamically select token weights. Empirical results across multiple tasks and environments demonstrate performance improvements when applied to various base models, such as DT, QDT, and ODT.

其他意见或建议

There is a mention of "MOD" in line 343, but I was unable to find what it refers to.
For Figure 2, consider adding axis titles or specifying in the caption what the attention scores represent to improve clarity. For example, in DC, the relevant discussion section explicitly mentions that the attention scores reflect relationships between token level (state, action, and returns-to-go). Providing similar context here would make the figure more interpretable.
The term “threshold k” is somewhat misleading, especially when the output of hypernetwork is denoted as k. Changing top-k to top-\texit{k} might help reader in understanding that this refers to the same value.

作者回复

2025-03-31

Thanks for the detailed review of our work. Given the strict word limit, we have carefully addressed the reviewer's comments and will incorporate the additional experiments and suggestions into the updated version. Anonymous pdf: https://anonymous.4open.science/r/Decision-Mixer-1068/rebuttal.pdf

The details of $R$ and $H$
We found that both $R$ and $H$ can perform well using simple MLP, as shown below. $R$ and $H$ are incorporated as part of the main model, with no additional constraints added, and the training is based solely on equation (6). Additional thoughts can be found in Q1 for Reviewer ekS2.

Network	Layer	Input	Output
$R$	Linear	embed_dim	1
$H$	Linear	context_length×embed_dim	512
	LeakyReLU
	Linear	512	1
$θ_{aux}$	Linear	embed_dim	embed_dim//2
	SiLU
	Linear	embed_dim//2	2

The auxiliary predictor $\theta^l_{aux}$ is trained with gradient isolation using dynamic selection results from $R$ and $H$ . The main model selects the top- $k$ tokens from $X^l$ and generates binary labels $z_i \in {0,1}$ as ground truth for $\theta^l_{aux}$ . The predictor outputs binary logits $\hat{y}_i$ , optimized via equation (5) to match $\sigma(\hat{y}_i)$ with $z_i$ . We found the token selection distribution in binary classification easy to learn, with rapid loss convergence shown in Figure 1 of the anonymous pdf. Main experiments and ablation studies confirm prediction accuracy. Alternatives are discussed in our response to reviewer EJaz on "Inconsistency of training and inference."

Lacks evaluation on specific environments
We have added experimental results for DM in the Atari, reporting the average performance across three random seeds. Table 1 of the anonymous pdf shows that DM performs well in discrete action environments, demonstrating the method's generalizability. This success was expected due to the precedents set by DT and DC.

More engage with the existing literature
We will provide a more detailed discussion in the main text. Token dropping, initially proposed to reduce BERT inference costs [1,2], was adapted by Hou et al. [3] to improve training efficiency. Random-LTD [4] advanced this with random-layer token dropping and learning rate scheduling. While prior work focuses on static efficiency strategies for vision and language tasks [5,6], they often lack dynamic adaptation, risking semantic disruption. In contrast, DM dynamically selects tokens using a router and hypernetwork, aligning with offline RL’s Markovian nature for better performance. Its plug-and-play design and synchronized training offer greater flexibility than existing methods.

[1] Train short, test long...
[2] SpAtten: Efficient Sparse...
[3] Token dropping for...
[4] Random-ltd...
[5] Revisiting token dropping...
[6] Multi-Stage Vision Token...

Further empirical evidence
In the paper, we have included baseline comparisons in RL scenarios, component ablation studies, token selection statistics and visualizations, computational efficiency, training curves, generalization, portability, and context length experiments. Following reviewer feedback, we added DM’s Atari results and adapted Random-LTD to DT in Gym. Table 2 of the anonymous pdf shows that Random-LTD underperforms DM in RL tasks, likely due to its reliance on random token dropping. We will attempt more transplant comparison experiments in future work.

Essential References Not Discussed
We will include LSDT in the related work and experimental comparison sections. LSDT integrates DT's self-attention and DC's dynamic convolution via branch design for decision-making. DM extends these environments and outperforms LSDT in most experiments (Table 3 of the anonymous pdf).

Other Comments

We apologize for the typo and will change "MOD" to "model".
We have placed the updated Figure 2 in the anonymous PDF.
We will replace the relevant terms with \texit{k}.

Q1: Why does an auxiliary predictor improve performance
The auxiliary predictor addresses issues from inconsistent input data formats during training and inference. $H$ uses the complete sequence to predict $k$ , but during inference, the autoregressive nature of the transformer makes the following sequence invisible, preventing sequence-level predictions. The token-level auxiliary predictor enables filtering without full sequence information. $R$ and $H$ jointly serve as teacher models, providing binary classification training data for the auxiliary predictor. This hierarchical training approach reduces the difficulty of training and fully utilizes all available data.

Q2: Include a clock-time measurement
We measured the clock time from training start to convergence across multiple tasks. The results in Table 4 (anonymous pdf) show that DM has a smaller time overhead than DT and is competitive with DC. Despite adding networks, DM's minimal parameters and dynamic token selection mechanism shorten the sequence length, avoiding significant clock time increases.

审稿意见

评分: 42025-03-11

This paper introduces Decision Mixer (DM), a select-concatenate-compute mechanism that improves efficiency in offline reinforcement learning. Inspired by MoE, DM dynamically filters key tokens for attention while retaining information from unselected ones. It also integrates an auxiliary predictor to mitigate short-sightedness. Experiments demonstrate that DM outperforms existing methods while significantly reducing computational overhead.

update after rebuttal

I will keep my score.

给作者的问题

The existing loss functions mainly focus on the MSE loss for action prediction. Considering the stability of the training process, should additional regularization terms, such as a routing weight smoothing term, be introduced to prevent abrupt changes in token selection between adjacent tokens?
Is the dynamic range of k generated by the hypernetwork constrained? When the input sequence contains significant noise, can the hypernetwork generate extreme values (e.g., k=0 or k = S), causing the model to degrade into a purely convolutional or attention-based architecture?
Could the paper provide a more balanced discussion by highlighting DM's limitations, such as scenarios where it may underperform, to guide future work?

论据与证据

The claims made in the submission are supported by clear and convincing evidence.

方法与评估标准

The proposed methods and evaluation criteria are suitable for the real-world problem.

理论论述

Yes, the theoretical claims and proofs have been examined. The auxiliary predictor in Section 3.3, formulated with Equation (5), provides a valid approach to handle autoregressive sampling constraints.

实验设计与分析

The experimental design is robust, with comprehensive benchmarking across diverse D4RL domains and thorough ablation studies to dissect the contributions of DM's components. A powerful aspect is the visualization of the token selection mechanism (Figure 4), which provides intuitive insights into how DM adapts to different tasks, such as the positional proximity of selected tokens in standard Markov tasks versus their discrete distribution in non-standard Markov tasks. This visualization aligns with the theoretical motivation behind DM and offers a clear, interpretable understanding of its behavior across varying task complexities.

补充材料

I reviewed the feasibility of the code provided in the submitted supplementary materials.

与现有文献的关系

Previous work, whether DT or DC, had a certain degree of data bias. DM introduces a dynamic token selection mechanism that dynamically balances local and long-term dependencies. MoE inspires this mechanism and leverages a router and hypernetwork to select tokens for attention computation. This innovation allows DM to adaptively focus on relevant features, addressing DT's limitations and improving task performance with standard and non-standard Markov properties.

遗漏的重要参考文献

No, all essential related works have been cited discussed.

其他优缺点

Strengths:

The paper creatively combines ideas from CSM, MoE, and conditional computation to design DM. The dynamic token selection mechanism, inspired by MoE, is a novel and innovative approach to balancing local and long-term dependencies in offline RL. By significantly reducing FLOPs and memory usage compared to DT, DM addresses a key practical challenge in scaling RL models, making it more feasible for resource-constrained environments.

Weaknesses:

The paper does not explicitly discuss the limitations of DM, such as potential failure cases or scenarios where it may underperform compared to baseline methods. A more balanced discussion would strengthen the paper.

其他意见或建议

It is recommended to adjust some instances of "MOE" to "MoE" in the main text to ensure consistency and accuracy in terminology.

作者回复

2025-03-31

Thanks for the careful review of our work. Due to the strict word limit, we have tried to address the reviewers' comments carefully. All additional experiments and suggestions will be incorporated into the updated text.

It is recommended to adjust "MOE" to "MoE"
We apologize for this mistake. To ensure consistency and accuracy of terminology, we will change all occurrences of "MOE" to "MoE" in the main text.

Q1: Should regularization terms be introduced
We appreciate the reviewer's thoughtful insights. To ensure our algorithm's simplicity and ease of deployment, we have solely used Equation (6) for the entire model training. Introducing additional regularization terms, such as a routing weight smoothing term, can improve training stability. To investigate this, we conducted experiments by adding a regularization term with a weight of 0.1 on top of Equation (6). The specific details of the regularization are as follows: $L_{smooth}=\lambda \sum_{i=1}^{S-1}\|w_i-w_{i+1}\|^2$ The experimental results are presented in the table below. We refer to DM with the added regularization term as DM_s.

	DM	DM_s
halfcheetah-medium	43.5 $\pm$ 0.7	33.1 $\pm$ 0.2
hopper-medium	98.1 $\pm$ 3.6	90.9 $\pm$ 1.0
walker2d-medium	83.8 $\pm$ 0.8	80.1 $\pm$ 0.1
maze2d-umaze	86.9 $\pm$ 1.9	86.4 $\pm$ 0.8
antmaze-umaze	100.0 $\pm$ 0.5	48.0 $\pm$ 0.8

We found that while DM_s exhibited improved stability, it experienced varying degrees of performance degradation across all tasks compared to DM. We hypothesize that this is primarily due to the introduction of the weight smoothing term, which constrains the flexibility of token selection based on task-specific characteristics, making the selection process overly conservative. Abrupt changes in token selection across neighboring positions can be reasonable, as they allow the model to flexibly concatenate trajectories based on data characteristics while preserving the Markov properties required for specific tasks. In future work, we will explore more refined approaches to achieving efficient token selection more smoothly.

Q2: Is the dynamic range of k constrained
The range of values for $k$ is unrestricted to ensure the algorithm's simplicity. Depending on the nature of the task, $k$ can approach either 0 or the entire sequence length $S$ . The medium-replay dataset (high noise), medium dataset (moderate noise), and medium-expert dataset (low noise) from specific tasks serve as references for evaluating the robustness of our model's selection, as shown in Table 8 and Figure 6. For the hopper-medium task, the average number of selected tokens fluctuates slightly across datasets of different qualities. A similar pattern is observed in the other two tasks, with the overall average number of selected tokens remaining between 20 and 30, approximately half of the entire sequence length, without significant degradation.

To further investigate the hypernetwork’s predictions in highly noisy environments, we introduced Gaussian noise sampled from a standard normal distribution to all tokens in the hopper-medium-replay and halfcheetah-medium-replay datasets while keeping the original action labels unchanged. We refer to these modified datasets as hopper-medium-noise and halfcheetah-medium-noise. The results show that the hypernetwork outputs a higher $k$ value for hopper-medium-noise, whereas the opposite trend is observed for halfcheetah-medium-noise. This suggests that the number of selected tokens adapts to task-specific characteristics in highly noisy environments. No significant performance degradation was observed during the experiments, indicating the robustness and stability of our approach.

	1th Mixer Layer	2nd Mixer Layer	3rd Mixer Layer	Average
hopper-medium-replay	40.59	33.60	15.76	29.98
hopper-medium-noise	45.94	37.66	29.83	37.81
halfcheetah-medium-replay	15.07	31.99	37.29	28.12
halfcheetah-medium-noise	10.04	25.37	7.99	14.47

W1/Q3: A more balanced discussion would strengthen the paper
We have summarized DM's advantages and limitations to help readers understand our approach better. By dynamically selecting and concatenating important tokens at each layer, DM reduces computational complexity while effectively balancing the trade-off between capturing long-term dependencies and extracting local Markov features. This approach enhances efficiency and offers valuable insights into scaling laws for offline RL. Notably, the dynamic token selection during inference relies on the auxiliary predictor for online decision-making, which may introduce latency in scenarios with strict real-time requirements. Additionally, DM performs slightly worse than value-based methods on low-quality or noisy data. Future work will explore data augmentation or more efficient and robust token selection strategies—such as adversarial training or noise-adaptive mechanisms—to improve adaptability to noisy or low-quality data.

审稿意见

评分: 52025-03-11

The main contribution is a novel dynamic token selection mechanism termed Decision Mixer (DM), inspired by MoE to enhance CSM for offline reinforcement learning. DM adaptively selects key tokens for attention computation while preserving information from unselected tokens via feature concatenation, improving efficiency and mitigating information loss. Additionally, an auxiliary predictor in the autoregressive sampling process enhances long-term decision-making. Experiments show that DM achieves SOTA performance with reduced computational cost.

给作者的问题

The training of the auxiliary predictor relies on token selection labels generated during training. However, the model generates sequences autoregressively during actual sampling, which may slightly cause the input distribution to deviate from the training data. In this case, could the auxiliary predictor make incorrect selections due to distribution shift, thereby affecting the quality of the generated sequences?
How does the "dynamic token selection mechanism" mentioned in the paper reduce computational overhead? Specifically, how does it lower computational complexity by reducing the number of tokens entering the attention layer?

论据与证据

Yes, the claims made in the submission are supported by clear and convincing evidence.

方法与评估标准

Yes, the proposed methods and evaluation criteria, including the benchmark datasets, are well-aligned with the problem and application at hand.

理论论述

The factorization in Equation (2) correctly follows from the conditional probability decomposition, ensuring a proper reweighting mechanism based on future returns. The formulation aligns with standard importance sampling principles, making it a valid approach. The theoretical justification for this reweighting perspective is well-grounded.

实验设计与分析

The study presents a well-structured evaluation of DM across multiple D4RL domains, supported by detailed ablation studies that effectively highlight the contributions of its core components. One notable strength is the computational complexity analysis (Table 3), demonstrating DM's efficiency in reducing memory usage and FLOPs compared to baseline methods like DT and DC. This analysis underscores DM's practical advantages and aligns with the broader goal of developing scalable and resource-efficient offline RL methods, and exploring how DM's efficiency scales with larger datasets or more complex environments would provide further insights into its applicability in real-world settings.

补充材料

Yes, I primarily reviewed the DM architecture design in the code within the supplementary material.

与现有文献的关系

Scaling law in offline reinforcement learning has not been explored well, especially regarding how CSM methods based on the Transformer architecture can maximize the performance advantages of Transformers, which is a fascinating question. While some previous works have scaled up the architecture and parameters of DT, they have done so at the cost of a proportional increase in computational resources. In contrast, DM proposes a feasible solution for computational efficiency while maintaining performance.

[1] Lee K H, Nachum O, Yang M S, et al. Multi-game decision transformers. Advances in Neural Information Processing Systems, 2022. [2] Reed S, Zolna K, Parisotto E, et al. A generalist agent. arXiv preprint arXiv:2205.06175.

遗漏的重要参考文献

All essential related works that provide the necessary context for understanding the key contributions of the paper have been cited and discussed.

其他优缺点

Strengths:

The paper tackles underexplored challenges in offline RL, such as handling non-standard Markov properties and improving generalization in suboptimal trajectories. These contributions are original and fill important gaps in the literature. DM's ability to handle standard and non-standard Markov tasks broadly applies to various RL benchmarks, including Gym, Adroit, Kitchen, AntMaze, and Maze2D.
The paper is well-organized, with clear explanations of the motivation, methodology, and results. The use of visualizations enhances understanding of the model's behavior.

Weaknesses:

While the dynamic token selection mechanism can adaptively adjust computational load, it may introduce selection bias in certain long-sequence or high-noise tasks (e.g., complex maze tasks with sparse rewards) due to fluctuations in router weights caused by local noise. Therefore, the robustness of the dynamic mechanism in extremely sparse scenarios still needs to be enhanced.

其他意见或建议

The authors could consider providing a more detailed explanation of this aspect, as the autoregressive nature of sampling may cause the input distribution to gradually deviate from the training data. Since the auxiliary predictor relies on token selection labels generated during training, could this distribution shift potentially affect the quality of the generated sequences?

作者回复

2025-03-31

Thanks for the careful review of our work. Due to the strict word limit, we have tried to address the reviewers' comments carefully. These additional experiments and suggestions will be incorporated into the updated main text.

W1: The robustness of the dynamic mechanism in sparse scenarios needs to be enhanced
We fully agree that there is still potential for improvement. An intuitive approach is to adopt data augmentation. Considering that complex tasks exhibit significant noise variations mainly at the environmental state level, we introduce Gaussian noise perturbation to all state dimensions in the training data, formulated as: $\tilde{s}_j=s_j+\alpha \cdot \sigma_j \cdot \epsilon_j.$ Here, $\alpha$ is a controllable noise intensity, which we set to 0.05. $\sigma_j$ represents the standard deviation of the $j$ -th dimension, which has been computed during data preprocessing. $\epsilon \sim \mathcal{N}(0, I)$ denotes a noise vector sampled from the standard normal distribution. We generate two random seeds to add noise to the initial data $\tau$ , obtaining two noisy datasets, $\hat{\tau_1}$ and $\hat{\tau_2}$ . Each training iteration uses a total of $3 \times$ batch data. $\tau$ , $\hat{\tau_1}$ , and $\hat{\tau_2}$ share identical values except for the perturbed states. The processed data is then used for training and referred to as DM_enhanced. The experimental results on three tasks are as follows:

	DM	DM_enhanced
hopper-medium	98.1 $\pm$ 3.6	98.0 $\pm$ 1.9
maze2d-umaze	86.9 $\pm$ 1.9	88.9 $\pm$ 1.3
antmaze-umaze	100.0 $\pm$ 0.5	100.0 $\pm$ 0.1

DM_enhanced performs better than DM on the maze2d-umaze task, with a minor standard deviation. Although DM_enhanced does not perform better on the reward-dense task hopper-medium, a similar stability improvement is observed. This phenomenon suggests that the data augmentation employed in DM_enhanced enhances its stability and robustness. Due to rebuttal time constraints, exploring more instructive designs for robustness suppression will be part of our future work.

Q1: Could the auxiliary predictor make incorrect selections due to distribution shift

The only potential distribution shift is the difference between the training and inference inputs. We designed three schemes: (1) The hypernetwork uses a single token as input for training and inference without needing an auxiliary predictor. (2) The hypernetwork uses a causal sequence as input for training and inference without needing an auxiliary predictor. (3) The hypernetwork uses the entire sequence as input for training, providing data for the binary classification training of the auxiliary predictor, which is then used for inference (the approach adopted by DM).

	(1)	(2)	(3)
halfcheetah-medium	27.5 $\pm$ 5.1	40.9 $\pm$ 0.6	43.5 $\pm$ 0.7
hopper-medium	82.7 $\pm$ 18.0	94.7 $\pm$ 5.0	98.1 $\pm$ 3.6
walker2d-medium	55.2 $\pm$ 4.9	83.4 $\pm$ 2.7	83.8 $\pm$ 0.8
maze2d-umaze	59.1 $\pm$ 4.7	83.2 $\pm$ 5.8	86.9 $\pm$ 1.9
antmaze-umaze	80.3 $\pm$ 12.4	75.0 $\pm$ 9.1	100.0 $\pm$ 0.5
average score	61.0	75.4	82.5

Scheme (3) demonstrated the best performance and stability across all tasks, with significantly shorter training times. Although schemes (1) and (2) ensured consistency between training and inference, they struggled with convergence due to insufficient utilization of global information from the training data. Scheme (3) adopts a task decomposition approach to implement training and inference hierarchically, with the auxiliary predictor trained based on the prediction data from each round of the hypernetwork and router. Figure 1 of the anonymous pdf shows the training loss curves, which indirectly suggest that binary classification of data on a specific Mixer layer for a given task is relatively easy to learn. Experimental results in Table 1 confirm that potential distribution shifts were addressed through synchronized training.

Q2: How does the "dynamic token selection mechanism" reduce computational overhead
In a specific Mixer Layer $l$ , the dynamic token selection mechanism reduces computational complexity by adaptively selecting the top- $k$ tokens (via router $R^l$ and hypernetwork $H^l$ ) for attention processing while skipping others. When the sequence length input to the attention layer is reduced from $S$ to $k$ , the attention complexity decreases from $O(S^2d)$ to $O(k^2d)$ , and the computational complexity of the FFN layer reduces from $O(Sd)$ to $O(kd)$ , where $d$ is the hidden dimension. For example, in the Key/Value projection stage, if $k = S/2$ , the computation is reduced to 25% of the original FLOPs. Experiments show that DM reduces FLOPs by 47.0% compared to DT and achieves better clock time performance, as shown in the table below.

	DT	DC	DM
halfcheetah-medium	$\approx$ 12.0h	$\approx$ 8.0h	$\approx$ 7.5h
hopper-medium	$\approx$ 10.0h	$\approx$ 6.5h	$\approx$ 7.5h
antmaze-umaze	$\approx$ 10.0h	$\approx$ 6.0h	$\approx$ 6.5h

最终决定Accept (poster)

2025-05-01

Based on the strong consensus among reviewers regarding the paper's novelty within the RL context, significant empirical contributions (SOTA performance across diverse benchmarks), demonstrated computational efficiency, and the authors' thorough and successful rebuttal addressing initial concerns, I recommend accepting this paper. The proposed Decision Mixer offers a valuable and well-validated contribution to the field of offline reinforcement learning, particularly concerning efficient and effective sequence modeling. The authors have convincingly differentiated their approach from prior work and provided strong evidence for its utility.