PaperHub
5.7
/10
Poster3 位审稿人
最低5最高6标准差0.5
6
6
5
3.0
置信度
正确性2.7
贡献度2.3
表达2.7
NeurIPS 2024

Is Mamba Compatible with Trajectory Optimization in Offline Reinforcement Learning?

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06
TL;DR

An investigation of Mamba in offline RL.

摘要

关键词
Offline RL; Trajectory Optimization; Mamba

评审与讨论

审稿意见
6

This paper comprehensively investigates the possibility of leveraging Mamba for trajectory learning. The authors take Decision Mamba as a playground and analyse the performance of this model over trajectory learning scenarios (gym/mujoco) from several aspects. A group of conclusions are attained through rigorous experiments, which is solid and potentially valuable for further researches realted with Mamba.

优点

  1. Novelty: given that Mamba is still at its exploratory stage, this paper positively probes Mamba's potential for tranjectory learning, with surprising results that with some specific pre-conditions, Mamba is more suited than Transformer.

缺点

  1. Most discoveries in this paper have been implicitly discussed for several months within the community, while it is firstly presented officially in this paper. Besides, these dicoveries lean to be emparical evidence, which is relatively shallow. This would make this paper's technical contribution weak. I would appreciate the authors if they could provide more in-depth explanation over these discoveries, in particular: 1) Transfomer-like model favors short sequence. 2) The significant role of the hidden attention.

  2. Although the experimental results are solid, I found that this paper is more suitable for Benchmark Track, since the technical novelty revolves around benchmarking Decision Mamba.

  3. Figure 1 (the title and the pic) should be improved. For now, it confuses me, especially the corresponding relationship between the text content (title) and the illustration.

  4. Minor: line 295: may more suitable -> may be more suitable

问题

see Weaknesses.

局限性

The authors provide a brief limitation summary in Conclusion. I would appreciate the authors if they could refine this part since the "limitations" listed there do not seem like limitations.

作者回复

We sincerely thank you for your insightful comments, which have catalyzed numerous enhancements and refinements to the paper. In the following, we reply to the questions one by one for the convenience of checking.


Weakness 1: Most discoveries in this paper have been implicitly discussed for several months within the community, while it is firstly presented officially in this paper. Besides, these dicoveries lean to be emparical evidence, which is relatively shallow. This would make this paper's technical contribution weak. I would appreciate the authors if they could provide more in-depth explanation over these discoveries, in particular: 1. Transfomer-like model favors short sequence. 2. The significant role of the hidden attention.

Response: Thanks for your comment. Past research on trajectory optimization usually approaches RL problems as sequence modeling problems in various ways. However, these studies are often constrained by issues related to large parameter counts and poor scalability, and there has been relatively little investigation into attention mechanisms. This study on DeMa contributes to the development of scalable and efficient decision models. Moreover, our empirical study lays a foundation for the further application of Mamba in RL.

More explanations of our discoveries are as follows:

  1. Most environments in trajectory optimization are Markovian. In these environments, a model inputs the current state and returns current action can also work(e.g. CQL), which means that excessively long historical information provides limited assistance for current decision-making. As shown in Figures B-C in the one-page PDF file, this result still holds in environments where the Markov property is relaxed. The results from the maze navigation task demonstrate that while DeMa's attention to past information increases, the hidden attention mechanism still primarily focuses on current information. This explains why the Transformer-like model favors short sequences.

  2. The significant role of hidden attention lies in its ability to aggregate contextual information efficiently. Hidden attention is the third view of Mamba, which shows that such models can be viewed as attention-driven models[1]. It allows the model to selectively propagate or forget information along the sequence length dimension depending on the current token. This nature significantly reduces the number of parameters and aligns with the Markov property of the RL environment. As a result, DeMa can achieve better performance than DT with fewer parameters.

[1] The hidden attention of mamba models.


Weakness 2: Although the experimental results are solid, I found that this paper is more suitable for Benchmark Track, since the technical novelty revolves around benchmarking Decision Mamba.

Response: Thanks for your comment. DeMa proposed in this paper can significantly address the challenges of large parameter counts and limited scalability in transformer-based trajectory optimization methods while maintaining good performance. Furthermore, our analysis of DeMa and the hidden attention mechanism provides valuable practical insights that apply to other sequence-based decision-making models.


Weakness 3: Figure 1 (the title and the pic) should be improved. For now, it confuses me, especially the corresponding relationship between the text content (title) and the illustration.

Response: Thanks for your comment, We have redrawn Figure 1 to improve clarity and readability. We provide it as Figure A in the one-page .pdf file. Overall, we have added numbering to each subplot and revised the descriptions in the title. We have arranged them side by side according to two structures: architecture and block.


Weakness 4: Minor: line 295: may more suitable -> may be more suitable

Response: Thanks for your comment, we have addressed the aforementioned errors. Additionally, we conducted a thorough review of the entire manuscript and corrected similar typos throughout.


Limitations: The authors provide a brief limitation summary in Conclusion. I would appreciate the authors if they could refine this part since the "limitations" listed there do not seem like limitations.

Response: Thanks for your comment, we have refined the limitations at the end of our revised manuscript.

Limitations: We investigate the application of Mamba in trajectory optimization and present findings that provide valuable insights for the community. However, there remain several limitations: 1. Trajectory optimization tasks typically involve shorter input sequences, raising questions about how well the RNN-like DeMa performs in terms of memory capacity in RL compared to models such as RNNs and LSTMs. The potential of RNN-like DeMa warrants further exploration, particularly in POMDP environments and tasks that require long-term decision-making and memory. 2. Although we examine the importance of the hidden attention mechanism in Section 4.2, our exploration is still in its nascent stages. Future work could leverage interpretability tools to examine further the causal relationship between memory and current decisions in DeMa, ultimately contributing to the development of interpretable decision models. 3. While we have assessed the properties of DeMa and identified improvements in both performance efficiency and model compactness compared to DT, it remains unclear whether DeMa is suitable for multi-task RL and online RL environments.


We hope that the above answers can address your concerns satisfactorily. We would be grateful if you could re-evaluate our work based on the above responses. We look forward to receiving your further feedback.

评论

All my concerns are addressed (at least a majority). I suggest the authors add the refinement into their paper properly. I will rise my score to 6.

评论

We are grateful for the endorsement of the reviewer 85kG. We will carefully follow your constructive comments and include the corresponding contents in the revision to improve our submission.

Best, The Author of Submission8092

审稿意见
6

This paper investigates how Mamba perform in trajectory optimization in offline RL with ablation analysis on mamba's data input structures and architectural structures and shows Mamba DT can achieve SOTA performance with less parameters.

优点

  1. The paper writing is good, the visualizations look good.
  2. The input concatenation experiments provides useful practical insight also for other sequence-based decision-making models
  3. The paper provides a detailed analysis of how various components of Mamba, such as the hidden attention mechanism and different residual structures, influence performance.

缺点

  1. Finding 3 is not very surprising on the tested MDP environment, since they by definition should focus only on recent states. It will be interesting to explore how this mechanism might perform in environments with long-term dependencies where the Markov property does not hold strictly.
  2. Only tested on standard Atari and mujoco tasks. How would mamba perform on tasks that requires long horizon planning skills? such as maze navigation or tasks with delayed rewards.

问题

please see weakness.

局限性

please see weakness

作者回复

We sincerely thank you for your insightful comments, which have catalyzed numerous enhancements and refinements to the paper. In the following, we reply to the questions one by one for the convenience of checking.


Weakness 1: Finding 3 is not very surprising on the tested MDP environment. It will be interesting to explore how this mechanism might perform in environments with long-term dependencies where the Markov property does not hold strictly.

Response: Thanks for your comments. We conduct additional explorations in environments involving maze navigation(maze2d, antmaze) and delayed rewards(MuJoCo with delayed rewards). The performance of the hidden attention mechanism is illustrated in Figures B-C in the one-page .pdf file, which shows that although DeMa’s attention to past information increases, the hidden attention mechanism still prioritizes the current information when the Markov property of the environment is relaxed. Furthermore, among historical information, the hidden attention mechanism demonstrates a significantly higher focus on states compared to rewards or actions.

There remains a significant gap in the exploration of the analysis of Markovian property and attention mechanism. Most studies focus on the attention scores of outputs yjy_j relative to inputs xix_i in the training phase. However, it does not effectively evaluate the context attention of the model during decision-making. Our analysis reveals that the hidden attention mechanism predominantly emphasizes the current token at each decision-making step, even when the Markov property of the environment is relaxed. Thus, our finding (Finding 3) offers valuable insights and guidance for the community.


Weakness 2: How would mamba perform on tasks that requires long horizon planning skills? such as maze navigation or tasks with delayed rewards.

Response: Thanks for your comment. We conduct three new experiments involving maze navigation(maze2d, antmaze) and delayed rewards(MuJoCo with delayed rewards).

  1. maze2d: This environment aims at reaching goals with sparse rewards, which is suitable for assessing the model’s capability to efficiently integrate data and execute long-range planning. The objective of this domain is to guide an agent through a maze to reach a designated goal.

  2. antmaze: This environment is similar to maze2d, while the agent becomes an ant with 8 degrees of freedom.

In these environments, we adjust the hyper-parameters, which will be provided in the Appendix of our revised manuscript. Results in Table B show that DeMa performs better compared to DT in the maze navigation task.

Table B: Results for maze2d and antmaze. We report the mean across 3 seeds.

DatasetEnvDTGDTDCDeMa(Ours)
umazemaze2d3150.436.354.5
mediummaze2d8.27.82.116.7
largemaze2d2.30.70.96.6
umazeantmaze59.2768596.7
umaze-diverseantmaze536978.596.7
  1. MuJoCo with delayed rewards: This is a delayed return version of the D4RL benchmarks where the agent does not receive any rewards along the trajectory, and instead receives the cumulative reward of the trajectory in the final timestep.

In this environment, we train DeMa using the same hyper-parameters settings, and the results are shown in Table C. Results show that CQL is the most affected, while DT also experiences a certain degree of influence. In contrast, DeMa and GDT are relatively less impacted. The results indicate that DeMa demonstrates effective performance in tasks with delayed rewards.

Table C: Results for D4RL datasets with delayed (sparse) reward. The "Origin Average" in the table represents the normalized scores of evaluations across six datasets under the original dense reward setting. We report the mean across 3 seeds. The dataset names are abbreviated as follows: "medium" as "M", "medium-replay" as "M-R".

DatasetEnvCQLDTGDTDeMa(Ours)
M-DelayedHalfCheetah1.042.24342.9
M-DelayedHopper23.357.358.269.1
M-DelayedWalker0.069.978.977.6
M-R-DelayedHalfCheetah7.833.04141.1
M-R-DelayedHopper7.750.879.883.8
M-R-DelayedWalker3.251.670.471.7
Average-7.250.861.964.4
Origin Average-65.563.463.866

Overall, DeMa achieves better performance than DT with fewer parameters in tasks that require long-horizon planning skills.


We hope that the above answers can address your concerns satisfactorily. We would be grateful if you could re-evaluate our work based on the above responses. We look forward to receiving your further feedback.

评论

Dear Reviewer qC8w,

We have carefully considered and addressed your initial concerns regarding our paper. We are happy to discuss them with you in the openreview system if you feel that there still are some concerns/questions. We also welcome new suggestions/comments from you!

Best Regards,

The authors of Submission 8092

评论

Thank you for conducting new experiments on new environments to address my concerns. I have decided to increase score to 6

评论

Thanks for your recognition of our work and feedback on our response. We will carefully follow your constructive comments and include the corresponding contents in the revision to improve our submission.

Best,

The Author of Submission8092

审稿意见
5

The work introduces Decision Mamba (DeMa) to address the challenges in offline RL posed by the large parameter size and limited scalability of Transformer-based methods. DeMa aims to achieve similar performance to Transformers with significantly fewer parameters. DeMa surpasses the DT with significantly fewer parameters in the benchmarks.

优点

  1. Extensive evaluations demonstrate the effectiveness of DeMa, highlighting its superior performance and efficiency compared to existing methods.

  2. DeMa provides a novel solution to the parameter size and scalability issues in trajectory optimization.

缺点

  1. Some symbols are not defined before use.

  2. This paper seems to have little relation to RL and appears more like a method applicable to all trajectory optimization.

  3. There is too little discussion on the relationship to RL in sections 3.2 and 3.3.

问题

  1. What is the definition of LMSE/CEL_{MSE/CE} and K:t_{-K:t}?

  2. Is DeMa applicable to all trajectory optimization methods?

局限性

The suggestions have been claimed in "Weaknesses" and "Questions".

作者回复

We sincerely thank you for your insightful comments, which have catalyzed numerous enhancements and refinements to the paper. In the following, we reply to the questions one by one for the convenience of checking.


Weakness 1 & Question 1: Some symbols are not defined before use.

Response: Thanks for your comment. We have thoroughly reviewed the manuscript and provided definitions for all symbols. Additionally, we have removed some redundant symbols to enhance the article's readability. Major modifications include:

  1. LMSE/CEL_{MSE/CE} in line 123. It represents the loss function of DeMa, when the output corresponds to continuous actions (as in MuJoCo), the loss function is Mean Squared Error (MSE), while when the output pertains to discrete actions, the loss function is the cross-entropy loss."
  2. sK:ts_{-K:t} in line 124. Subscript denotes a range from time step t-K+1 to time step t. We have corrected it to tK+1:t_{t-K+1:t} to accurately represent the intended range from time step tK+1t-K+1 to time step tt.
  3. A,B,C,DA,B,C,D in line 133. AinmathbbRNtimesN,BinmathbbRNtimes1,CinmathbbR1timesN,DinmathbbRA\\in\\mathbb{R}^{N\\times N} , B\\in\\mathbb{R}^{N\\times 1}, C\\in\\mathbb{R}^{1\\times N}, D\\in\\mathbb{R} are parameter matrices in State Space Model.
  4. barK\\bar{K} and barC\\bar{C} in line 136-137. We have removed the redundant symbol L and eliminated the symbol barC\\bar{C}, which was used incorrectly. Now it becomes
yi=CbarAibarBu1+CbarAi1barBu2+cdots+CbarAbarBui1+CbarBui,quady=ubarK,barK=(CbarB,CbarAbarB,ldots,CbarAkbarB,ldots).y_{i}=C\\bar{A}^i\\bar{B}u_1+C\\bar{A}^{i-1}{\\bar{B}}u_2+\\cdots+C\\bar{A}\\bar{B}u_{i-1}+C\\bar{B}u_i,\\quad y=u*\\bar{K},\\\\ \\bar{K}=(C\\bar{B},C\\bar{A}\\bar{B},\\ldots,C\\bar{A}^{k}\\bar{B},\\ldots).
  1. Line 158-159. To provide a clearer formulation of hidden attention, we have revised it as follows
yi=Cij=1iˆbig(Pik=j+1iˆbarA_kbig)barB_jxj,hi=j=1iˆbig(Pik=j+1iˆbarA_kbig)barB_jxj,y_i=C_i\sum_{j=1}\^i \\big(\\Pi_{k=j+1}\^i \\bar{A}\_k \\big) \\bar{B}\_j x_j,\\\\ h_i=\sum_{j=1}\^i \\big(\\Pi_{k=j+1}\^i \\bar{A}\_k \\big) \\bar{B}\_j x_j,

where Bi=SB(hatx_i)B_i=S_B (\\hat{x}\_i), Ci=SC(hatx_i)C_i=S_C(\\hat{x}\_i), Deltai=textsoftplus(SDelta(hatx_i))\\Delta_i=\\text{softplus}(S_{\\Delta}(\\hat{x}\_i)) with SBS_B, SCS_C and SDeltaS_{\\Delta} are linear projection layers, and SoftPlus is an elementwise function that is a smooth approximation of ReLU. barA,barB\\bar{A},\\bar{B} is the discretization of A,BA, B, that is, barA_i=exp(Delta_i(A))\\bar{A}\_{i}=\\exp(\\Delta\_{i}(A)) and barB_i=Deltai(Bi)\\bar{B}\_{i}=\\Delta_{i}(B_i).


Weakness 2: This paper seems to have little relation to RL and appears more like a method applicable to all trajectory optimization.

Response: Thank you for your comments.

1.Trajectory optimization methods treat RL problems as sequence modeling problems to get better performance and generalization[1]. Decision Mamba (DeMa) proposed in this paper is designed to address the challenges posed by transformer-based trajectory optimization methods, particularly the issues of large parameter sizes and limited scalability, which are long-standing issues that have been largely overlooked. Exploring the potential of DeMa in RL tasks will contribute to the development of scalable and efficient decision models, which will significantly enhance the practical applications of RL.

We revise Sections 1-2 to emphasize the significance of DeMa in the context of RL.

2.Our paper primarily focuses on the analysis of DeMa in RL to provide valuable insights for the community. Indeed, DeMa can be combined with other trajectory optimization methods to achieve even better performance.

We conduct additional experiments, the results of which are summarized in the table below. By integrating DeMa with QT [2], we develop Q-DeMa, which achieves performance comparable to state-of-the-art (SOTA) models while utilizing less than one-seventh of the parameter size of QT. This finding underscores the significant potential of applying Mamba to RL and emphasizes the critical importance of the research presented in this paper.

Table A: Results for D4RL datasets. The dataset names are abbreviated as follows: "medium" as "M", "medium-replay" as "M-R". We report the mean across 3 seeds.

DatasetEnv-GymDTDeMa (Ours)QTQ-DeMa (Ours)
MHalfCheetah42.64351.451.2
MHopper68.474.596.988.1
MWalker75.576.688.889.1
M-RHalfCheetah37.040.748.948.6
M-RHopper85.690.7102.0101.5
M-RWalker71.270.598.599.8
Average-63.466.081.079.7
All params #-0.7M/2.6M0.2M/0.5M3.7M0.5M

[1] On Transforming Reinforcement Learning with Transformers: The Development Trajectory

[2] Q-value Regularized Transformer for Offline Reinforcement Learning.


Weakness 3: There is too little discussion on the relationship to RL in Sections 3.2 and 3.3.

Response: Thanks for your comments. We revise Sections 3.2 and 3.3 to enhance the discussion regarding their relation to RL. Specifically, Section 3.2 provides a concise introduction to the two types of Mamba, allowing for the utilization of both types of DeMa in RL. To expand our analysis, we introduce hidden attention in Section 3.3. This enables us to visualize the hidden attention matrices within DeMa, thereby gaining a deeper understanding of the model's internal behaviors.


Question 2: Is DeMa applicable to all trajectory optimization methods?

Response: Thanks for your comments. DeMa is applicable to most transformer-based trajectory optimization methods, as it is designed to address the challenges posed by transformers, particularly the issues of large parameter sizes and limited scalability. By integrating DeMa with QT [2], we develop Q-DeMa, which achieves performance comparable to state-of-the-art (SOTA) models while utilizing only one-seventh of the parameter size of QT. This finding underscores the significant potential of applying Mamba to RL and emphasizes the critical importance of the research presented in this paper.


We hope that our answers can address your concerns satisfactorily and improve the clarity of our contribution. We would be grateful if you could re-evaluate our paper. We look forward to receiving your further feedback.

评论

Thank the authors for the clarification and additional experiments. I have decided to increase the score

评论

We really appreciate your further comment and your recognition of our responses. We will carefully follow your constructive comments and include the corresponding contents in the revision to improve our submission.

Best,

The Author of Submission8092

评论

Dear Reviewer BojN,

The authors really appreciate your time and effort in reviewing this submission, and eagerly await your response. We understand you might be quite busy. However, the discussion deadline is approaching.

We have provided detailed responses to every one of your concerns/questions. Please help us to review our responses once again and kindly let us know whether they fully or partially address your concerns and if our explanations are in the right direction.

Thanks for your attention.

Best regards,

The authors of submission 8092.

作者回复

We want to thank all the reviewers for their thoughtful suggestions on our submission, and we appreciate that the reviewers have multiple positive opinions of our work, including:

  • novelty (BojN, 85kG)
  • good writing, good visualizations (qC8w)
  • the detailed analysis provides useful practical insight (BojN, qC8w)

We provide a summary of our responses, and we will carefully revise our manuscript by adding suggested evaluations, providing more detailed explanations, and fixing the typos.

Introduction, Related Works, and Preliminaries:

  • We strengthen the relationship between DeMa, trajectory optimization, and RL. (for Reviewer BojN)
  • We refine the notations for improved clarity. (for Reviewer BojN)
  • We enhance the discussion regarding their relation to RL. (for Reviewer BojN)

Experiments:

  • We redraw Figure 1 to improve clarity and readability. (for Reviewer 85kG)
  • We explore the applicability of combining DeMa with another method. (for Reviewer BojN)
  • We analyze the hidden attention in tasks that require long horizon skills and tasks with delayed rewards. (for Reviewer qC8w)
  • We explore the potential of DeMa in tasks that require long horizon skills and tasks with delayed rewards. (for Reviewer qC8w)
  • We provide more explanations for our discoveries. (for Reviewer 85kG)

Conclusion

  • We provide more details about our limitations. (for Reviewer 85kG)

We appreciate all the reviewers' time and effort again. All these comments and suggestions are very insightful and beneficial for us to improve the quality of this work.

最终决定

This paper presents a comprehensive investigation of leveraging Mamba architecture for trajectory optimization. The proposed DeMa achieves comparable or better performance as compared with the DT class of method, but with much fewer parameters. A group of findings is attained through comprehensive experiments, which can potentially be valuable for future research based on the Mamba architecture.

Strengths:

  • The paper provides extensive evaluations and ablation analyses to show how various components of Mamba, such as the hidden attention mechanism, way of concatenation, and different residual structures, influence performance. These could provide useful practical insights for other sequence-based decision-making models.

Weaknesses:

  • I share similar feelings with Reviewer 85kG that some of the findings are somewhat shallow, it would be better if the authors could provide more in-depth explanations. Nevertheless, the compressive experiments and investigation are meaningful and can potentially benefit the community.
  • The paper primarily conducted evaluations on the D4RL datasets. Note that most of the D4RL datasets are generated using RL polices, which by nature exhibit strong Markovian properties. Considering D4RL datasets are somewhat biased if compared with most real-world offline decision-making datasets, solely conducting experiments on D4RL might lead to some misleading insights. The study can benefit from exploring its effectiveness on datasets that are more non-Markovian, such as some human demonstration data in the robotic manipulation domain. This is also partly mentioned by Reviewer 85kG and qC9w.

The paper received unanimously positive feedback after the rebuttal. Although the paper still has some issues, I think it is a timely investigation of Mamba architecture for the community.

I suggest the authors include the additional Antmaze results in the main paper, as well as conduct additional experiments on some non-Markovian human demonstration data for robotics tasks in their final version. It should be noted that D4RL actually is not a very good benchmark to evaluate the real-world practical performance of offline RL methods, which is heavily biased (strongly Markovian, unrealistic state-action space coverage, unreasonably large amount of data for simple tasks). I suspect in some long-horizon non-Markovian data/tasks, some of the findings might have a different conclusion, we should be careful to not release misleading insights to the community.