A Large Recurrent Action Model: xLSTM enables Fast Inference for Robotics Tasks
摘要
评审与讨论
This paper tackles the recurrent approach to the large action model, where current robotic policies are de facto implemented as transformers. The paper systematically analyzes xLSTM under various conditions and across different environments, demonstrating the superior performance of xLSTM compared to transformer-based policies. Due to the recurrent characteristics of xLSTM, the Recurrent Large Action Model (RLAM) can efficiently scale up to large contexts without incurring the quadratic complexity associated with increasing tokens.
update after rebuttal
After carefully reading the rebuttal, I have decided to keep my original score.
给作者的问题
- The paper adopts 256-bin discretization for continuous actions, following [Reed et al., 2022] and [Brohan et al., 2023b]. Many robotic models (e.g., OpenVLA [Kim et al., 2024]) use this discretization, especially for autoregressive architectures; however, some (e.g., ACT [Zhao et al., 2023]) output continuous actions directly. Are there specific reasons for modeling with discretization rather than direct continuous actions, or was this decision influenced by the Decision Transformer backbone?
- The paper replaces Decision Transformer’s transformer backbone with xLSTM. Are there alternative architectures beyond Decision Transformer that might better suit recurrent models? Do the authors have insights into what could be an even better design for recurrent large-action models?
论据与证据
The claims of this work are clear and well-supported. The xLSTM-based RLAM efficiently exploits large contexts, with sufficient ablation studies on each xLSTM block (mLSTM and sLSTM). The authors also conducted latency and throughput studies, comparing xLSTM to transformer-based architectures, demonstrating that the xLSTM-based architecture is more lightweight and efficient to deploy compared to transformers.
方法与评估标准
The proposed method focuses on utilizing xLSTM with an existing RL model backbone, evaluating it with various RL benchmarks in offline RL with large transitions. Although the paper does not propose a new architecture, the comprehensive analysis of xLSTM-based RLAM is sufficient to identify trends in efficiency and performance for recurrent-based models compared to transformers through experiments on various benchmarks.
理论论述
I checked the theoretical claims, especially Equation 2, which is based on the Decision Transformer framework. I also reviewed the reasoning behind excluding action tokens, which is well-justified.
实验设计与分析
I reviewed the experimental design, and the paper provides sufficient information. However, in Section 4.4 (Inference Time Comparison), the model setup with custom kernels was not fully clear. Therefore, I have complaint that I did not entirely understand how the custom kernels were designed and adapted to accelerate inference speed, where this is not common knowledge for general readers.
补充材料
I fully reviewed the supplementary material, including implementation details. The paper provides detailed information about the environment and hyperparameters used for the experiments.
与现有文献的关系
The contribution of this paper is not in introducing a novel architecture but in demonstrating the feasibility of recurrent models for large action models. With comprehensive analysis and ablation studies, the paper provides an important milestone for recurrent-based robotic action models.
遗漏的重要参考文献
Overall, the paper sufficiently discusses related references.
其他优缺点
Strengths
- The recurrent-based action model is especially important in online, real-world settings, where context length varies dynamically, and parallelization is difficult. I think this paper would marks a potential turning point for real-world agent design.
- The paper thoroughly analyzes different types of recurrent-based large action models, including xLSTM and Mamba, conducting ablation studies to assess performance and efficiency. Weaknesses
- While the paper argues for the importance of recurrence in real-world settings, all experiments are conducted in offline RL. To fully validate RLAM, real-world experiments would be highly beneficial.
- The paper extensively analyze architectural efficiency and performance, but it does not fully explore the advantages of recurrence in long-term memory retention. More experiments in memory-intensive environments could highlight how recurrent backbones differ from transformers in terms of state persistence and recall.
其他意见或建议
none
Thank you for your constructive feedback. We appreciate your positive assessment and are very glad that you believe that our work may mark a potential turning point for real-world agent design.
Real-World Experiments: We agree that experiments in real-world robotics settings would be valuable. While we believe that such experiment are out-of-scope for the current work, evaluating modern recurrent architectures in real-world robotics settings is an exciting direction for future research. Nevertheless, we believe our findings will transfer to real-world scenarios and that the advantages of recurrent architectures may be even more pronounced in real-world settings. For example, we conduct our inference time comparison on a high-end data-center GPU with 40GB of RAM and the Transformer runs OOM for larger context sizes. In contrast, applications on edge devices in real-world scenarios, may have to deal with less powerful accelerators, making the use of modern recurrent architectures attractive. Moreover, their ability to handle long sequences without increasing computational cost/requirements can be particularly beneficial for complex real-world applications, which often exhibit longer-term dependencies.
Memory-intensive Environments: We again agree that exploring the advantages of modern recurrent architectures in long-term memory retention more would be interesting. While we cannot provide a detailed study within the time-frame of the rebuttal, we believe that this direction is interesting for future work. Note that the experiments on Dark-Room, which exhibits sparse rewards and a partially-observable observation space, go in a related direction (Figure 4 and 16). There we find that recurrent backbones compare favorably to Transformers (especially xLSTM [7:1], which enables state-tracking). In the meantime, we refer to [1] for a comparison of vanilla LSTM to the Transformer in memory intensive RL environments. Furthermore, we refer to the Figure 5 in [2], for a comparison outside the field of RL of xLSTM and Transformer on associative recall tasks.
Discretization for Continuous Actions: The reviewer is correct, that we make use of discretization of continuous actions similar to prior works. The 432 task we consider exhibit both discrete/continuous action inputs and image/vector-based state representations. The use of discretization in our large action models is motivated by the need to handle both discrete and continuous action spaces. To this end, every action dimension is discretized into 256 uniformly spaced bins. The shared action head is used to predict the action bins of all continuous action dimensions jointly, which allows handling envs with different action spaces. Furthermore, because of the shared action head, we do not (have to) rely on autoregressive prediction of continuous action dimensions, which further speeds up inference for all backbones. At inference time, the number of dimensions of the current environment is known, and we extract the respective dimensions from the joint predictions. We describe this procedure in detail in Appendix B.3.
Alternative Designs: In this work, our goal is to better understand whether modern recurrent backbones can be alternatives for large action models (LAMs) and focus our analyses on the DT setup as correctly identified by the reviewer. DTs rely on return-conditioning, but existing large-scale robotic models instead rely on behavior cloning (e.g., [2]). Therefore, we validate that our findings also transfer to a behavior cloning setting at the 206M scale (Section 4.3 and Appendix E.2). Exploring alternative architectures beyond the DT and BC settings is an exciting direction for future work. We believe that exploring MoEs [4,5] for multi-task settings, which may require specialization, can be a promising approach. Moreover, we would like to point the reviewer to an interesting concurrent work that studies the design space of imitation learning policies [6].
Once again, thank you for your helpful comments and positive assessment of our work.
[1] “When Do Transformers Shine in RL? Decoupling Memory from Credit Assignment”, NeurIPS, 2023
[2] “xLSTM: Extended Long Short-Term Memory”, NeurIPS, 2024
[3] “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control”, ArXiv, 2023
[4] “Mixture of Experts in a Mixture of RL settings”, RLC, 2024
[5] “Mixtures of Experts Unlock Parameter Scaling for Deep RL”, ICML, 2024
[6] “X-IL: Exploring the Design Space of Imitation Learning Policies”, ArXiv, 2025
Dear Authors, thank you for your detailed response. I believe most of my concerns have now been addressed.
Thank you for your reply. We are happy to hear that your remaining points have been addressed.
This paper investigates various architectures, including Transformers, Mamba, and xLSTM, for reinforcement learning. Building on the Decision Transformer framework, it systematically compares these architectures across 432 tasks spanning six datasets. The empirical results highlight xLSTM’s advantages in both performance and inference speed, demonstrating its effectiveness as a scalable alternative.
给作者的问题
- I am a bit confused regarding the experiment setting. The paper uses 6 different datasets. Do the authors train one model over all the tasks? Additionally, in Section 3.2 “Shared action head”, can the authors explain more about action discretization?
- From my understanding, Mamba and xLSTM will only inference faster when there are more than several thousand tokens (I am not absolutely sure). However, in Figure 6, the xLSTM performs still faster with a small context length.
论据与证据
The claims in the paper are supported by thorough experiments.
方法与评估标准
The proposed method and evaluation make practical contributions to the community.
理论论述
There is no theoretical proof to check in this paper.
实验设计与分析
The paper conducts comprehensive experiments and ablation analyses.
补充材料
I reviewed the experimental details and additional results.
与现有文献的关系
The paper is based on the Decision Transformer [1] and further explores recent architectures Mamba and xLSTM to improve the performance and inference speed.
[1] Chen, Lili, et al. "Decision transformer: Reinforcement learning via sequence modeling." Advances in neural information processing systems, 2021.
遗漏的重要参考文献
The references are enough to understand the paper.
其他优缺点
Strengths:
- The paper is well-written and easy to follow.
- The paper provides extensive experiments and ablation studies.
- The paper compares the latent representations between Transformer and xLSTM, which is informative.
Weaknesses:
- The paper is based on Decision Transformer which has not been used much these days. It would be better to extend the architectures to other reinforcement learning or imitation learning settings.
- The paper mainly compares Transformer, Mamba, and xLSTM from empirical evaluations, but does not discuss the reason. It would be valuable to provide more discussions regarding why xLSTM can outperform Mamba and xLSTM.
- The paper compares the inference time using different context lengths from 50 to 24576. However, there is no indication that the model will perform better with a very long context length. Additionally, most robot learning tasks do not contain such long trajectories. It would be beneficial to discuss scenarios where they offer a significant advantage.
其他意见或建议
The paper introduces Transformer, Mamba, and xLSTM in the Introduction and Related Work parts. Since these architectures are the main focus, maybe it makes sense to explain their differences using a single paragraph.
Thank you for your helpful feedback and positive assessment of our work.
Imitation Learning: We agree with the reviewer that studying modern recurrent architectures in settings other than the DT setting is important. In this work, our goal is to better understand whether modern recurrent backbones can be alternatives for large action models (LAMs) and focus our analyses on the DT setup. However, we want to highlight that we already provide experiments in an imitation learning / behavior cloning (BC) setting at the 206M scale, as suggested by the reviewer (Section 4.3 and Appendix E.2). Notably, performance trends across backbones mirror the results in the DT setting (Figures 27, 28), which indicates that our findings generalize beyond the DT framework. Moreover, we agree with the reviewer that studying modern recurrent architectures in RL is interesting. While we believe that such experiments are out-of-scope for this work, we listed online RL with modern recurrent architectures in our future work section (Lines 431-435).
Differences between Backbones: While we discuss when and why to use which backbone (Section 5), we agree with the reviewer that an extended discussion on the reasons for performance differences between backbones is useful:
- We observed that xLSTM outperforms Transformers/Mamba in terms of sequence prediction and env performance (Figure 2). One benefit of xLSTM is that it enables state tracking [1] via sLSTM blocks, which Transformers/Mamba cannot. This property can be useful for partially observable environments and may be a reason for the enhanced ICL performance (Figure 4). Another reason may be the enhanced domain-separation in the embedding space (Figure 5), which could facilitate task identification at inference time. Moreover, our results reflect performance improvements in language settings [2].
- Nevertheless, the differences may depend on the environment. For example, Transformers may have advantages for tasks where exact recall is required. For such tasks, self-attention is typically superior (Figure 5 in [2]) and can be important for decision-making tasks [3]. Therefore, the choice for the right backbone may depend on the task at hand.
Long Context:
- Generally, performance improves when training with longer sequences both for DT and xLSTM (Section 4.3). For DT, the avg. normalized score increases from 0.3 with C=1 to 0.7 with C=50 (Figure 23). Similarly, for xLSTM it increases from 0.4 to 0.8 (Figure 25). Furthermore, domains with longer episode lengths, like DMC or Atari, benefit more from increasing context (Figures 24, 26). This is because the history helps to predict the next action (e.g., by observing past mistakes). This highlights that LAMs can strongly benefit from increased context length, even on the simulated environments we consider.
- We agree with the reviewer that for our environments, increasing the context to several thousand timesteps may not result in better performance. However, we believe that the ability to handle longer sequences can be beneficial for complex real-world applications that exhibit longer-term dependencies. Similarly, longer context can be beneficial for ICL applications, which benefits from keeping multiple episodes in the context, as indicated in Figure 4.
We added a short discussion on long sequences in real-world tasks to our manuscript.
Experiment Setup: Yes, we train a single model on datasets comprising 432 tasks. The environments contain both discrete/continuous action inputs and image/vector-based state representations. To handle both discrete/continuous actions, we make use of discretization. Every action dimension is discretized into 256 uniformly spaced bins. The shared action head is used to predict the action bins of all continuous dimensions jointly, which allows handling envs with different action spaces. At inference time, the number of dimensions of the current environment is known, and we extract the respective dimensions from the joint predictions. We describe this procedure in detail in Appendix B.3.
Inference Speed: The reviewer is correct that the benefits of modern recurrent architectures become more important at longer sequence lengths. Importantly, the inference speed and memory consumption of xLSTM/Mamba does not change with increasing sequence lengths, which is particularly apparent in memory consumption (Figure 6c). It is true that in our experiments there are some inference time speed-ups at shorter context lengths, which may result from the different kernel implementations. Note that the sequence lengths are measured in timesteps with each timestep containing 3 tokens (largest C is 76K tokens).
We revised our manuscript, and hope to have clarified your questions.
[1] The Illusion of State in State-Space Models, ICML 2024
[2] xLSTM: Extended Long Short-Term Memory, NeurIPS, 2024
[3] When Do Transformers Shine in RL? Decoupling Memory from Credit Assignment, NeurIPS, 2023
The authors propose changing the Decision Transformer (DT) backbone from a transformer to the recently-proposed xLSTM. They perform large-scale experimentation and compare to other DT backbones.
给作者的问题
The meaning of the embedding space (fig 5) is not clear to me. What does it mean if two Atari game embeddings are closer than an Atari and ProcGen embedding? Wouldn't the latter case suggest better generalization (generalizing across platforms, rather than within platforms). Could you comment on this?
论据与证据
The authors contribute:
- A Large Recurrent Action Model (LRAM) using an xLSTM with favorable performance
- Comparisons against other recurrent backbones
- They release code and datasets
In my opinion, they do what they claim. I will note that the claims are fairly weak.
方法与评估标准
The authors evaluate their method by examining:
- Scaling capacity compared to other popular LRAMs
- Returns across many popular benchmarks (Atari, MuJoCo, etc)
- The embedding space
- Latency and resource usage compared to a decision transformer
These experiments make sense, given they are comparing DT backbones.
理论论述
The claims made are empirical and cannot be proven.
实验设计与分析
The experiments seem well-founded and span a wide range of tasks and model baselines.
补充材料
The authors presents more experiments, but I did not read too deeply.
与现有文献的关系
Recent prior work on DT has focused on changing backbones with Mambas, State Space Models, etc. It is only natural to consider the xLSTM as well.
遗漏的重要参考文献
None that I am aware of
其他优缺点
The paper is well written and easy to follow. The authors are the first to use an xLSTM within a decision transformer framework, and provide a thorough study of their architecture with large number of experiments.
However, it is a bit disappointing that there is not much novel here besides changing the DT backbone with another well-known sequence model. As such, there appear to be relatively incremental increases in performance.
其他意见或建议
There isn't much new here, but the authors provide more than sufficient empirical validation of their method.
Thank you for your helpful feedback. We are glad that you consider our experiments well-founded and the paper well written.
Embedding space:
- Regarding your question on the embedding space, we want to clarify that Figure 5 is constructed using the aggregated hidden states (averaged across the sequence) from the final layer of the respective agents and visualized via UMAP (see Appendix F for details).
- The purpose of this visualization is to examine how the models organize their representations of different environments. In general, tasks within the same domain tend to share similar input characteristics - such as visual inputs (e.g., image frames) possible actions to perform, and reward structures - and are therefore more likely to be “grouped” together in the embedding space. Consequently, when embeddings of Atari games are closer to each other than to Procgen games, it indicates that Atari games share more similar underlying dynamics or inputs structures compared to Procgen. For reference, we now also include the embedding space plots for Mamba in our updated manuscript, as suggested by Reviewer 8NAr14.
- We observe that xLSTM exhibits a slightly more refined and better-separated embedding space, which may be a reason for its better final performance. In contrast, DT produces embeddings with less clear separation between domains. This suggests that for the environments we consider in this work, it may be beneficial for the model to learn more “separate” representations, potentially because it facilitates task identification at inference time. However, we fully agree with the reviewer that generalization across domains is generally desirable if environments share structural similarities. Therefore, we believe that studying the learned embedding spaces of multi-task agents in environments that share more structure across domains is interesting for future work.
We hope to have clarified your remaining question. Thank you again for your positive assessment of our work.
The authors introduce a Large Recurrent Action Model (LRAM), which replaces traditional Transformer architectures with xLSTMs to address real-time robotic applications. They demonstrate that the proposed model achieves significant speed improvements without compromising performance. Experimental validation across 432 tasks from six domains confirms LRAM's superiority over Transformers regarding inference speed while maintaining competitive predictive performance.
Update after rebuttal
The authors have partially addressed my questions. However, without any experimental results (e.g., via an anonymous link), the claims cannot be verified by the reviewers. Therefore, I maintain my score.
给作者的问题
Q1: The dataset compilation is highlighted as a notable contribution. Could you explain the rationale behind the specific data ratios presented in section 4.1? Understanding the impact of different ratios on multitask capabilities would be insightful.
Q2: Can you provide embedding plots for Mamba alongside xLSTM and DT in Figure 5?
Q3: Could you include latency results for Mamba in Figure 6 to clarify the comparative performance advantages of xLSTMs versus other state-space models (Mamba)?
论据与证据
The authors' claims are supported by substantial experimental evidence, demonstrating that xLSTM generally performs at least as well as the Decision Transformer (DT) and often surpasses it. However, the experimental results also show that xLSTM frequently achieves performance similar to the state-space model (SSM) "Mamba." While the authors clearly illustrate xLSTM's speed advantages over DT, the paper lacks explicit evidence comparing inference speed against state-space models, which somewhat limits the strength of claims regarding real-time applicability. Lastly, the authors acknowledge that the primary target application of LAMs is robotics and that they only tested in simulations; they believe that their findings translate to real-world scenarios. Although there is no direct evidence of this translation in the current work, the authors address this limitation in their discussion.
方法与评估标准
The proposed methods and evaluation criteria, including benchmark datasets, appear appropriate for the targeted problem. Evaluations of performance scores, inference speed, and the impact of fine-tuning are reasonable and thorough.
理论论述
No major theoretical claims requiring validation are presented, as the study primarily emphasizes empirical model performance.
实验设计与分析
The experiments and analyses are well-structured, effectively demonstrating the benefits of the xLSTM architecture in inference efficiency and multitask performance. A significant contribution is the thorough investigation into the beneficial effects of xLSTM block ratios, confirming that sLSTMs are advantageous for state-tracking over longer horizons. This aligns with findings in related fields (e.g., time series forecasting, (P-LSTM, Kong et al., ArXiv 2024), (xLSTM-Mixer, Kraus et al., 2024)), broadening the potential impact of this work. However, since the authors highlight dataset compilation as a contribution, evaluating how different data source distributions affect model performance would further enhance the paper's insights into offline RL action model pretraining.
补充材料
The supplementary materials were briefly reviewed, primarily to verify dataset details and additional experimental setups.
与现有文献的关系
The authors build on existing large action models by addressing the inherent limitations of Transformer-based approaches in real-time robotics. By integrating and extending recurrent networks particularly through xLSTM architectures the paper advances prior work that has traditionally relied on Transformers. Moreover, by discussing state-space models (SSMs), the work connects to a broader scientific trend of improving inference speed and scalability in sequence modeling.
遗漏的重要参考文献
None that I am aware of.
其他优缺点
S1. Clear demonstration of xLSTM’s advantages for potential real-time robotic inference.
S2. Robust experimental validation across diverse robotic tasks.
S3. Valuable dataset compilation and pipeline.
W1. Although omitting actions from your policy formulation (Equation 2) is deliberate, its current form may confuse readers comparing it directly to Equation 1. Clarifying the differences, possibly through color-coding or multi-line formatting, would enhance readability.
W2. As mentioned, omitting Mamba’s performance from the experimental comparisons limits the impact of the work, as readers would benefit from a direct performance comparison between DT, xLSTMs, and Mamba.
W3. The model configuration and their particual area of application is not quite clear to me, as the results in figure 3 are not properly discussed. Why does [1:0] work so much better on procgen than [7:1].
其他意见或建议
I am really leaning towards a full accept, however in its current form the experiments do not provide the entire picture, as the performance of SSMs are missing.
Thank you for your helpful feedback on our work and your positive assessment. We are glad that you consider it a clear demonstration, robust experimental validation and valuable dataset pipeline. We address your open points in the following.
Policy formulation: We agree with the reviewer that the differences between our policy formulation that omits actions (Equation 2) and the formulation in Equation 1 can be highlighted more clearly. Many thanks for the suggestion of using color coding. We now introduced color coding and believe it enhances readability.
Data ratios:
- A key motivation behind our dataset compilation was the scarcity of suitable existing datasets that span a large number of simulated tasks. To address this, our primary target was to assemble a collection of datasets that span as many tasks as possible to enable a robust comparison of sequence model architectures. To facilitate usability for future works, we consider standard benchmarks (e.g., Meta-World, Atari, Procgen) that are widely adopted by the community. Therefore, we hope that our data pipeline can serve as a solid basis for future research on multi-task agents.
- Note that during pre-training, every domain is represented with approximately equal proportion in every update step (see Section 4.2 and Appendix C). Because the dataset sizes vary across domains (due to different numbers of tasks), this results in different numbers of total repetitions per dataset (Table 1). We opt for equal proportions, because we aim to study how the different backbones perform across domains, rather than optimizing performance on specific domains.
- We agree with the reviewer that understanding the impact of the data ratios on multitask capabilities would be insightful. Varying the data ratios would, for example, allow studying potential interferences between the 432 tasks. While we believe that a thorough investigation on data ratios is out-of-scope for this work, because of the excessive cost or pre-training with varying ratios, we believe that studying data ratios in large action models represents an interesting direction for future work. We hope that our data pipeline and datasets may facilitate such future studies.
We updated our manuscript to include a short discussion about the rationales behind our data ratios.
Embedding plots for Mamba: Thank you for the suggestion. We added the embedding plots for Mamba to Appendix F in our manuscript (we cannot update the PDF). Mamba exhibits a more refined embedding space than DT, but slightly less refined than xLSTM. However, while a more refined embedding may help with task identification at inference time, it is unclear how strongly it benefits final performance.
Inference speed of Mamba: We agree with the reviewer that the inference speeds for Mamba were missing from our original manuscript and are important to provide the entire picture. Initially, we focused our inference time tests on DT and xLSTM, as xLSTM tends to be slightly faster than Mamba in prior work (see Figure 9 in [1]). Therefore, we now repeated the inference time analyses conducted in Section 4.4 with Mamba and provide the detailed results in our updated manuscript. As expected, we find that Mamba exhibits the same linear scaling properties as xLSTM. Consequently, Mamba exhibits advantages over DT in terms of speed and memory for longer sequence lengths. In our experiments, Mamba runs slightly slower compared to xLSTM (0.002 sec/step vs. 0.0048 for B=16 in Figure 6b), but requires slightly less GPU RAM. We hypothesize that this is because the Mamba kernels are not compatible with torch.compile, which may result in a slowdown. With compatible kernels, that gap might be closed. Moreover, Mamba achieves throughputs similar to xLSTM. When using larger batch sizes with xLSTM, we found that decreasing the head dimension (more heads, same total hidden dim) is important for enabling higher throughput, and added an ablation on this to the Appendix. This is because a higher head dimension incurs more FLOPS. Generally, these results reflect findings in existing works [1,2].
Our manuscript improved considerably by addressing and incorporating your comments: thank you. In particular, believe that the additional results for Mamba strengthen the paper.
[1] “xLSTM: Extended Long Short-Term Memory”, NeurIPS, 2024
[2] “xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference”, Arxiv, 2024
This paper presents an expansive comparison of xLSTM backbone against transformers for real-time robotics applications, showing that xLSTMs lead to a significant efficiency boost. However, it seems the performance is similar to Mamba, so it is not clear what the significance of the contributions of this paper are.
There are no theoretical contributions, so the work stands only on the empirical results. While there is emphasis on real-time robotics, there are no robotics experiments. The authors point out that this is still useful for general multi-task agents, but then their own emphasis could have been in that direction instead.
I recommend the authors modify the messaging and add some results for the multi-task setting for the camera-ready to make the paper stronger.