/10

Poster3 位审稿人

最低3最高4标准差0.5

ICML 2025

Directly Forecasting Belief for Reinforcement Learning with Delays

Qingyuan Wu,Yuhui Wang,Simon Sinong Zhan,Yixuan Wang,Chung-Wei Lin,Chen Lv,Qi Zhu,Jürgen Schmidhuber,Chao Huang

提交: 2025-01-13更新: 2025-07-24

TL;DR

We present the directly forecasting belief method which can effectively reduce the compounding errors and improve performance.

摘要

关键词

reinforcement learningreinforcement learning with delaysbelief representation

评审与讨论

审稿意见

评分: 42025-03-08

This paper addresses reinforcement learning with delayed observations by proposing a Directly Forecasting Belief Transformer (DFBT). DFBT treats state estimation as a sequence modeling problem—predicting the current (and intermediate) states directly from past delayed observations instead of forecasting them iteratively. The authors combine DFBT with Soft Actor-Critic and introduce multi-step bootstrapping from the predicted states to improve learning efficiency. Empirical results on MuJoCo tasks with both fixed and random delays show significantly higher performance than prior augmentation-based or recursively predicted belief methods.

update after rebuttal

Both before and after the rebuttal phase, I believe that the work has a certain novelty, and therefore, I leave my current high assessment.

给作者的问题

How well does DFBT generalize if the policy visits out-of-distribution states not covered in the offline dataset?
Could fine-tuning the belief model online further boost performance or stability?
Have you considered outputting distributional beliefs for stochastic environments?*

论据与证据

The key claim is that direct (rather than recursive) forecasting mitigates compounding errors as delay increases, leading to superior policy returns. Their theoretical bounds show that the error does not scale exponentially with delays, and empirical comparisons on MuJoCo (e.g., HalfCheetah, Hopper, Walker2d) confirm much better performance at long delays. This is well-supported by both error metrics (on offline data) and final returns in the RL tasks.

One potential weakness is that DFBT requires an offline dataset to pre-train the belief model. In scenarios where such data is not available, one would have to gather it (possibly by random exploration) or train the model online (which the paper did not explore). Augmentation-based methods, by contrast, learn everything online (though at a heavy sample cost). The authors did not explicitly discuss this trade-off.

方法与评估标准

The authors pre-train the Transformer-based belief model using offline trajectories (D4RL) and then use it in online learning with a standard SAC agent. They measure belief prediction accuracy (L1/MSE error) and normalized returns on MuJoCo. Baselines include recent augmentation-based (BPQL, ADRL) and belief-based (D-SAC, D-Dreamer) methods, making the comparisons fair and thorough.

理论论述

They provide bounds showing that recursive belief estimation accumulates errors exponentially in the worst case, whereas direct prediction has a linear bound in terms of overall model error.

实验设计与分析

Experiments are methodically done on MuJoCo tasks with delays ranging from 8 to 128, as well as random delays.

Additional experiments that might make the paper stronger: 1. analyzing the impact of the algorithm for ood states 2. stochastic environment 3. Results for online finetuning the belief

补充材料

The supplementary includes proofs, detailed hyperparameters, additional results (for different delay settings), and ablations.

与现有文献的关系

This work builds on both augmentation-based and belief-based delay handling methods, connecting to model-based RL (e.g., Dreamer) by learning a forward model but differs by predicting states in one shot with a Transformer.

遗漏的重要参考文献

Works like TransDreamer (Chen et al., 2022) also explore Transformers in partially observable RL but from a full model-based viewpoint. Fu et al. also discusses compounding error bound with the Lipschitz assumption.

Reinforcement Learning with Transformer World Models. Chen et al.
Performance Bounds for Model and Policy Transfer in Hidden-parameter MDPs. Fu et al.

其他优缺点

其他意见或建议

Minor issues:

“delays are fundamentally affect the system’s safety”
“cypher-physical systems”

作者回复

2025-03-31

We sincerely appreciate Reviewer 4D3p's thoughtful comments. Our detailed responses to your questions and concerns are as follows:

Q1: One potential weakness is that DFBT requires an offline dataset to pre-train the belief model. In scenarios where such data is not available, one would have to gather it (possibly by random exploration) or train the model online (which the paper did not explore). Augmentation-based methods, by contrast, learn everything online (though at a heavy sample cost). The authors did not explicitly discuss this trade-off.

Thanks for your insightful comment on the trade-off between augmentation-based and belief-based approaches. We will add the related discussion in the revised paper based on your suggestion. As mentioned in the paper (Lines 406-411), learning DFBT online from scratch always suffers from instability and inefficacy issues. Therefore, we separate the belief learning from the RL process for stabilizing the online learning process. Separating belief learning from the online RL process and freezing belief representation during RL allows us to investigate the belief component solely, eliminate potential influences from the RL side. To address the reviewer's concern, we report the experimental results of learning belief representation on MuJoCo tasks with deterministic 32 delays in Table R4. The results show that directly learning belief representation in the RL process suffers from instability and efficiency, thus leading to poor performance.

Q2: How well does DFBT generalize if the policy visits out-of-distribution states not covered in the offline dataset? Could fine-tuning the belief model online further boost performance or stability?

Thanks for the thoughtful suggestions. In the online RL process, the agent indeed visits out-of-distribution (ood) states, which leads to a relatively limited performance gain. As mentioned by the reviewer, this issue can be addressed by fine-tuning the belief representation with the online RL process. To address this concern, we conduct the additional experiment: fine-tuning DFBT online on MuJoCo tasks with deterministic 32 delays. The experimental results show that fine-tuning the belief representation can gain better performance improvement with much longer training hours (increasing training hours from 6 to 12). Note that there are other methods to improve the DFBT's online performance. However, they are orthogonal to this work's contribution. Therefore, in this paper, we separate belief learning from the online learning process, which allows us to investigate the belief component solely, eliminate potential influences from the RL side. We will add related discussions in the revised version.

Table R4. Performance comparison of fixed and fine-tuned DFBT on MuJoCo tasks with deterministic 32 delays.

Task	Online	Offline	Offline + Fine-tuning
HalfCheetah-v2	$0.11\pm0.39$	$\mathbf{0.42\pm0.12}$	$0.39\pm0.04$
Hopper-v2	$0.10\pm0.58$	$0.68\pm0.20$	$\mathbf{0.84\pm0.04}$
Walker2d-v2	$0.09\pm0.27$	$0.64\pm0.10$	$\mathbf{0.96\pm0.32}$

Q3: Have you considered outputting distributional beliefs for stochastic environments?

Thanks for the helpful suggestions. We conducted additional experiments on the stochastic MuJoCo tasks with a probability of 0.001 for the unaware noise (similar settings with related works [1, 2]) and deterministic 128 delays. As shown in Table R5, the results demonstrate that our DFBT-SAC with distributional belief achieves superior performance in these stochastic MuJoCo tasks.

Table R5. Performance on stochastic MuJoCo with deterministic 128 delays.

Task	A-SAC	BPQL	ADRL	DATS	D-Dreamer	D-SAC	DFBT-SAC
HalfCheetah-v2	$0.00\pm0.03$	$0.01\pm0.05$	$0.13\pm0.04$	$0.13\pm0.03$	$0.07\pm0.01$	$0.00\pm0.04$	$\mathbf{0.35\pm 0.04}$
Hopper-v2	$0.03\pm0.04$	$0.08\pm0.05$	$0.06\pm0.04$	$0.05\pm0.06$	$0.04\pm0.05$	$0.04\pm0.05$	$\mathbf{0.13\pm 0.22}$
Walker2d-v2	$0.06\pm0.02$	$0.04\pm0.01$	$0.08\pm0.01$	$0.06\pm0.03$	$0.10\pm0.03$	$0.05\pm0.02$	$\mathbf{0.30\pm 0.07}$

Q4: Works like TransDreamer (Chen et al., 2022) also explore Transformers in partially observable RL but from a full model-based viewpoint. Fu et al. also discusses compounding error bound with the Lipschitz assumption.

Thanks for the suggestions on missing related works. We will discuss these related works and references in the revised paper.

Q5: Minor issues: "delays are fundamentally affect the system’s safety" and "cypher-physical systems"

Thanks for your helpful comments. We will fix these typos in the revised version.

Reference

[1] Kim, Jangwon, et al. "Belief projection-based reinforcement learning for environments with delayed feedback." Advances in Neural Information Processing Systems.

[2] Wu, Qingyuan, et al. "Boosting Reinforcement Learning with Strongly Delayed Feedback Through Auxiliary Short Delays." International Conference on Machine Learning.

审稿意见

评分: 32025-03-13

This paper introduces a method for directly predicting the current belief state in reinforcement learning with delays using a Transformer-based model. The main idea is to use Transformers for state forecasting to help mitigate the effects of observation delays in RL environments. The approach is simple, modular, and easy to implement.

给作者的问题

N/A

论据与证据

Performance claims are well-supported by comprehensive experiments across delay settings.
Ablation studies on multi-step bootstrapping provide convincing evidence of this technique's importance.
Innovation claim somewhat overstated - direct prediction approach itself isn't new when handling delay problem in RL.

方法与评估标准

The proposed method is based on a Transformer architecture for belief forecasting, which is intuitive and straightforward.
The experimental setup considers high delays (8 to 128 steps) and random delays (U(1,n) distribution), making the evaluation more aligned with real-world scenarios. However:
The separation of the prediction module from the reinforcement learning process limits the ability of the prediction network to adapt to task-specific informations. This design choice could lead to inefficiencies in tasks where task-relevant information differs from state prediction accuracy.
The choice to operate directly in the original state space may lead to inefficiencies in high-dimensional environments. The decision to predict directly in the original state space rather than learning a task-optimized lower-dimensional representation presents challenges in high-dimensional environments. This approach risks the "curse of dimensionality" when states contain redundant or task-irrelevant information. In complex environments, only a subset of state variables may be critical for decision-making, making direct prediction of the entire state inefficient. A comparison with approaches that learn low-dimensional task-specific representations could provide additional insights.

理论论述

No special points to be mentioned.

实验设计与分析

Methodology appropriate for research problem, using standard experiments for delays. It covers motion controlling, would be great if it can be extended to other areas.
Good testing across environments with different delay characteristics such as randomness and time scale.
Well-designed ablation studies, such as bootstrapping steps (N=1,2,4,8).

To be improved: The direct prediction approach likely offers computational advantages over complex recursive forecasting methods. By avoiding intermediate steps, it reduces error accumulation and computational overhead. An explicit analysis of computational efficiency (training time, inference speed, memory usage) would strengthen the paper's practical value proposition.

补充材料

I did review the supplementary material roughly.

与现有文献的关系

While the paper's conceptual innovation may be limited, its main contribution lies in providing an efficient, straightforward, and universally applicable solution to the delayed reinforcement learning problem using modern frameworks. Although the core ideas have appeared in previous methods, the paper's implementation using contemporary Transformer architecture brings practical improvements. The significance of this work is not in proposing entirely new concepts, but in effectively adapting modern deep learning techniques to create an "out-of-the-box" solution that outperforms existing approaches.

遗漏的重要参考文献

N/A.

其他优缺点

See above

其他意见或建议

See above

作者回复

2025-03-31

We sincerely appreciate Reviewer dQtK's thoughtful comments. Below, we give responses to your questions and concerns.

Q1: Contribution and Novelty Clarification

First, we would like to express our gratitude for the reviewer's comments. We recognize that our current statement may have led to some misunderstandings, and we will refine the claim and clarify the statement in the revised version. Here, we want to clarify our contribution and the novelty. This paper aims to address the compounding errors issue in belief-based methods via directly forecasting belief. While we acknowledge that the directly forecasting approach has been used in delayed RL, and we have cited the related work and considered it as the baseline [1], we note that existing work with online belief learning always suffers from instability and inefficacy issues. To overcome these issues, we separate the belief training from the RL process to stabilize the online learning process (Lines 406-411). Specifically, as shown in Fig. 1, the DFBT is trained on the offline datasets, then later frozen and deployed in the environment with delays, enabling efficient RL training. To this end,

We propose DFBT, which incorporates reward signals in tokens for capturing sufficient dynamic information. Empirical results (Fig. 2) demonstrate that DFBT achieves superior prediction accuracy, effectively addressing the compounding errors issue.
By leveraging the accurate predictions from DFBT, we integrate the multi-step bootstrapping technique on the forecasted states to improve learning efficiency.
We theoretically demonstrate that directly forecasting belief significantly mitigates compounding errors, providing a stronger performance guarantee.

Q2: The separated belief is limited to task-specific information of RL.

We acknowledge that some task-specific information may be missing if the belief is frozen in the RL process, leading to limited performance improvement. This issue can be mitigated by fine-tuning the belief within the RL process. The results, shown in Table R2, demonstrate that fine-tuning helps the DFBT capture the task-specific information with better performance. Note that there are many potential methods for capturing task-specific information not limited to fine-tuning DFBT. However, they are orthogonal to this work's contribution. We will add related discussions for broader interests in the revised version. As mentioned in the limitations (Lines 406-411), belief learning from scratch in the online RL process always suffers from instability issues. Therefore, in this paper, we separate belief learning from the online RL process, which allows us to investigate the belief component solely, eliminate potential influences from the RL side.

Table R2. Performance comparison of DFBT-SAC with different training methods on MuJoCo tasks with 32 delays.

Task	Online	Offline	Offline + Fine-tuning
HalfCheetah-v2	$0.11\pm0.39$	$\mathbf{0.42\pm0.12}$	$0.39\pm0.04$
Hopper-v2	$0.10\pm0.58$	$0.68\pm0.20$	$\mathbf{0.84\pm0.04}$
Walker2d-v2	$0.09\pm0.27$	$0.64\pm0.10$	$\mathbf{0.96\pm0.32}$

Q3: Inefficiencies in high-dimensional environments.

In this work, we mainly consider MuJoCo tasks, which have relatively low-dimensional state spaces, such as HalfCheetah (17), Hopper (11), and Walker2d (17). As discussed in the paper (Lines 47-51, 99-102), our approach belongs to the belief-based method, which can efficiently address the "curse of dimensionality" issue of the augmentation-based approach in facing long delays. For the high-dimensional state space (e.g., image-based RL tasks), it is essential to learn a low-dimensional and compact latent state space for efficient state prediction. However, this falls outside the scope of the current paper. We will include a related discussion in the revised version and plan to explore high-dimensional delayed RL tasks in future work.

Q4: Experiments on computational efficiency.

Thanks for your helpful suggestion. Based on your advice, we conducted additional computational efficiency experiments. As shown in Table R3, the results demonstrate that directly forecasting belief maintains a consistent and stable inference speed (around 4 ms) across different delays. In contrast, the recursively forecasting belief experiences inference speed issues as delays increase. In HalfCheetah-v2 with 128 delays, the training times of DATS and D-Dreamer are around 10 hours and 15 hours, respectively, while that of D-SAC and DFBT-SAC both are around 6 hours.

Table R3. Inference speed (ms) comparison in HalfCheetah-v2.

Delays	DATS	D-Dreamer	D-SAC	DFBT-SAC
8	$1.10\pm0.02$	$1.85\pm0.01$	$4.03\pm0.03$	$4.18\pm0.04$
32	$3.85\pm0.06$	$6.80\pm0.04$	$4.03\pm0.04$	$4.18\pm0.04$
128	$14.97\pm0.22$	$26.51\pm0.19$	$4.03\pm0.03$	$4.15\pm0.05$

Reference

[1] Liotet, Pierre, et al. "Learning a belief representation for delayed reinforcement learning".

审稿意见

评分: 32025-03-15

The authors focus on reinforcement learning with delayed observations. To mitigate this issue, most prior work learns a dynamics model which, given a known delay time $\Delta t$ , rolls out a dynamics model from $t$ to $t + \Delta t$ . The policy then makes decisions based on $s_{t + \Delta t}.

While prior work tends to use recurrent models, the authors suggest using a transformer to generate all states between $s_t$ and $s_{t + \Delta t}$ in one shot, sidestepping error propagation that might occur from calling a recurrent model sequentially.

The authors train their transformer to predict future states using the negative log likelihood. They pretrain their transformer on offline dataset, then freeze the parameters and utilize the transformer-produced states to train the policy and value function as one would in normal RL.

The authors provide some error bounds for belief forecasting, then run some experiments on MuJoCo tasks. First, they plot the belief forecasting error, and then they compare the corresponding trained policy performance

给作者的问题

It is unclear to me why the transformer architecture works so much better for stochastic delays. For $U(1, 128)$ , does this just mean learning a deterministic delay of $64$ ? If you cannot know the delay ahead of time, it seems like this is the best you can do. Can you explain this further?

论据与证据

The authors claim to:

Propose a transformer architecture to tackle the forecasting problem
Integrate the architecture in SAC
Demonstrate that their method reduces compounding forecasting errors
Demonstate that their method results in better policies

I think they provide sufficient evidence to back their claims.

方法与评估标准

The authors select a standard MuJoCo baseline, however they only focus on three tasks. I think it could be helpful to evaluate on the entire MuJoCo suite, even if the other results are only written in the appendix.

理论论述

The authors make theoretical claims but I did not check them closely.

实验设计与分析

The experimental design consists of both deterministic and random delays, and performs ablations.

I commend the authors for also demonstrating the DFBT-SAC(1) results. Initially, I was concerned that their improved returns could result primarily from the use of an n-step return. This ablation assuages my concerns.

补充材料

I looked at the appendix but not in detail.

与现有文献的关系

I an not familiar with the field of delayed RL.

遗漏的重要参考文献

I an not familiar with the field of delayed RL.

其他优缺点

The authors' method is well-founded and their experimental setup is well-done. The idea is fairly straightforward and provides strong results.

I think the writing could be a bit clearer, and I suggest the authors go through the paper and try to minimize tense changes and stick to either passive or active voice.

My biggest concern is not necessarily on the authors work, but rather with the field of "delayed RL". I suspect that most realistic robotics tasks already integrate an RNN/Transformer to handle partial observability. Such a setup would be able to handle delayed observation-action interactions implicitly, removing the need to consider delayed RL as a separate problem.

其他意见或建议

As above, try and be more consistent with grammatical tense and voice to make the paper slightly nicer to read. More MuJoCo tasks could also strengthen the evidence for the authors' claims.

作者回复

2025-03-31

We sincerely appreciate Reviewer KEME's thoughtful comments. Below, we give responses to your questions and concerns.

Q1: Evaluation of other MuJoCo tasks.

In this work, we consider the MuJoCo benchmark. To ensure transparency and reproducibility in belief training, we utilize the open offline RL datasets (D4RL), including HalfCheetah-v2, Hopper-v2, and Walker2d-v2 only. We acknowledge that other tasks are also valuable. Therefore, we conducted experiments on the Pusher-v2, Reacher-v2, and Swimmer-v2 with deterministic 32 delays. The offline datasets (500k samples) are collected from a SAC policy, with other settings unchanged. The results are shown in Table R1, showing that DFBT-SAC achieves superior performance on these tasks.

Table R1. Performance on additional tasks with deterministic 32 delays.

Task	A-SAC	BPQL	ADRL	DATS	D-Dreamer	D-SAC	DFBT-SAC
Pusher-v2	$0.05\pm 0.00$	$0.93\pm 0.58$	$0.95\pm 0.24$	$0.92\pm 0.10$	$0.87\pm 0.18$	$0.94\pm 0.04$	$\mathbf{1.04\pm 0.24}$
Reacher-v2	$0.89\pm 0.08$	$0.83\pm 0.06$	$0.85\pm 0.01$	$0.82\pm 0.13$	$0.84\pm 0.02$	$0.88\pm 0.07$	$\mathbf{0.93\pm 0.06}$
Swimmer-v2	$0.27\pm 0.05$	$0.80\pm 0.14$	$0.60\pm 0.06$	$0.25\pm 0.05$	$0.21\pm 0.07$	$0.30\pm 0.07$	$\mathbf{1.01\pm 0.27}$

Q2: Writing and grammar issues.

Thanks for the reviewer's helpful suggestions. We will revise the paper to improve clarity and ensure consistency in grammatical tense and voice.

Q3: Necessity of treating "delayed RL" as a separate research problem.

As the reviewer mentioned, some robotics tasks incorporate RNN or Transformer to handle partial observability. We acknowledge that these models implicitly address delays by capturing sequential dependencies and retaining memory over time. However, we emphasize that explicitly handling delayed observation-action interactions through delayed RL remains essential both technique-wise and application-wise. The key reasons are as follows:

(1) Unique problem structure enables specialized algorithms. Delayed MDP poses unique challenges due to entirely missing observations rather than partially missing ones, though it can be viewed as a specialized form of POMDP [1]. Its explicit delay-induced structure enables specialized, efficient algorithms that outperform general-purpose POMDP methods. For example, at timestep $t$ , the agent can only access the historical state $s_{t-\Delta}$ . In this context, the agent has to explicitly handle delays $\Delta$ using state augmentation [2] or belief representation [3] techniques to retrieve the Markovian property and enable efficient RL. For instance, in the state augmentation technique, the decision-making is based on the augmented state $x_t:=\\{s_{t-\Delta},a_{t-\Delta:t-1}\\}$ . This paper aims to address the compounding errors issue within the belief representation method, thereby enhancing learning efficiency and performance. These distinctive properties and challenges in delayed RL necessitate treating it as a distinct research problem, separate from the conventional partially observable RL problem.

(2) Strong application-driven motivations. Delayed RL aims to address the delayed feedback problem, which is practical and common in real-world control applications (e.g., transportation systems [4] and financial systems [5]). In robotics, several studies have demonstrated that delayed RL could improve the system's safety, agility, efficiency, and robustness [6, 7]. The practical demands of real-world scenarios also underscore the necessity to investigate delayed RL as a distinct research problem.

Q4: $U(1, 128)$ delays explanation and Transformer for stochastic delays.

For $U(1, 128)$ delays, it does not mean learning the deterministic delays of $64$ . Under this setting, in each timestep $t$ , the probability of observed state $s_{t-1}$ equals that of every observed state till $s_{t-128}$ . Therefore, the agent shall have the ability to address varying delays ranging from 1 to 128. As shown in Fig.2, the transformer can effectively address the compounding errors issue, maintaining superior and consistent prediction accuracy across varying delays. These accurate predictions further improve the learning efficiency and final performance of RL.

Reference

[1] Karamzade, Armin, et al. "Reinforcement learning from delayed observations via world models".

[2] Bouteiller, Yann, et al. "Reinforcement learning with random delays".

[3] Walsh, Thomas J., et al. "Learning and planning in environments with delayed feedback".

[4] Cao, Zhiguang, et al. "Using reinforcement learning to minimize the probability of delay occurrence in transportation".

[5] Deng, Yue, et al. "Deep direct reinforcement learning for financial signal representation and trading".

[6] Mahmood, A. Rupam, et al. "Setting up a reinforcement learning task with a real-world robot".

[7] Hwangbo, Jemin, et al. "Control of a quadrotor with reinforcement learning".

最终决定Accept (poster)

2025-05-01

The manuscript introduces DFBT which directly forecasts states from observations without estimating intermediate states incrementally. Reviewers generally expressed positive views of the paper. The rebuttal was very thorough and provided important clarifications. I recommend acceptance.

Directly Forecasting Belief for Reinforcement Learning with Delays

摘要

评审与讨论

update after rebuttal

给作者的问题

论据与证据

方法与评估标准

理论论述

实验设计与分析

补充材料

与现有文献的关系

遗漏的重要参考文献

其他优缺点

其他意见或建议

Q2: How well does DFBT generalize if the policy visits out-of-distribution states not covered in the offline dataset? Could fine-tuning the belief model online further boost performance or stability?

Q3: Have you considered outputting distributional beliefs for stochastic environments?

Q4: Works like TransDreamer (Chen et al., 2022) also explore Transformers in partially observable RL but from a full model-based viewpoint. Fu et al. also discusses compounding error bound with the Lipschitz assumption.

Q5: Minor issues: "delays are fundamentally affect the system’s safety" and "cypher-physical systems"

Reference

给作者的问题

论据与证据

方法与评估标准

理论论述

实验设计与分析

补充材料

与现有文献的关系

遗漏的重要参考文献

其他优缺点

其他意见或建议

Q1: Contribution and Novelty Clarification

Q2: The separated belief is limited to task-specific information of RL.

Q3: Inefficiencies in high-dimensional environments.

Q4: Experiments on computational efficiency.

Reference

给作者的问题

论据与证据

方法与评估标准

理论论述

实验设计与分析

补充材料

与现有文献的关系

遗漏的重要参考文献

其他优缺点

其他意见或建议

Q1: Evaluation of other MuJoCo tasks.

Q2: Writing and grammar issues.

Q3: Necessity of treating "delayed RL" as a separate research problem.

Q4: U(1,128)U(1, 128)U(1,128) delays explanation and Transformer for stochastic delays.

Reference

Q4: $U(1, 128)$ delays explanation and Transformer for stochastic delays.