PaperHub

暂无评分数据

ICLR 2025

Time to Truncate Trajectory: Stochastic Retrace for Multi-step Off-policy Reinforcement Learning

OpenReviewPDF
提交: 2024-09-27更新: 2024-10-02

摘要

While off-policy reinforcement learning methods that is based on one-step temporal difference learning have shown to be promising for solving complex decision making problems, multi-step lookahead from behavior policies is still challenging by the discrepancy between behavior policy and target policy. Several recent works have addressed this challenge by introducing the coefficients to correct the discrepancy such as Retrace and evolving the behavior policy in a similar way to conservative policy iteration such as Peng's $Q(\lambda)$. However, both methods do not universally work well by the policy evaluation error caused by the value from the later part of a long trajectory data. In this work, we propose a stochastic truncation method which modify the correction coefficeints of Retrace into the sequence of Bernoulli random variables to remove the later part of trajectory, which degrades off-policy evaluation by adding the unnecessary noise. Unlike prior method for reducing the off-policy discrepancy, our stochastic truncation enjoys two strengths form the conservative and non-conservative multi-step RL methods. We demonstrate that our algorithm, time-to-truncate-trajectory (T4) outperforms various model-free RL methods.
关键词
multi-step off-policy reinforcement learning

评审与讨论

撤稿通知

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.