PaperHub
6.5
/10
Poster4 位审稿人
最低5最高8标准差1.1
5
8
6
7
2.5
置信度
正确性3.0
贡献度2.5
表达2.8
NeurIPS 2024

Periodic agent-state based Q-learning for POMDPs

OpenReviewPDF
提交: 2024-05-13更新: 2024-11-06
TL;DR

Periodicity helps in agent-state based RL in POMDPs because agent-states are not Markov.

摘要

关键词
POMDPsRLQ-learningnon-stationary policiesnon-Markovian environments

评审与讨论

审稿意见
5

The paper proposes a new RL algorithm for POMDPs. The standard approach to convert the POMDP in a belief MDP is not possible in the RL setting, when there is no knowledge about the system model. The alternative is to use an agent-state which is a function of the observation history. Standard RL algorithms can be adapted to the agent-state to learn policies for POMDPs. However, these standard algorithms lead to stationary policies which might not be optimal, as the agent-state may not satisfy the Markov property. As a consequence, the authors propose a special form of nonstationary policies that may be better suited for non-Markovian agent-states.

优点

  1. The paper has an interesting problem setting and describes well the limitations of stationary policies applied to non-Markovian agent-states.

  2. The authors provide an extensive discussion of the related work.

缺点

  1. The proposed algorithm leads to a very limited class of policies. Having periodic policies might only help, because several policies mixed might lead to a better strategy than having only one deterministic policy. Apart from that, is there an inductive bias to repeat the policies?

  2. Another weakness is, that the paper mainly discusses deterministic policies. It would be more interesting to increase the set of policies to stochastic policies. It could be that a single stochastic policy has the same effect on mixing the policies and therefore the periodic policy has no advantage.

  3. The numerical experiments consider only small examples.

问题

Have you considered other non-stationary approaches?

局限性

The authors addressed the limitations adequately.

作者回复

We thank the reviewer for the feedback.

1. Considering other non-stationary approaches

As we mention in the related work section, there are some other approaches to non-stationarity that have been taken in the literature e.g., continual learning and hierarchical learning including options. However, for the most part, the theoretical analysis is limited and when available is largely restricted to the MDP setting. Given the insights that are obtained from our current analysis, it may be possible to analyze some of these more general algorithms for POMDPs using the tools developed in our paper.

2. The numerical experiments consider only small examples.

The purpose of our numerical experiments is two-fold. First, to show that the convergence results do agree with what is predicted by the theory. Second, to show that PASQL can outperform ASQL. To make the first point, we need to restrict ourselves to simple models, as we have presented in the paper.

3. Deterministic vs stochastic policies.

Please see our response in the global rebuttal.

评论

I thank the authors for their answers. However my concern,that periodic policies might only help, because several policies mixed might lead to a better strategy than having only one deterministic policy, still remains. I keep my original score, but I would not object if other reviewers championed the paper.

评论

Thank you for your reply. We are not completely sure what you mean by "several policies mixed". Note that we have already shown via the example on line 959 (page 22) that periodic policies can do better than stochastic policies. So, we believe that you mean something other than stochastic policies.

Another interpretation of "mixed policies" is according to the terminology used in game theory. In this setting, the agent has a set of mm deterministic policies (π1,,πm)(π_1, \dots, π_m). Before the system starts running, the agent picks a random variable MM in the set {1,,m}\lbrace 1,\dots, m\rbrace with probability (p1,,pm)(p_1, \dots, p_m) and then plays the policy πMπ_M forever. This is called a "mixed policy" in game theory and we will denote it by μμ.

Let J(πi)J(π_i) denote the performance of policy πiπ_i. Then, by linearity of expectation

J(μ)=p1J(π1)++pmJ(πm).J(μ) = p_1 J(π_1) + \cdots + p_m J(π_m).

In particular, this means that

J(μ)maxi{1,,m}J(πi).\displaystyle J(μ) \le \max_{i \in \lbrace 1, \dots, m\rbrace} J(π_i).

Thus, mixing cannot improve performance.

In the above argument, there was no restriction on the class of policies. So, if we take a policy that mixes between stationary deterministic policies, then it cannot do better than the best stationary deterministic policy. So, a periodic deterministic policy will perform better than mixed policies that mix between stationary deterministic (or even stationary stochastic) policies.

评论

Thank you for your response. After reconsidering your discussion of deterministic vs. stochastic policies, I have decided to increase the score by one.

评论

Thank you for the discussion and for raising your score.

审稿意见
8

This work proposes a type of non-stationary policy for POMDPs that is periodic. The authors argue that typical agent states in partially observable RL do not satisfy the Markov property and illustrate why introducing non-stationarity can improve the optimal policy within the policy class (vs considering only stationary policies), even if this policy cannot achieve global optimality across all history-dependent policies. The claims are supported by several well chosen examples, rigorous theory, and numerical experiments.

I'm happy to strongly recommend acceptance.

优点

Originality: To my knowledge, the proposal of periodic policies for POMDPs is novel (though I am unfamiliar with theoretical papers in this area). The authors do a good job of presenting the relevant prior literature and what contributions in the work at hand are novel. Since the work generalizes stationary policies (using a period of L=1L=1), I appreciated the remarks on how previous theoretical results can be obtained using the new theorems in this work.

Quality: The approach is rigorously supported by theory, and some numerical experiments. While I did not verify all the proofs, the theorems seem correct. Throughout the paper, the discussion presented by the authors was extremely insightful.

Clarity: The paper is well written and the ideas are clearly communicated. While the writing can be technically dense, this should be expected of a work of this nature. I particularly commend the authors for the example in Figure 1, since it immediately conveys a number of important features of this work in a concise manner.

Significance: While it is not surprising that performance in POMDPs can be improved by leveraging non-stationarity (in the absence of belief states), it can be challenging to implement non-stationarity due to the infinite parameter requirement the authors highlight. This work is an interesting proposal of how this can be done in principle with only finite parameters. While it may still not necessarily be practical, I think the ideas are nonetheless valuable.

缺点

Originality: No weaknesses that are not already apparent.

Quality: My only major concern is with the numerical experiments. First, the agent models used are rather weak (the agent only considers the current observation when deciding an action). Perhaps some stronger choices that would be more convincing are frame stacking or an approximation of the belief state with added noise.

The following is a list of limitations. These are mostly stated clearly in the paper and I don't think they constitute reasons to reject.

  • Choosing an appropriate period LL is difficult. While it is shown that choosing L=n!L = n! monotonically improves, large values of LL have large memory requirements.
  • For the above reason, I'm not sure this particular algorithm will see practical use.
  • The choice of behavioural policy is critical.
  • The framing of the problem is limited to deterministic Q-learning. It does not consider stochastic or policy gradient methods.

Clarity: A couple of minor issues:

  • In the introductory example, explicitly writing the values of JBDJ^*_{BD}, JSDJ^*_{SD} under γ=0.9\gamma = 0.9 would aid in the comparison with JLJ^*_L.
  • In the numerical examples, is there an explanation for why u1u_1 is a better choice than u2u_2 and u3u_3?

Significance: See prior comments on practicality.

问题

(As mentioned above:)

  1. Could you please comment on the relevance of the insights towards larger POMDPs, like those common in deep RL?
  2. In the numerical examples, is there an explanation for why u1u_1 is a better choice than u2u_2 and u3u_3?
  3. I can see why most agent states that aren't the belief state will not satisfy the Markov property. Could you elaborate on some other cases to help clarify this (in the paper)? For example, when the agent state is uninformative (e.g. constant), or when the agent state is an RNN encoding of the history, do these satisfy the Markov property?

局限性

Yes.

作者回复

Thank you for your comments and the positive endorsement. We address your questions and concerns below.

1. Could you please comment on the relevance of the insights towards larger POMDPs, like those common in deep RL?

The current state of the art QQ-learning based deep RL algorithms for POMDPs always converge to a stationary policy. Our main insight is that non-stationary deterministic policies perform better than stationary deterministic and stochastic policies. As far as we are aware, none of the current deep RL algorithms for POMDPs exploit this insight. In this paper, we present a simple way to exploit this insight via periodic policies. We agree with your point that this introduces additional memory requirements, but as we argue in global response, the additional memory requirements are not too severe.

At the same time, for practical implementation of PASQL, one would leverage the insights of deep RL algorithms, especially when it comes to the choice of behavioral policy. The convergence of both AQSL (which is an idealized version of practical Q-learning based RL algorithms for POMDPs) and PASQL depends on the behavioral policy. However, practical algorithms such as R2D2 and dreamer use distributed actors, which help with exploration. Similar ideas can be used in practical implementation of PASQL.

2. In the numerical examples, is there an explanation for why u1u_1 is a better choice than u2u_2 and u3u_3 ?

It is difficult to provide an intuitive explanation for why policy μ1\mu_1 is a better choice than μ2\mu_2 or μ3\mu_3. The main reason is that under policy μ1\mu_1, the agent executes actions with positive reward (the blue and the green arcs in Fig. 2) more often. The exact reason for this hard to explain verbally but we can "see" that easily via a Monte Carlo simulation.

In general, it is difficult to provide a good intuition behind what constitutes a good behavioral policy in PODMPs (this is also true for ASQL). We did perform some numerical experiments on small models where we computed the performance of all behavioral policies (by discretizing the behavioral policies, computing the converged policy πμ\pi_{\mu}, and evaluating its performance via Monte Carlo). We found that in several models, there are behavioral policies that lead to a converged policy πμ\pi_{\mu} that is optimal, but they were not always intuitively obvious choices.

3. I can see why most agent states that aren't the belief state will not satisfy the Markov property. Could you elaborate on some other cases to help clarify this (in the paper)? For example, when the agent state is uninformative (e.g. constant), or when the agent state is an RNN encoding of the history, do these satisfy the Markov property?

To address this, we will adapt the paragraph starting on line 42 as follows:

When the agent state is an information state, i.e., satisfies the Markov property, i.e., P(zt+1z1:t,a1:t)=P(zt+1zt,at)\mathbb{P}(z_{t+1} | z_{1:t}, a_{1:t}) = \mathbb{P}(z_{t+1} | z_t, a_t) and is sufficient for reward prediction, i.e., E[Rty1:t,a1:t]=E[Rtzt,at]\mathbb{E}[R_t | y_{1:t}, a_{1:t}] = \mathbb{E}[R_t | z_t, a_t], where RtR_t is the per-step reward, ata_t is the action and yty_t is the observation, the optimal agent-state based policy can be obtained via a dynamic program (DP) [Sub+22]. An example of such an agent state is the belief state. But, in general, the agent state is not an information state. For example, frame stacking and RNN do not satisfy the Markov property, in general. It is also possible to have agent-states that satisfy the Markov property but are not sufficient for reward prediction (e.g., when the agent state is always a constant). In all such settings, the best agent-state policy cannot be obtained via a DP. Nonetheless, there has been considerable interest to use RL to find a good agent-state based policy ...

4. My only major concern is with the numerical experiments. First, the agent models used are rather weak (the agent only considers the current observation when deciding an action). Perhaps some stronger choices that would be more convincing are frame stacking or an approximation of the belief state with added noise.

The purpose of our numerical experiments is two-fold. First, to show that the convergence results do agree with what is predicted by the theory. Second, to show that PASQL can outperform ASQL. To make the first point, we need to restrict ourselves to simple models, as we have presented in the paper. In principle, we could have used more sophisticated agents (e.g., by using frame stacking), but we did not do so in the paper because it is more difficult to visualize the theoretical results in that case.

评论

Thank you to the authors for the detailed response to my questions. After reading all of the other reviews I continue to endorse this paper. As a sidebar, I don't necessarily think the method proposed here is practical but that's beside the point. The paper introduces insightful and interesting new ideas that fly under the radar in the existing deep RL literature and that alone is significant.

审稿意见
6

This paper presents the problem of learning a policy in a partially observable environment in a model-free way. The authors address this problem by proposing to learn a non-stationary periodic policy that can outperform stationary ones. This aspect is motivated by the fact that the policy can be constructed either using the last observation or a fixed set of last observations as a state, which in general do not satisfy the Markov property.
An approach based on learning periodic policies using Q-learning is presented and convergence guarantees for this type of policies are presented under some assumptions. Finally, a numerical experiment validates the claims and the benefits of using the devised approach.

优点

The writing of the paper is clear. The authors consider the approach of learning periodic policies and are able to extend the convergence guarantees that hold under the stationary case also to periodic policies.

缺点

I will highlight here some weaknesses of the presented approach.

The authors claim in lines 71-73 that "a non-stationary deterministic agent-state based policy performs better than stationary deterministic agent-state based policies". This claim is reported here using only the results reported based on example 2, thus lacking a more formal characterization of the statement. As a second note, if we formally consider the set of non-stationary policies as containing the set of stationary ones, the results of the claim are always true since the set of non-stationary policies is more expressive and thus more powerful than the second set.

Theorem 1 addresses the main challenge of the convergence analysis of PASQL that is the non-markovianity of the agent state. However the same challenge is already faced, under stationary policies (for ASQL), in [1]. Since the same result was also proved in [1], this reduces the contribution of the presented statement which seems to extend it to the set of periodic policies.

It is not clear what in general the benefits of using periodic policies are and, even though the authors show some examples where they are effective, it is in general not easy to choose the right value of the period LL. Nonetheless, as stated by the authors, an increase in LL does not necessarily reduce the sub-optimality of the policy. Its increase however leads to an increase in the complexity of finding a good estimate of the Q-function for each lLl \in L. Furthermore, no indication is given on the choice of the behavioral policies that lead to the convergence to good policies.

[1] Erfan SeyedSalehi, Nima Akbarzadeh, Amit Sinha, and Aditya Mahajan. “Approximate information state based convergence analysis of recurrent Q-learning”. In: European conference on reinforcement learning. 2023.

问题

It is not clear to me the point (ii) in line 80. In particular, what does it mean that "the set of all periodic deterministic agent-state based policies of period L is not monotonically increasing"? It is also not clear to me why periodic policies with period specified as in line 83 have monotonically increasing performance.

It is not clear why the authors state that the main challenge in this setting is the absence of the Markov property of the agent state [Zt]t1[Z_t]_{t\ge 1}, however Lemma 1 defines the process (including the agent state {Zt}\{Z_t\}) as a Markov chain. Another questions is: how is Lemma 1 derived? Is it an assumption?

The authors claim that previous results hold under a restrictive assumption on the learning rates but they do not state what these assumptions are. Since this aspect is claimed as an improvement over previous work, it would be useful to describe the previous assumptions.

The convergence guarantees of PASQL hold under the assumption that the Markov chain is periodic. How difficult is this assumption in practice? Would in this case the period of the optimal policy coincide with the periodicity of the chain?
Furthermore, in section 5, the authors claim that, differently from previous results which assume the irreducibility and the aperiodicity of the Markov chain, the current work assumes that the chain is periodic which is "a weaker assumption". I do not get why this assumption is weaker since it appears that the irreduciblity and aperiodicity assumption should also hold for each lLl \in L as appears in Assumption 3 in the appendix. Could the authors clarify this aspect?

局限性

N/A

评论

In general, the set of non-stationary policies is a superset of the set of stationary policies. So, one would expect the performance of the best non-stationary policies to be greater than or equal to the performance of the best stationary policy. We are interested in whether this relationship is strict or not. In MDPs, the relationship is an equality: it is well known that stationary policies perform as well as non-stationary ones. The same is true for POMDPs when using a belief-state. All RL algorithms for MDPs leverage this fact and only learn stationary policies. When these algorithms are adapted to POMDPs, one typically replaces the "state of an MDP" with an "agent-state for a POMDP". Our main point is that such a naive replacement is not sufficient and more drastic structural changes in the algorithms may be needed because, in general, non-stationary policies can strictly outperform stationary policies in PODMPs. We illustrate this by providing examples that disprove "stationary policies have the same performance as non-stationary policies".

评论

I thank the authors for the detailed answers and for clarifying my doubts. I will raise my score.

评论

Thank you for your reply and for raising the score!

作者回复

We thank the reviewer for the feedback. We address your questions and concerns.

1. Monotonically increasing performance of periodic policies

We illustrate this via an example. Let ΠLΠ_L denote the set of all agent-state based policies with period LL. Consider a policy (π1,π2,π1,π2,)Π2(π_1, π_2, π_1, π_2, \dots) \in Π_2, where π1π2π_1 \neq π_2. This policy does not belong to Π3Π_3 because the policy at time 1 (π1π_1) is not the same as the policy at time 4 (π2π_2). However, this policy does belong to Π4Π_4 because any sequence that is periodic with period 2 is also periodic with period 4.

Since Π2⊄Π3Π_2 \not\subset Π_3, the set of periodic policies is not monotonically increasing. However, if we take any integer sequence {Ln}n1\lbrace L_n \rbrace_{n\ge 1} that has the property that LnL_n divides Ln+1L_{n+1}, then the performance of the policies with period LnL_n is monotonically increasing in nn. The sequence Ln=n!L_n = n! mentioned on line 83 is just one such sequence. We will adapt the discussion around line 83 to make this clear.

2. Non-Markov agent state ZtZ_t

The standard analysis of Q-learning for MDPs relies on two facts: (i) the agent-state (which is just the state in MDPs) is a controlled Markov process and (ii) the Q-learning iteration is a stochastic approximation of the Bellman update. The first fact is needed in order to define a Bellman operator. In POMDPs, if the agent state is not a controlled Markov process, we cannot define a Bellman operator and therefore need a different technique to analyze the convergence of Q-learning.

Lemma 1 is different from the fact that agent state is a controlled Markov process. The controlled Markov property states that

P(zt+1z1:t,a1:t)=P(zt+1zt,at)\def\PR{\mathbb{P}}\PR(z_{t+1}|z_{1:t},a_{1:t})=\PR(z_{t+1}|z_t,a_t).

while Lemma 1 states that

\PR(st+1,zt+1,yt+1,at+1s1:t,z1:t,y1:t,a1:t)=\PR(st+1,zt+1,yt+1,at+1st,zt,yt,at).\def\1{\PR(s_{t+1},z_{t+1},y_{t+1},a_{t+1}|s_{1:t},z_{1:t},y_{1:t},a_{1:t})}\1=\PR(s_{t+1},z_{t+1},y_{t+1},a_{t+1} |s_t,z_t,y_t,a_t).

These two are different and the latter does not imply the former because the functions of Markov chains are not necessarily Markov.

Proof of Lemma 1: Lemma 1 is an immediate consequence of the law of total probability, which implies that

\1=\PR(at+1s1:t+1,z1:t+1,y1:t+1,a1:t)\PR(zt+1s1:t+1,z1:t,y1:t+1,a1:t)\PR(st+1,yt+1s1:t,z1:t,y1:t,a1:t)\def\2{\phantom{\1}}\def\I{\mathbb{1}} \1=\PR(a_{t+1}|s_{1:t+1},z_{1:t+1},y_{1:t+1},a_{1:t})\PR(z_{t+1}|s_{1:t+1},z_{1:t},y_{1:t+1},a_{1:t})\PR(s_{t+1},y_{t+1}|s_{1:t}, z_{1:t},y_{1:t},a_{1:t})

\2=μ(at+1zt+1)\I{zt+1=ϕ(zt,yt+1,at)}\PR(st+1,yt+1st,at)\2=μ(a_{t+1}|z_{t+1}) \I\lbrace z_{t+1} = \phi(z_t, y_{t+1}, a_t) \rbrace \PR(s_{t+1},y_{t+1}| s_t, a_t)

\2=\PR(st+1,zt+1,yt+1,at+1st,zt,yt,at)\2=\PR(s_{t+1},z_{t+1},y_{t+1},a_{t+1} | s_{t}, z_{t}, y_{t}, a_{t}).

3. Improvement over previous learning rates

We compare our assumptions on learning rates with those in the literature in App E. To re-iterate, the previous learning rates considered were of the form

αt(z,a)=\I{Zt=z,At=a}1+n=0t\I{Zn=z,An=a},\displaystyle α_t(z, a) = \frac{\I\lbrace Z_t = z, A_t = a\rbrace}{1+\sum_{n=0}^t \I\lbrace Z_n = z, A_n = a\rbrace},

whereas assumption 1 is weaker and only requires t1αt(z,a)=\sum_{t \ge 1} α^\ell_t(z,a) = \infty and t1(αt(z,a))2<\sum_{t \ge 1} (α^{\ell}_t(z,a))^2 < \infty.

4. Periodic Markov chains and optimal policies

It is relatively easy to ensure that the Markov chain {St,Yt,Zt,At}t1\def\M{\lbrace S_t,Y_t,Z_t,A_t\rbrace_{t \ge 1}}\M is periodic simply by choosing a behavior policy that is periodic.

The periodicity of the converged policy πμπ_μ is the same as the periodicity of the MC \M\M. However, this is not related to the periodicity of the optimal policy. In fact, as shown in the example in the introduction, in general, the optimal policy need not be periodic.

5. Weaker assumption of irreducibility and aperiodicity in periodic MC

Let PtP_t denote the transition probability of the Markov chain \M\M. The previous results are assuming that PtP_t is time-homogeneous, say equal to PP, and that PP is irreducible and aperiodic. We are assuming that PtP_t is time-periodic, say equal to (P0,,PL1)(P^0, \dots, P^{L-1}), and that P0P1PL1P^0 P^1 \cdots P^{L-1}, P1PL1P0P^1 \cdots P^{L-1} P^0, etc. are aperiodic and irreducible. Note that we are not assuming that P0,,PL1P^0, \dots, P^{L-1} are individually irreducibile and aperiodic. In fact, by construction, these matrices are periodic.

When L=1L=1, the two assumptions are equal. For L1L \neq 1, if the previously assumption holds, then so does our assumption but our assumption does not imply the previous assumption. In that sense, our assumption is mathematically weaker than the previous assumption.

6. Comparison with [Sey+23]

We strongly disagree with this characterization. The key point that we emphasize in the paper is that non-stationary policies can perform better than stationary policies in POMDPs with an agent state. Previous papers that analyze convergence of Q-learning for POMDPs such as [Sey+23] do so under very restrictive assumptions which do not hold in the non-stationary setting. As mentioned in point 3 above, they impose a strong assumption on the learning rates. But more importantly, [Sey+23] assume that the MC \M\M converges to a stationary limiting distribution. Under this assumption, Q-learning would converge to a stationary limit and therefore the greedy policy will be stationary. Removing this assumption is technically non-trivial, as there is no existing literature on time-periodic Markov chains (the existing literature is on "state-periodic" but time-homogeneous Markov chains). Please see App B.2, in particular Prop 2 and 3, which are both new. These results enable us to generalize stochastic approximation results to time-periodic Markov chains (Prop 4), which is then a key tool used in the convergence analysis presented in this paper. None of these results follow from the analysis presented in [Sey+23].

7. Claim regarding non-stationarity is confusing

Please see the official comment below.

审稿意见
7

The paper introduces PASQL (Periodic Agent-State Based Q-Learning), a novel reinforcement learning approach tailored for Partially Observable Markov Decision Processes (POMDPs). Traditional methods often rely on transforming POMDPs into fully observable MDPs by employing belief states. However, belief states require system model knowledge, limiting their practicality in model-free settings like RL. PASQL addresses this by learning non-stationary, periodic policies using agent states—compressed representations of history—which do not assume the Markov property. This method leverages periodic Markov chains and stochastic approximation theories to prove that PASQL converges to a cyclic limit and can outperform traditional stationary policies in terms of approximation error.

优点

  1. PASQL creatively combines concepts from periodic Markov processes and Q-learning to handle non-Markovian dynamics, a significant departure from traditional RL methods.
  2. The paper provides a rigorous theoretical framework, including convergence proofs and bounds on sub-optimality, which substantiates the claims made about PASQL's efficacy.
  3. Through numerical experiments, the paper effectively demonstrates that PASQL can outperform standard Q-learning in POMDP settings, offering practical relevance to the theoretical claims.
  4. The paper offers a comprehensive comparison with existing methods, highlighting PASQL’s advantages in handling agent states that do not satisfy the Markov property.

缺点

  1. The introduction of periodic policies increases complexity and may require more computational resources, which could limit practical applications.
  2. The current study focuses on tabular representations, which might not scale well to high-dimensional problems or those requiring function approximation.

问题

  1. How does PASQL scale with the dimensionality of the state and action spaces in practical scenarios?
  2. On what basis were the specific periods chosen for the policies in the experiments, and how sensitive is PASQL's performance to these choices?

局限性

yes

作者回复

Thank you for the positive endorsement. We address your questions and concerns.

1. How does PASQL scale with the dimensionality of the state and action spaces in practical scenarios?

In the paper, we have focused on the tabular setting to analyze the simplest form of the algorithm. In a practical scenario, one would use function approximation with deep neural networks. The current state of the art ASQL implementations like R2D2 use LSTMs which involve using a recurrent layer that stores an "internal state'' of the agent state which acts as some form of memory of the history and implementations like dreamer use recurrent state space models (RSSMs) which consists of a self-predictive model that can predict forward in a latent space similar to a sequential variational autoencoder. So, the natural approach to implement PASQL will be to use a similar architecture, with a shared RSSM or LSTM base and LL-different heads for the different QQ^{\ell}.

The scaling with respect to state and action spaces will remain similar to that of ASQL with a worst-case additional scaling factor of LL. There are two ways in which we can think about scaling in this setting: memory and sample complexity. For memory, the RSSM/LSTM part would remain the same as in ASQL but we would need a larger replay buffer, because a sample in the replay buffer is now associated with a particular value \ell. For the sample complexity, in the worst-case, we would require LL-times more samples, but in practice, the scaling is much better because each QQ^{\ell} is bootstrapped from Q[[+1]]Q^{[[\ell + 1]]}, which leads to improved sample efficiency. We can see this fact in the experiments included in the paper where PASQL (Fig 3) and ASQL (Fig 5) have roughly the same sample complexity.

2. On what basis were the specific periods chosen for the policies in the experiments, and how sensitive is PASQL's performance to these choices?

In general, one can view the period LL as a hyper-parameter. In our experiments, we only tried with L=2L=2. The main point that we wanted to make was that PASQL can give a better performance than ASQL, and the experiment with L=2L=2 demonstrates that; so we did not experiment with larger values of LL.

It is difficult to give a precise characterization of the sensitivity of PASQL to the choice of LL mainly because the performance depends on the choice of behavioral policy and there is no obvious way to "lift" the behavioral policy for period LL to a behavioral policy for a larger period. In theory, as we argue on line 83, the performance of periodic policies with period L{n!:nN}L \in \lbrace n! : n \in \mathbb{N} \rbrace increases monotonically. In fact, if we take any integer sequence {Ln}n1\lbrace L_n\rbrace_{n\ge1} that has the property that LnL_n divides Ln+1L_{n+1}, then the performance of the policies with period LnL_n is monotonically increasing in nn. We will adapt the discussion around line 83 to clarify this point.

作者回复

We thank the reviewers for the comments and feedback. We address some of the common issues raised by the reviewers below.

1. Practical implementation of the algorithm

We want to clarify certain points regarding the practical implementation of the algorithm.

  • In terms of implementation complexity, the deep RL version of the algorithm can be implemented with a shared LSTM/RSSM base (similar to R2D2 [Kap+19] / Dreamer [Haf+20]) and multiple heads corresponding to each LL. So, the additional memory overhead of implementing the algorithm is not too restrictive.

  • In terms of sample complexity, in the worst case the sample complexity will decrease by a factor of LL, but in practice the sample complexity will not be so large because the function QQ^{\ell} is bootstrapped from Q[[+1]]Q^{[[\ell+1]]}. This is indeed the case in our current experiments where the sample complexity for L=2L=2 (Fig 3) is roughly the same as L=1L=1 (Fig 5).

  • In terms of performance, as we mention in the introduction, PASQL with period mm will perform at least as well as PASQL with period nn if mm is a multiple of nn. Therefore, PASQL with any L>1L > 1 would have performance which is at least as good as the performance of ASQL (which corresponds to L=1L=1). Hence, one can think of LL as a hyper-parameter which is tuned with other hyper-parameters.

2. Deterministic vs Stochastic policies

The main thesis of the paper is that non-stationary policies perform better than stationary policies in POMDPs with agent state. This is true for both deterministic and stochastic policies. For instance, as we point out on line 959 (page 22) that non-stationary deterministic policies (and therefore non-stationary stochastic policies) can in general perform better than stationary stochastic policies. Thus, adding non-stationarity is beneficial for both deterministic and stochastic policies.

In this paper, we propose and analyze a variant of Q-learning that converges to a non-stationary policy. The reason that we focus on Q-learning is because the current state of the art RL algorithms for POMDPs such as R2D2 and Dreamer are variants of Q-learning. In principle, the same idea would work for actor-critic algorithms as well. In fact, our results provide a strong motivation that one should investigate periodic variants of actor-critic algorithms for POMDPs

3. Choice of the behavioral policy

The converged value of any agent-state based Q-learning (ASQL) algorithm depends on the behavioral policy. This is in contrast to MDPs, where at least in theory, any behavioral policy works as long as all state action pairs are visited infinitely often. Even then, the behavior policy does impact the sample complexity of Q-learning, and there is a rich literature on choosing the behavior/exploration policy. Sucn an understanding does not exist for ASQL. Since our algorithm generalizes ASQL to learn non-stationary policies, it shares the same limitations as every Q-learning algorithm for POMDPs.

最终决定

This paper presents a method for improving learning in POMDPs using period policies. It shows that in cases where the 'agent state' that is used for decision-making is imperfect (non-Markovian) non-stationary policies can outperform stationary ones. Periodic agent-state based Q-learning (PASQL) is proposed to learn a form of non-stationary policy that can converge. Experiments also show the method can work well.

Learning in POMDPs is an important but hard topic. Developing better theory and algorithms for this case is a worthwhile challenge. This paper highlights an important problem with simpler methods that poorly approximate the belief or history and provide a rigorous solution. It also presents a proposed solution that is theoretically strong and performs well.

While the theory is strong in this paper, the presentation is unclear at times due to the dense nature of the paper. For instance, the assumptions of the method are not fully clear in the current paper. First, the introduction of the paper seems to indicate that belief-state MDPs being the only 'correct' way to represent an 'agent state' when methods for learning history abstractions (e.g., using an RNN) could learn a Markovian agent state. Similarly, there is mention of 'memory augmentation' in the discussion but learning a better agent state may be easier. Second, some of the other assumptions are not clear. For example, it is not clear how restrictive Assumption 2 is. Other details are provided by the reviewers. There was also some uncertainty about how practical the proposed algorithm is. The author response was helpful for clarifying some of these points but the setting and assumptions should be clarified in the paper.