PaperHub
6.8
/10
Poster4 位审稿人
最低3最高5标准差0.8
5
5
3
4
3.3
置信度
创新性2.5
质量3.3
清晰度2.8
重要性3.3
NeurIPS 2025

Reward-Aware Proto-Representations in Reinforcement Learning

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29
TL;DR

We lay the theoretical foundation underlying the default representation, a reward-aware proto-representation, and empirically demonstrate its benefits over the successor representation in tabular settings.

摘要

关键词
Reinforcement LearningRepresentation LearningProto-RepresentationDefault RepresentationSuccessor Representation

评审与讨论

审稿意见
5

This paper extends the theoretical foundations of the Default Representation (DR), a reward-aware proto-representation in reinforcement learning. Building on prior neuroscience-focused work, the authors develop dynamic programming and temporal-difference learning methods for DR acquisition, characterise its vector space properties relative to the Successor Representation (SR), introduce default features for function approximation, and extend the framework to maximum entropy RL. Empirical evaluations demonstrate DR's advantages over SR in reward shaping, option discovery, exploration, and transfer learning tasks.

优缺点分析

STRENGTHS:

  • Provides the first complete derivation of DP/TD methods for DR learning, filling a key gap in prior theoretical work.
  • Offers valuable insights into DR's mathematical properties through eigenvectors analysis and trajectory-based interpretations.
  • Introduces default features that enable DR usage without transition dynamics access, enhancing practical applicability.
  • Demonstrates DR's utility across multiple RL paradigms (reward shaping, exploration, transfer) with consistent improvements over SR.
  • Provides functional code implementations supporting claimed results.

WEAKNESSES:

  • The SR's reward-agnostic nature is framed as a weakness, though this design enables flexibility across reward functions - a core SR strength not sufficiently acknowledged.
  • All experiments use negative reward regions, leaving unclear how DR performs in neutral/positive reward landscapes.
  • Focus is exclusively put on SR comparison, while omitting comparisons to modern representation learning approaches (e.g., bisimulation metrics, deep successor features).
  • Maximum entropy extension (MER) receives limited analysis despite being a claimed contribution.

问题

  1. What are the key challenges in applying DR to real-world RL problems compared to established proto-representations?
  2. Does DR's performance advantage persist in environments without penalty regions?

局限性

The paper briefly mentions but under-analyses the limitations. Additionally, the biological plausibility claims from earlier DR work are not reconciled with the purely computational focus.

最终评判理由

The authors have addressed all of my questions, and concerns from other reviewers as well, hence I will keep a positive score.

格式问题

NA

作者回复

We thank the reviewer for insightful comments. We are glad that the reviewer finds our work valuable. We will now address the areas of improvements suggested by the reviewer.

Q1: The SR's reward-agnostic nature is framed as a weakness, though this design enables flexibility across reward functions - a core SR strength not sufficiently acknowledged.

While we did not intend to frame the reward-agnostic nature of the SR as a weakness, because our paper explores whether reward-awareness can be beneficial, we have focused extensively on scenarios where being reward-aware is important. However, we agree with the reviewer that the reward-awareness of the DR comes at a price of reduced flexibility, especially in the context of transfer. We will adjust our narrative to better reflect this trade-off. For example, in Section 5.1, when discussing default features, we can add the following after line 218:

“While DFs enable directly computing the optimal policy, they are limited to the scenario where only the terminal state rewards change. On the other hand, while SFs can only compute a policy as good as the ones used to compute the SFs, they can be applied when any state rewards change, allowing more flexibility than DFs.”

Q2: All experiments use negative reward regions, leaving unclear how DR performs in neutral/positive reward landscapes.

Summary: We assume rewards at non-terminal states to be negative, which is necessary for the DR and dynamic programming/temporal difference learning methods to be well-behaved. For environments with neutral/positive reward landscapes, we can simply transform the rewards to be negative before applying our framework.

As shown in Theorem 4.1, the reward for all non-terminal states being negative is a necessary condition for the dynamic programming (DP) method to converge, and for the DR to be well-defined. This is why we use negative rewards for all non-terminal states in our experiments.

However, we do not see this as a restriction of this framework, since we can transform a reward function with neutral/positive rewards into one with only negative rewards. For example, in the count-based exploration experiments, the two environments (Riverswim and SixArms) consist of non-negative rewards. In order to apply the DR, denoting the maximum reward by RmaxR_{max}, we transform every reward by (r(s,a)Rmax1)/Rmax(r(s, a) - R_{max} - 1) / R_{max}. Note that the reward transformation is only performed for the DR learning, and for the SARSA part, we use the original reward function. More generally, for a reward function that is non-negative, after applying the transformation, we expect that the DR would treat states (or state-action pairs) having the smallest positive reward as low-reward regions, and states having large positive reward as more desired regions.

Q3: Focus is exclusively put on SR comparison, while omitting comparisons to modern representation learning approaches (e.g., bisimulation metrics, deep successor features).

We focus on comparing with the SR since our goal is to contrast and characterize the difference between being reward-aware and reward-agnostic. While it would be good to add comparisons, including such comparisons could shift the focus of the paper to comparing the performances of different representation learning methods, rather than establishing the fundamentals of the DR and characterizing the properties of a reward-aware proto-representation.

Q4: Maximum entropy extension (MER) receives limited analysis despite being a claimed contribution.

We thank the reviewer, and other reviewers, for pointing this out. We will adjust our contribution, and move the discussion of the MER to the Appendix.

Q5: What are the key challenges in applying DR to real-world RL problems compared to established proto-representations?

One key challenge is that a naive implementation of the framework leads to instability. This is a mechanistic limitation, not a conceptual one.

Due to taking the exponential of negative rewards, when rewards have a large magnitude, the exponential can become very small, causing numerical issues. In our experiments involving small tabular environments, we were able to overcome this using a specialized library for more precise numerical computations. In more complicated environments, this might not be easy. Even if we normalize the reward function, in complicated environments with long horizons, the accumulation of negative rewards over a large number of time steps could still make the problem persist when trying to approximate the DR.

Q6: Does DR's performance advantage persist in environments without penalty regions?

As we also demonstrated in Figures 6 and 8 in Appendix C, in environments without penalty regions, the DR and the SR have very similar performance.

Q7: The paper briefly mentions but under-analyses the limitations.

We thank the reviewer for pointing this out. If our paper is accepted, we will use the extra space to better discuss the limitations. We plan to add the following paragraph:

“While the DR provides benefits in environments where being reward-aware is important, there are some limitations. First, naively computing the DR and its eigendecomposition can lead to numerical instabilities due to taking the exponential of negative rewards, as discussed in Appendix C.1. This could be problematic, especially in more complicated environments with long horizons. Second, as we show in Figures 6 and 8 in Appendix C, in environments in which all non-terminal state rewards are the same, the DR will perform as well as the SR. Given the numerical issues associated with the DR, the SR may be a better choice in these environments. Third, while the DR captures the environment dynamics more fully, this comes at the expense of flexibility. While the SR, which is reward-agnostic, can allow transfer learning when any state rewards change, the DR, which is reward-aware, can only perform transfer in settings where only the terminal state rewards change.”

This paragraph would also address the reviewer’s earlier concern that the flexibility of the SR is not sufficiently acknowledged.

Q8:Additionally, the biological plausibility claims from earlier DR work are not reconciled with the purely computational focus.

While biological plausibility inspired us to perform this work, our focus is on bridging the gap in the theoretical formulation of the DR so that it can be applied in computational RL. Discussing biological plausibility would fall outside of the scope of the paper.

评论

I thank the authors for their thoroughful response and, being satisfied with their responses to other reviewers as well, I will keep my positive score.

审稿意见
5

The paper presents an extension of the successor feature framework to incorporate the default representation. This representation is obtained by computing the expected future rewards under a reference policy in a similar manner to the state-visitation counts used in the successor feature literature. The authors show several foundational results tying the default representation to the successor representation, present simple but tractable algorithms for computing it, and provide evidence for the usefulness of the representation in simple domains.

优缺点分析

Overall, the paper provides a very clear and interesting account of a novel representation learning scheme which extends the established successor feature regime in a valuable manner. The papers theoretical results are clearly established, well explained, and the proofs are correct as far as I can tell. While the paper only provides small scale experiments and doesn't expand the framework to work with modern deep learning architectures, this limitation was clearly stated in the scope of the paper, and so I do not see this as a weakness. The benefits of the default representation over successor features seems intuitively clear to me, which is also due to the overall high quality of writing which establishes (almost) all concepts very clearly.

Weaknesses

All weaknesses stated here are minor and should not be seen as barriers to acceptance. I list them in the interest of pointing out potential areas of improvement.

It would have been useful to add slightly more exposition on the properties of the successor features, for readers not immediately familiar with the field. For example, while the eigenvectors of the successor features are mentioned as "useful", some short exposition on what they intuitively characterize would be great. (line 128f)

The default feature learning approach in Section 5.1 is mentioned to only work with constant non-terminal rewards. However, this limitation is only acknowledged in the expository text of the section, not in the description of the method. It would be helpful to point out how this limitation arises, and maybe (if possible) provide some notes how it could be addressed.

Section 5.2 seems slightly superfluous and somewhat breaks the flow of the paper. While I greatly appreciate that the authors show further extensions of the method, I think this section should either be introduced a bit more clearly, or maybe moved to the appendix? Part of what makes this section slightly worse than the surrounding ones is that relevant concepts, such as the adjacency matrix, are only introduced in the appendix. Alternatively it would be possible to slightly extend the section, but given the tight space, that might be difficult.

Minor nitpicks:

  • Please state what distributional assumptions underlie the 95% confidence intervals shaded in the plots? What method was used to calculate the bounds?

问题

Are there requirements on the default policy? I assume that it has to be somewhat reasonable for solving the MDP because otherwise no better policy can be learned by deviating from it, correct?

Would it be possible to use the default representation in a policy iteration style where a new improved policy becomes the next iterations default policy? If so, would there be efficient algorithms to shift or update the representations to account for the new ooptimal policies, or would all relevant quantities have to be recomputed?

Several equations (espec. the policy update formulation) are highly reminiscent of Relative Entropy Policy Search. It might be good to highlight the connection and cite prior work here, especially as a line of empirically strong relevant algorithms has been developed which could readily be combined with the idea of default features.

[1] Relative Entropy Policy Search, Peters et al., AAAI 2010

局限性

n/a

最终评判理由

I keep my positive score. The answers of the authors to my (minor) questions were satisfactory.

格式问题

n/a

作者回复

We thank the reviewer for the valuable feedback. We are glad that the reviewer finds our paper clear and interesting. We will first address the potential areas of improvements suggested by the reviewer, before moving to the questions.

Q1: It would have been useful to add slightly more exposition on the properties of the successor features, for readers not immediately familiar with the field. For example, while the eigenvectors of the successor features are mentioned as "useful", some short exposition on what they intuitively characterize would be great. (line 128f)

If our paper is accepted, given the extra page, we can make use of the space to add more explanation for the successor representation/features. As an example, we can make the following change to line 128:

“This result is particularly interesting because the eigenvectors of the SR have been shown to be useful in settings such as reward shaping and option discovery. The top eigenvector of the SR captures temporal distance between states, and such distance has been used for reward shaping as it leads to better performance than distances that are based on the coordinates of states [42, 38]. In option discovery, it has been shown that eigenoptions, which are obtained by training a policy to maximize a reward signal induced by the eigenvectors of the SR, allow the agent to reach less explored areas of the state space, facilitating exploration [21]. We will revisit these applications in Section 6.”

Apart from this change, we will revise the whole paper to provide more explanation wherever possible.

Q2: The default feature learning approach in Section 5.1 is mentioned to only work with constant non-terminal rewards. However, this limitation is only acknowledged in the expository text of the section, not in the description of the method. It would be helpful to point out how this limitation arises, and maybe (if possible) provide some notes how it could be addressed.

Summary: The limitation arises from the fact that default features (DFs) are dependent on non-terminal state rewards. As for ways to address this limitation of the framework, in practice, it may be possible to simply allow the agent to explore the areas around the non-terminal states for which the rewards have changed.

We will add the following to the end of line 206 of Section 5.1 to better explain how this limitation arises: “Note that such a form of optimal value computation is only possible when only the rewards at terminal states change. The reason is because, when only the terminal state rewards change, ζ(s)\zeta(s) stays the same for all non-terminal ss. However, when the rewards for non-terminal states change, ζ(s)\zeta(s) changes since ζ(s)=ZNNP_NTπ_dΦ\zeta(s) = \mathbf{Z}_{NN} \mathbf{P}\_{NT}^{\pi\_d} \mathbf{\Phi} and Z_NN\mathbf{Z}\_{NN} depends on non-terminal state rewards.”

Without assuming access to the transition and reward functions, one way to address the change in non-terminal state rewards is to allow the agent to interact with the environment and update the default features to reflect the changes. This method, while seemingly naive, is in fact reasonable since, without assuming access to environment dynamics, the agent has no way to be aware of changes in the environment without experiencing the changes itself. However, since the default feature of every non-terminal state is dependent on other states, even if the reward has not changed in a non-terminal state, the agent still needs to visit it so that changes in other non-terminal states are reflected. In practice, it might be possible to get away with just visiting local regions around the non-terminal states for which the reward has changed, since the contribution of the DFs of a state to those of other states decays exponentially as the distance increases (see Equation 16). Therefore, DFs of states far away from the non-terminal states for which rewards have changed may still be good enough approximations of their ground-truth DFs.

This way of addressing the changes in non-terminal state rewards for the DFs is speculative, and we have not performed experiments in this particular case, so we do not plan on including this discussion in the paper. Nevertheless, we find this direction very interesting and will pursue this further in our future work.

Q3: Section 5.2 seems slightly superfluous and somewhat breaks the flow of the paper.

We will move the introduction of the MER to the Appendix.

Q4: Please state what distributional assumptions underlie the 95% confidence intervals shaded in the plots? What method was used to calculate the bounds?

We discussed the computation of 95% confidence intervals in Appendix D. We now realize that readers might not notice it due to the lack of pointers in the main text, and will add pointers.

Q5: Are there requirements on the default policy? I assume that it has to be somewhat reasonable for solving the MDP because otherwise no better policy can be learned by deviating from it, correct?

Summary: To ensure that the deviation cost from the default policy is well-defined for any agent policy, the default policy needs to assign non-zero probability to all actions in all states. Without prior knowledge of the problem, the standard choice is the uniform random policy.

Mathematically, in order for the deviation cost from the default policy, which is modeled as the KL divergence between the transition probabilities under the agent’s policy π\pi and the default policy πd\pi_d, to be well defined, the support of the transition probabilities under πd\pi_d should be wider than that under π\pi. To ensure that this is true for any π\pi, πd\pi_d should assign non-zero probability to all actions in all states. We will explicitly state this requirement in line 81, when the default policy is first mentioned.

As long as the above requirement is satisfied, we can choose any default policy, though, as pointed out by prior work [1], and also noticed by the reviewer, the choice of the default policy affects the quality of the optimal policy. In general, the optimal policy is softly biased towards the default policy. For example, if the default policy assigns large probabilities to a1a_1 in s1s_1, the optimal policy will be inclined to do the same. This is why, without prior knowledge of the problem, the standard practice is to use the uniform random policy.

Q6: Would it be possible to use the default representation in a policy iteration style where a new improved policy becomes the next iterations default policy?

Summary: Our option discovery algorithm, RACE, is related to the algorithm suggested by the reviewer. It is possible to construct such forms of algorithms, and they can be useful.

In RACE, we in fact compute a DR with respect to a constantly-evolving default policy. This is because, as we compute and follow options, the transitions are not collected strictly under the default policy anymore. Rather, they come from a mixture of the default policy (uniform random) and the computed options. At every iteration, the default representation induces an option, which changes the behavior policy. The new behavior policy can now reach states that are further away from the start state and at the frontier of exploration in a limited number of time steps. The default representation is then updated with respect to this new behavior policy, and induces a new option, and the cycle goes on. Therefore, the idea suggested by the reviewer is possible, and can be useful in certain scenarios.

Q7: If so, would there be efficient algorithms to shift or update the representations to account for the new optimal policies, or would all relevant quantities have to be recomputed?

Summary: When the default policy changes, there are, in general, no efficient algorithms for computing the new DR, unless we assume access to transition dynamics. However, there are two possible ways we can address this.

In RACE, we do not have an efficient method to directly compute the DR for the new default policy, and only update the DR using transitions and TD learning. While a prior work [1] has demonstrated using matrix inversion lemmas to directly compute the new DR when the default policy changes, such computation is only possible if the transition dynamics is known. For the general case when the environment dynamics is not known, we are currently not aware of efficient algorithms to directly update the DR.

There are two possible directions to address this: 1) If only the default policy changes, and environment dynamics stays constant, it might be possible to update the default representation using a replay buffer and importance sampling. 2) We can perhaps use the generalization ability of a neural network to help generalize the default representation to new policies, similar to USFA [2]. For example, a neural network can take the state and an encoding of a policy as the input, and output the corresponding default representation.

[1] P. Piray and N. D. Daw. Linear reinforcement learning in planning, grid fields, and cognitive control. Nature Communications, 12(1):4942, 2021.

[2] Borsa et al. Universal successor features approximators. ICLR 2019.

Q8: Several equations (espec. the policy update formulation) are highly reminiscent of Relative Entropy Policy Search.

We thank the reviewer for pointing out relevant work. Relative entropy policy search constrains the KL divergence between the data distribution induced by the policy and the data distribution of past observed data. This indeed is similar to linearly solvable MDPs, as both frameworks use KL divergence to avoid deviating too much from a reference distribution. One difference is that in linearly solvable MDPs, the KL term is added to the reward function, while in relative entropy policy search, the KL term is treated as a constraint of the optimization. Given the similarities, we will cite this and related papers.

评论

Thanks for the clarifications. I am happy to keep my positive rating and recommend acceptance.

审稿意见
3

Based on the concept of the default representation, this paper investigates how the incorporation of reward signals into proto-representations (that is, representations that encode temporal similarity of states) can influence RL algorithms and tasks downstream. First, the paper establishes the default representation (DR) as the fixed point of a dynamic programming scheme, and presents a TD-learning update rule to estimate the DR from samples. This formalism is then generalized to default features (DF, much like successor features), which can be easily represented / estimated even in continuous MDPs. It is shown that optimal value functions can be efficiently extracted from the DR (and DFs), and additionally, the DR (resp. DFs) can be used to zero-shot infer optimal policies for any configuration of terminal rewards specified after training. The paper concludes by demonstrating empirically that substituting the successor representation for the DR in many RL approaches (e.g., option discovery, exploration bonuses) tends to lead to more efficient learning.

优缺点分析

Strengths

Firstly, this paper is visually stunning, and largely well written.

I enjoyed reading about the DR, and I believe the contribution of formalizing this concept and establishing algorithmic techniques for estimating it is quite timely.

Moreover, despite the relatively small scale of the experiments, the breadth of the experiments is quite large, in my opinion. The purpose of the empirical section is not to demonstrate state of the art performance, but to explore how the DR can be leveraged downstream, and I think the authors have done a great job at this. The results with DR driving exploration bonuses and option discovery were surprising and encouraging to me.

Weaknesses

My main criticisms of the paper are the following:

  1. The math is quite imprecise, making some of the results difficult to interpret, and in some cases, (I believe) incorrect without further assumptions. I suspect most of this can be fixed by clarifying the statements and assumptions.
  2. The experimental section, particularly the section about options, does not have enough details. Some of these sections would be effectively impossible to follow / understand without reading the appendix.

Moreover, as I discuss in more detail later, the section on the maximum entropy representation seems basically vacuous, and (in my view) it does not add to the paper. At the very least, I highly recommend for this section to be moved to the appendix, at which point space could be used to substantially clear up the experimental sections.

The Questions section outlines more specifically some of the issues I highlighted above.

Minor Issues

  1. Nit: regarding the TD-learning update rule for the SR shown in (3), credit should be also given to the seminal work of Dayan ([7] in your submission's reference list).
  2. The notation KL(pπpπdSt)\mathrm{KL}(p^\pi \| p^{\pi_d} \mid S_t) is difficult to read and seemingly incorrect (though it's otherwise reasonably clear what you're describing). Why not KL(pπ(St)pπd(St))\mathrm{KL}(p^\pi(\cdot\mid S_t)\| p^{\pi_d}(\cdot\mid S_t))? It's more verbose, but it correctly conditions both probability kernels on the state. Alternatively you can use kernel notation and write KL(pStπpStπd)\mathrm{KL}(p^\pi_{S_t} \| p^{\pi_d}_{S_t}), where e.g. psπΔ(S)p^\pi_s\in\Delta(\mathcal{S}) for any sSs\in\mathcal{S}.
  3. Nit: the notation Z\mathbf{Z} should specify λ\lambda, e.g. Zλ\mathbf{Z}^\lambda.
  4. You are using λ\lambda for both the regularization strength of linearly-solvable MDPs and for eigenvalues.
  5. On line 174, "Neurmann series" should be "Neumann series".

问题

I split this section into two parts, the first listing questions / concerns that directly inform my score, and the second which lists less pressing questions for clarification.

Critical Questions / Concerns

There is an important argument missing in Theorem 3.1, which is partially addressed in its proof. Namely, you claim that the iith eigenvectors of the SR and the DR are the same, where it is implied that the eigenvectors here are ranked by the scale of their corresponding eigenvalues. However, from equation (6), this result is not clear. In the proof of this theorem, you claim that when λSR,i\lambda_{SR, i} increases, then λDR,i\lambda_{DR, i} increases. However, I would expect the relevant quantity to be the magnitude of the eigenvalue. Is it not possible that, for example, λSR,i=1000\lambda_{SR, i} = -1000 while λDR,i0\lambda_{DR, i} \approx 0, in which case the corresponding eigenvector is highly descriptive of the SR while being effectively irrelevant for the DR? To be clear, I'm not necessarily convinced that this is possible, but I also believe it is not trivially impossible, and it is important to establish.

I have some concerns about equation (7). Particularly I find the notation τss\tau_{s\to s'} imprecise and confusing. Firstly, I believe this equation only makes sense, for multiple reasons, in the finite horizon case. Maybe most simply put, if you have an MDP with a single state, there are N|\mathbb{N}| trajectories τss\tau_{s\to s} (following how this notation is used in the derivation of equation 7, where you enumerate over non-distinct trajectories), each having probability 1. Then the Z(s,s)\mathbf{Z}(s, s) is not well-defined (or rather, it is infinite whenever the reward is nonzero). In the case of MDPs with more than one state, it's not immediately clear that the proof is correct, for the same reason. The only way I can see the derivation in equation (25) converging generally is if either trajectories are finitely long, or there are terminal states, but this isn't specified clearly anywhere.

I also have concerns about the MER. I don't think the A\mathbf{A} term is correct. The MER, as you've defined it, it precisely the same for any MDP without rewards as long as it is possible to jump between any pair of states with nonzero probability. In this case, what information does the MER provide? Clearly, one could not zero-shot infer optimal policies for instantiations of the terminal reward, because the optimal policies would be the same for any such MDP. I strongly suspect A\mathbf{A} has to include information about the transition probabilities between states. In any case, Definition 5.1 is effectively meaningless on its own. You have not established any useful properties of this object (indeed, as I describe just above, the generalization property of the DR doesn't appear to hold).

Section 6.2 gives no details at all about how the proposed RACE algorithm works, besides saying it's roughly the same as CEO substituting the SR for the DR. There should be at least a high level description of the algorithm, even if the details must be deferred to the appendix.

Ancillary Questions

  1. Is there a difference between the objectives of a linearly-solvable MDP with default policy πd\pi_d and relative-entropy-regularized RL with reference policy πd\pi_d (and, say, finite horizon with γ=1\gamma=1)? It would be helpful to explicitly write out the objective.
  2. In section 6.1, which policy are you estimating the SR from? Is it just the default policy?
  3. In section 6.1, can you give some intuition about why the potential-based rewards make sense? Particularly in the case of the SR (depending on which policy induces it), why should this potential-based reward be meaningful?
  4. In section 6.1, are you using the state-only DR leveraging the exact reward function? Under which circumstances would you have access to this DR but not to an optimal policy? Is the idea primarily that you can use the DR to upgrade from a "soft" optimal policy (i.e., optimal under the KL penalty) to a "true" (un-regularized) optimal policy?

局限性

As mentioned above, there seem to be some missing assumptions regarding the DR. That said, I believe these assumptions are mentioned, just not as explicitly as they should be. Particularly, it seems that the DR requires finite episode horizon.

最终评判理由

While the authors have improved on some of the issues I pointed out in the text, their theoretical results are still a little shaky -- see the discussion I had with them, where they require the introduction of several new assumptions (some of which are more severe than others). I am maintaining my score for now, curious to hear what other reviewers think in the proceeding discussion.

格式问题

No major formatting issues.

作者回复

We thank the reviewer for the thoughtful and insightful feedback. We are glad that the reviewer finds our work visually appealing and well written, and the experiments broad and encouraging. Given limited space, we here address the critical concerns, and defer the response to remaining questions to a later post.

Q1a: You claim that the ith eigenvectors of the SR and the DR are the same, where it is implied that the eigenvectors here are ranked by the scale of their corresponding eigenvalues. However, from equation (6), this result is not clear.

Summary: Thank you for pointing this out. We now realize that there’s a minor flaw with Theorem 3.1, and further assumptions need to be made for the order of the eigenvectors to be the same for the SR and DR.

The eigenvalues of the SR and DR for the same eigenvector are related by the following:

λSR=f(λDR)=[γ(λDR1+γ1exp(r(s)/λ))]1\lambda_{SR}=f(\lambda_{DR})=[\gamma (\lambda_{DR}^{-1}+\gamma^{-1}-\exp(-r(s) / \lambda) ) ]^{-1}

If we take the derivative, we have f(λDR)=(1/γ)[1+λDR(γ1exp(r(s)/λ))]2f’(\lambda_{DR})=(1/\gamma)[1+\lambda_{DR}(\gamma^{-1}-\exp(-r(s)/\lambda))]^{-2}

For 0<γ<10 < \gamma < 1, this derivative is positive everywhere, except when λDR=1/(γ1exp(r(s)/λ))\lambda_{DR}=-1/(\gamma^{-1}-\exp(-r(s)/\lambda)).

We overlooked the condition λDR=1/(γ1exp(r(s)/λ))\lambda_{DR}=-1/(\gamma^{-1}-\exp(-r(s)/\lambda)), and falsely concluded that ff is a monotonic function, and so the order of the eigenvectors is always preserved. To fix this, we actually need one extra assumption: All eigenvalues of the DR lie on the same side of 1/(γ1exp(r(s)/λ))-1/(\gamma^{-1}-\exp(-r(s)/\lambda)).

When this holds, the order of the eigenvectors is also preserved. Theorem 3.1 can be easily fixed by including this additional condition, and we just adjust our theorem statement to the following:

“Suppose both the SR and DR are computed with respect to the same policy, i.e., π=πd\pi = \pi_d. When the reward function is constant, i.e., r(s)=r(s)s,sSr(s) = r(s’) \forall s, s’ \in \mathcal{S}, the SR and DR share the same set of eigenvectors, and the eigenvalues of the SR and DR for the same eigenvector are related as follows:

λSR=[γ(λDR1exp(r(s)/λ)+γ1)]1.\lambda_{SR} = [\gamma (\lambda_{DR}^{-1}-\exp(-r(s)/\lambda)+\gamma^{-1})]^{-1}.

Furthermore, when γ1exp(r(s)/λ)=0\gamma^{-1}-\exp(-r(s)/\lambda)=0 or when all eigenvalues of the DR lie on the same side of 1/(γ1exp(r(s)/λ))-1/(\gamma^{-1}-\exp(-r(s)/\lambda)), the order of the eigenvectors of the SR and DR by their corresponding eigenvalues is the same.“

Q1b: In the proof of this theorem, you claim that when λSR,i\lambda_{SR,i} increases, then λDR,i\lambda_{DR,i} increases. Is it not possible that, for example, λSR,i=1000\lambda_{SR,i}=−1000 while λDR,i0\lambda_{DR,i}\approx0, in which case the corresponding eigenvector is highly descriptive of the SR while being effectively irrelevant for the DR?

Summary: We agree with the reviewer, and will make a small adjustment in our argument to support the claim that the DR is a generalization of the SR.

Theorem 3.1 has been used to claim that the DR is a generalization of the SR: when the reward function is constant, they capture similar information because they share the same eigenvectors, and the order of eigenvectors is also the same. However, the reviewer is right that merely having the same order does not mean that the two matrices capture similar information. Nevertheless, the DR can still be seen as a generalization of the SR, since for the special case of exp(r(s)/λ)=γ1\exp(-r(s) / \lambda) = \gamma^{-1}, the order of the eigenvectors is preserved, and the magnitude only differs by a constant γ\gamma.

We now provide a full account of how DR is a generalization of the SR:

  1. When the reward function is constant and exp(r(s)/λ)=γ1\exp(-r(s) / \lambda) = \gamma^{-1}, the i-th eigenvectors of the DR and SR are identical, and the eigenvalues only differs by γ\gamma.
  2. When the reward function is constant, and exp(r(s)/λ)γ1\exp(-r(s) / \lambda) \neq \gamma^{-1}, the DR and SR share the same set of eigenvectors. Furthermore, when all eigenvalues of the DR lie on the same side of 1/(γ1exp(r(s)/λ))-1 / (\gamma^{-1} - \exp(-r(s) / \lambda)), the order of the eigenvectors is the same.
  3. When the reward function is not constant, the DR and SR can have different eigenvectors.

We will highlight the special case of exp(r(s)/λ)=γ1\exp(-r(s) / \lambda) = \gamma^{-1} to better support our claim.

Q2: In equation (7), I find the notation τss\tau_{s \to s'} imprecise and confusing. Firstly, I believe this equation only makes sense, for multiple reasons, in the finite horizon case. Maybe most simply put, if you have an MDP with a single state, there are |N| trajectories τss\tau_{s \to s'} (following how this notation is used in the derivation of equation 7, where you enumerate over non-distinct trajectories), each having probability 1. Then Z(s,s) is not well-defined (or rather, it is infinite whenever the reward is nonzero). In the case of MDPs with more than one state, it's not immediately clear that the proof is correct, for the same reason. The only way I can see the derivation in equation (25) converging generally is if either trajectories are finitely long, or there are terminal states, but this isn't specified clearly anywhere.

Summary: We will adjust our notation for the trajectories. The reviewer is right that equation 7 relies on an additional assumption, which we only mentioned later in the paper in Theorem 4.1: the assumption that non-terminal state rewards are all negative.

We used τss\tau_{s \to s’} to refer to distinct trajectories from ss to ss’, which may be confusing for readers. We will switch to defining distinct trajectories from ss to ss’ by the following: Let τss\tau_{s\to s’} denote a trajectory from s to s’. Let T_ss=\mathcal{T}\_{s\to s’}={τ_ssPπ(τ_ss)>0\tau\_{s\to s’}|\mathbf{P}^\pi(\tau\_{s\to s’})>0} And when summing over trajectories, we will use _τT_ss\sum\_{\tau\in\mathcal{T\_{s\to s'}}} instead. Now with the set notation, it will be clear that we are referring to distinct trajectories.

As for the concern about whether Z(s, s’) is defined, it is defined because we assume that non-terminal state rewards are negative. Therefore, even if a state can transition to itself with probability 1, the contribution of longer trajectories to the summation decays exponentially. While we mentioned this assumption in Theorem 4.1, we agree with the reviewer that we could have mentioned it explicitly earlier. We will add this to Section 2.2. Note that we do not see this assumption as a restriction of the framework (see our response to reviewer M3b3 Q2).

Q3: I also have concerns about the MER. I don't think the A term is correct. The MER, as you've defined it, it precisely the same for any MDP without rewards as long as it is possible to jump between any pair of states with nonzero probability. In this case, what information does the MER provide? Clearly, one could not zero-shot infer optimal policies for instantiations of the terminal reward, because the optimal policies would be the same for any such MDP.

Summary: The reviewer is correct that the MER would be the same regardless of specific transition probabilities when the graph connectivity stays the same. We claim that this is a property of the MER, rather than an error in our derivation.

We first provide intuition in the case of a deterministic transition function where each state-action pair leads to a different next state with probability 1. First, we can show that the linearly solvable MDP formulation can be transformed into a maximum entropy formulation by choosing the default policy to be the uniform random policy:

r(s)λKL(π(,s)πd(,s))=r(s)λE_aπ[logπ(as)]+λE_aπ[logπd(as)]=r(s)λlogA+λH(π(s))r(s)-\lambda KL(\pi(\cdot, s)\|\pi_d(\cdot,s))=r(s)-\lambda\mathbb{E}\_{a\sim\pi}[\log\pi(a|s)]+\lambda\mathbb{E}\_{a\sim\pi}[\log\pi_d(a | s)]=r(s)-\lambda\log|A|+\lambda\mathcal{H}(\pi(\cdot|s))

By choosing πd\pi_d to be the uniform random policy, the linearly solvable MDP formulation is equivalent to a maximum entropy formulation where the rewards are shifted by a constant. In this setting, note that by following the uniform random policy, the agent in fact has a uniform probability of visiting every next state. It then makes intuitive sense for the MER to not capture the specific transition probabilities, since the policy with respect to which the MER is computed assigns a uniform distribution over transitions.

For stochastic transitions, the MER can be thought of as being computed with respect to a "generalized" uniform random policy. An agent following this policy has a uniform transition probability to every next state. Therefore, it is a property of the MER to remain the same when specific transition probabilities change, but the connectivity of the graph remains the same.

We disagree with the reviewer that one-shot optimal policy inference cannot be performed in this case. It is possible to use the MER, in a similar manner as the DR, to perform transfer. However, the optimal policy we obtain will have the property of disregarding actual transition probabilities, and treating all trajectories as equally likely. This optimal policy may not be useful in the standard settings that we usually consider, but could be informative in certain settings. For example, this optimal policy can possibly be used as an “optimistic” policy that assumes that the agent can transition to a desirable next state with the same probability as any other next state, regardless of how low the actual probability is.

Finally, we provide a more general view of the SR, DR, and MER: The DR is the most general form that depends on both the transition probabilities and reward function. The SR is a special case when we disregard the reward information. The MER, on the other hand, can be thought of as another special case of the DR that considers rewards but disregards the transition probabilities.

Nevertheless, we agree with the reviewer that given the limited exploration of the properties of the MER, including it in the main text can obstruct the flow without providing much value. As the MER is not a central contribution of our paper, we will move it to the Appendix.

评论

Thanks to the authors for their detailed comments.

Q1: I think this makes sense, I'm going to try to find some time to verify it exactly. In the meantime, can you comment at all on which MDPs you would expect to satisfy this property?

Q2: Again, this sounds plausible, but I'll need some time to take another look.

Q3: Thanks for moving this to the appendix. I am still a little suspicious of this representation, at least it doesn't seem to have nearly the same properties as the SR or DR. Suppose you have an MDP with continuous state space and Gaussian transitions with very low variance; then the equivalent of the A\mathbf{A} operator would contain effectively no information, whereas an environment with deterministic transitions at the centers of the aforementioned would have an entirely different A\mathbf{A} operator. Again, maybe this is fine, it just doesn't fit with the rest of the paper and I personally believe it needs further development.

Q7: I think I misunderstood the experimental setup here. So just to be sure, you're basically "pre-training" the SR/DR to extract the top eigenvector and then training an RL method from scratch using the potential-based rewards? I was confused because I thought you were also learning the e\mathbf{e} online (via SR/DR), at which point the potential-based reward may never actually provide any signal.

评论

Q4: The experimental section, particularly the section about options, does not have enough details. Some of these sections would be effectively impossible to follow/understand without reading the appendix.

If the paper is accepted, using the additional space, we will add more experiment details. E.g., we can add the following to Section 6.2 to clarify the RACE algorithm:

“In RACE, at every iteration, the agent collects samples to learn the DR, the eigenvector of which is then used to compute an eigenoption. The eigenoption will be used in the next RACE iteration to collect samples. This forms a cycle of using eigenoptions to improve the learned representation, and using the learned representation to then refine eigenoptions.”

Q5: Is there a difference between the objectives of a linearly-solvable MDP with default policy πd and relative-entropy-regularized RL with reference policy πd (and, say, finite horizon with γ=1)?

For relative-entropy-regularized RL with reference policy πd\pi_d and γ=1\gamma=1, the objective is identical as the linearly solvable MDP objective: argmaxπE[_t=0(r(s_t)λKL(π(s_t)πd(s_t)))].\arg \max_\pi \mathbb{E}[\sum\_{t=0}^\infty (r(s\_t)-\lambda KL(\pi(\cdot | s\_t) \| \pi_d(\cdot | s\_t)))].

However, linearly solvable MDPs make further assumptions to enable expressing the optimal value function as a linear equation: 1) It assumes that the reward function is non-positive, and 2) it assumes the existence of at least one terminal state.

Q6: In section 6.1, which policy are you estimating the SR from? Is it just the default policy?

We use the uniform random policy for both the SR and DR.

Q7: In section 6.1, can you give some intuition about why the potential-based rewards make sense? Particularly in the case of the SR (depending on which policy induces it), why should this potential-based reward be meaningful?

The top eigenvector of the SR captures the temporal distance between states. It usually assigns the highest value to one state, and the values for other states decrease smoothly as the temporal distance from the highest-value state increases. In the absence of terminal states, usually one of the corners of the environment will have the highest value. In the presence of a terminal state, the terminal state will have the highest value. Hence, the eigenvector captures the temporal distance from the terminal state: a state has a larger eigenvector entry if it is closer to the terminal state. This is why the eigenvector can be used as a potential function for reward shaping, since the agent will receive a reward bonus every time it gets closer to the terminal state.

Q8: In section 6.1, are you using the state-only DR leveraging the exact reward function? Under which circumstances would you have access to this DR but not to an optimal policy? Is the idea primarily that you can use the DR to upgrade from a "soft" optimal policy (i.e., optimal under the KL penalty) to a "true" (un-regularized) optimal policy?

Yes. Our purpose here is to demonstrate the utility of the top eigenvector of the DR, and particularly how its reward-avoiding nature can provide benefits over the SR in applications where the SR has been applied before. Therefore, in this particular experiment, we assume that we are given the eigenvector. In more practical scenarios, we would not assume access to the eigenvector (or the reward and transition function), and would learn the eigenvector of the DR and the policy simultaneously. We plan to pursue this direction, especially in more complicated environments, in follow-up work.

Q9: Nit: regarding the TD-learning update rule for the SR shown in (3), credit should be also given to the seminal work of Dayan ([7] in your submission's reference list).

Thanks for pointing this out, we will do so.

Q10: The notation KL(pπ|pπd∣St) is difficult to read and seemingly incorrect (though it's otherwise reasonably clear what you're describing). Why not KL(pπ(⋅∣St)|pπd(⋅∣St))?

We will switch to using the suggested notation. We used this notation since it was more concise than the suggested notation. In an earlier draft of the paper, we included the derivation of the DR for the state reward case in the Appendix, and using this notation makes the equations more concise and easier to read. We later removed that derivation since it largely overlaps with the one presented by Piray & Daw, but we kept the notation.

Q11: Nit: the notation Z should specify λ, e.g. Z^λ.

Z depends on λ\lambda, but λ\lambda, similar to γ\gamma in the case of the SR, is usually fixed for a given problem. For simplicity, people have chosen to omit γ\gamma for the SR. Similarly, we omit λ\lambda for the DR.

Q12: You are using λ for both the regularization strength of linearly-solvable MDPs and for eigenvalues.

To improve clarity, we will use another symbol to refer to eigenvalues of the SR and the DR.

Q13: On line 174, "Neurmann series" should be "Neumann series".

We will fix it.

评论

Re-Q3: Thanks for moving this to the appendix. I am still a little suspicious of this representation, at least it doesn't seem to have nearly the same properties as the SR or DR. Suppose you have an MDP with continuous state space and Gaussian transitions with very low variance; then the equivalent of the A operator would contain effectively no information, whereas an environment with deterministic transitions at the centers of the aforementioned would have an entirely different A operator. Again, maybe this is fine, it just doesn't fit with the rest of the paper and I personally believe it needs further development.

The difference in A\mathbf{A} described by the reviewer is a property of the MER. In the continuous state space example described by the reviewer, A\mathbf{A} contains the information that every state is connected to all other states. In the deterministic transition case, A\mathbf{A} would contain the information that every state is connected to one other particular state. As we explained earlier, the MER captures graph connectivity, so it actually makes sense for the MER to have a different A\mathbf{A} here, since in the continuous and deterministic cases described by the reviewer, the connectivity of states are different.

As our paper’s main contribution is the default representation, and the MER is only mentioned as another reward-aware proto-representation arising from another MDP framework, we will limit the discussion of the MER to the Appendix for now, and explore it further in a future work. In particular, it will be interesting to see if it really gives an “optimistic policy”.

Re-Q7: I think I misunderstood the experimental setup here. So just to be sure, you're basically "pre-training" the SR/DR to extract the top eigenvector and then training an RL method from scratch using the potential-based rewards? I was confused because I thought you were also learning the e online (via SR/DR), at which point the potential-based reward may never actually provide any signal.

Yes, the description of the reviewer is accurate. We will update Section 6.1 to more clearly state that we use the same assumptions of previous work (Wu et al., 2018; Wang et al., 2021) that we have access to the top eigenvectors prior to training the agent with the potential-based reward. For the case when we learn e\mathbf{e} online and train the agent simultaneously, the reviewer is right that it is not impossible that the agent learns the optimal policy before e\mathbf{e} is learned well enough to provide a useful signal. However, we conjecture that this would likely happen only in small environments that can be easily solved. For more complex environments, we expect the eigenvector of the SR/DR would still provide useful information while it’s being learned. In our option discovery experiments, in which we do learn the eigenvectors online while learning everything else as well, we observe that the intermediate eigenvectors of the SR/DR encourage the agent to visit novel states, and the eigenvector of the DR has the additional property of avoiding low-reward regions. We foresee that given these properties, these eigenvectors are useful as evolving potential functions also for reward shaping, and we find this a promising direction for future research.

评论

Thanks again to the authors.

Q1: This is really nice! I went over your new proof carefully, and I am almost entirely on board. Excuse me for the possibly naive questions as I am quite bogged down now, but is it obvious that the eigenvalues of the DR are all real? I believe this is an implicit assumption in your analysis of the monotonicity of that function ff relating DR eigenvalues to SR eigenvalues via its derivative. Perhaps this is a known fact or an obvious corollary of something. Likewise, the proof depends on the rewards being negative---is this just an assumption that needs to be made? Sorry if this was discussed somewhere in the text. Anyway, this isn't such a wild assumption, it is fine with me if it must be introduced. Besides these points, the rest of the proof looks correct to me.

Q7: Thanks for the clarification.

评论

Re-Q1: I think this makes sense, I'm going to try to find some time to verify it exactly. In the meantime, can you comment at all on which MDPs you would expect to satisfy this property?

Summary: In revisiting our proof in response to the reviewer’s follow-up question, we realized that the condition that all eigenvalues of the DR lie on the same side of 1/(γ1exp(r(s)/λ))-1 / (\gamma^{-1} - \exp(-r(s) / \lambda)) always holds. This means that our original theorem statement is, in fact, correct, although our initial reasoning was incomplete. We will update the proof in the Appendix.

In our previous response regarding the question on Theorem 3.1, we mentioned that we require all DR eigenvalues to lie on the same side of the vertical asymptote at 1/(γ1exp(r(s)/λ))-1 / (\gamma^{-1} - \exp(-r(s) / \lambda)) for the order of the eigenvectors of the SR and DR to be identical. This condition was the best we could do without reasoning about the range of DR eigenvalues.

On a high level, we first derive lower and upper bounds for the eigenvalues of the DR. Recall that the position of the vertical asymptote depends on γ\gamma and r(s)/λr(s) / \lambda. Without loss of generality, we ignore λ\lambda here for simplicity since it can be absorbed into the reward function. We also write rr instead of r(s)r(s) since we assume a constant reward function. We can then show that for both γ>exp(r)\gamma > \exp(r) and γ<exp(r)\gamma < \exp(r), the eigenvalues of the DR all lie on the same side of the asymptote.

We first reason about the range of the eigenvalues of the DR. We first start with the inverse of the DR: diag(exp(r))Pπddiag(\exp(-r)) - P^{\pi_d}. Here we refer to the (i,j)(i,j)-th entry of PπdP^{\pi_d} as pijp_{ij}. It is not hard to see that assuming r < 0, diag(exp(r))Pπddiag(\exp(-r)) - P^{\pi_d} is a strictly diagonally dominant matrix. Then, by Gershgorin's circle theorem, we know that this matrix has all positive eigenvalues, and thus, the eigenvalues of the DR are all positive. We have then obtained a lower bound of DR eigenvalues: λDR,i>0 i\lambda_{DR, i} > 0 \ \forall i.

We can also obtain an upper bound on the DR eigenvalues using Gershgorin's circle theorem. For the inverse of the DR, we know that each eigenvalue λDR1,i\lambda_{DR^{-1}, i} lies in the disc centered at exp(r)pii\exp(-r) - p_{ii} with radius 1pii1 - p_{ii}. Note that for rows corresponding to terminal states, the disc is centered at exp(r)\exp(-r) with radius 0. Given this, we can obtain a lower bound on the eigenvalues of the inverse of the DR: exp(r)pii(1pii)=exp(r)1\exp(-r) - p_{ii} - (1 - p_{ii}) = \exp(-r) - 1. We then know that the upper bound on the eigenvalues of the DR is 1exp(r)1\frac{1}{\exp(-r) - 1}.

Given the bounds, we now show that regardless of the values of γ\gamma and rr, all eigenvalues always lie on the same side of the asymptote.

When γ<exp(r)\gamma < \exp(r), it can be easily derived that 1γ1exp(r)<0\frac{-1}{\gamma^{-1} - \exp(-r)} <0, meaning that the position of the vertical asymptote is negative. Since every DR eigenvalue is positive, we have that all eigenvalues of the DR lie on the right side of the asymptote.

When γ>exp(r)\gamma > \exp(r), we can similarly show that all eigenvalues of the DR lie on the left side of the asymptote: γ<1    1/γ>1    1/γexp(r)>1exp(r)\gamma < 1 \implies 1/\gamma > 1 \implies 1/\gamma - \exp(-r) > 1 - \exp(-r). Now both sides of the inequality are negative since γ>exp(r)>0\gamma > \exp(r) > 0 and r<0r < 0. Then, 1γ1exp(r)<11exp(r)\frac{1}{\gamma^{-1} - \exp(-r)} < \frac{1}{1 - \exp(-r)}, which gives 1γ1exp(r)>1exp(r)1λDR,max\frac{-1}{\gamma^{-1} - \exp(-r)} > \frac{1}{\exp(-r) - 1} \geq \lambda_{DR, max}.

For both γ>exp(r)\gamma > \exp(r) and γ<exp(r)\gamma < \exp(r), all DR eigenvalues lie on the same side of the asymptote, so the order of eigenvectors is preserved. Also, when γ=exp(r)\gamma = \exp(r), it is obvious from Equation 6 that the order is preserved. Therefore, the order of eigenvectors of the SR and DR is always the same.

评论

Thanks again for pointing these out. These are good questions.

Is it obvious that the eigenvalues of the DR are all real? I believe this is an implicit assumption in your analysis of the monotonicity of that function ff relating DR eigenvalues to SR eigenvalues via its derivative.

We will make it explicit in Theorem 3.1 that we assume that the DR is symmetric. In practice, when performing the eigendecomposition for the SR, people often first symmetrize it, by taking the average between it and its transpose. We perform a similar procedure for the DR, as described in Appendix C.1, which is why we have this implicit assumption that eigenvalues are all real.

Likewise, the proof depends on the rewards being negative---is this just an assumption that needs to be made?

Yes. We will explicitly mention it in Theorem 3.1.

评论

Thanks for the response. Though, to be clear, requiring the DR to be symmetric (or alternatively stating a result about its symmetrization) does change the implications of your theorem. I have no idea to what extent its symmetrization preserves / approximates the claimed properties of the DR. Particularly, now the theorem does not relate the eigenvectors of the DR to the SR, but rather it relates eigenvectors of their symmetrizations (which are different altogether).

评论

We would like to clarify that when the SR and DR are not symmetric, they still share the same set of eigenvectors, because Equations 38-43 still hold. Our claim that the SR and DR span the same space is still valid. However, when eigenvalues are complex, it is not clear how to order them, so we do not see a straightforward way to reason about the order of eigenvectors.

We can split the current theorem into two theorems: 1) the first theorem focuses on general SR/DR and establishes that they share the same set of eigenvectors, and 2) the second theorem focuses on symmetric SR/DR, and establishes that the eigenvectors have the same orders.

评论

Yeah, sorry I wasn't clear. I do agree that the result about the SR and the DR sharing the same set of eigenvectors is correct, I was referring specifically to the ordering for the purpose of low-rank approximations. I think it would make sense to either split the theorem in two like you said, or to keep one theorem that says "SR and DR have the same set of eigenvectors. Also, when the DR is symmetric, the eigenvectors have the same orders".

评论

Thanks for the suggestion. We will keep one theorem and clearly specify the conditions for orders to be the same.

Thanks again for all the questions and suggestions, which helped improve our paper and made it more precise. We still see that the reviewer recommends rejection. If there are further concerns, we are happy to address them.

审稿意见
4

The authors in this submission studied a similar representation to the popular successor representation (SR), by incorporating the rewards in MDPs. This default representation (DR) was first proposed in a related recent work from neuroscience, and the authors here showed how to extend DR to reinforcement learning. They demonstrated the relationship between SR and DR, and derived the algorithms to learn DR via both dynamic programming and temporal difference learning. Furthermore, they showed the usefulness of DR in different settings through various experiments.

优缺点分析

Strength:

  • The extension from SR to DR is an interesting direction and this submission showed the promise of using DR in linearly solvable MDPS.

  • The authors managed to deriving efficient dynamic programming and TD learning algorithms, and showing the convergence for one of them.

  • Through various experiments the authors demonstrated the versatility of the proposed method.

Weakness:

  • The current version focuses on a few simple domains so it would be even more convincing if the authors can show superior performance in more complex environment, along with the extention with neural networks as the authors mentioned.

问题

  • How did you select λ\lambda in the DR?

  • What's the definition of ``Terminal Reward Configuration'' in Figure 5?

  • Is there any explanation on why the norm of the DR can be used as an exploration bonus in count-based exploration?

局限性

yes

最终评判理由

As the authors addressed my comments and questions, I will keep my score.

格式问题

NA

作者回复

We thank the reviewer for the thoughtful review. We are happy that the reviewer finds our work interesting and sees it as a promising direction, having only clarifying questions. Performing experiments only in the tabular setting was a conscious choice, as we discuss it below.

We focused on simple environments in our experiments to fully characterize the properties of the DR, since this paper is the first paper in computational RL on this framework. Including the results for more complicated environments using neural networks could distract from the paper's focus. We believe one has to establish the theoretical foundations and intuitions of an area before focusing on scaling it up, and, importantly, there’s nothing in our framework that suggests it is not scalable. Also, despite focusing on simple environments, as reviewer Ldct noted, the breadth of our experiments is large, exploring several downstream applications of the DR. Nevertheless, we agree with the reviewer that extending the application of the DR to more complex environments using neural networks is an interesting direction that we plan to pursue in the future.

We now address clarifying questions.

Q1: How did you select λ\lambda in the DR?

Summary: We select a λ\lambda large enough such that our implementation would be numerically stable.

While our algorithm is not sensitive to the λ\lambda value, in practice, some values of λ\lambda can lead to numerical instability. As shown by Equation 7, each entry of the DR involves taking the exponential of the return of trajectories. Since we used negative rewards for all non-terminal states, the exponential of the sum of negative rewards can be very small, leading to numerical issues. As using a larger λ\lambda has the effect of reducing the magnitude of the negative rewards in the computation, we increased λ\lambda by 0.1 until we settled on the final value of 1.3. Performance was not a consideration in this process.

We will clarify this λ\lambda selection by adding the following paragraph to Appendix C.1:

“As a larger λ\lambda has the effect of reducing the magnitude of negative trajectory returns in the DR (see Equation 7), it can alleviate numerical instabilities caused by taking the exponential of very negative trajectory returns. In our experiments, we initially started with λ=1\lambda=1, and slowly increased λ\lambda by 0.1 until we settled on λ=1.3\lambda=1.3, which, paired with the use of python-flint, resulted in no numerical issues.”

Q2: What's the definition of ``Terminal Reward Configuration'' in Figure 5?

Summary: “Terminal reward configuration” here refers to the rewards received upon entering terminal states.

Figure 5 shows the results of transfer learning experiments, where we first learn the default features, before applying them in new scenarios where only the rewards at terminal states change. As shown on the left of Figure 5, the environment has several terminal states, indicated by the color green. When performing transfer, we randomly sampled rewards for each of these terminal states, and we called the set of sampled rewards at the terminal states as “terminal reward configuration”.

We will make this clearer by adding the phrase “where terminal state reward configuration refers to the set of sampled rewards at the terminal states” in Section 6.4.

Q3: Is there any explanation on why the norm of the DR can be used as an exploration bonus in count-based exploration?

Summary: Machado et al. (2020) showed that the norm of the SR captures state visitation, and thus can be used for count-based exploration [1]. The DR similarly captures state visitation, but does so in a way that reflects rewards.

We first provide an intuition of why the SR can be used as an exploration bonus, before explaining the case for the DR. The SR matrix is usually initialized to all zeros, and the ground-truth SR matrix has all non-negative entries. Every time a transition (s,a,r,s)(s, a, r, s’) is observed, the SR of ss is updated to approach the ground-truth value. A state that is visited more is updated more, and its SR is closer to the ground-truth value. This is why the SR captures state visitation, and has been used in count-based exploration.

The DR can capture state visitation in a similar manner. For the DR, every entry of the ground-truth DR is non-negative. Entries along the diagonal have the largest magnitudes, while off-diagonal entries usually have very small magnitudes, so we chose to focus on the diagonal entries. We initialize the diagonal entries of the DR as the identity matrix (which, in practice, is an optimistic initialization), and every time state ss is visited, the corresponding diagonal entry in the DR matrix is updated to approach the ground-truth value from above. Therefore, the DR still captures state visitation and can encourage the agent to visit novel states. However, the DR is different from the SR in that the rate of approaching the ground-truth value is affected by the reward. If a state has a low reward, the corresponding diagonal entry decays faster and approaches the ground-truth value at a faster rate, meaning that the exploration bonus provided to the agent reduces faster. If a state has a high reward, the corresponding diagonal entry decays more slowly, and thus over the course of learning, provides a longer-lasting bonus for the agent. This reward-aware exploration bonus might be the reason why the DR was able to outperform the SR in count-based exploration.

[1] Machado et al. Count-based exploration with the successor representation. AAAI 2020.

最终决定

The successor representation (SR) has been widely used in reinforcement learning, for example, for state counting. The paper theoretically and empirically analyzes the default representation which, contrary to SR, takes also reward dynamics into account.

Strengths of the paper include the analysis of DR which is of interest to reinforcement learning research, the theoretical proofs, derived new algorithms/methods, and the writing.

A weakness of the paper is the focus in experiments on simple domains but this is ok taking into account the focus on analyzing DR.

Overall, the paper provides a valuable contribution that will most likely be used by other research.

During the rebuttal period, there was a very useful discussion mainly between reviewer Ldct and the authors regarding the correctness of the equations in the paper. Due to the discussion the authors promised to modify the assumption of the second part of Theorem 3.1. I encourage the authors to incorporating all the comments that relate to making the theoretical analysis clearer and formally specifying all the required assumptions.