5.3

/10

Rejected4 位审稿人

最低3最高8标准差1.8

3.3

置信度

ICLR 2024

RL Algorithms are Information-State Policies in the Bayes-Adaptive MDP

Alyssa Li Dayan,Michael D Dennis,Stuart Russell

OpenReview PDF

提交: 2023-09-24更新: 2024-02-11

TL;DR

Viewing RL algorithms as hard-coded policies in BAMDPs allows a wide range of theoretical analyses, as well as insights into popular reward shaping techniques.

摘要

关键词

Bayesian Reinforcement LearningBAMDPsReinforcement Learning TheoryLifelong LearningReward ShapingIntrinsic Motivation

评审与讨论

审稿意见

评分: 5置信度: 22023-10-30

This paper tries to describe RL algorithms in the context of BAMDP. It shows the regret of the algorithms in terms of BAMDP value and decomposes the value into the incremental value of information and value of opportunity. Finally, the regret is further analyzed using the concept of reward shaping.

优点

The paper provides the theoretical analysis of the main topic.

缺点

The motivation and contribution of the paper is unclear. The presentation of the paper is unfocused, thereby hard to follow its main concept and insights. Also, the practicalness of the analysis is unclear because of the lack of practical examples or insights in terms of the usage of the analysis. (For example, how we can improve or design more practical RL algorithms using the insights.) Since multiple concepts such as BAMDP, RL algorithms, and reward shaping are used in the paper, it should be clear how each one is related and why it is important under a main message of the paper. But it doesn't, so it's confusing.

问题

The motivation and contribution of the paper is not clearly provided. It is unclear for what purpose RL algorithms are being described in the context of BAMDP. Is this because of specifying a proper way to use RL algorithms for BAMDP, or because of showing that RL algorithms are not suitable for BAMDP?
In a similar context, it is not clear why the paper considers BAMDP. For meta-learning, considering BAMDP is reasonable since meta-learning tries to address multiple tasks in real-world applications together, which can be modeled as different task MDPs. Are the authors arguing that real-world problems that have traditionally been modeled with MDPs are actually more suited to be modeled with BAMDPs?
What is "hand-written information-state policies" in the abstract standing for precisely? That is used only in the abstract and there is no clear description in the text.
The rationale for applying reward shaping in analyzing the regret is unclear. Why it is important should be justified more clearly.

评论- Response to Reviewer 6FQX

2023-11-17

We thank reviewer 6FQX for their feedback, which has helped us improve our paper. We have addressed their concerns with several edits to the writing, as indicated by the blue text in the updated manuscript. Specifically:

Motivation and contribution The reviewer pointed out that the motivation and contribution of the paper were unclear. We motivate BAMDPs as a way to formally specify RL problems, allowing us to analyze RL algorithms just like MDPs allow us to analyze policies. This radically different perspective makes it straightforward to derive many insights and analyze the extent to which RL algorithms successfully navigate the explore-exploit trade-off, and which naturally has many implications for the design of RL algorithms. We have concretized this more in the abstract and introduction to make our work’s motivation and contribution clearer.
Unfocused presentation We have moved the less directly applicable section on potential-based shaping to the appendix to focus the story around the more practical insights.
Lack of practical examples We flesh out and further emphasize practical examples in our manuscript, specifically:
- Our framework can be used to inform the design of better RL algorithms - it shows that guaranteeing convergence to the optimal policy for the underlying MDP is generally undesirable because it comes at the cost of over-exploring, and algorithms should instead be designed to explore exactly when that is expected to increase return. This is a radical departure from the mainstream view in RL of convergence to the optimal MDP policy as the gold standard. We have emphasized this more in the abstract, introduction and optimality sections.
- We also flesh out the Negative Surprise example in the Effects of Reward Shaping section, clarifying exactly what assumptions of the algorithm and task distribution must hold for Negative Surprise rewards to be beneficial, to show how our framework provides practically useful insights for reward shaping design.

We address their additional questions below.

Modeling RL problems with BAMDPs instead of MDPs The reviewer questioned the use of BAMDPs, rather than MDPs, to model RL problems outside of meta-learning. BAMDPs are necessary to fully specify RL problems because the underlying task MDP cannot capture the initial uncertainty and information-gathering that RL algorithms do to learn to maximize return, whereas these are baked into the BAMDP transition function and augmented state space. To reuse existing MDP theory on RL algorithms, we must frame RL problems as MDPs in which the algorithms act like policies, but RL algorithms are not valid policies in the task MDPs because they learn from experience, and thus select actions based on more than just the current task MDP state.
Hand-written information-state policies We have amended this vague use of language. Information-state policies are policies that act on states augmented with the information gathered thus far. Hand-written refers to the fact that they are explicitly hard-coded by the RL practitioner rather than learned or otherwise optimized.

2023-11-22

I appreciate the authors for the great effort in responding and revising. The practical example in the revised version is reasonable, but I think it still does not directly demonstrate practical use. However, the paper has been revised to be clearer and most of my concerns and misunderstandings have been addressed. Therefore, I have increased my score.

评论- Response to Reviewer 6FQX

2023-11-22

We greatly appreciate reviewer 6FQX taking the time to consider our response and the updates to our manuscript.

We understand the remaining concern about directly demonstrating practical use. It is now clear to us that it would greatly benefit the reader to see the BAMDP value decomposition we derived applied outside of our toy Caterpillar problem, in a way that demonstrates the practical relevance to AI designers. We have fleshed out our example of how the value signaled by “Empowerment”-based intrinsic motivation (measured as mutual information) can be decomposed into two terms signaling the Value of Opportunity (represented as a negative surprise term) and the Value of Information (represented as an entropy bonus term). To highlight this usage we have moved it from the appendix up into the main text (in the Effects of Reward Shaping section), and mentioned this corollary in the introduction. We find that our decomposition provides a new perspective on the empirical behavior of “Empowerment”-driven agents, more accurately describing the observed behaviors than the prior interpretation of it motivating agents to seek states of maximum influence [1].

Specifically, since Empowerment signals a positive Value of Information for exploring a wide range of actions, and a negative Value of Opportunity for unpredictable transitions. The latter component explains why we see empowerment-driven agents barricade themselves in corners to avoid danger: upon dying they respawn in random locations, which is highly unpredictable. Though this is counterintuitive behavior looking like “Empowerment,” it is directly evident when observing the value decomposition.

We believe this has greatly strengthened our paper by clearly demonstrating to readers how our framework can be leveraged to gain significant novel insights on a practical example.

[1] Mohamed, S., & Jimenez Rezende, D. (2015). Variational information maximisation for intrinsically motivated reinforcement learning. Advances in neural information processing systems, 28.

审稿意见

评分: 5置信度: 42023-10-31

This work analyses reward shaping techniques from the perspective Bayes regret.

First, they establish a notation for viewing RL algorithms as policies in the framework of Bayes-adaptive MDP (BA-MDP), a framework which constructs a history-MDP and proposes to maintain (and predict) posteriors over the unknowns of the MDP and of which a solution will optimally explore with respect to the prior given the to agent.

The Q-values of optimal solution to the BA-MDP are decomposed into "incremental value of information" plus "value of opportunity". The first value, loosely, is the utility of the knowledge (improving MDP posterior accuracy) that comes from doing an action in a state at a particular time step (through better informed actions in the future). The latter, again informally, is the actual (long term) utility expected under the current posterior that is gained from doing the action in the state.

Then, the paper compares the Q values of such a optimal algorithm to a myopic (typical) RL algorithm which only considers the current information. Concretely, as an example, Q-learning approximates the Q-values given previous data, and not given potential future interaction based on uncertainty over the MDP. It is shown that this corresponds to maximizing only for the "value of opportunity".

Finally, the paper looks at various reward shaping approaches as compensating for the difference between the Bayes-optimal Q values and those of typical RL algorithms. This analysis introduces the notion of shaping the Q signal or I (knowledge) signal. For instance, adding minor rewards for getting closer to a goal state is reward shaping the Q signal, whereas positive reward for rare occurrences such as new observations means shaping the I-signal.

优点

The paper proposes a very interesting concept, that of considering RL algorithms as policies in history-based MDPs. The resulting derivation, one where myopic (typical) RL algorithms can be seen as optimizing only one of two terms of the optimal Q-value, has much potential for understanding and framing RL.

This is important, because the exploration-exploitation problem is a core issue unique to RL and one where progress should be of interest to the majority of RL research community. And progress, especially non-incrementally, often requires novel insights and perspectives which, I believe, this paper has managed to find.

As a result, I find this paper a good step in a promising direction and hope to see it fleshed out a little more.

缺点

Despite these strengths, my main concerns are its presentation and lack of formality (especially in the analysis) which affect the contribution significantly.

Sometimes this results in just vague statements, such as comments like "... manually programmed RL algorithms." and "... since regardless of how much meta-learning takes place, some algorithm must be written down eventually." in the introduction (and maybe even the usage of "information-state policies", which are never really defined or seemingly used technically, but pop-up occasionally?). Other times, however, this is detrimental to understanding the intent. For instance, what is the "true p(M) of the problem" (before 6.2.1)? Also, it is still unclear to me how a potential based shaping function is relevant to the paper: It does not seem important, especially since the section talks about reward shaping in the BA-MDP: why would one want to shape the reward if a solution is already optimal with respect to exploration?

Most importantly, I feel like the key contribution is missing / not obvious in its current version. The paper contains of English text describing how various reward shaping techniques can be seen (categorized?) in different lights, but for those interested in exploration in RL there seems little novel. For example, intrinsic reward is very well known (indeed motivated by the idea) to boost the Q-value based on the notion that "knowledge" has value.

To conclude, the definition of regret of typical RL algorithms compared in Q values with the Bayesian optimal solution seems like great step, but the actual analysis (in its current form) is lacking in formality and rigor that I believe is necessary for a paper at ICLR.

问题

N/A

伦理问题详情

N/A

评论- Response to Reviewer 7dtM

2023-11-17

We thank reviewer 7dtM for their thorough and insightful review; we are glad that they find our framework to be interesting and novel, and see the potential of this perspective for understanding RL and significantly advancing the field.

We also greatly appreciate the reviewer’s constructive feedback, which has helped us improve our paper. We have addressed their concerns with several edits, as indicated by the blue text in the updated manuscript. Specifically:

Vagueness/lack of formality We understand that we have not been sufficiently precise in our use of language, and have replaced or defined vague language throughout the paper, specifically:
- Manually programmed algorithms refers to algorithms that are explicitly hard-coded by the RL practitioner rather than learned or otherwise optimized. We have removed the use of this language in the paper, as it appears to be more confusing than helpful.
- Information state policies refers to policies that act on states augmented with the information gathered thus far (intended as a shorthand to describe BAMDP states in the abstract and introduction to readers who aren't already familiar with BAMDPs)- we have added this clarification to the manuscript.
- The true p(M) refers to the distribution of tasks that the algorithm would encounter in practice; we have replaced it with this more precise language, and also added more explanation in the definition section to clarify the meaning of p(M) when using the BAMDP to model RL problems in practice
Potential based shaping We understand the reviewer’s perspective that this theory is less obviously useful because we would not need to shape a BAMDP policy that is already Bayes-optimal; preserving optimal behavior is merely a nice additional property for a reward shaping function to have, though it doesn’t tell us how effectively it would reduce the regret of a suboptimal policy. We updated this section to clarify this point, and have moved the section to the appendix to focus the story around the more practical insights. But we believe it is still a useful demonstration to the reader of an immediate application of our framework to reuse existing MDP theory for RL problems, which does not require any more definitions or formalisms beyond the BAMDP itself.
Key contribution and novel implications The reviewer expressed concern that the key contribution of the work is not obvious, and many of the implications and insights derived from the framework are not novel. We have amended the abstract and introduction to make it clearer that casting RL algorithms as BAMDP policies is the key contribution, providing a radically different perspective that makes it straightforward to derive many insights and analyze the extent to which RL algorithms successfully navigate the explore-exploit trade-off, and which naturally has many implications for the design of RL algorithms. We highlight two novel implications here:
- Guaranteeing convergence to the optimal policy for the underlying MDP generally comes at the cost of over-exploring, and algorithms should instead explore exactly when that is expected to increase return. This is a radical departure from the mainstream view in RL that convergence to the optimal MDP policy as the gold standard. We have fleshed this out more in the abstract, introduction and optimality sections
- Although intrinsic reward is indeed motivated by the general idea that knowledge has value, it was not previously possible to formalize exactly which knowledge has how much value for a given RL problem, and as a result practitioners struggle with intrinsic rewards producing counterproductive behavior because they do not assign value correctly for the problem they’re used in (e.g. the noisy TV problem). The BAMDP framework allows us to formalize exactly what value intrinsic rewards should be signaling in terms of the task distribution and properties of the learning algorithm, so we can design them more accurately. We should have made this point clearer in the Effects of Reward Shaping section- we address this by diving deeper into the Negative Surprise example in the Effects of Reward Shaping section, clarifying exactly what assumptions of the algorithm and task distribution must hold for Negative Surprise rewards to be beneficial, to demonstrate the non-obvious insights our theory can provide RL practitioners trying to design appropriate shaping functions.

2023-11-20

Thank you for the detailed response. I appreciate how it addresses my review and, in combination with the reviewer above, and I have incremented my score accordingly.

Overall, I find the derived decomposition provided in the paper of interest but would like to see more substantial usage of it in order to confidently vote for acceptance. On the other hand, I would not be opposed to acceptance, if other reviewers lean towards that, after the updates proposed in this response.

评论- Response to Reiviewer 7dtM

2023-11-22

We greatly appreciate reviewer 7dtM taking the time to consider our response and the updates to our manuscript.

We find reviewer 7dtM’s suggestion to show a more substantial usage of our value decomposition to be very insightful. It is now clear to us that it would greatly benefit the reader to see the value decomposition applied outside of our toy Caterpillar problem in a way that demonstrates the practical relevance to AI designers. This has driven us to flesh out our example of how the value signaled by “Empowerment”-based intrinsic motivation (measured as mutual information) can be decomposed into two terms signaling the Value of Opportunity (represented as a negative surprise term) and the Value of Information (represented as an entropy bonus term). To highlight this usage we have moved it from the appendix up into the main text (in the Effects of Reward Shaping section), and mentioned this corollary in the introduction. We find that our decomposition provides a new perspective on the empirical behavior of “Empowerment”-driven agents, more accurately describing the observed behaviors than the prior interpretation of it motivating agents to seek states of maximum influence [1].

Again, we thank the reviewer for leading us to this realization and we believe this has greatly strengthened our paper by clearly demonstrating to readers how our framework can be leveraged to gain significant novel insights on a practical example.

[1] Mohamed, S., & Jimenez Rezende, D. (2015). Variational information maximisation for intrinsically motivated reinforcement learning. Advances in neural information processing systems, 28.

审稿意见

评分: 3置信度: 32023-11-02

The paper studies reinforcement learning through the lens of Bayes-Adapative MDPs, in which the RL algorithm is viewed as a policy of the BAMDP. The paper first provides a formal description of the setting and its components. Then, it derives a decomposition of the BAMDP value into the sum of the Incremental Value of Information, which coarsely measure the information gain over the true MDP, and the Value of Opportunity, which measures the expected return given the current information. Finally, the decomposition is used to analyse a set of reward-shaping mechanisms that have been previously considered in RL literature, including intrinsic motivation.

优点

(Originality) The paper provides a novel and interesting view of reinforcement learning algorithms as policies over BAMDPs.
(Categorization) The paper provides a valuable characterization of a handful of reward-shaping approach through the lens of BAMPD value.

缺点

(Motivation) Whereas the formulation is interesting, BAMDP is also known to be an intractable problem in general, so it is unclear what benefit this new perspective can bring.
(Implications) The paper does not fully clarify how the introduced perspective should help analyzing existing RL algorithms and building more advanced algorithms in the future.

The paper is providing an original interpretation of RL as the problem of solving a BAMDP, and it does also report some insights, such as the value decomposition that explicitly separates exploration and exploitation contributions to the value. However, despite the promising formulation, the paper is somewhat falling short from providing a coherent set of implications resulting from this new perspective, beyond a few informal consideration over reward shaping methods. In my opinion, to clear the bar for acceptance this paper shall narrow its scope, e.g., presenting the contribution as a study of reward-shaping through BAMDP perspective, and provide more formal/deeper implications from the analysis, such as a study on how the different shaping methods impact the Bayesian regret and under which assumptions or prior any of those methods can be considered optimal. For this reason, I am currently providing a negative evaluation to the paper, but I encourage the authors to keep working on this problem, which looks like a nice research direction to pursue.

问题

What is the point of framing the RL problem as BAMPD, when it is well known that BAMDP cannot be solved efficiently in general? Especially, given that provably efficient RL algorithms exist (under somewhat restrictive assumptions), does this mean that the provided formulation is missing some of the structure of the underlying problem?
While setting a prior over the tasks might be reasonable in meta-RL or analogous settings, in which some knowledge of the task distribution can be collected during training, is this also reasonable in standard RL? Of course any MDP can be seen as a sample from a very uninformative prior over all the possibile MDPs, but this does not seem to provide any benefit.
Can the authors discuss how their framework will help producing a deeper analysis of existing RL algorithms and possibly guide the development of improved algorithms?

MINOR

The use of \citep and \citet in the paper is sometimes confusing. I would suggest the authors to use \citet only when the papers' authors name are part of the sentence;
Some choice of reference is somewhat odd, such as reporting Yang et al., (2021) for regret bounds in RL. There are a pletora of papers on that topic, and perhaps a more representative reference can be chosen;
The itemize at the end of the Introduction is missing a full stop after the last bullet point.

评论- Response to Reviewer t2gv

2023-11-17

We thank reviewer t2gv for their thoughtful review. We are glad that they found our formulation interesting and original, and saw the value in our use of this perspective for characterizing reward shaping functions.

We highly appreciate the reviewer’s detailed feedback, which has greatly helped us improve the paper (edits indicated by blue text in the updated manuscript). In particular:

Practical implications We have made the practical implications of our framework much clearer and more concrete:
- We highlight more in the abstract, introduction and optimality sections the powerful implications our framework has for the design and development of improved RL algorithms: guaranteeing convergence to the optimal policy for the underlying MDP generally comes at the cost of over-exploring, and algorithms should instead be designed to explore exactly when gathering that additional information is expected to increase return. This is a radical departure from the mainstream view in RL of convergence to the optimal MDP policy as the gold standard.
- We dive deeper into the Negative Surprise example in the Effects of Reward Shaping section, clarifying exactly what assumptions of the algorithm and task distribution must hold for Negative Surprise rewards to be beneficial, to demonstrate the non-obvious insights our theory can provide Rl practitioners trying to select the appropriate shaping function
We have also fixed the use of \citep and \citet, added a more representative reference for regret bounds, and added the full stop after the last bullet in the introdution.
Motivation The reviewer questioned the benefit of using the BAMDP, because they are known to be generally intractable to solve. We should have made it clearer that we never intend to solve the BAMDP directly, we simply use it as a conceptual framework to model and analyze the behavior of RL algorithms. We don’t think solving the BAMDP is necessarily a good approach to designing optimal RL algorithms- heuristic approximate RL algorithms can be very effective. The BAMDP is valuable because it allows us to formalize what these algorithms should ideally aim to approximate, clarifying when the approximations are appropriate, and giving insights into how to design them- for instance how to shape their rewards to approximate the optimal Q function. We have made this more clear in the introduction.
Question 1 The reviewer questioned the validity of the BAMDP formulation, citing the existence of provably efficient RL algorithms which appears incompatible with the intractability of the BAMDP. We thank the reviewer for pointing out these efficient algorithms; we have amended our statement about theoretical RL aiming to converge with unlimited interactions, and clarified our own vague use of “efficient”. We used “efficient” to mean making the optimal exploration-exploitation trade-off, so neither over-exploring past the point of diminishing returns, nor exploiting before having enough information to get high rewards. Algorithms that find the optimal MDP policy in polynomial time complexity are not necessarily efficient in this way- always finding this optimal policy often comes at the cost of over-exploring. If the discount factor were high enough that finding the optimal MDP policy is not over-exploration and if we made those same assumptions of underlying structure, then we would arrive at the same results in our framework as a special case.
Question 2 The reviewer questioned the assumption of a prior over MDPs in the formulation of RL problems. Our use of the word prior should have been clearer- by prior we mean the distribution of tasks the RL algorithm will encounter. We have amended the definition in the BAMDP section and changed language in other places that p(M)/the prior is mentioned to clarify this. An RL algorithm is useful insofar as it performs well on the types of tasks it’s expected to encounter in practice, while it does not matter if it performs poorly on very unrealistic tasks. RL algorithms and reward shaping functions are always designed with specific types of task in mind, and evaluated on such tasks (e.g. maze navigation, atari games, driving simulators) i.e. assumptions about the task distribution are central in the design and evaluation of virtually all RL algorithms, even though this is often not stated explicitly. The task distribution may not be expressible in closed form, but we do not need to construct and solve the BAMDP directly- we show in the Effects of Reward Shaping section how simple features of this distribution can be used to inform the design of helpful reward shaping functions.

评论- Follow-up Question

2023-11-22

I want to thank the authors for their detailed response and for updating the manuscript following reviewers' comments.

Can I ask them to develop more on the following claim:

This perspective implies that RL practitioners should not design algorithms to converge to the optimal task policy, instead they should be designed to explore exactly when gathering further information is expected to maximize overall return.

Why this is not included already in the various notions of regret minimization?

I am sorry for my late reply. Hopefully, there is stil enough time for a quick clarification.

评论- Response to reviewer t2gv

2023-11-22

We greatly appreciate reviewer t2gv taking the time to consider our response and the updates to our manuscript. We forgot to mention in our previous response that we also found the suggestion to narrow the scope of the paper to be helpful, and we have moved our section on potential-based shaping to the appendix to focus the story around the more practical insights.

Re: the reviewer's request for clarification, this implication is indeed equivalent to the notion of minimizing Bayes Regret, specifically. We should have been more precise-- adversarial or otherwise non-Bayesian regret minimization would not necessarily have the issue of sacrificing return to guarantee convergence to the optimal task policy. Its difference from Bayes Regret is that it is not aware of the distribution of tasks the designer intends to use the algorithm in, so it may still over-explore or under-exploit with respect to that.

In the claim quoted by the reviewer, we were referring to the fact that many RL practitioners, particularly in deep RL, do not consider the regret of their algorithms at all- instead using the final performance of their algorithms after training converges as their metric, reported on a few benchmark environments. Thus they only consider convergence results for their algorithms, such as for policy gradient or Q-learning style algorithms.

Although Bayes Regret is not a novel concept, this perspective is novel: solving RL problems can generally be understood as minimizing Bayes Regret, because RL algorithms are designed for and evaluated on the distribution of tasks they’re expected to encounter in practice. Our BAMDP framework provides a way to express and analyze the Bayes regret of any RL algorithm (the regret of an RL algorithm cast as a policy in the BAMDP is equivalent to its Bayes-Regret in the corresponding RL problem), explaining many findings made by RL researchers, even when not explicitly doing Bayesian RL, such as which reward shaping functions are effective in different settings. It is thus valuable for guiding RL practitioners in designing algorithms and shaping functions that intentionally and appropriately balance exploration and exploitation in the types of tasks they expect to use them on.

We thank the reviewer for their question which has highlighted an aspect of our paper that should be clearer, and will revise the manuscript to clarify this.

2023-11-23

I want to thank the authors for their further reply. I now better see their point on the Bayesian regret, although it is somewhat informal as a result. However, I think this is the right direction: Without better clarifying how the new framework can bring benefit to algorithm design or analysis, this will remain a nice perspective without clear practical use.

Unfortunately, given the previous concern, I am keeping my original score. While I think the paper is not ready for publication as is, it is an original and interesting direction, which I encourage the authors to keep following.

评论- Response 3 to reviewer t2gv

2023-11-23

We are very thankful to reviewer t2gv for their time considering our response. It would be helpful to know more specifically how we could clarify the benefits brought by our framework to algorithm design or analysis. It would greatly aid in our continued efforts to improve our paper.

Specifically, it would be very helpful in our future revisions to know how our changes demonstrating the use of our framework to design negative surprise rewards under different assumptions of the task distribution, and to more accurately describe the observed behaviors of empowerment-driven agents, have been inadequate in showing these benefits.

审稿意见

评分: 8置信度: 42023-11-04

This paper proposes a new framework for understanding reinforcement learning (RL) algorithms as policies in Bayes-adaptive Markov decision processes (BAMDPs).

A BAMDP is a species of "meta" sequential decision-making process that -- given a set of candidate tasks (represented as MDPs) within which the true task lives as well as a prior over those tasks -- formalizes the problem of balancing acquiring information about the task being solved with maximizing its lifelong expected return. Unlike policies for single tasks/MDPs that seek to myopically maximize rewards under a fixed policy without accounting for policy changes based on future information gain, a BAMDP policy can be viewed as a procedure for updating policies over time based on the task information they generate.

In light of the foregoing, the paper describes how RL algorithms can be viewed as BAMDP policies and investigates useful consequences of doing so. Specifically:

standard, "myopic" RL algorithms are formally characterized within the BAMDP framework;
the regret of such standard RL algorithms with respect to Bayes-optimal solutions to the BAMDP is characterized;
a decomposition of the regret into two familiar components corresponding to information gain (exploration) and value improvement (exploitation) is provided;
a characterization of potential-based reward shaping in the context of BAMDPs and its relationship to reward shaping of standard, myopic RL algorithms is given.

优点

This paper provides a creative, insightful new formal framework for reasoning about RL algorithms as policies in BAMDPs. The work is theoretical in nature and will be of significant interest to the RL theory community. The connections it draws between reward shaping in the BAMDP realm and effects on standard, myopic RL algorithms may provide a catalyst for new developments in the experiment RL community as well. The paper is primarily a conceptual work synthesizing existing ideas (BAMDPs with RL algorithms) and providing new perspectives. The theoretical results are straightforward with no serious mathematical heavy lifting required, but they are clear and insightful. The paper is well-written and the recurring example pictured in Fig. 1 is used effectively to provide concrete illustration of main concepts throughout.

缺点

The primary weakness of this paper is the absence of a direct practical application of the ideas developed. This is natural, given the conceptual nature and theoretical focus of the work, but some experimental evidence supporting the main ideas would make the paper much more accessible to the experimental RL community. For example, it would be very helpful to have numerical experiments illustrating key aspects of the discussion of reward shaping provided in Sec. 6 on a non-trivial problem.

问题

why is $R(r_t | s_t, a_t)$ included in the definition of $\bar{T}$ in bullet point four of the BAMDP definition on page 3? can you clarify how $\bar{T}( \cdot | \bar{s}_t, a_t)$ remains a pdf/pmf?
the analysis in the appendix seems to consist primarily of proving Theorem 3.1 and providing computations for the caterpillar example; are there any key innovations in the analysis that you consider worth highlighting?
do you have any ideas for how the discussion in Sec. 6 might be illustrated experimentally?

评论- Response to Reviewer h87r

2023-11-17

We thank reviewer h87r for their generally positive assessment of our work and its writing. We’re glad they found our perspective to be creative and insightful, and saw the significant benefits of our framework to the RL theory community, as well as the value the characterization of reward shaping can bring to RL practitioners.

We understand their concern about the need for more explicit direct practical applications of the ideas, so we updated the manuscript to show more clear and concrete practical implications. Specifically,

in the abstract, introduction and optimality sections we highlight a powerful implication our work has for the design of RL algorithms, i.e. that exploring until finding the optimal policy for the underlying MDP is not the optimal approach for the problem itself, and algorithms should instead explore if and only if further information is expected to maximize overall return. This is a radical departure from the mainstream view in RL of convergence to the optimal MDP policy as the gold standard.
In the Effects of Reward Shaping section we have expanded on the negative surprise reward shaping example, to show more concretely how an RL practitioner could determine when it is beneficial- i.e., when the vast majority of surprise correlates well with negative outcomes on the distribution of trajectories the agent actually experiences.

We address their questions in detail below:

$R(r_t|s_t,a_t)$ is included in the BAMDP transition function because the history component $h_t$ of the BAMDP state includes all the rewards received over the agent’s trajectory. This captures how the RL algorithm learns the reward function $R$ from its interactions with the task MDP. This remains a pdf because the product $T(s_{t+1}|s_t,a_t)R(r_t|s_t,a_t)$ is equal to the joint probability of transitioning to MDP state $s_{t+1}$ and receiving reward $r_t$ .
The main innovation of this paper was this new conceptual perspective, the use of which makes analysis of RL algorithms very straightforward. We updated the manuscript to emphasize this point.
Our analysis in the Effects of Reward Shaping section should be predictive of the experimental performance of agents, for each type of shaping function and corresponding conditions on the task distributions described- e.g. adding prediction error rewards will improve performance in problems where the prediction error of an observation is an accurate signal for the value of that observation’s information content. We updated the manuscript fleshing out the negative prediction error example in this section to make this more concrete.

2023-11-23

Thanks to the authors for their responses to my questions. My concerns have been addressed and I will keep my score.

评论- Overall Response

2023-11-23

We greatly appreciate all the reviewers' insightful comments.

A consistent concern among reviewers was demonstration of practical implications and benefits of our framework to algorithm design and analysis. We have taken this to heart, fleshing out concrete examples demonstrating the use of our framework to design Negative Surprise shaping rewards under different assumptions of the task distribution, and to more accurately explain the observed behaviors of Empowerment-driven agents.

We are very thankful for the reviewers taking the time to leave such constructive feedback and engage in discussion with us, which has helped us significantly improve our paper.

AC 元评审

2023-12-05

This paper proposes the original approach of viewing RL algorithms as hand-written information-state policies for the BAMDP, and derive a number of insights from this approach.

Reviewers have reported the following strengths:

Originality of the idea;
Interest to the community;
Straightforward but interesting theoretical results.

Reviewers have reported the following weaknesses:

Lack of motivation;
Unclear implications.

Decision

This paper proposes a novel approach that has been deemed very original by all Reviewers. However, the majority of Reviewers expressed concerns about the lack of motivation for the proposed approach, together with concerns about its unclear implications in RL. The authors are encouraged to improve their work clarifying these aspects in a future submission.

为何不给更高分

N/A

为何不给更低分

N/A

最终决定Reject

2024-01-16

Reject