3.4

/10

Rejected5 位审稿人

最低1最高5标准差1.5

3.4

置信度

ICLR 2024

Non-ergodicity in reinforcement learning: robustness via ergodic transformations

Dominik Baumann,Erfaun Noorani,James Price,Ole Peters,Colm Connaughton,Thomas B. Schön

OpenReview PDF

提交: 2023-09-22更新: 2024-02-11

TL;DR

Optimizing the long-term performance instead of the expected accumulated reward enables learning of robust policies in non-ergodic environments.

摘要

关键词

Reinforcement learningErgodicityReward transformation

评审与讨论

审稿意见

评分: 3置信度: 32023-10-31

This paper exposes the difficulties of dealing with non-ergodic reward sequences in RL. In particular, a simple example is given which shows that such non-ergodic settings can result in policies with high expected value, but low returns with probability 1. The authors propose a method for transforming reward signals to be ergodic, and demonstrate that policies optimizing the transformed reward tend to earn much more reward than policies optimizing the original objective. Then, the authors present a method for learning this transformation from data, and demonstrate the benefits of doing so in familiar RL benchmarks.

优点

The paper, for the most part, is extremely well written and easy to follow (except for a few points mentioned in the Questions section). The problem of optimizing the expected return is often neglected in RL, and this paper mentions a fresh perspective on why this can be problematic, which is convincing. The proposed method is novel, and appears to work well.

缺点

My main concern with this paper is that it is not clear that the reward transformation that is proposed is generally useful in RL settings. Firstly, as I mention in the Questions section, it seems unlikely that a reasonable reward transformation can be learned in general without access to a sufficiently good policy or sufficient exploratory data. Moreover, given access to this data, I suspect that other transformations not based on ergodic increments can lead to superior performance (i.e., what if you learn a value function from the data and use that as the reward function in the test phase)?

The paper does not clearly define what it means for reward increments to be ergodic. While I suppose the reader is meant to infer that reward increments are ergodic if the resulting return is ergodic, this should be stated more clearly. It would also help to have some more intuition about the nature of ergodic increments.

问题

In definition 2, what is the variance function? Is the integral in this equation an integral over the space of random variables?

I am a little skeptical about the proposed framework for learning the ergodic transformation given at the end of section 4. As I understand it, the proposal is to first collect a bunch of trajectories using some default strategy, then learn the ergodic transformations based on the rewards from those trajectories, and finally perform policy optimization w.r.t. the reward function transformed by the learned transformation. I can see how this works with the coin game, since here the states are essentially the returns, and the dynamics are extremely simple. Crucially, it is almost trivial to 'explore' the rewards in this setting -- by setting $F=1$ , the agent will have experienced all behaviors of the reward function within a few steps. This is because the reward function is essentially the same at every state, up to a factor. When you train an agent on the transformed reward function, the distribution of observed states will likely be very different (hopfully, otherwise the transformation isn't helping much), in which case I would not expect the reward transformation to generalize properly to the new trajectories in most cases. Are some assumptions needed?

Is there any reason to expect such a large performance improvement in Reacher? Given that without the transformation, the agent gets basically no rewards, how is it even conceivable that the learned transformation is helpful here?

Minor Issues

In Figure 1, it would be a lot more clear if the red and blue lines were more distinguished from the sample paths, for instance if they were dashed lines.

评论- Replies

2023-11-15

We thank the reviewer for the positive evaluation of our contributions and for acknowledging the “fresh perspective” our paper provides.

Problem description. We have thoroughly revised the problem setting section and think that now it is clearer how ergodicity connects to RL and, especially, what is the relation between ergodicity of rewards and returns.

Variance function. We follow the standard definition of the variance function. For instance, from https://en.wikipedia.org/wiki/Variance_function: a smooth function that depicts the variance of a random quantity as a function of its mean. We have now added this definition to the paper.

Contribution. The algorithm we propose in this paper is a first step toward resolving an issue that will require much further study to resolve completely. As we show in the paper, state-of-the-art RL methods already fail for a very simple example. This indicates that for non-ergodic environments if we care about the long-term performance of agents, the expected value might not be the best optimization objective. We then propose a transformation that resolves the problem in the simple example. We further demonstrate that it extends to more challenging problem settings. REINFORCE is a Monte-Carlo-based algorithm. As such, it collects a reward trajectory and uses this to update its weights. Thus, we receive a reward trajectory in every episode, can learn a transformation, transform returns, and optimize based on transformed return increments. Thus, in Monte-Carlo-type algorithms, our transformation can already be used. Clearly, an extension to an incremental transformation would be important to apply it more broadly.

Experiments. In the reacher example, we also receive rewards in the beginning, even though they are low. This is apparently enough to get at least a rough idea of the dynamics, and the transformation can provide a better optimization objective than the standard expected value.

Figure 1. We followed your suggestion and now show the red and blue lines as dashed lines to distinguish them more clearly from the others.

审稿意见

评分: 1置信度: 42023-11-01

The paper discusses non-ergodicity in reinforcement learning with a method that avoids departures of RL methodologies by using a method that converts a time series of rewards into a time series with ergodic increments. The paper includes various examples of how and where such problems arise thus motivating the importance of studying such problems.

优点

The paper studies an important problem which seems to have been somewhat underexplored despite there being some works on the subject. The vision of the paper is supported by some empirical evidence showing improved performance when applying the technique proposed by the paper.

缺点

The paper has several notable weaknesses:

Structure The structure of the paper seems quite disorderly since the flow is interrupted by examples that seem misplaced as well as an unexpected detour to discuss risk-sensitivity. It is also very unclear which are the novel results of the paper which the authors seek to highlight as main results and which are the results of lesser importance.

Motivation In the introduction the definition of ergodic is not clearly introduced. For example, ergodicity can be defined by a process which is stationary and every invariant random variable of the process is almost surely equal to a constant. Additionally, as the authors discuss, non-ergodic reward functions have been studied within RL --- the authors write “none of these works, as a consequence of non-ergodicity, question the use of the expectation operator in the objective function” which seems a little vague. For these reasons, the work seems poorly motivated with it being unclear as to why the examples given aren’t resolved by using other existing approaches such as risk-sensitive objectives (though these approaches are discussed)

Formal details The paper at times lacks precision and consistency with formal details - for example the paper begins with a discrete time analysis then makes a diversion to studying continuous time. Additionally, some formalities are missing or unclear for example the transition dynamics aren’t specified and it’s unclear why a subindex is needed on the time index.

There is also an awkward transition to a Markov decision setting with reward dynamics being governed by a stochastic difference equation. Without further discussion it is unclear how restrictive this is since the standard MDP/RL setup does not require conditions similar to this.

Empirical Evaluation

The empirical evaluation lacks comparison to a range of techniques.

问题

What are the restrictions we get from imposing the reward expression in equation 3.
Can the authors clarify the contribution beyond what has already been studied towards the goal of RL with non-ergodic reward functions e.g. what is new compared to Majeed and Hutter (2018).
How does this approach compare with risk-sensitive RL?

评论- Replies

2023-11-15

Thank you for critically reviewing our paper and acknowledging the importance of the problem we address.

Problem description. We have thoroughly revised the problem setting section to define the concept of ergodicity more clearly. We are here using an subscript $k$ on the time $t$ to clarify that we are considering discrete time steps and to distinguish it from the continuous-time analysis we do when comparing our contributions to risk-sensitive reinforcement learning.

Connections to related work. The main contribution of this paper is raising the question of ergodicity of reward functions in RL and how it impacts the use of the expected value as an optimization objective, as well as proposing a first approach toward resolving the problem. Other approaches that we are aware of consider non-ergodicity of MDPs but do not make the connection to the divergence between expected rewards and long-term rewards in such settings. Majeed and Hutter prove that the Q-learning algorithm still converges under non-ergodic dynamics. In the coin-toss game, the PPO algorithm also converges. And it does converge toward the policy that yields the maximum expected return. However, this policy yields a return of 0 in the long run with probability 1. Convergence itself does not necessarily help if the optimization objective is the problem.

Risk-sensitive RL. As we discuss in Section 5, for a specific class of reward functions (logarithmic), the risk-sensitive transformation is an ergodicity transformation. For such settings, the risk-sensitive transformation does exactly what we propose to do in this paper. This is not the case for other classes of reward functions - which is intuitive: a fixed transformation should only work well for a class of reward functions and not for all potential classes one could think of. We used the risk-sensitive transformation in the coin-toss game, whose dynamics are exponential, i.e., opposite from what the risk-sensitive transformation would need. In this case, the transformation fails.

评论- Reviewer response

2023-11-21

I would like to thank the authors for their responses.

Although I think the paper presents an interesting line of inquiry, my perspective is that the current set of results and presentation doesn't yet meet the threshold for acceptance at this venue. I also think the ideas presented hold potential for a good publication in the future after the analysis has been developed to include much deeper results and a more streamlined presentation.

2023-11-22

We thank the reviewer for emphasizing the potential of our contributions. In this paper, we identify a problem that so far seems unaddressed and even unnoticed in reinforcement learning. Through an intuitive example, we show the implications of neglecting non-ergodicity has for state-of-the-art RL algorithms. We then develop a Monte Carlo algorithm that can solve the problem. We further provide theoretical foundations for the area of risk-sensitive RL. Overall, this is a significant set of contributions. Especially the presentation of the paper has been positively recognized by other reviewers.

审稿意见

评分: 3置信度: 42023-11-01

This paper studies the viability of dealing with non-ergodicity in reinforcement learning (RL) by transforming potentially non-ergodic reward processes generated during learning into ergodic reward processes. Unlike most existing works on RL theory, which typically frame ergodicity as a property of the Markov chain induced by a given policy applied to the underlying Markov decision process (MDP), this paper considers a certain notion of ergodicity of the overall reward process, independent of the policy used. A recurring example based on a coin toss betting problem is presented to illustrate what is meant by non-ergodic rewards, a general class of ergodic transformation and a procedure for approximately learning the transformation are proposed, a derivation is provided linking recent formulations of risk-sensitive RL to the proposed framework, and experiments illustrating benefits of the proposed transformations on the coin flip problem and two RL benchmarks (Cartpole and Reacher) are provided.

优点

The paper considers the issue of non-ergodicity in RL from an interesting perspective: it is novel and potentially practically useful, especially in light of applications of RL to finance and economics, to directly consider non-ergodicity in induced Markov reward processes. The improvements experimentally observed on the coin toss betting problem suggest the method has merit and the connection to risk-aware RL bears further study.

缺点

Though the paper presents an interesting perspective on non-ergodicity in RL, there are serious issues in the proposed approach:

The proposed notion of (non-)ergodicity and how it relates to the underlying sequential decision-making problem is not clearly defined, seriously undermining the proposed approach and its potential relevance to existing work. This is a major drawback and my most serious concern. RL is typically viewed as a family of methods for (approximately) solving MDPs, where the goal is to find a policy -- a mapping from states to distributions over actions -- maximizing some notion of expected (discounted, average, or total) reward. For a fixed policy, a Markov chain (and corresponding Markov reward process) is induced over the underlying MDP. In RL, (non-)ergodicity is a property of these induced Markov chains and reward processes, so specifying a policy is necessary before one can consider (non-)ergodicity. In the paper, these issues are not discussed, resulting in lack of clarity at several critical points detailed in the questions section below. A critical issue is left unaddressed in the paper: what is the relationship between the (non-)ergodicity of the reward process and the (non-)ergodicity of the underlying sequential decision-making process?
The variance-stabilizing transform developed in Sec. 4, though interesting, is insufficiently motivated. Importantly, the mechanism whereby it produces an ergodic, transformed reward sequence is unclear. This weakens support for the validity of the approach.
Sec. 5 points out an interesting connection between risk-sensitive RL and the transformation proposed in Sec. 4, yet it remains unclear for what classes of problems (i.e., what classes of MDPs, rewards, and policies) the $\hat{\text{I}}$ to process of Eq. (4) is a reasonable model for the reward process.
The experiments are limited: performance improvements over baseline methods are observed only on the simple coin toss betting problem and Reacher; comparable performance to baseline is observed on Cartpole.

问题

Some specific questions:

how does your notion of (non-)ergodicity relate to that considered in (Puterman, 2014), (Sutton & Barto, 2018), and many recent works on RL theory?
on page 2, the reward function is defined both as $R : \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$ and $R(T) = \sum_{t_k=0}^T r(t_k)$ , which appear incompatible; which is correct?
what is the definition of $t_k$ in Eq. (1)? if $t_k$ denotes timestep, then what is $r(t_k)$ and how does it relate to $\mathcal{S}$ and $\mathcal{A}$ ?
what is the expectation in Eq. (2) taken w.r.t.?
what is the action space in Sec. 2.1? what policies were used in the experiment in Fig. 1?
what is $T$ on the LHS of the equation in Def. 1 on page 2? what is the role of policies in Def. 1?
what is the role of actions and/or policies in Eq. (3)?
why is the variance-stabilizing transform proposed in Sec. 4 an ergodic transform, even if only approximately?
under what conditions on the underlying problem is Eq. (4) a good reward model?
the last two paragraphs of Sec. 5 are deflating; is the proposed ergodic transform not useful for risk-aware RL?
the test phases in Fig. 4 appear to perform comparably for both methods -- what does this mean?

评论- Replies

2023-11-15

We thank the reviewer for acknowledging the novelty and potential of our approach.

Ergodicity notions. In the RL literature, ergodicity is typically defined as the existence of a finite $N$ such that any state in the Markov chain can be reached from any other state in at most $N$ steps. We could now also understand the coin-toss game as a Markov chain, where the different states in the chain would represent the “wealth” of the agent. As long as we consider finite episode lengths, a finite path exists from any starting state to any other state in the chain. Nevertheless, in the limit, the agent will end up in an absorbing state (wealth equal to zero) with probability one. Thus, in the limit, due to the existence of an absorbing state, the game would also be non-ergodic from the MDP perspective. On the other hand, a non-ergodic MDP, which, for instance, has an absorbing state in which we end up with non-zero probability, should, in general, also be non-ergodic from the reward perspective. This is because, in the infinite time limit, we will end up in the absorbing state with probability 1. A case where the two definitions would not coincide would be if we define the reward dynamics themselves to follow a geometric Brownian motion while the agent moves in a grid world in which all states can be reached or if we have a grid world with absorbing state where all states, including the absorbing one, give the same reward. However, typically, ergodicity of the MDP should imply ergodicity of the reward function and vice versa.

Notation. We agree that the notation has not been clear enough in the submitted version, and we have thoroughly revised it, also defining more clearly what we mean by ergodicity and how exactly it relates to RL. The reward typically depends on the current state and action. I.e., we have $r(a(t_k), s(t_k))$ . We abbreviate this for conciseness as $r(t_k)$ , as we now define in the paper.

Expectation. Given a stochastic environment, such as in the coin-toss game, a policy induces a stochastic process. The expected return then gives us a measure of how good this policy is. So conventionally, in RL, we would maximize over policies to find the one with the highest expected return. In that sense, we take the expected value over trajectories induced by a certain policy.

Action space. In Section 2.1, the action space is the discrete decision of playing or not playing the game. Since not playing it results in no change in wealth, in Figure 1, we show the dynamics if one decides to play the game.

Variance-stabilizing transform. The problem that this work attempts to solve is how to estimate a function that transforms a given time-series into one with ergodic increments. There is no established measure of how ergodic a time-series is, so instead, we take the approach of using a proxy metric, that is a metric that is easier to measure and still results in a transformed time series with ergodic increments. This motivates our choice of the variance-stabilizing transform as (assuming the incremental mean has no underlying trend) if the incremental variance of a random process is constant, then it will have ergodic increments. Of course, there is a class of random processes that have ergodic increments but varying incremental variance. A consequence of our proxy measure approach is that we cannot make any inferences on these processes.

Ito process. Equation (4) assumes that the reward function is an Ito process. This is a very general type of stochastic processes, as the two functions $f$ and $g$ can be nonlinear or even stochastic. It is also used more as an analysis tool. We state for which underlying Ito process the risk-sensitive transformation would be an ergodicity transformation and find that we can get valuable insights from this perspective.

Experiments. In the experiments, we see indeed that the improvements for the cart-pole are only modest. This implies that either the reward function is ergodic or that the difference between the time average and expected value is relatively small so it does not show too clearly in the experiments. For the reacher, on the other hand, we see a clear improvement.

2023-11-23

Thank you to the authors for their response. While some of my concerns have been addressed, my main concerns remain regarding the lack of clarity in how ergodicity of the reward process and underlying decision process are related, insufficient justification for the variance-stabilizing transform, and absence of specific conditions or concrete examples where the $\hat{\text{I}}$ to process is an appropriate reward model. I will therefore keep my score.

审稿意见

评分: 5置信度: 32023-11-04

This paper begins by addressing the distinction between non-ergodicity and ergodicity in reward functions within the context of Reinforcement Learning (RL). It proceeds to introduce an algorithm for transforming rewards, particularly effective in non-ergodic reward environments. Additionally, the paper explores the relationship between optimizing time-average growth rates instead of expected values and discount factors. The proposed approach is supported by a concise experimental section for validation.

优点

The paper is well-written for the most part; however, there are a few inaccuracies in the figures descriptions that should be addressed.
The concept of transforming rewards from non-ergodicity to ergodicity is intriguing and appears to be well-founded.

缺点

Could the author clarify the meaning of 't_k'? If it represents the 'k'-th step within one episode, then the use of 'R(t)' in Figure 4(a) should be reconsidered, as this should represent the cumulative rewards of each episode. It is essential for the author to provide a clearer explanation regarding the interpretation of the vertical axis.
The author's analysis focuses solely on the case of an exponential distribution. However, given the multitude of reward functions in simulations or real environment, it becomes challenging to determine whether they exhibit ergodicity. A broader exploration of different reward functions would enhance the paper's comprehensiveness.

问题

Could the author provide a more detailed explanation for the observed difference in improvement between the Reacher and Cartpole environments? Does this suggest that the reward dynamics in Cartpole are ergodic, while they are non-ergodic in Reacher?
In the Reacher environment, it's noted that the reward in the testing phase is significantly lower than during training. Could the author clarify whether there are differences in the parameters between Reacher's test and training environments?
It would be beneficial if the author could elaborate on the specific scenarios in which this reward transformation is most beneficial. Is it guaranteed to yield a better policy when applied?

评论- Replies

2023-11-14

We thank the reviewer for the overall positive evaluation of our paper and for acknowledging our core contribution as “intriguing and well-founded.”

Notation. The variable $t_k$ in our paper indeed represents the time step. We here added the indice $k$ to clearly distinguish it from the $t$ in Section 5, where we consider a continuous-time setting. Figure 4 shows the return for the cart-pole and reacher experiments at $t_k=T$ over the number of episodes. The parameter $T$ has also in the problem setting been defined as the duration of an episode.

Different distributions. We focus our analysis on one particular example for which we know that the rewards are non-ergodic and show that state-of-the-art reinforcement learning algorithms cannot solve it - despite the simple dynamics. The fact that even such a simple example can pose severe challenges for established methods should be a good reason to re-evaluate the usage of the expectation operator for highly complex environments where RL is usually applied. In Section 6, we discuss in more detail for which types of reward functions our ideas are most important. Apart from non-ergodic stochastic processes, we particularly highlight environments where we have an absorbing barrier. We can often find such settings in safe learning settings where reaching forbidden states, e.g., a robot crashing into an obstacle, automatically ends the episode. A risky policy, for which there is a non-zero probability of reaching the forbidden state when rolled out many times for a short duration, might now, on average, yield a high reward, as it only sometimes crashes. However, if we let the same policy run for a long time, we will almost surely eventually reach the forbidden state. Thus, also here time average and expected value do not coincide, and, for the sake of robustness, we might want to go for optimizing the time average.

Cart-pole example. In the cart-pole environment, the differences between the runs are only modest. This suggests that the cart-pole environment is either ergodic or that the non-ergodicity here plays a less significant role. For a non-ergodic process, time average and expected value are different, but they may still be close to each other. For the reacher, this apparently does not hold. The difference between testing and training is because we change the link length, as we also mention in the corresponding section.

Benefits of our method. The ergodicity transformation guarantees that we focus on maximizing the long-term growth rate of the return. Given an ergodic reward function, this is identical to the expected growth rate. Thus, given an ergodic reward function, using or not using the transformation makes no difference. Given a non-ergodic reward function, this does not hold any longer. Then, it depends on whether we care more about the long-term performance or the performance averaged over many trials. In the former case, the ergodicity transformation is beneficial.

审稿意见

评分: 5置信度: 32023-11-04

The paper makes an interesting point that while RL practices aim at maximizing expected value of the accumulated reward, this assumption is only guaranteed to be valid if the system is ergodic but this is not the case even in a rather simple toy stochastic settings (a coin toss game). Thus ergodicity, i.e., the property that the average accumulated reward over the statistical ensemble of infinitely many trajectories agrees with the average over a single but infinitely long trajectory, is defined as a desirable attribute of such systems. In rather simple, hand-crafted settings it is possible due to prior work to find transformation of payoffs such the new system is ergodic. However, since this approach cannot be guaranteed beyond such setting the paper next opts out from simpler to satisfy proxies such variance/second moment stabilizing transforms, which have also been considered in prior works. Adapting the payoffs of a PPO agent accordingly in the coin toss game improves its performance. Then the paper examines connections between ergodicity and risk aversion based on the assumption that a specific class of risk-sensitive transformation extracts an ergodic observable. A calculation shows that these assumptions are valid only if the rewards grow logarithmically in time, which does not hold for the guiding example of the coin toss games which has exponential costs. The paper ends with another couple of simple RL settings (e.g. cart-pole and reacher) showing that the ergodic transformation can improve the performance of vanilla REINFORCE.

优点

The paper brings interesting ideas from the recently evolving area of ergodicity economics to RL. It is an intriguing connection that most likely will be a totally novel perspective for most of the ICLR audience. The argument made by the authors is easy to follow with simple examples to build intuition. The presentation does not require effectively any prerequisites in ergodic theory, random dynamical systems or even RL.

缺点

On the antipode of the easy to read part of the paper, at some times I feel the paper makes choices that reduce its value as a stand alone piece of work. For example, the paper keeps referencing to recent works by Peters et al. as an important precursor if not originator of the major ideas in the current paper. As not an expert in this line of research, I am not very confident about the added value of this line of work. Similarly, despite the promises about the advantages of the technique, its RL applications seem somewhat underwhelming and as the authors themselves point out constitute more of a proof of concept than a smoking gun that this technique actually delivers. The risk sensitive transformation is not applied successfully in any setting.

问题

Q1. In your approach you only consider summation of costs instead discounted summation, which I think as you report in the paper results into numerical issues. Is this approach applicable under the standard case of discounted rewards?

Q2. Even if ergodicity is true this implies than an idealized infinite length orbit captures the statistics of the whole system. But real life samples are of course finite and without discounting trajectories have to be "cut" artificially even if the system could go on forever to avoid overflows. What conditions allow for fast convergence guarantees and would be expect these conditions to be a good match for RL settings?

Q3. The current approaches seem to fit either the case of exponentially fast increasing rewards or logarithmic slow increasing rewards. What about the in between values such as linear/polynomial rewards?

Q4. Which are the thorniest obstacles when scaling these ideas to more complex RL benchmarks?

评论- Replies

2023-11-14

We thank the author for highlighting the “totally novel perspective” that our paper might bring to most of the ICLR audience. This is indeed where we see a major contribution of this work: posing the question of non-ergodicity in reward functions and what could then be alternatives to relying on the maximization of expected rewards.

Application of risk-sensitive transformation. The risk-sensitive transformation we discuss in the paper has been applied in various works, for instance, Prashanth et al. 2022, Noorani et al. 2022, and Fei et al. 2021. As this transformation is not a contribution of our paper, we did not provide a successful example ourselves. Instead, we wanted to show that the risk-sensitive transformation is an ergodicity transformation given a specific form of reward function. However, when the reward dynamics are very different from this form, as in the coin-toss game, such fixed transformations fail. Therefore, a more generic approach, such as the transformation we propose, is required.

Discounted rewards. Using discounted rewards does not pose an issue, and for the REINFORCE examples, we also used a discount factor, with which the transformation works equally well. As we point out in the conclusions, the ergodicity perspective can also motivate or rather explain certain choices of discount factors and how they decrease over time. Here, we focused on the relationship between (non-)ergodicity and the expectation operator. However, examining more thoroughly which insights this perspective can bring to the choice of discount factor will clearly be a topic for further research.

Infinite length of trajectories. When choosing a policy that optimizes the expected value, we choose a policy that would be optimal if we would roll it out infinitely often, which, in practice, we can never do. Similarly, when optimizing the time average, the policy we choose would be optimal if we roll out the policy once but for infinite time. Also this, we cannot do in practice. Thus, in both cases, what we optimize differs from what we do in practice. The question is which one is closer: Are we, in the end, interested in what happens on average when we try the policy many times or when we try it once but for a long time? Depending on the answer, we should choose our optimization objective. We are also not arguing to abandon the expected value completely. In a swarm robotics setting, the average performance over all agents in the swarm might be what we care about - then optimizing the expected value would be a good choice.

Dynamics. Exponential dynamics are nice to consider in this setting since a widely used stochastic process exhibits such dynamics (geometric Brownian motion), and we can find closed-form solutions. The logarithmic dynamics come from a transformation widely used in risk-sensitive reinforcement learning. Nevertheless, for other dynamics, such transformations can be found as well. In the paper of Peters and Adamou from 2018, the authors also investigated square-root dynamics, for which a transformation can be found in a similar way.

Challenges. We list some challenges in the conclusions of the paper. Our current transformation requires a trajectory from which it can learn the transformation. This limits us to Monte-Carlo-type approaches. Extending this to an incremental transformation and then integrating this into the RL algorithm we would see as a challenging but essential next step.

评论- Response

2023-11-21

I thank the authors for their response.

Although I definitely appreciate the authors' willingness to share this intriguing perspective with the ICLR community, I feel that the current level of contribution does not feel significant enough to make it through in such a highly competitive ML venue. I would advise the authors to either forward their paper as is to a slightly less competitive venue that maybe explicitly encourages blue sky ideas or develop some of the aspects of the paper in more detail (e.g. ...the ergodicity perspective can also motivate or rather explain certain choices of discount factors and how they decrease over time). Right now these feel more like plausibility arguments, which although intriguing, do not feel sufficient.

2023-11-22

We thank the reviewer for emphasizing the intriguing nature of our contributions again. In this paper, we identify a problem that so far seems unaddressed and even unnoticed in reinforcement learning. Through an intuitive example, we show the implications neglecting non-ergodicity has for state-of-the-art RL algorithms. We then develop a Monte Carlo algorithm that can solve the problem. We further provide theoretical foundations for the area of risk-sensitive RL. Overall, this goes significantly beyond "plausibility arguments."

评论- Response

2023-11-22

The plausibility comment was in regards to the statement that "the ergodicity perspective can also motivate or rather explain certain choices of discount factors and how they decrease over time." This is not a line of inquiry that is pursued in the current paper. In fact, there is no discussion about discount factors at all. If indeed the ergodicity perspective can provide insights on such parameter tuning for RL then this would be intriguing to see this analyzed clearly.

I thank the authors again for their responses but I will keep my score as is. Once again, I am appreciate and supportive of this line of work and I very much hope to see an expanded version of it in the near future, however, in its current state this paper is not quite ready for publication at ICLR.

AC 元评审

2023-12-06

The reviewers are in agreement that the motivation, technical consistency, and contributions are unclear. For instance, the jump between continuous and discrete time, the jump between exploration and risk sensitivity, and the lack of theoretical basis for why the proposed techniques enable improved performance in practice are all barriers to acceptance at this time.

为何不给更高分

There is a in lack of clarity on several critical points in the manuscript, and the reviewers are clear in it not being ready for acceptance at this time.

为何不给更低分

最终决定Reject

2024-01-16

Reject