Online Laplacian-Based Representation Learning in Reinforcement Learning
We propose an online method for learning the Laplacian representation in reinforcement learning, and show theoretically and empirically it converges.
摘要
评审与讨论
This paper reveals an important observation from Klissarov & Machado (2023): Learning the representation in an online manner provides better exploration and total rewards. Motivated by this, the author proposes the online optimization formulation of GDO and gives some theoretical analysis.
优点
This paper extends the existing setting of Laplacian representation to the online optimization framework and provides a convergence analysis.
缺点
The theoretical results are not satisfying and over-simplified.
- The assumption 1 is not realistic for the reinforcement learning senario. It is known that the optimal policy could be a deterministic policy, which means that could be zero in many deterministic environments. The relaxion discussed in the following paragraph doesn't ensure that is non-zero. A possible solution is to add a small uniform perturbation to all policies.
- Proposition 1 is a relatively loose upper bound, which nearly removes all dependence on the time . The main motivation of this paper discusses the benifits of online optimization framework; however, this motivation is not presented in the final convergence analysis. It is because the only connection to the time is simply omitted in this step.
- The theoretical analysis doesn't bring anything new to this field. It simply uses the smoothness of the objective function , and follows the same steps of standard optimization paper.
- Moreover, Eq. (44) doesn't implies the upper bound is . It contains the term . The author simply assumes it is less than infinite in Assumption 2. But this assumption has never been made in the TRPO and PPO paper. Making such assumption requires the learning rate of the policy-based algorithm to be approperiately choosen. The author didn't reflect the impact of this assumption in the final bound and presents the convergence rate.
问题
I hope the author addresses my concerns in the weakness section.
We thank the reviewer for their detailed feedback and thoughtful suggestions. Below we address their concerns and questions.
Assumption 1
The first part of Assumption 1 is a standard assumption in RL theory, particularly in works analyzing policy evaluation and optimization in Markov Decision Processes (MDPs) (e.g., (Puterman, 2014), (Bertsekas and Tsitsiklis, 1996)). Moreover, ergodicity (irreducibility with positive recurrence) is often assumed to ensure that the chain does not get stuck in transient or absorbing states under the learned policies (e.g., (Melo et al., 2008)), (Even-Dar et al, 2009)). While these assumptions may not hold for arbitrary policies, they are reasonable when considering the restricted class of policies produced by the RL algorithm, as explicitly stated in our paper.
Additionally, we acknowledge the reviewer’s observation that the assumption may not hold for some learning algorithms. However, many RL algorithms are explicitly designed to promote diverse exploration, making this assumption plausible. For instance, policy-gradient-based methods with entropy regularization (e.g., (Schulman et al., 2017)), value-based methods with optimism in face of uncertainty (e.g., (Strehl, 2007), (Azar et al., 2017)), or -as the reviewer noted- methods that randomly sampling actions with a small probability inherently drive the policies towards sufficient state coverage. These mechanisms align with our assumption that the policies produced by the RL algorithm induce a stationary distribution with non-negative entries for all states.
This work serves as an initial exploration into the online learning of Laplacian representations which is currently an open question, where we adopt certain (possibly restrictive) assumptions to establish a solid foundation for theoretical analysis and results. We acknowledge the limitations imposed by these assumptions and plan to explore their relaxation in future work to enhance the practicality and broader applicability of our framework.
Proposition 1
In Proposition 1, we prove the Lipschitz continuity of AGDO for a fixed policy, meaning AGDO at a specific time step. This property is established for any generic policy at any fixed time step, making it independent of . The dependence on comes from the drift bound and appears in both Lemma 2 and Theorem 2.
Theoretical Analysis
While the proof of ergodic convergence follows similar techniques from the stochastic optimization literature, we provide assumptions and proofs for necessary conditions that make these techniques possible for this particular problem. For instance, we start by showing that the smallest -eigenvectors are the unique stable equilibrium of AGDO. Additionally:
- We define the assumptions necessary for the convergence of online AGDO.
- We rigorously analyze the properties of the online objective under these assumptions, including characterizing the drift in the objective and establishing its Lipschitz continuity.
- Leveraging these properties, we prove that online AGDO achieves ergodic convergence, drawing on established techniques from the literature on the optimization of smooth objectives in the presence of drift.
Our empirical and theoretical analysis also support that these assumptions and properties are necessary. For example, we demonstrate that algorithms such as Q-learning-based approaches—which often cause significant shifts in the action distribution and are commonly used in option discovery—may not be well-suited for learning Laplacian representations in an online setting.
Theorem 2 and Assumption 2
We agree with the reviewer that Assumption 2 is not universally satisfied by all reinforcement learning algorithms. However, as the reviewer noted, it can be achieved in certain algorithms with appropriate hyperparameter selection.
The first part of the assumption, which requires the policy not to deviate significantly after each update, is satisfied by algorithms like TRPO and PPO. Additionally, the condition that the drift summation is bounded by a finite constant is met whenever the learning algorithm converges to a stable policy, which can be ensured through proper hyperparameter tuning. For example, TRPO with entropy regularization has been shown to have convergence guarantees under mild assumptions (Shani et al., 2019).
Although the drift summation term can be hidden within the big-O notation since it is constant, we agree with the reviewer that explicitly including the full term in Theorem 2 enhances clarity and conveys the complete message. We have updated the manuscript to reflect this improvement.
- Puterman, Martin L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014 (Section 8.3).
- Bertsekas, D. P. "Neuro-dynamic programming." Athena Scientific (1996) (Section 6.7.6).
- Melo, Francisco S., Sean P. Meyn, and M. Isabel Ribeiro. "An analysis of reinforcement learning with function approximation." Proceedings of the 25th international conference on Machine learning. 2008.
- Even-Dar, Eyal, Sham M. Kakade, and Yishay Mansour. "Online Markov decision processes." Mathematics of Operations Research 34.3 (2009): 726-736.
- Strehl, Alexander L. Probably approximately correct (PAC) exploration in reinforcement learning. Diss. Rutgers University-Graduate School-New Brunswick, 2007.
- Azar, Mohammad Gheshlaghi, Ian Osband, and Rémi Munos. "Minimax regret bounds for reinforcement learning." International conference on machine learning. PMLR, 2017.
- Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
- Shani, Lior, Yonathan Efroni, and Shie Mannor. "Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 04. 2020.
I appreciate the author's response. Here I may hope present my concerns more clearly:
-
I agree that the first part of the Assumption 1 is standard. However, I would like to further highlight two situations: (a) for all ; and (b) there exists a constant such that for all .
The second condition would be strictly stronger. The reference (melo2008) says that for all policy with , the ergodicity holds, which is the case (a) not (b). Consider the two-action policy . Then for all , the chain is ergodic. However, the limiting one with is not.
-
I also agree that there are many tricks to restrict the policy class to make Assumption 1 possible, but such restriction will cause additional approximation error in the theoretical analysis, which has not been reflected in the proof.
-
Assumption 2: Let the learning rate be . Then . Your convergence rate should be instead of . I strongly disagree that the author simply assumes this term is finite, while it potentially grows as increases.
We appreciate the reviewer's further clarifications and quick response.
- We agree that (b) is more restrictive. However, note that we require to be greater than some and not . This can be achieved even if for some and , given some restrictions on the underlying MDP. For example, if there is some action failure probability for which the intended action fails and a random behavior is sampled, a corresponding can be computed.
We have also updated our discussion on the relaxation of the assumption to only require that if then but not necessarily the opposite. For this scenario, our analysis would apply to the set {}.
-
The tricks discussed such as entropy regularization and some exploration are known to have some tradeoff on downstream tasks, yet are common techniques for exploration and achieving faster convergence to a possibly sub-optimal policy. In this study, the noise introduced from such methods should not affect the representation learned as we only require the algorithm to converge to some policy.
-
We agree with the reviewer that the total drift is dependent on T. We have updated both Assumption 2 and Theorem 2. The updated assumption requires the policy learning algorithm to converge to some policy while generating a total drift that is sublinear in T. We also updated the bound in Theorem 2 to include that dependence.
This work considers the problem of online representation learning in MDPs. Building on past work that uses the standard Laplacian graph embedding on the weighted graph induced by the Markov chain associated to a given policy as a representation of the MDP, this paper introduces a new objective that allows for learning when the policy is being updated, as occurs during training of an RL agent. More precisely, the authors introduce the Asymmetric Graph Drawing Objective, which seeks to produce a graph embedding from the Laplacian of the graph induced at step of some fixed RL algorithm. Under assumptions on reachability and bounded drift, as well as technical assumptions on the graph structure, the authors then prove that the desired Laplacian imbedding is the unique stable equilibrium under gradient descent dynamics for a fixed time . The authors then demonstrate convergence to an approximate first-order stationary point under the bounded drift assumption on the RL algorithm. The authors conclude with an empirical analysis of their algorithm on Gridworld, where they compare their proposal with the earlier Augmented Lagrangian Laplacian Objective as well as conducting an ablation study to empirically examine the necessity of their assumptions.
优点
This paper introduces a novel variant of a previous representation learning objective that is simple and intuitive. They demonstrate that running gradient descent on their objective has nice properties under several technical assumptions and demonstrate empirically the efficacy of their method.
缺点
I think that there are three main weaknesses:
-
I think that Assumption 1 is extremely strong. The irreducibility of the Markov chain induced by a policy being ergodic is already a strong assumption, and lower bound on the stationary distribution's mass at any given state is even more so, but the part I think is strongest is that this is essentially an all policy requirement. While reachability assumptions are often invoked in RL theory, they are usually state-dependent in the sense that for every state, there is some policy that does a good job of reaching that state; the current assumption is that every state is visited sufficiently frequently by every policy, which seems very difficult to guarantee. Note that the authors do not literally assume this, as their Assumption 1 is restricted to policies produced by the RL algorithm, but it is very difficult to control what policies an arbitrary RL algorithm will visit. The paragraph below this assumption allows for some slight weakening, where Assumption 1 only has to hold on the support of the initial stationary distribution, but this still seems to impose an extremely strong restriction on the ability of the RL algorithm to explore, especially as I expect that would figure polynomially into any quantitative bound (as it indeed does in Lemma 2(d)).
-
While it is nice that the authors demonstrate convergence to the Laplacian embedding, and I understand that this embedding is frequently used in visualizing graphs, it would be nice if the authors provided more motivation for why this is a reasonable objective for which to strive.
-
The experiments seem to suggest that AGDO and ALLO behave extremely similarly; for example, the GridRoom-4 line in Figure 3 looks like the same curve in both (a) and (b). If this is the case, I am confused as to what the empirical takeaway is here and what the purpose of using AGDO over ALLO is.
问题
See weaknesses.
We appreciate the reviewer's thoughtful comments and critiques of our manuscript. Below, we address the reviewer's questions and concerns.
Assumption 1
- Ergodicity: The first part of Assumption 1 is a standard assumption in RL theory, particularly in works analyzing policy evaluation and optimization in MDPs (e.g., Puterman, 2014, Bertsekas and Tsitsiklis, 1996). Moreover, ergodicity is often assumed to ensure that the chain does not get stuck in transient or absorbing states under the learned policies (e.g., Melo et al., 2008, Even-Dar et al., 2009). While these assumptions may not hold for arbitrary policies, they are reasonable when considering the restricted class of policies produced by the RL algorithm, as stated in our paper.
- Lower Bound of the Stationary Distribution: We acknowledge the reviewer’s observation that the lower bound on the stationary distribution mass in Assumption 1 plays a role in the quantitative bounds, particularly in Lemma 2(d). These bounds depend on the stationary distribution and its minimal mass to ensure finite sample guarantees. Similar dependencies appear in prior works that derive finite-time performance bounds for RL algorithms (e.g., (Ortner et al., 2020), (Metelli et al., 2023)), where the mass of the stationary distribution or a related measure (e.g., mixing time of the policy-induced Markov chain) appears as a scaling factor. Such dependencies are a common trade-off for achieving rigorous theoretical guarantees in RL settings. This work serves as an initial exploration into the online learning of Laplacian representations which is currently an open question, where we adopt certain (possibly restrictive) assumptions to establish a solid foundation for theoretical analysis and results. We acknowledge the limitations imposed by these assumptions and plan to explore their relaxation in future work to enhance the practicality and broader applicability of our framework.
- Example Settings: The reviewer rightly notes that our assumption applies only to policies produced by the RL algorithm, not all possible policies. While controlling the policies an RL algorithm visits during training is challenging, many RL algorithms are explicitly designed to promote diverse exploration, making this assumption plausible. For instance, policy-gradient-based methods with entropy regularization (e.g., (Schulman et al., 2017)) or value-based methods with optimism in face of uncertainty (e.g., (Strehl, 2007), (Azar et al., 2017)) inherently drive the policies towards sufficient state coverage. These mechanisms align with our assumption that the policies produced by the RL algorithm induce a stationary distribution with non-zero entries for all states in the support of the initial distribution.
Importance of the Online Learning of the Laplacian Representation
The Laplacian representation given a fixed policy has found success in reward-shaping (Wu et al., 2018), and options learning (Jinnai et al., 2019; Chen et al., 2024). Additionally, in the case of online learning of the representation, the work by Klissarov and Machado (2023) provides valuable insights into the benefits of using online representations for learning options compared to static representations. Their empirical results demonstrate the effectiveness of this approach across a variety of continuous tasks. However, the theoretical guarantees and the accuracy of the learned representations remain unexplored. In this work, we address this gap by establishing a theoretical foundation for the online learning of Laplacian representations, a method already shown to be effective for downstream tasks in prior empirical studies, providing further insights into the necessary conditions for theoretical guarantees of accurate learning.
As noted in the conclusion, our results lay the groundwork for broader applications. These include tasks like option discovery, reward shaping, and value function approximation, all supported by the theoretical guarantees provided by our analysis. We agree with the reviewer that studying the effect of this algorithm on downstream tasks is important and can be the direction of future work.
Similarity to ALLO
The result is intended to show that AGDO a simpler objective can achieve similar results to ALLO. AGDO is equivalent to ALLO with . As noted by Gomez et al., this simplification is suitable when learning eigenvalues is unnecessary. However, their analysis primarily focused on AGDO (or ALLO with ) for learning representations associated with a static policy, with convergence assessed solely through empirical validation. In this work, we adopt AGDO for analyzing convergence in the online learning context. We begin by showing that the -smallest eigenvectors are the unique stable equilibrium of AGDO empirically and theoretically. AGDO enables a more practical analysis of the online drift without relying on the eigenvalue drift.
References
- Ortner, Ronald. "Regret bounds for reinforcement learning via markov chain concentration." Journal of Artificial Intelligence Research 67 (2020): 115-128.
- Metelli, Alberto Maria, Mirco Mutti, and Marcello Restelli. "A tale of sampling and estimation in discounted reinforcement learning." In International Conference on Artificial Intelligence and Statistics, pp. 4575-4601. PMLR, 2023.
- Azar, Mohammad Gheshlaghi, Ian Osband, and Rémi Munos. "Minimax regret bounds for reinforcement learning." International conference on machine learning. PMLR, 2017.
- Jin, Chi, et al. "Provably efficient reinforcement learning with linear function approximation." Conference on learning theory. PMLR, 2020.
- Puterman, Martin L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014 (Section 8.3).
- Bertsekas, D. P. "Neuro-dynamic programming." Athena Scientific (1996) (Section 6.7.6).
- Melo, Francisco S., Sean P. Meyn, and M. Isabel Ribeiro. "An analysis of reinforcement learning with function approximation." Proceedings of the 25th international conference on Machine learning. 2008.
- Even-Dar, Eyal, Sham M. Kakade, and Yishay Mansour. "Online Markov decision processes." Mathematics of Operations Research 34.3 (2009): 726-736.
- Sidford, Aaron, et al. "Near-optimal time and sample complexities for solving Markov decision processes with a generative model." Advances in Neural Information Processing Systems 31 (2018).
- Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
- Strehl, Alexander L. Probably approximately correct (PAC) exploration in reinforcement learning. Diss. Rutgers University-Graduate School-New Brunswick, 2007.
- Martin Klissarov and Marlos C. Machado. Deep Laplacian-based options for temporally-extended exploration. In Proceedings of the 40th International Conference on Machine Learning, pp.17198–17217. PMLR, July 2023. URL https://proceedings.mlr.press/v202/klissarov23a.html.
- Jiayu Chen, Vaneet Aggarwal, and Tian Lan. A unified algorithm framework for unsupervised discovery of skills based on determinantal point process. Advances in Neural Information Processing Systems, 36, 2024.
- Yuu Jinnai, Jee Won Park, Marlos C Machado, and George Konidaris. Exploration in reinforcement learning with deep covering options. In International Conference on Learning Representations, 2019.
- Marlos C Machado, Marc G Bellemare, and Michael Bowling. A Laplacian framework for option discovery in reinforcement learning. In International Conference on Machine Learning, pp.2295–2304. PMLR, 2017a.
- Yifan Wu, George Tucker, and Ofir Nachum. The Laplacian in RL: Learning representations with efficient approximations. arXiv preprint arXiv:1810.04586, 2018.
Thank you for the response. Can you point out where in Metelli et al the authors assume a uniform lower bound on the mass at any state of any stationary distribution? From a cursory reading, it appears that their bounds depend on spectral gaps, which bound mixing times but do not imply such a lower bound. I agree that Ortner makes use of this assumption, but I maintain that this is extremely strong. Modern RL theory involving episodic MDPs does make reluctant use of reachability assumptions (e.g. Kinematic State Abstraction and Provably Efficient Rich-Observation Reinforcement Learning by Misra et al 2019 and Model-free representation learning and exploration in low-rank mdps by Mode et al 2021 among many, many others) but again these assumptions allow the reaching policy to depend on the state, which is substantially weaker. Moreover, you say:
While controlling the policies an RL algorithm visits during training is challenging, many RL algorithms are explicitly designed to promote diverse exploration, making this assumption plausible. For instance, policy-gradient-based methods with entropy regularization (e.g., (Schulman et al., 2017)) or value-based methods with optimism in face of uncertainty (e.g., (Strehl, 2007), (Azar et al., 2017)) inherently drive the policies towards sufficient state coverage.
I am not sure I agree that this assumption is plausible, especially near the end of training; indeed, in MDPs where there are `bad' states that always give poor reward (e.g. a walker falling over), the trained policy will explicitly avoid those states, and this assumption will fail to be satisfied, even by those algorithms that initially encourage sufficiently diverse exploration.
We thank the reviewer for their quick response and further discussion.
-
We mentioned Metelli et al. as an example of theoretical bounds that are dependent on the properties of the induced Markov chain of the policy. As the reviewer stated, their work depends on the spectral properties of the induced Markov chain and mixing time.
-
We agree that it is possible for potentially dangerous states to be completely avoided by a good policy. We have updated our discussion on the relaxation to assumption 1. It now states that if then but not necessarily the opposite. This allows for the policy learning to eventually have and our analysis would apply to the set {}. This relaxed assumption only requires that if a state is disconnected under a policy it remains disconnected under future updates.
While I thank the authors for their clarification, I maintain my original score as I remain concerned about the strength of this assumption.
This work studies learning the policy and the Laplacian representation of the transition dynamics simultaneously in online RL. The authors propose an asymmetric graph drawing objective and show that online projected gradient descent on this objective achieves ergodic convergence.
优点
The proposed method enables updating the Laplacian representation simultaneously with the policy. The online PGD algorithm enjoys theoretical convergence guarantees.
缺点
The problem of learning Laplacian representation in RL doesn't seem to be a super important question itself, because it doesn't really scale up to real-world, challenging settings where the state space is large. The experiments focus on a slightly toy environment and only evaluate the accuracy of the Laplacian learning, without demonstrating how online Laplacian representation learning can benefit RL (more than previous methods).
问题
How does online Laplacian representation learning benefit RL (more than previous methods) in practical RL applications?
We thank the reviewer for their feedback and for pointing out areas that require clarification. We would also appreciate it if we get further feedback on what areas can be improved to enhance soundness and presentation. Below, we address the reviewer's concern.
Laplacian Representation Importance
Learning the Laplacian representation in reinforcement learning has been a widely studied topic in the reinforcement learning literature such as the work by Mahadevan and Maggioni (2007) and more recently the works of Wu et al. (2018), Wang et al. (2021), and Gomez et al. (2023). These works study how to accurately learn the Laplacian representation corresponding to some fixed policy. In addition, as we highlight in the introduction section the Laplacian representation has found success in reward-shaping (Wu et al., 2018), and options learning (Machado et al., 2017a; Jinnai et al., 2019; Chen et al., 2024). We believe this topic is particularly interesting for the readership of ICLR
Usefulness of Online Representation Learning
Here we discuss the importance of learning the representation online compared to learning the representation for a static policy. The work by Klissarov and Machado (2023) provides valuable insights into the benefits of using online representations for learning macro-extended actions (options) compared to static representations learned from a uniform policy. Their empirical results demonstrate the effectiveness of this approach across a variety of continuous tasks. However, the theoretical guarantees and the accuracy of the learned representations remain unexplored. In this work, we address this gap by establishing a theoretical foundation for the online learning of Laplacian representations, a method already shown to be effective for downstream tasks in prior empirical studies. Our theoretical and empirical results also emphasize the assumptions necessary for accurate learning. For instance, we demonstrate that algorithms such as Q-learning-based approaches—which often cause significant shifts in the action distribution and are commonly used in option discovery—may not be well-suited for learning Laplacian representations in an online setting.
As noted in the conclusion, our results lay the groundwork for broader applications. These include tasks like option discovery, reward shaping, and value function approximation, all supported by the theoretical guarantees provided by our analysis. This work opens up new directions for leveraging online representations in reinforcement learning. We agree with the reviewer that studying the effect of this algorithm on downstream tasks is important and can be the direction of future work.
References
-
Gomez, Diego, Michael Bowling, and Marlos C. Machado. "Proper Laplacian Representation Learning." The Twelfth International Conference on Learning Representations.
-
Martin Klissarov and Marlos C. Machado. Deep Laplacian-based options for temporally-extended exploration. In Proceedings of the 40th International Conference on Machine Learning, pp.17198–17217. PMLR, July 2023. URL https://proceedings.mlr.press/v202/klissarov23a.html.
-
Jiayu Chen, Vaneet Aggarwal, and Tian Lan. A unified algorithm framework for unsupervised discovery of skills based on determinantal point process. Advances in Neural Information Processing Systems, 36, 2024.
-
Yuu Jinnai, Jee Won Park, Marlos C Machado, and George Konidaris. Exploration in reinforcement learning with deep covering options. In International Conference on Learning Representations, 2019.
-
Marlos C Machado, Marc G Bellemare, and Michael Bowling. A Laplacian framework for option discovery in reinforcement learning. In International Conference on Machine Learning, pp. 2295–2304. PMLR, 2017a.
-
Sridhar Mahadevan and Mauro Maggioni. Proto-value functions: A Laplacian framework for learning representation and control in markov decision processes. Journal of Machine Learning Research, 8(10), 2007.
-
Kaixin Wang, Kuangqi Zhou, Qixin Zhang, Jie Shao, Bryan Hooi, and Jiashi Feng. Towards better Laplacian representation in reinforcement learning with generalized graph drawing. In Proceedings of the 38th International Conference on Machine Learning, pp. 11003–11012. PMLR, July 2021. URL https://proceedings.mlr.press/v139/wang21ae.html.
-
Yifan Wu, George Tucker, and Ofir Nachum. The Laplacian in RL: Learning representations with efficient approximations. arXiv preprint arXiv:1810.04586, 2018.
Thank you for your response. After reading the work again, I decide to keep my score.
The paper proposes a new algorithm for efficient online learning of laplacian representations for RL. Under benign assumptions, the authors prove convergence to the laplacian eigenvectors. Finally, authors provide some simple gridworld experiments showing that the algorithm performs as intended, in accordance with the assumptions (eg. it works worse when policy update is less stable).
优点
The paper is well written, with a clear survey of the ideas that lead to the solution.
Authors consider an interesting problem of efficiently learning representations that dynamically adjust as policy learning occurs. We could use existing algorithms for this task by performing an expensive representation learning step every time a policy update occurs, but this is impractical.
The algorithm seems like a practical extension of existing work.
As far as I can tell (did not read the proofs in the appendix) the paper is theoretically grounded.
The ablations to evaluate how the assumptions translate into actual performance is appreciated.
缺点
It is not entirely clear how much the theoretical contribution of this work relies on the existing work of Gomez et al. (2023). Given this contribution is primarily theoretical, and the actual algorithmic contribution to convert the existing offline approaches to an online algorithm seem reasonable, but mostly direct, the value of the contribution rests primarily on the theory. To me, it appears that there is enough of a contribution here.
The experiments seem more like debug/sanity check experiments. They are useful for convincing the algorithm works beyond asymptotics of theory but don't provide a compelling case that the method is actually useful. Without a stronger theoretical contribution, a more comprehensive experimental section (more challenging environments and a setting where the learned embeddings are used) would be helpful to showing the value of this method.
问题
Why is cosine similarity the only thing observed in the experiments? It seems like the core point of learning representations online is to be able to use them. Unless I am misunderstanding, the policy here can be anything, and does not use the representations. I think the paper would be strengthened with an example showing how the ability to adjust the representations online improves the actual task, whether that means that the policy acts on the representations or it is part of exploration (eg. only used in the exploration part of epsilon greedy).
Typos: 214: These should be eigenfunctions/eigenvectors, not eigenvalues. 341: non-> none
We thank the reviewer for their thorough response and insightful feedback. We have updated the manuscript to fix the typos highlighted and we address below the other concerns and questions.
Theoretical Contribution
Our work builds on the foundation established by Gomez et al. (2023), specifically leveraging a similar objective as ALLO. We use AGDO, which is equivalent to ALLO with . As noted by Gomez et al., this simplification is appropriate when eigenvalues are not required. However, prior work primarily focused on AGDO (or ALLO with ) for learning representations tied to a static policy, with convergence validated only through empirical analysis. In contrast, we make several key contributions:
- Theoretical Analysis of AGDO Stability: We provide the first theoretical proof that AGDO has a unique stable equilibrium corresponding to the -smallest eigenvectors. While our proof builds on techniques similar to those used for ALLO, it introduces important changes specific to AGDO. For example, AGDO has more equilibrium points in the form of for some vectors which we also show are not stable. This result is crucial because we use AGDO as the core objective in our online framework. Unlike ALLO, AGDO enables a more practical analysis of the online drift without relying on the eigenvalue drift.
- Convergence Analysis of Online AGDO: Our primary contribution is a novel theoretical analysis of the online version of AGDO, which significantly diverges from prior work due to the challenge of dealing with drifting objectives. Specifically:
- We define the assumptions necessary for the convergence of online AGDO.
- We rigorously analyze the properties of the online objective under these assumptions, including characterizing the drift in the objective and establishing its Lipschitz continuity.
- Leveraging these properties, we prove that online AGDO achieves ergodic convergence.
Usefulness of Online Representation Learning and Usage of Cosine Similarity
The work by Klissarov and Machado (2023) provides valuable insights into the benefits of using online representations for learning macro-extended actions (options) compared to static representations learned from a uniform policy. Their empirical results demonstrate the effectiveness of this approach across a variety of continuous tasks. However, the theoretical guarantees and the accuracy of the learned representations remain unexplored.
In this work, we address this gap by establishing a theoretical foundation for the online learning of Laplacian representations, a method already shown to be effective for downstream tasks in prior empirical studies. Our theoretical and empirical results also emphasize the assumptions necessary for accurate learning. For instance, we demonstrate that algorithms such as Q-learning-based approaches—which often cause significant shifts in the action distribution and are commonly used in option discovery—may not be well-suited for learning Laplacian representations in an online setting.
As noted in the conclusion, our results lay the groundwork for broader applications. These include tasks like option discovery, reward shaping, and value function approximation, all supported by the theoretical guarantees provided by our analysis. This work opens up new directions for leveraging online representations in reinforcement learning. We agree with the reviewer that studying the effect of this algorithm on downstream tasks is important and can be the direction of future work.
Finally, since our main goal is assessing the accuracy of the learned representation, we opted to use cosine similarity as the metric for evaluating the accuracy of the learned representation.
References
-
Gomez, Diego, Michael Bowling, and Marlos C. Machado. "Proper Laplacian Representation Learning." The Twelfth International Conference on Learning Representations.
-
Martin Klissarov and Marlos C. Machado. Deep Laplacian-based options for temporally-extended exploration. In Proceedings of the 40th International Conference on Machine Learning, pp. 17198–17217. PMLR, July 2023. URL https://proceedings.mlr.press/v202/klissarov23a.html.
Thank you for the detailed response and I approeciate the work to correct Assumption 2/ Theorem 2 as was pointed out by reviewer fhTs.
While I acknowledge related work demonstrates the empirical value of online representations, I hold by my criticism that there are no experiments in the paper truly showing the value. I am still unconvinced that the theoretical contribution is significant enough to warrant acceptance. I retain my score.
We would like to thank all the reviewers for the thoughtful and detailed feedback on our work. We greatly appreciate the time and effort you have invested in reviewing our submission. We have made modifications to the manuscript to fix typos and incorporate suggestions. We hope our response answers the reviewer's questions and resolves their concerns. In that case, we hope the reviewers reconsider their evaluation. We would be happy to further elaborate on any of the points if the reviewers would like additional details.
This paper proposes a new algorithm for efficient online learning of Laplacian representations in reinforcement learning (RL). Under benign assumptions, the authors prove convergence to Laplacian eigenvectors. The reviewers identified several critical shortcomings in the paper, citing a lack of meaningful contributions in both algorithm design and theoretical analysis. Specifically, the algorithmic framework is largely adapted from existing works, offering limited novelty. The theoretical analysis is similar to that in Gomez et al. (2023), with minimal differentiation. Comprehensive experiments demonstrating the algorithm’s efficacy are missing. Moreover, certain assumptions made in the paper appear unrealistic or overly strong in practical scenarios. For example: Assumption 1, which lacks specific structural assumptions on the problem, could introduce approximation errors in the analysis. Assumption 2 is not valid for many existing algorithms and may lead to a different convergence rate for Theorem 2. The authors’ response did not adequately address these issues and suggested that significant modifications to the current results would be required. Given these concerns, I recommend rejecting this paper in its current form.
审稿人讨论附加意见
The discussion failed to establish the paper's technical and algorithmic contributions. Additionally, scrutiny of the paper's underlying assumptions during the discussion suggests that the theoretical results may not hold.
Reject