Constructing an Optimal Behavior Basis for the Option Keyboard
A method for constructing an optimal behavior basis for the Option Keyboard, enabling zero-shot identification of optimal solutions for any linear-reward task.
摘要
评审与讨论
This paper is about a method for efficiently constructing an optimal behavior basis (a set of so-called base policies) for solving multi-task reinforcement learning. Based this optimal behavior basis a dynamic combination of these policies is used to solve new RL tasks. The underlying concept that make this work are successor features. The paper provides strong formal guarantees that this optimal basis allows the options keyboard (a generalization of GPI with different weights in each state) to optimally solve any new linear task. One of the theoretical contributions is a proof that this can express a strictly larger set of policies than the convex combination baseline. The approach is evaluated on high-dimensional RL problems and shows favorable performance relative to baselines.
优缺点分析
Strengths
- The paper is well-written and addresses an important and interesting problem.
- From what I understand, the paper is correct and provides a novel and interesting solution. However, I am not an expert in this area and cannot claim to understand the theoretical results of the paper completely.
- The research questions are clearly phrased in the experiments section which helps understanding the evaluation.
- Given the results, the method seems to extend to non-linear reward functions to a certain degree.
- Evaluation is done with 15 seeds.
- The method is shown to significantly reduce the number of needed base policies which is very relevant.
- I am sure the paper is of high quality and I would like to see an expert reviewer make a statement about the theoretical contributions which are extensive (considering the appendix).
Weaknesses
- I a bit puzzled by the question of when such an approach is actually practical. The assumption that this approach makes seem challenging for a real-world RL tasks / settings. Perhaps, the authors can argue more for practical suitability beyond the simulated robot example.
问题
-
See weaknesses.
-
How do you designed suitable successor features?
局限性
yes
最终评判理由
I thank the authors for their responses (also regarding the comments of the other reviews). At this point, I do not have any further questions and I remain positive about this paper and maintain my ratings.
格式问题
We thank the reviewer for the positive and encouraging feedback. We are glad you found the paper to be clear, novel, and impactful. We also appreciate your comments on the correctness of our approach, the strength of our empirical evaluation, and the significance of our method’s ability to produce an optimal behavior basis significantly smaller than a CCS: this was precisely the central goal of the work.
Below, we address the reviewer’s questions and comments:
Questions regarding the practical applicability of our method
The reviewer raised an excellent question about the practical suitability of our approach beyond robotics examples. This is indeed a crucial point, since Successor Features (SFs)—the core mathematical object underlying our method—form the foundation of many recent advances in reinforcement learning.
A key reason SFs scale well to large problems is that computing them reduces to estimating a vectorized value function, which allows direct application of state-of-the-art TD learning algorithms. These techniques have been shown to scale to domains with over a million state features, including computer Go. Accordingly, SF-based methods that build upon such techniques have successfully extended to high-dimensional settings such as 3D image-based navigation and robotic locomotion (Carvalho et al., 2023a; Chua et al., 2024). Our method can directly leverage all of these advances.
Moreover, please note that our approach is the first to ensure zero-shot optimality without requiring explicit construction of exponentially large Convex Coverage Sets (CCS), a key limitation in existing methods. This results in a substantial improvement in scalability over prior approaches.
Finally, we would like to highlight that:
(1) Our results go beyond demonstrating that OKB consistently outperforms state-of-the-art competitors in the most relevant and challenging benchmarks in the field (e.g., the autonomous driving environment in Figure 5). They also show that our method’s performance advantage over existing techniques becomes increasingly pronounced as task complexity increases.
(2) Prior methods have only been evaluated on problems with dimensionality (e.g., in the original SFOLS and SIP papers). Our paper is the first to demonstrate effective, scalable, and provably optimal zero-shot transfer in problems with twice as many dimensions. This shows not only that our technique outperforms all relevant baselines, but also that it does so in problems an order of magnitude more complex than those addressed in prior work.
This is a great question and helped us realize the need to better emphasize that the core components of our approach (SFs) have already been systematically shown to scale to problems with millions of features, and that our own results provide the first empirical evidence of consistent outperformance over state-of-the-art competitors on domains with twice the dimensionality of those studied in prior work.
Designing Successor Features
The paper details how the reward features are designed for each domain in Appendix F.2. For example, in the Minecart domain, includes quantities of collected ores and fuel consumption. In Fetch-PickAndPlace, represents negative Euclidean distances to target locations. For Item Collection, is composed of indicator functions for collected items, and in Highway, includes normalized forward speed, lane position, and turning actions. These are hand-crafted features based on domain knowledge, which is a common practice in the SF literature. Details about the architecture and training of the neural networks used to model the SFs follow standard practices in deep RL and are described in Appendix F.
We thank the reviewer again for the positive and supportive review. Given your primary questions regarding scalability of the approach, we hope to have clarified the practical applicability of our method and SFs in general. We note that the reviewer said in their review that they are “sure the paper is of high quality” and they would like to see an expert reviewer make a statement about the theoretical contributions. Other reviewers also pointed out that: “OKB produces a set of base policies and meta-policy which are optimal with respect to the family of tasks while requiring fewer base policies than existing methods” (Reviewer rDD6); “the proposed method is novel and theoretically sound” (Reviewer 1Yjh); “the policy basis is indeed smaller than that in previous work that also guarantee optimality” (Reviewer QsaG).
If you believe the clarifications above address the questions raised, and in light of the positive comments of the other reviewers regarding our theoretical contributions, we would be grateful if you would consider revisiting your score and confidence level. If any points remain unclear, we welcome further feedback and would be happy to continue the discussion. Thank you again for your thoughtful comments!
I thank the authors for their responses (also regarding the comments of the other reviews). I have a better understanding of the scalability of the method now. At this point, I do not have any further questions and I remain positive about this paper and maintain my ratings.
We thank the reviewer once again for the thoughtful discussion, constructive feedback, and valuable comments. We greatly appreciate their positive and encouraging remarks on the relevance and importance of our contributions, as well as their support for the acceptance of our paper.
The paper proposes a novel approach to multitask reinforcement learning with successor features. Concretely, the contribution consists in computing a policy basis that enables zero-shot identification of an optimal policy for any linear task. The policy basis can be significantly smaller than a convex coverage set, and generalization occurs by considering linear combinations of the policies in the basis. The contribution is empirically tested in a set of benchmark domains.
优缺点分析
The paper is clearly written, and achieving efficient zero-shot identification of an optimal policy in a family of MDPs is an important research question. It is also true that the policy basis is indeed smaller than that in previous work that also guarantee optimality.
On the negative side, I am not convinced that the proposed algorithm leads to more efficient learning than previous work. The reason is that the computational cost is dominated by the number of successor feature vectors, not the number of base policies. Specifically, learning the successor feature vector of a single policy is as hard as solving a single MDP in the given family. From the description of the algorithm (on page 4 and in Algorithm 1) it appears that the number of successor feature vectors in the set \Psi^{OK} is larger than that of a convex coverage set, which would imply that the overall computational complexity is greater. The authors do not discuss why this is the case, but my intuition is that we cannot linearly compose successor feature vectors of different policies, making it necessary to explicitly compute a successor feature vector for each policy considered (including many linear combinations of the base policies).
问题
Proposition 4.4 states that the OK can represent an optimal policy for a non-linear reward function. However, how would you actually compute the function \omega(s,w) in this setting, if there is no linear task w?
In the experiments, the authors use a USFA parameterized on the task w to learn the successor features. This seems like a peculiar choice given the problem at hand. On one hand, this will likely introduce approximation errors. On the other hand, this allows representing the successor features of any policy, not just those in a small set.
There seems to be a typo in Theorem 4.2: in A^{w^*}r(s,a), r should supposedly be a quantifier of A.
局限性
Notably, a discussion around the number of successor feature vectors computed by the algorithm is lacking.
最终评判理由
After discussion with the authors I have raised my score, since some of my concerns were clarified (specifically, the number of successor feature vectors computed by the algorithm). I believe that the final version of the paper needs to be more honest regarding the contribution, and include the theoretical justification provided by the authors in the discussion.
格式问题
No concerns.
We thank the reviewer for the insightful review. We appreciate their positive comments acknowledging that our method is capable of producing a basis significantly smaller than a CCS while ensuring zero-shot optimality—this was precisely the primary goal of our paper.
Below, we address the reviewer's main concerns and questions:
On the dominant computational cost in our approach
Thank you for your questions. Your comments helped us realize that we should more clearly emphasize the distinct roles played by the two sets central to our method—namely, and .
The reviewer makes three claims: (a) learning an SF vector is as hard as solving an MDP; (b) the computational cost of our method is dominated by the number of successor feature (SF) vectors rather than the number of base policies; and the overall cost of learning exceeds that of learning a CCS.
To carefully address these, we begin with a broader overview of our method and its key properties, followed by a more technical and formal discussion.
Overview and definitions
(1) Two sets of policies are central to our method—both for formally characterizing its properties and for designing the algorithm itself: and . The set is a behavior basis—a small, compact set of policies (much smaller than the CCS) that our method combines to solve new tasks (line 153). In contrast, denotes the set of all optimal policies (more precisely, their SF vectors) that can be directly reconstructed by our method by combining the policies in (line 151). These are the strategies our method can deploy to solve novel tasks zero-shot, without the need to train a specialized policy for each task—a process that is often costly, time-consuming, or infeasible.
(2) A key property of our method—central to its effectiveness and flexibility—is its provable ability to directly reconstruct all policies in a CCS, as well as policies for a challenging family of non-linear tasks, without ever explicitly having to compute or optimize them. Instead, it is capable of reconstructing them on demand using a meta-policy that combines policies from the behavior basis . These formal guarantees are established in Theorem 4.3 and extended by Proposition 4.4.
(3) In our paper, we formally express this property by stating that the CCS is a subset of ; that is, CCS . Intuitively, this means that our method is not only capable of reconstructing all optimal solutions for linear tasks, but it generalizes beyond the CCS and directly synthesizes optimal policies for particular non-linear tasks.
(4) Importantly, note that the formal statement CCS is a claim about the expressiveness of our method—specifically, which optimal policies it is capable of directly reconstructing, zero-shot, without requiring any training. It is not a statement about which (or how many) policies must be learned to construct the behavior basis that enables such zero-shot capability.
Addressing the three primary points raised by the reviewer
Given the overview above, we now respond to the three key claims raised by the reviewer.
(a) *”Learning an SF vector is as hard as solving an MDP”**.
Unlike what the reviewer claims, learning SF vectors is not as hard as solving an MDP.
Estimating the successor features associated with a policy is a policy evaluation problem. Solving an MDP, by contrast, is a policy optimization problem. The sample complexity of solving an MDP and identifying an -optimal policy is approximately . The sample complexity of evaluating a policy (specifically, computing its successor features with respect to the MDP’s initial state, as done in our paper; see line 97) is .
These well-known results imply that solving an MDP requires more samples than evaluating a fixed policy (e.g., computing its SFs). As approaches 1 (as is common in the literature), solving an MDP becomes orders of magnitude harder than computing its SFs. As a concrete example, consider that in deterministic environments, evaluating a policy—computing its SFs—can be done by simply executing the policy once, over a single episode, and recording the resulting rewards. By contrast, learning an optimal policy for standard RL benchmarks often requires thousands or millions of environment interactions.
Hence, we respectfully point out that the reviewer’s claim—that learning SFs is as hard as solving an MDP—is incorrect. This misunderstanding also appears to underlie the two subsequent points they raised; please see below.
(b) “The computational cost of [our method] is dominated by the number of SF vectors, not the number of base policies”.
We believe this claim stems from the reviewer’s observation that stores SF vectors, along with our statement that CCS . However, as clarified in point (4) above, the cardinality of is not related to the number of base policies (or SFs) the agent must learn, and the notation CCS does not imply that our method needs to compute more policies than those in a CCS. Rather, these statements merely indicate that our method is capable of reconstructing a larger set of optimal policies than those in a CCS—specifically, by combining policies from the behavior basis , where CCS (line 153).
The observation that the size of is unrelated to the number of base policies (or SFs) the agent must learn—combined with the result from point (a) above, which states that learning base policies requires orders of magnitude more samples than learning SFs—implies that the computational cost of our method is not dominated by the number of SF vectors. Instead, the primary computational bottleneck lies in training the base policies that make up the behavior basis .
(c) The overall cost of learning exceeds that of learning a CCS.
To address this claim, we first note the set , as defined in Section 3 line 151, is never explicitly computed by Algorithm 1. Instead, as discussed earlier, this set characterizes the optimal policies (more precisely, their SF vectors) that can be directly reconstructed by our method by combining the policies in . The set that appears in Algorithm 1, on the other hand, is updated iteratively and stores the SF vectors of the CCS policies our method becomes capable of reconstructing. Hence, it is never larger than the CCS itself. Our algorithm terminates when this set converges to the CCS (line 12), at which point and . Thus, contrary to the reviewer's claim, our method does not have to learn all policies in a CCS, and certainly not more. This is formally established in Theorem 4.3, which is central to our theoretical contributions.
Using a USFA to learn SFs
We note that the use of USFAs is common practice in the SF literature—see, e.g., Borsa et al. (2019), Kim et al. (2022), and Carvalho et al. (2023a,b). Our motivation for using a USFA is to avoid the memory and computational costs associated with training a separate neural network for each policy’s SFs. Importantly, a USFA does not necessarily introduce more approximation error than training separate networks—one per policy—as generalization in both cases depends primarily on the network architecture, capacity, and the distribution of training data.
Finally, although a USFA is theoretically capable of approximating the SFs of any policies for linear tasks, in practice we train it (and rely on its predictions) only for the tasks on which the OKB is trained on. By limiting its use in this way, we reduce potential generalization errors. We agree this is an important point and will clarify it further in the updated version of the paper.
Proposition 4.4
The reviewer asked how to learn a meta-policy in settings where the OKB is used to reconstruct the optimal policy for non-linear tasks. The reviewer is correct that when generalizing over linear tasks , the meta-policy takes the form . In contrast, when solving potentially non-linear tasks (as in Proposition 4.4), the meta-policy is conditioned only on the state; that is, it takes the form . In this case, it reduces to the formulation used in the original Option Keyboard paper; see Section 2.4 and Eq. (4) for details.
To train , we use an actor-critic policy gradient approach, following the methodology of Barreto et al. (2019). Specifically, we follow Algorithm 3 (Appendix D), replacing with the scalar non-linear reward , and with a standard scalar Q-function . We will clarify these points in the updated version of the paper.
Minor Typo
Thank you for pointing out this minor typo. The correct notation is indeed (i.e., subscripted), as we defined in Eq. (9). This will be fixed in the updated version of the paper.
We hope our responses address the main point you raised regarding the computational cost of the method, as well as the other minor clarification questions. We will revise the text to incorporate a discussion about all these points. If you believe the clarifications above address the key points you raised, we'd be grateful if you would consider revisiting your score. If any points remain unclear, we welcome further feedback and would be happy to continue the discussion. Thank you again for your thoughtful comments!
I would like to thank the authors for their detailed rebuttal. My understanding of the algorithm has not changed, though it was not completely clear to me that the algorithm stops when the set is precisely a CCS. This at least clarifies that the set is at least as large as a CCS (even if it is not larger during execution of the algorithm).
Could you provide a reference to the sample complexity of policy evaluation? My understanding is that computing the successor features of a given policy requires solving a Bellman equation over state-action pairs (or at least over states), which has exactly the same complexity as computing the value function of a given policy. This is the basis of my claim that computing the successor features is as hard as solving the MDP.
Even if the sample complexity of policy evaluation is independent of the number of states and actions, the computational complexity of solving these Bellman equations by nature have to be linear in S (and possibly A). I do not see how the successor features will converge by following a single episode of a policy since the estimate has to be updated many times for each state in order to converge.
We thank the reviewer once again and are glad that some of the key theoretical contributions of our paper are clearer.
Below, we respond to the three remaining points raised in the reviewer’s latest comments: (1) the request for references on the sample complexity of policy evaluation and policy optimization; (2) the claim that computing successor features for a policy is as hard as solving an MDP; and (3) the claim that the overall cost of our method exceeds that of learning a CCS.
(1) Sample complexity: Policy evaluation is fundamentally easier than solving an MDP
To address this point, we focus on two of the reviewer’s statements and provide a detailed response to each:
-
(a) "Computing (...) successor features of a given policy (...) has exactly the same complexity as computing the value function of a given policy"; and
-
(b) "This is the basis of my claim that computing the successor features is as hard as solving the MDP".
We fully agree with statement (a). Regarding statement (b), we would like to clarify a potential misunderstanding. It appears the reviewer may be equating the difficulty of computing the value function for a given policy (i.e., policy evaluation) with that of solving an MDP (i.e., policy optimization). To clarify, these are fundamentally distinct problems with qualitatively different sample complexity. The Bellman equation for a fixed policy involves a known action selection rule, whereas the Bellman optimality equation requires searching over all possible rules. Solving the latter is strictly more challenging, and the sample complexities differ accordingly—a fact well established in the literature. We review these results and provide key references below.
▓▒░ Sample complexity of Policy Optimization ░▒▓
The sample complexity of solving an MDP (policy optimization) is .
As requested, we provide below representative references that support this result:
- "Near-Optimal Time and Sample Complexities for Solving Discounted Markov Decision Process with a Generative Model". Sidford et al., NeurIPS (2018).
- The bound appears in the Abstract, Introduction, and Theorem 4.6.
- [Note: "Generative model" here refers to settings where the transition function is not directly accessible, but the learner can sample from it; i.e., the standard model-free setting].
- "Is Plug-in Solver Sample-Efficient for Feature-based Reinforcement Learning?". Cui and Yang, NeurIPS (2020).
- See Table 1, which summarizes several papers establishing the dependency in the sample complexity of policy optimization, as we claimed.
- "Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model". Azar et al., Machine Learning (2013).
- See Theorem 3. In that result, the authors define and . Substituting these into the expression in Theorem 3 shows that their bound matches the ones cited above.
▓▒░ Sample complexity of Policy Evaluation ░▒▓
The sample complexity of computing the value function of a given policy (policy evaluation) is .
As requested, we provide below a representative reference and a simple derivation from first principles to establish this result.
-
[Proof] Consider estimating for a fixed using Monte Carlo: we initialize the agent in and perform a rollout by following and recording the return . The estimator is the average over rollouts: , where is the return from the -th rollout. Hoeffding’s inequality bounds, as a function of , the probability that this estimate deviates from the true value by more than : , where and are the maximum and minimum possible returns of a rollout. Define such that ([Eq.1]). This implies that . Solving [Eq.1] for gives . In the infinite-horizon case with rewards in , we have and , so . Substituting, we obtain , which implies that the sample complexity of estimating the -value of a single pair is . Note that this expression drops constant and logarithmic terms, as is standard under the notation. Since there are such pairs, the total sample complexity of policy evaluation is .
-
A representative reference that also derives this bound is "A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation". Bhandari et al., COLT (2018).
- Theorem 3 of this paper presents the policy evaluation bound mentioned above. It states that the policy evaluation error, as a function of number of samples is bounded as follows: . While this expression may initially look different, it can be easily rearranged into the more familiar form by solving for the number of samples required to ensure a given error level . The precise definitions of the variables in this expression are not central to our goal of characterizing sample complexity, as results from Theorem 2 imply that most of them are absorbed into the term. In the tabular case, plays a key role and can be shown to satisfy . Substituting this into the bound and dropping logarithmic factors under yields , which matches the bound we previously discussed for the sample complexity of policy evaluation: .
We hope the clarifications above help resolve this point. Based on the citations, formal analyses, and derivations provided, we respectfully restate our original claims and reiterate their correctness:
- Policy evaluation and policy optimization have fundamentally distinct sample complexity characteristics. Solving an MDP requires more samples than evaluating a fixed policy. In common cases where approaches 1, solving an MDP becomes orders of magnitude harder than performing policy evaluation.
Given the formal results provided above, as requested, we respectfully note that the reviewer's claim—that computing successor features is as hard as solving an MDP—is inaccurate and not supported by foundational results in RL theory.
(2) On the claim that computing SFs is as hard as solving an MDP
We address this concern by commenting on two specific points raised by the reviewer: (a) the difficulty of computing successor features in deterministic settings, and (b) the sample complexity of estimating successor features compared to that of solving an MDP.
-
(a) In our rebuttal, we highlighted the difference in difficulty between evaluating SFs and solving an MDP through a motivating example: in deterministic environments, SFs can be evaluated exactly by executing the policy a single time.
- The reviewer responded that “I do not see how [in our method], successor features will converge (...) in a single episode (...) since the estimate has to be updated many times”. Recall that is the expected discounted sum of observed reward features when starting in and following thereafter. In a deterministic environment, such as in our motivating example, this expectation collapses to a single trajectory. That is, each step produces a deterministic reward feature and results in a deterministic transition. After one episode, the agent will have observed exactly the sequence of features that define ; that is, Repeating this rollout from under would produce the same result. Thus, in the motivating setting we described, and for which the reviewer requested clarification, can indeed be estimated exactly from a single episode.
- In stochastic environments, can be estimated using standard policy evaluation methods, such as Monte Carlo rollouts—a policy evaluation approach whose sample complexity was discussed earlier.
- Finally, recall that—as defined in line 97—the SFs used in our paper (denoted ) correspond to the expected value of . That is, is a vector of real numbers representing the average SFs starting from the initial state, rather than the full SF function over all pairs. This implies that the complexity of estimating each in is even lower than the bounds discussed earlier, since OKB only requires SF estimates from the beginning of the episode. See line 7 of Algorithm 2 for reference.
-
(b) The reviewer claimed that "learning the SF vector of a single policy is as hard as solving a single MDP".
- As the reviewer correctly notes, computing SFs is equivalent to performing policy evaluation. Building on this observation and on the sample complexity results discussed above, which highlight the substantial gap between policy evaluation and policy optimization, we respectfully point out that the reviewer’s claim is inaccurate: learning successor features is significantly easier—often by orders of magnitude—than solving an MDP. These are qualitatively different problems with fundamentally different sample complexity requirements.
- We believe this misunderstanding may also underlie the reviewer’s final point—namely, the claim that "the overall cost of [our method] exceeds that of learning a CCS". Building on the observations and formal results discussed above, we now address this latter point.
(3) On the claim that our method is more expensive than learning a CCS
We address this point through three key observations:
(3.1) On the cost of constructing a CCS using existing techniques.
- Recall that a CCS (Eq. 6) is a set of successor features (SF) vectors. Existing algorithms build a CCS iteratively: at each step, they select a task (MDP), learn its optimal policy, and compute that policy's SFs vector to include in the CCS.
- This is computationally expensive due to the high sample complexity of solving MDPs. Specifically, for each element of the CCS, existing techniques incur the cost of policy optimization and policy evaluation .
- Assume, for example, that the CCS contains 1000 elements. In this case, the total cost incurred by existing techniques to construct a CCS is = .
(3.2) On how our method avoids the dominant costs of CCS construction.
- Our method (OKB) is capable of reconstructing the entire CCS without incurring the dominant costs associated with existing techniques. In particular, Theorem 4.3 and Proposition 4.4 formally show that OKB can directly synthesize any policy in the CCS without having to directly optimize it, thereby bypassing the costly process of solving CCS tasks individually—a process with sample complexity per policy.
- As training progresses, OKB incrementally builds the set by computing and storing the SFs of policies it becomes capable of reconstructing. The resulting set of SFs includes all of those in the CCS, as well as SFs of policies optimal for a particular class of nonlinear tasks. Recall that computing the SFs of a policy (i.e., performing policy evaluation) has sample complexity .
- Since the OKB can zero-shot reconstruct/synthesize all CCS policies and only needs to compute their SFs—a significantly cheaper step than full policy optimization—its total cost is orders of magnitude smaller than that of learning a CCS. Assume, again, that the CCS contains 1000 elements. Then, the total cost incurred by OKB is .
(3.3) On how to characterize the sample complexity improvements achieved by OKB.
- The sample complexity improvement achieved by OKB can be quantified by comparing its cost to that of existing approaches for constructing a CCS. Specifically, the ratio captures how much more expensive it is to build a CCS using existing techniques compared to OKB.
- To see why this is significant, consider the common case where . In this setting, existing approaches for learning a CCS require roughly times more samples than our method.
- Importantly, note that the sample complexity improvements enabled by our method increase significantly as task complexity grows—for example, in long-horizon settings where the discount factor approaches 1. Consider the impact of increasing from 0.9 (as discussed above) to 0.99: in this case, learning a CCS through existing methods would require approximately 100 times more samples than our approach.
Final remarks and closing thoughts
We thank the reviewer for engaging deeply with our work and for raising important questions. We hope that the detailed discussion above—including all requested citations, formal analyses, proofs from first principles, and sample complexity derivations—has addressed all remaining concerns. In particular, our response focused on thoroughly engaging with the reviewer’s latest comments by (1) precisely characterizing the complexity gap between computing a policy's successor features and solving an MDP; (2) mathematically describing how our method avoids the dominant costs incurred by existing techniques for CCS construction; and (3) providing further theoretical support establishing that OKB’s total cost is strictly lower than that of learning a CCS.
If the responses above provide all the information and clarification you requested, we would be grateful if you would consider revisiting your score. Once again, we thank you for engaging so thoughtfully with our work.
The paper proposes a novel multi-task reinforcement learning method that uses a bi-level policy structure. In the linear-reward setting, prior work has shown that it is possible to compute a Convex Coverage Set (CCS), a set of base policies where the optimal policy for any reward function (by the linear reward definition) can be expressed as a convex combination of policies in the CCS. However, constructing a CCS can be expensive and it is hard to scale it to more complex domains. The proposed method additionally uses a meta policy that is allowed to dynamically choose the convex combination weights of the base policies depending on the state (unlike in prior work the weights are constant across states). This allows them to express policies that are optimal for more complex reward functions beyond the linear assumption. The proposed method is evaluated on four simulated environments. Compared to prior methods, the proposed method needs fewer number of base policies to achieve the same performance of the baselines with more base policies.
优缺点分析
Strengths
- The paper is well-written and easy to read.
- The proposed method is novel and theoretically sound.
Weaknesses
- The authors consider only a limited number of baselines (SIP and DQN on one environment and SFOLS on others). The paper can benefit from including more baselines to demonstrate the effectiveness of the proposed method much better. For example, how does the proposed method compare to using existing unsupervised skill discovery methods that learn a latent-conditioned policies? (e.g., [1]). Also, comparing methods in options framework would also be relevant.
- One of the main motivations of the paper is that CCS "are computationally expensive and do not scale to complex domains". However, as far as I could tell. All the tasks considered are relatively simple where their CCS can be computed exactly. It would greatly strengthen the paper if the authors could showcase the propose method on some of the environments where computing a CCS is completely intractable.
[1] Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. "Diversity is all you need: Learning skills without a reward function." arXiv preprint arXiv:1802.06070 (2018).
问题
N/A
局限性
yes
最终评判理由
The authors have addressed all of my concerns. The new experiments provide evidence that the proposed method OKB can scale to more complex domains. I am raising my score to 5.
格式问题
N/A
We thank the reviewer for the positive comments regarding our method’s novelty and theoretical soundness, as well as the clarity and quality of the writing. The reviewer’s questions focused on two key aspects: (1) the set of baseline methods; and (2) the complexity of the domains used in the experiments. Below, we address both points.
Comparison with Unsupervised Skill Discovery and Options Baselines
We thank the reviewer for bringing up this important point. As the reviewer correctly noted, other methods from the unsupervised skill discovery and options literature are indeed relevant to the broader goal of more rapidly solving novel tasks. Examples include Diversity is All You Need (DIAYN), which learns a diverse set of skills without access to task rewards; Set-Max Policy (SMP), which constructs a set of policies that maximize worst-case performance over a task set; and DSP, which encourages diversity in the space of successor features while maintaining near-optimality (Zahavy et al., 2021).
In Section 5, we chose SFOLS and SIP as our primary baselines, as both have been repeatedly shown in the literature to outperform more task-agnostic methods across a variety of settings. Moreover, they are representative of approaches that aim to construct policy sets with theoretical guarantees—a central objective of our work.
Below, we present results comparing SIP to the three methods discussed above (DIAYN, SMP, and DSP), which belong to the family of approaches highlighted by the reviewer. These results are adapted from Alver and Precup (2022) and show that SIP consistently achieves better downstream task coverage (measured by mean normalized return over seventeen tasks uniformly covering the task space) in an experimental setting similar to ours. Since our results demonstrate that OKB systematically outperforms SIP—and SIP is well known to consistently outperform the other baselines in this class—these findings further support our choice of SIP as the most competitive and informative point of comparison.
| w | SIP | DIAYN | SMP | DSP |
|---|---|---|---|---|
| 1 | 0.81 | 0.19 | 0.47 | 0.37 |
| 2 | 0.86 | 0.23 | 0.50 | 0.39 |
| 3 | 0.87 | 0.31 | 0.51 | 0.43 |
| 4 | 0.87 | 0.44 | 0.54 | 0.50 |
| 5 | 0.90 | 0.67 | 0.60 | 0.56 |
| 6 | 0.90 | 0.84 | 0.62 | 0.67 |
| 7 | 0.89 | 0.89 | 0.67 | 0.68 |
| 8 | 0.91 | 0.91 | 0.68 | 0.68 |
| 9 | 0.93 | 0.93 | 0.70 | 0.69 |
| 10 | 0.94 | 0.94 | 0.73 | 0.71 |
| 11 | 0.93 | 0.93 | 0.70 | 0.69 |
| 12 | 0.95 | 0.90 | 0.67 | 0.66 |
| 13 | 0.92 | 0.77 | 0.60 | 0.69 |
| 14 | 0.94 | 0.65 | 0.58 | 0.51 |
| 15 | 0.94 | 0.53 | 0.56 | 0.47 |
| 16 | 0.93 | 0.38 | 0.54 | 0.44 |
| 17 | 0.86 | 0.22 | 0.49 | 0.37 |
| Mean norm. return → | 0.90 | 0.63 | 0.60 | 0.56 |
| StdDev norm. return → | 0.0388 | 0.2857 | 0.0832 | 0.1304 |
Table 1: Normalized returns (mean and standard deviation shown at the bottom) for various baselines, including SIP, DIAYN, SMP, and DSP across different weight vectors (tasks).
For completeness, we will include these comparisons in the paper. The findings above—namely, that SIP consistently outperforms more task-agnostic baselines such as DIAYN, DSP, and SMP in settings like the one we study—have been repeatedly observed in other prior work (e.g., Alegre et al., 2022). Based on the broader literature comparing these techniques, we thus fully expect to observe the same trend here, further reinforcing that SIP is the strongest and most informative point of comparison. All supporting tables and graphs will be added to Section 5 and the appendix.
Complexity of the Tasks
The second point raised by the reviewer concerns whether the domains used in our experiments are too simple and whether their CCS could be computed exactly. We have empirically evaluated OKB in challenging high-dimensional RL problems, including Minecart, Fetch-PickAndPlace (a robotic arm task), Item Collection, and Highway (an autonomous driving environment). These domains are well-established benchmarks in multi-task and multi-objective RL. Our results show that OKB consistently outperforms state-of-the-art GPI-based approaches, with its performance advantage becoming increasingly pronounced as task complexity (measured by the number of reward features ) increases. These results demonstrate that OKB is indeed applicable and highly effective in complex and challenging scenarios relevant to real-world applications, such as robotics and autonomous driving, going beyond simple simulated environments.
Regarding tractability, prior to our paper, to the best of our knowledge, existing methods had been evaluated in problems with dimensionality at most (e.g., SFOLS and SIP). In our paper, we double the number of reward features and show that our method remains effective—outperforming all baselines even on problems an order of magnitude more challenging.
Finally, we highlight that the CCS of some domains (e.g., FetchPickAndPlace and Highway) cannot be computed exactly. In all domains, we resorted to function approximation via neural networks to handle continuous state spaces of up to 25 dimensions. We agree with the reviewer on the importance of further discussing these points, and we will update the text to more clearly emphasize that our experiments were conducted on domains whose size and complexity had not been previously addressed in the literature.
We hope our responses address your main questions. We will revise the text to incorporate a discussion on all the points brought up by the reviewer. If you believe the clarifications above address the key points you raised, we would be grateful if you would consider revisiting your score. If any points remain unclear, we welcome further feedback and would be happy to continue the discussion. Thank you again for your thoughtful comments!
Thanks authors for the rebuttals. I appreciate the new comparisons to DIAYN and the clarification. Most of my concerns have been addressed except my concern on the task complexity. While it is nice to know that the CCS of some of the domains (e.g., FetchPickAndPlace and Highway) cannot be computed exactly, I am still not very confident that the proposed method can scale to more complex/challenging environments such as environments with image observations (e.g., MO-Supermario in the MO-Gym that the authors consider). While I remain positive about the work, I would like to maintain my current score.
We thank the reviewer for their thoughtful engagement and are glad to hear that our rebuttal addressed most of their concerns. We also appreciate their follow-up message and the additional question they would like to discuss next.
In their response to our rebuttal, the reviewer noted they were not yet confident our method could scale to more complex and challenging environments, such as MO-SuperMario. To directly address this concern, we adapted our codebase to support image-based observations and ran additional experiments evaluating OKB on the domain requested by the reviewer.
To ensure a fair and meaningful evaluation that directly addresses the reviewer’s concern, we conducted a focused literature review to identify state-of-the-art methods that have been shown to consistently perform well in general and in this particular domain. We examined recent papers on the topic as well as publicly available empirical results tracked automatically through platforms such as Weights & Biases, MORL-Baselines, and the Open RL Benchmark. We selected GPI-LS (Alegre et al., 2023b) for comparison. This is a leading MORL algorithm that performs well across a range of settings and has demonstrated strong results on MO-SuperMario specifically. It is therefore a particularly appropriate baseline, both as a state-of-the-art method in the field and as one well suited to the domain highlighted by the reviewer.
Our goals with this additional experiment were twofold: (1) to demonstrate that OKB does scale to high-dimensional domains with challenging observation spaces; and (2) to show that it not only scales but also outperforms state-of-the-art methods on such tasks. Below, we report the mean return achieved by each method on the challenging domain highlighted by the reviewer. The table shows each method’s performance—mean return on a set of novel test tasks—at various points during training (e.g., after 100k, 200k time steps, etc.).
| Time Step | OKB (ours) | GPI-LS |
|---|---|---|
| 100k | 7.48 | 5.55 |
| 200k | 7.50 | 7.05 |
| 300k | 8.47 | 7.31 |
| 400k | 8.96 | 7.11 |
Table: Mean return on a set of novel test tasks, evaluated at different points during training.
To ensure our comparisons were statistically meaningful and to establish the robustness of our findings, we went beyond the results in the table above and analyzed the statistical properties of the percentage improvement of our method over the state-of-the-art baseline, computed at each evaluation point as the relative difference in return across all runs and time steps. These analyses revealed that:
-
On average, across all runs and time steps, our method achieved a 20.21% higher return than the baseline.
-
To go beyond average-case performance and fully address the reviewer’s concern, we also quantified the confidence interval of this improvement. Our analysis shows that the performance advantage of our method lies within the interval [8.5%, 31.8%] with 95% confidence. Put differently, our method outperforms the baseline throughout the entire training process with high confidence, with performance gains typically ranging from approximately 10% to 30%, depending on the time step.
We hope that, by complementing the theory and experiments in the paper with the additional evaluation requested by the reviewer, we have fully addressed the final concern raised. These new results demonstrate that our method not only scales effectively to such domains but also maintains a consistent performance advantage over a state-of-the-art competitor. We will include these results in the final version of the paper. Thank you for suggesting this meaningful additional comparison.
Given that the analyses above provide strong, statistically grounded evidence directly addressing the reviewer's remaining question, we would be grateful if you would consider revisiting your score. If any aspects remain unclear, we welcome further feedback and would be happy to continue the discussion. Thank you again for your thoughtful comments.
Thanks for the additional experiments. The new results do seem quite promising for demonstrating the scalability of OKB! I have a few clarification questions regarding the new experiments:
- How many seeds did you use in the experiments? (e.g., how many training runs?) How were the results for GPI-LS obtained?
- What changes did you need to make to adapt OKB to the image domain? (e.g., what is the image encoder being used? What about GPI-LS?)
We truly appreciate your encouraging feedback on the additional results we presented! Thank you, also, for suggesting this additional relevant comparison.
Given the limited time available during the discussion period, we have so far conducted experiments using 7 random seeds, and we are currently running additional seeds to strengthen the statistical robustness of the analyses presented in our previous message.
Our preliminary analyses suggest that variability will remain low since all experimental results observed so far appear to be quite consistent in terms of the average performance attainable as the training process progresses. We believe this may be the case since this domain is fully deterministic, despite its high-dimensional state space. We will report all updated statistical analyses as soon as the ongoing experiments are complete.
Regarding your second point, we adapted our method to handle domains with image observations by incorporating an image encoder into the neural network architecture. To ensure fairness in the comparison, we used exactly the same encoder and architecture (layers, neurons, activation functions, etc.) as in the original GPI-LS paper, whose code and experimental results are publicly available through the MORL-Baselines library. Specifically, the architecture consists of a CNN similar to that used in the Nature paper that introduced the DQN algorithm [1], followed by a two-layer MLP with 512 neurons per layer. We also made minor modifications to other components, such as the replay buffer, to support storing and processing image observations.
Once again, thank you for the constructive feedback, for suggesting this additional relevant comparison, and for your encouraging words on the promise shown by the experimental results we shared in our previous message.
[1] Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning". Nature 518.7540 (2015).
Thank you for the additional clarifications! All of my concerns have been addressed. I am raising my score to 5.
Hi, Thanks for already engaging in the discussion. Can you respond to the latest author comments here so they may help clarify any remaining concerns?
This paper formalizes the idea of an optimal behavior basis for use with the Option Keyboard (OK), i.e., a set of base policies which are sufficient for OK to solve a family of MDPs with optimality. Further, the authors propose Option Keyboard Basis (OKB), an incremental method for contructing a behavior basis and meta-policy which combined are provably optimal with respect to any of the tasks in the family. The paper includes the results of several experiments intended to answer specific research questions and demonstrate the effectiveness of OKB compared to baselines in different domains.
优缺点分析
Strengths
- The issue of discovery of a set of policies to use with Successor Feature-based methods which are sufficient to behave optimally in all tasks in a family is interesting and an important one to solve.
- The formalization of the optimal behavior basis is simple but still meaningful.
- OKB appears to be a strong approach to producing a set of base policies and meta-policy which are optimal with respect to the family of tasks while requiring fewer base policies than existing methods.
- The experiments are nicely targeted to answer the specific research questions that are proposed in Section 5.
Weaknesses
- Only a single baseline that is not an ablation of OKB is used in each experiment. While this is sufficient to answer the specific research questions that the authors presented, it would be nice to see a broader comparison to compare this method to others that don't necessarily guarantee optimality across the family.
- For guaranteed optimality, the check for OK optimality requires knowledge of (or solving for) the optimal policy for the task in question, which is obviously not practical. While the sample-based method used in the experiments appears to be sufficient to allow OKB to be effective, there is little discussion of the best way to do this and no way to know how much suboptimality this approximation can introduce.
问题
-
Can you discuss how accurate the OK optimality test is in practice? Including results from domains where this can be checked exactly but are still large would be helpful.
-
What other approaches did you consider for the OK optimality test and why did you choose the one that you did?
-
Does using the OK framework specifically more constraining than a CCS? In other words, does this limit the problems OKB can be used to solve compared to previous methods?
局限性
- The authors admit that the OK optimality test must be approximated, but do not expand much on how their approximation method was chosen or how much suboptimality it might introduce.
最终评判理由
The authors addressed the questions that I had regarding OKB. These were all things that I would like to see discussed somewhere in the paper, so I am happy that they intend to adjust their text to include them.
My rating remains Accept.
格式问题
None
We thank the reviewer for their positive assessment of our paper and for their encouraging comments on the relevance of the problem, the strength of our method, and the quality of our experiments. Below, we address the reviewer’s questions:
Comparison with methods that do not guarantee optimality
The reviewer highlighted that our comparisons were sufficient to positively answer all specific research questions and wondered how our method would perform when compared to weaker methods that do not necessarily guarantee optimality.
We appreciate this question and agree that broader comparisons are valuable. For a more detailed discussion—including additional baselines from the unsupervised skill discovery literature—please see our response to Reviewer 1Yjh. Please do not hesitate to let us know if you have any thoughts or further questions based on those results: we would be happy to discuss them further if helpful.
Do approximations to the OK optimality test affect correctness or optimality guarantees?
Thank you for raising this important point. Your question helped us recognize the need to further clarify that our method does not rely on an exact test to guarantee the correctness of the behavior basis it constructs. As briefly noted in line 226 of Section 4.1, the test introduced in Eq. (9) for assessing OK optimality is a sufficient (but not necessary) condition for guaranteeing optimality. This test is designed to serve two key purposes: (1) to avoid training on tasks whose optimal policies have either already been learned or that can be reconstructed from and , and (2) to prioritize which tasks to train on in order to accelerate learning (see Appendix E for further discussion). In other words, the test is primarily a tool to reduce training effort, not a prerequisite for correctness.
More concretely, even if the test is imperfect, as long as the algorithm avoids re-solving previously trained tasks, the algorithm is still guaranteed to identify a behavior basis that supports zero-shot optimal synthesis for all tasks in the family. In cases where the test is imperfect, the algorithm may take longer to converge, potentially solving more tasks than strictly necessary, but all correctness guarantees remain intact. These guarantees follow from our argument in lines 189–200 and from established results in the literature (e.g., Theorem 3 in Roijers, 2016). In short, although approximate versions of Eq. (9) may slightly increase the number of training iterations required, they do not affect the method’s correctness or result in suboptimal solutions.
We appreciate your suggestion to make this point more explicit. In the final version, we will revise Section 4.1 to more clearly state that this test is a sufficient, but not necessary, condition for optimality. We will also move portions of the discussion from Appendix E into the main text to better emphasize that its primary role is to accelerate the identification of a behavior basis.
Is the OK framework more constraining than a CCS or prior methods?
This is a great question. The set of problems solvable by the OKB is no more limited than that of previous approaches; in fact, it is broader. As shown, for example, by Alegre et al. (2022), a Coverage Convex Set (CCS) is guaranteed to contain the optimal solution to every and all tasks in the family defined in Eq. (5). Since our method can directly reconstruct any element of the CCS (and, importantly, without having to train an individual policy for each), it follows that it can optimally solve all linear problems in that family. This result is established formally in Theorem 4.3, where we prove that the OKB is at least as expressive as a CCS; that is, it can reconstruct all policies in the CCS, and potentially more. Proposition 4.4 goes further, showing that the OKB is in fact strictly more expressive than a CCS. In particular, it supports the direct reconstruction of optimal solutions not only for all linear tasks, but also for a relevant class of non-linear problems. This means that while existing methods that compute a CCS are limited to linearly expressible tasks, our method is not only guaranteed to return zero-shot optimal solutions to all such tasks, but it can also—importantly—solve a broader class of non-linear problems that previous approaches could not handle.
We hope the comments above address your questions. Please feel free to reach out with any further thoughts—we would be glad to continue the discussion. Thank you once again for your thoughtful feedback and for taking the time to carefully read our work and share your insights.
I thank the authors for their answers to my questions as they helped clarify some aspects of OKB in my mind. I will keep my Accept rating.
We thank the reviewer once again for the thoughtful discussion, constructive feedback, and valuable comments. We greatly appreciate their positive and encouraging remarks on the relevance and importance of our contributions, as well as their support for the acceptance of our paper.
The paper proposes a method to construction an option-keyboard basis, a basis of behaviors that can be used to express the solution for all MDPS within a family, within the options keyboard framework. The argument is centered around the idea of a CCS (convex coverage set), which is a set of policies that include an optimal policy for any linear task. The challenge with a CCS is that it often grows exponentially with the number of reward features. The proposed methodology incrementally constructs a basis of behavior that can express all policies in a CCS, but significantly more compactly. The authors provide empirical and theoretical results to this effect and validate on a number of simulated problems.
Reviewers brought up several key concerns during discussion - 1) limited baselines, 2) limited task complexity, 3) theoretical justification of complexity. Through the rebuttal period and discussion the authors were able to include more thorough comparisons to unsupervised RL methods, increase task complexity to other tasks like MO-supermario and provide an argument on the theoretical merits of their approach. The method does still need further validation and comparisons to make it really stand out in the literature, but it is a good addition to the community.