PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
4
5
4
3.3
置信度
创新性2.5
质量3.0
清晰度2.8
重要性2.5
NeurIPS 2025

Deployment Efficient Reward-Free Exploration with Linear Function Approximation

OpenReviewPDF
提交: 2025-05-03更新: 2025-10-29
TL;DR

We provide a computational efficient algorithm to achieve $O(H)$ deployment cost with polynomial sample complexity.

摘要

We study deployment-efficient reward-free exploration with linear function approximation, where the goal is to explore a linear Markov Decision Process (MDP) without revealing the reward function, while minimizing the number of distinct policies implemented during learning. By ``deployment efficient'', we mean algorithms that require few policies deployed during exploration -- crucial in real-world applications where such deployments are costly or disruptive. We design a novel reinforcement learning algorithm that achieves near-optimal deployment efficiency for linear MDPs in the reward-free setting, using at most $H$ exploration policies during execution (where $H$ is the horizon length), while maintaining sample complexity polynomial in feature dimension and horizon length. Unlike previous approaches with similar deployment efficiency guarantees, our algorithm's sample complexity is independent of the reachability or explorability coefficients of the underlying MDP, which can be arbitrarily small and lead to unbounded sample complexity in certain cases -- directly addressing an open problem from prior work. Our technical contributions include a data-dependent method for truncating state-action pairs in linear MDPs, efficient offline policy evaluation and optimization algorithms for these truncated MDPs, and a careful integration of these components to implement reward-free exploration with linear function approximation without sacrificing deployment efficiency.
关键词
Reinforcement learninglinear MDPdeployment efficiency

评审与讨论

审稿意见
4

In this paper, the authors study deployment-efficient reward-free exploration with linear function approximation, where the goal is to explore a linear Markov Decision Process (MDP) without revealing the reward function, while minimizing the number of distinct policies implemented during learning. For the task, the authors design a provably efficient algorithm with near optimal deployment efficiency while the sample complexity has polynomial dependence on relevant parameters. Compared to previous algorithms, this work removes a crucial assumption of reachability, directly addressing an open problem from prior work.

优缺点分析

Strengths:

  1. The problem of deployment efficient RL is important and interesting.
  2. The technical results in this paper are solid, while the paper provides lots of novel techniques.
  3. This paper removes the assumption of reachability, which is a strong assumption in previous works.

Weaknesses:

  1. The sample complexity upper bound does not appear in the main part, and the bound is far from optimal. The bound is better than previous results only when the reachability coefficient in previous papers is very close to 0. While I understand the difficulty to remove the reachability assumption, it will be helpful if there are more discussions about the difficulty to improve the current sample complexity.

  2. The proof overview in the main part is not very informative. The intuition of the three statements in Lemma 6 is not clear and requires more explanation. A proof sketch (even in the appendix) showing the amount of data for different parts of the algorithm (e.g. policy design, matrix evaluation) and how does these data combine to the whole sample complexity would be helpful.

  3. I would suggest the authors to polish the writing and check for the typos in the main part. E.g., in line 7 of Algorithm 4 and (i) of Lemma 6, should the Λ\Lambda be Λ1\Lambda^{-1}?

Overall, I think removing the reachability assumption is an important contribution. Therefore, I slightly lean towards acceptance of the paper.

问题

Please refer to the weaknesses.

局限性

Yes.

格式问题

NA

作者回复

We thank the reviewer for the insightful comments. Below we provide our responses to each point.

Regarding the bad dependencies on d,Hd, H and 1/ϵ.1/\epsilon.

The main reason of the bad dependency is that: The main algorithm (Algorithm 1) involves two iterative processes, and since the results of different iterations all rely on the same offline dataset, these results are subtly coupled with each other. To break the statistical coupling, we simply make independent copies of the offline datasets by following the exploration policy and repeatedly sampling trajectories with fresh randomness.

One potential approach to improving the sample complexity is to prove concentration inequalities (e.g., Lemma 17) through union bound arguments. The major difficulty in this approach is that the complexity of the offline planning policy in Algorithm 3 and 4 could be high.

Regarding the proof sketch of the sample complexity.

Thanks for the suggestion. In Section 6, we present the framework for proving the main theorem. The most complicated components are the proofs of Lemmas 6 and 15, for which we will provide high-level proof sketches in the next revision.

Here, we present a summary of the sample complexity for each component of the algorithm. We will provide the details in Appendix E.

We begin by counting the number of sub-datasets required throughout the learning process. Then the sample complexity (total number of episodes) is obtained by multiplying this number N=109d7H7log(dHϵδ)ϵ3N = \frac{10^{9}d^7H^7\log\left(\frac{dH}{\epsilon\delta}\right)}{\epsilon^3}.

Algorithm 4 (Truncated-Matrix-Eval): This algorithm requires at most HH sub-datasets.

Algorithm 3 (Matrix-Eval): This algorithm makes O(m)O(m) calls to Algorithm 4, each requiring at most O(Hm)O(Hm) sub-datasets.

The effectiveness about Algorithm 3 (including Algorithm 4) is provided in Lemma 18, which means we could achieve efficient truncation with recursive truncated planning.

Algorithm 5 (Planning) and Algorithm 6 (Planning-R): These two algorithms shares similar structure, which requires at most HH sub-datasets.

Algorithm 2 (Policy-Design): This algorithm makes O(m)O(m) calls to Algorithm 3 and Algorithm 5, which requires at most O(m2H)O(m^2H) sub-datasets.

Algorithm 1 (Exploration): This algorithm makes HH calls to Algorithm 2, which requires at most O(m2H2)O(m^2H^2) sub-datasets.

Putting all together, the total number of episodes is O(m2H2N)=O~(d15H14ϵ5)O(m^2H^2N)=\tilde{O}\left( \frac{d^{15}H^{14}}{\epsilon^5}\right).

Regarding the typos.

Thanks for pointing out the typos and we will fix accordingly in the next revision.

评论

Thanks for the detailed reply. I will keep my score.

审稿意见
4

This paper studies reward free exploration problem in linear MDP problem. In particular, we assume a linear MDP setting without the knowledge of the reward function. The goal is to accumulate enough data to find a good estimate of the underlying MDP. This estimation then will be used in the next step to find an estimation of the optimal policy corresponding to an arbitrary reward function. The authors develop this algorithm, and show that their algorithm requires only a polynomial number of samples with respect to the problem parameters.

优缺点分析

The main strength of the paper is to remove the infrequent direction assumption in the prior work. In particular, this work does not assume a uniform lower bound on the eigen values of Λh\Lambda_h matrix over all the horizons. Instead, they devise algorithm 3, which helps on achieving a good data set, with at most H number of deployments (rather than exponential number of deployments with respect to H, which might happen since the lower bound on the distribution over states can decrease exponentially with respect to the horizon). The main weakness is that in terms of problem formulation is relatively similar to the prior work, and the work is summarized in just a theorem with polynomial dependency on \epsilon, H, d, and log(1/δ\delta). This polynomial dependency can be very bad, and no lower bounds or simulations are provided to support the result.

问题

  • What is "z" in Algorithm 5?
  • Can you provide a sample complexity (with exact dependency on parameters of the MDP) for tabular MDP setting?

局限性

yes

最终评判理由

The authors conducted experiment, and claim that their algorithm outperforms prior work. If paper gets accepted, I expect to see the result of the experiment in the final version of the paper.

格式问题

  • The order of writing in the paper needs to be changed. It is not helpful to have all the discussion in pages 5 and 6 before even introducing the algorithm.
  • When discussing about infrequent directions, it is better to first define it clearly, and explain the assumption that appears in the prior work. Just mentioning it in words is not sufficient, and would make the reader confused.
  • The paper needs a rewriting since there are several grammatical errors.
作者回复

We are grateful to the reviewer for the constructive feedback. Below we present our response.

Regarding the bad polynomial dependency.

Indeed our current sample complexity is far from being optimal. The main reason of the bad dependency is that: The main algorithm (Algorithm 1) involves two iterative processes, and since the results of different iterations all rely on the same offline dataset, these results are subtly coupled with each other. To break the statistical coupling, we simply make independent copies of the offline datasets by following the exploration policy and repeatedly sampling trajectories with fresh randomness.

One potential approach to improving the sample complexity is to prove concentration inequalities (e.g., Lemma 17) through union bound arguments. The major difficulty in this approach is that the complexity of the offline planning policy in Algorithm 3 and 4 could be high.

Regarding the lower bound.

While the precise trade-off between sample complexity and deployment complexity remains unknown, Theorem 3.2 in Huang et al., 2022 provides an Ω(H/logd(1/ϵ))\Omega(H/\log_d(1/\epsilon)) lower bound for the sample complexity under polynomial sample complexity constraints.

Regarding the numerical experiments.

The primary goal of this work is to eliminate dependence on reachability in both sample complexity and deployment complexity. To ensure our algorithm outperforms reachability-dependent approaches, we require the reachability parameter to be exponentially small. However, this leads to highly artificial MDP constructions. In practice, features are typically learned via neural networks, making it unclear how to test reachability parameters in such settings. Since different features yield different reachability values, evaluating the algorithm in a natural, practical environment becomes challenging.

Regarding zz in Algorithm 5.

We set z=100000ϵ2d4H5z = \frac{100000\epsilon^2}{d^4H^5}. Please refer to line 781 for the parameter settings.

Regarding the tabular case.

Let SS and AA be respectively the sizes of the state space and the action space. By the method in (Zhang et al., 2022), one could reach the asymptotically optimal O~(SAH3ϵ2)\tilde{O}(\frac{SAH^3}{\epsilon^2}) sample complexity (in episodes) with H+1H+1 deployments. In this approach, we use the first HH deployments (which corresponds to HH batches in Zhang et al., 2022) to learn an approximate transition model P~\tilde{P} using poly(S,A,H)\mathrm{poly}(S,A,H) episodes. Then we design an exploration policy π\pi' with minimax coverage under the approximate model P~\tilde{P} such that maxπs,a,hdπ(h,s,a)dπ(h,s,a)=O(SAH).\max_{\pi} \sum_{s,a,h} \frac{d_{\pi}(h,s,a)} {d_{\pi'}(h,s,a)} = O(SAH). Here dπ(h,s,a)=Eπ,P~I(sh=s,ah=a)d_{\pi}(h,s,a) = \mathbb{E}_{\pi,\tilde{P}} \mathbb{I}(s_h=s,a_h=a) denotes the occupancy distribution of the policy π\pi under the transition P~\tilde{P}. By running π\pi' for O(SAH3ϵ2)O(\frac{SAH^3}{\epsilon^2}) episodes, we could learn an ϵ\epsilon optimal policy. Similar approaches have also been explored by Qiao et al. (2022).

References:

Huang et al., 2022: Towards deployment-efficient reinforcement learning: lower bound and optimality.

Zhang et al., 2022: Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning.

Qiao et al., 2022: Sample-efficient reinforcement learning with loglog(t)\log\log(t) switching cost.

评论

Thank you for the detailed response. For experimental results, can you test your algorithm on a small simple MDP with a few state and actions?

评论

We would like to thank the reviewer for the constructive comments. As suggested by the reviewer, we conducted experiments on the following small and simple MDP.

The horizon length is H=7H = 7, meaning that there are 77 levels in the MDP. The action space contains three actions: a1,a2,a3a_1, a_2, a_3. There is a single initial state s0s_0 at the first level. In the middle levels, i.e., 2h62 \le h \le 6, there are three different states: s1,s2,s3s_1, s_2, s_3. At the last level h=7h = 7, there are two terminal states: swins_{\text{win}} and sloses_{\text{lose}}.

The transitions are defined as follows:

  • Choosing action aia_i at the initial state s0s_0 transitions to state sis_i at h=2h = 2.
  • For 2h52 \le h \le 5, the transitions are independent of the action chosen:
    • s1s_1 and s2s_2 transition back to themselves at the next level.
    • s3s_3 transitions to s2s_2 and s3s_3 with equal probability, i.e., 1/21/2.
  • At level h=6h = 6, the transitions are probabilistic and independent of the action chosen:
    • State s1s_1 transitions to swins_{\text{win}} with probability 1/2+ϵ1/2 + \epsilon and sloses_{\text{lose}} with probability 1/2ϵ1/2 - \epsilon.
    • States s2s_2 and s3s_3 transition to swins_{\text{win}} with probability 1/2ϵ1/2 - \epsilon and sloses_{\text{lose}} with probability 1/2+ϵ1/2 + \epsilon.

We set ϵ=0.05\epsilon = 0.05 in our experiments. The agent receives a reward of 10001000 at swins_{\text{win}}, and a reward of 1000-1000 at sloses_{\text{lose}}. The reward values for all other states are set to 00.

We compare the new algorithm in our paper with Algorithm 2 from the prior work by Huang et al. [2022]. This is a fair comparison, as these two algorithms have the same deployment complexity. In our experiments, we vary the number of trajectories per deployment (denoted as NN) and report the expected total reward of the returned policy, averaged over 1000 repetitions. Note that in the underlying MDP, the smallest state-reaching probability is 24=0.06252^{-4} =0.0625, which is not exceptionally small. However, even in this setting, we observe the following results.

NNOur algorithmHuang et al.
301.86-4.14
3001.92-3.66
30004.2-0.84
300009.364.52
30000010.009.54

From the results, it is clear that our new algorithm outperforms the algorithm by Huang et al. [2022] across a wide range of NN, as long as NN is not very large. This aligns with our theoretical analysis, as our algorithm quickly identifies s3s_3 as an infrequent state (dimension), and will not waste samples trying to reach s3s_3. On the other hand, such a mechanism does not exist in the previous algorithm by Huang et al. [2022] (as it assumes that all directions (states) are reachable). Consequently, it will waste a significant amount of samples on infrequent states (dimensions).

We note that it is possible to further increase the performance gap between our algorithm and that of Huang et al. [2022], by possibly adding more infrequent states (dimensions) and more levels. Given the time constraints of the author-reviewer discussion period, we will add more discussion on these experiments to our final version.

评论

Dear Reviewer Ty3N,

We sincerely thank you again for your valuable feedback, which has greatly helped us improve the quality of our work. As the discussion deadline is approaching, we kindly request you to review our rebuttal and our response on experiments. We also sincerely ask you to consider increasing the score if our responses has resolved the main concerns.

Thank you again for your time and effort!

审稿意见
5

The paper proposes an algorithm for reward-free exploration in linear MDPs. Its main result is a nearly optimal deployment complexity (i.e., the number of different exploration policies used to collect data) of O(H)O(H), and a polynomial (though suboptimal) sample complexity. Its main technical novelty is providing such guarantees without assumptions on the reachability coefficients of the MDP.

优缺点分析

Strengths

  • The paper is organised and written clearly.

  • The authors provide the first nearly optimal algorithm for their setting without making assumptions on reachability coefficients of the MDP.

Weaknesses

  • Lack of experiments (even on toy MDPs).

  • According to Assumption 2, the considered linear MDP setting assumes a finite number of states, unlike Jin et al. 2019.

Suggestions

  • A table including the deployment complexity and the sample complexity of your algorithm and related work would be useful (even if in the appendix).

问题

  • From line 782, the sample complexity of your algorithm is O(d15H15/ϵ5)O(d^{15} H^{15} / \epsilon^5), right? This information would be useful in the main text. An interesting direction for future work could be improving this sample complexity. Hence, it would be helpful if you could summarise the main causes of sub-optimality there!

  • In algorithm 1, there seems to be exactly HH policy deployments. Why is the deployment complexity in Theorem 4 O(H)O(H) and not exactly HH?

  • Out of curiosity, is there any result showing whether sample complexity and deployment complexity are conflicting objectives (i.e., whether both could be near-optimal)?

  • I understand that this is a theoretical RL paper, but even in that case, I believe that experiments (on toy environments at least) are important. For example, one important contribution of the paper is obtaining guarantees without reachability assumptions. From what I understood, this is also a consequence of algorithmic improvements (not only proof techniques), so it would be interesting to see if (i) your algorithm works in a toy MDP with hard-to-reach states; (ii) if previous algorithms that require reachability assumptions get stuck in that case. I also understand that, in order for it to be feasible to run experiments, some theoretical constants (often too large) need to be relaxed empirically, but that shouldn’t be difficult, and would illustrate the main algorithmic ideas that make exploration work.

局限性

Yes.

最终评判理由

I think the paper brings a relevant theoretical contribution to the field. I increase my rating from 4 to 5 following the authors responses, and experimental validation on a toy experiment.

格式问题

None.

作者回复

We appreciate the reviewer's valuable comments. We respond to each comment below.

Regarding the numerical experiments.

We thank the reviewer for the suggestion. The primary goal of this work is to eliminate dependence on reachability in both sample complexity and deployment complexity. To ensure our algorithm outperforms reachability-dependent approaches, we require the reachability parameter to be exponentially small. However, this leads to highly artificial MDP constructions. In practice, features are typically learned via neural networks, making it unclear how to test reachability parameters in such settings. Since different features yield different reachability values, evaluating the algorithm in a natural, practical environment becomes challenging.

Regarding the hardness in improving the sample complexity.

You are right that the sample complexity of our algorithm is O(d15H15/ϵ5)O(d^{15}H^{15}/\epsilon^5). In the next revision, we will provide the precise sample complexity in the theorem statement.

The main reason of the bad dependency is that: The main algorithm (Algorithm 1) involves two iterative processes, and since the results of different iterations all rely on the same offline dataset, these results are subtly coupled with each other. To break the statistical coupling, we simply make independent copies of the offline datasets by following the exploration policy and repeatedly sampling trajectories with fresh randomness.

One potential approach to improving the sample complexity is to prove concentration inequalities (e.g., Lemma 17) through union bound arguments. The major difficulty in this approach is that the complexity of the offline planning policy in Algorithm 3 and 4 could be high.

Regarding the trade-off between deployment efficiency and sample efficiency.

By Theorem 3.2 in Huang et al., 2022, achieving polynomial sample complexity requires at least cH/logd(1/ϵ)cH/\log_d(1/\epsilon) deployments for some constant c > 0. Any deployment budget below this threshold results in super-polynomial sample complexity. With O(H)O(H) deployments, our algorithm achieves polynomial sample complexity. Moreover, with O~(dH)\tilde{O}(dH) deployments, one could learn an ϵ\epsilon optimal policy with a near optimal sample complexity of O~(d2H3ϵ2)\tilde{O}(\frac{d^2H^3}{\epsilon^2}) with the help of online learning methods (He et al., 2022).

The remaining challenge is to determine the deployment complexity threshold for achieving near-optimal sample complexity. Specifically, we ask: Are Ω(dH)\Omega(dH) deployments necessary to attain the near-optimal sample complexity of O~(d2H3ϵ2)\tilde{O}(\frac{d^2H^3}{\epsilon^2})?
We leave this problem as an open question for future research.

Regarding the finite state space.

We are sorry for the confusion. We will redefine μ\mu as dd measures (μ1,μ2,,μd)(\mu_1,\mu_2,\ldots, \mu_d) over the state space S\mathcal{S} as in Jin et al., 2019. We emphasize that our algorithm and theoretical analysis hold even for infinite state spaces.

About the exact number of deployments.

The number of deployments is exactly HH assuming the initial state is fixed as s1s_1. Without this assumption, we need H+1H+1 deployments. We will clarify this in the next revision.

Regarding the table to summarize the related works.

We thank the reviewer for this suggestion and will include the table in the next revision.

References:

Huang et al., 2022: Towards deployment-efficient reinforcement learning: lower bound and optimality.

He et al., 2022: Nearly minimax optimal reinforcement learning for linear Markov Decision Processes.

Jin et al., 2019: Provably efficient reinforcement learning with linear function approximation.

评论

Thank you for the clarifications! It's good to know that the guarantees apply to infinite state spaces. I have no other concerns and believe that the paper brings a relevant theoretical contribution to the field. I decided to keep my rating (4), as I believe that the paper would have a higher impact if it brought algorithmic ideas that could be used in practical implementations.

评论

As suggested by the reviewer, we conducted experiments on the following small and simple MDP to illustrate the main algorithmic ideas.

The horizon length is H=7H = 7, meaning that there are 77 levels in the MDP. The action space contains three actions: a1,a2,a3a_1, a_2, a_3. There is a single initial state s0s_0 at the first level. In the middle levels, i.e., 2h62 \le h \le 6, there are three different states: s1,s2,s3s_1, s_2, s_3. At the last level h=7h = 7, there are two terminal states: swins_{\text{win}} and sloses_{\text{lose}}.

The transitions are defined as follows:

  • Choosing action aia_i at the initial state s0s_0 transitions to state sis_i at h=2h = 2.
  • For 2h52 \le h \le 5, the transitions are independent of the action chosen:
    • s1s_1 and s2s_2 transition back to themselves at the next level.
    • s3s_3 transitions to s2s_2 and s3s_3 with equal probability, i.e., 1/21/2.
  • At level h=6h = 6, the transitions are probabilistic and independent of the action chosen:
    • State s1s_1 transitions to swins_{\text{win}} with probability 1/2+ϵ1/2 + \epsilon and sloses_{\text{lose}} with probability 1/2ϵ1/2 - \epsilon.
    • States s2s_2 and s3s_3 transition to swins_{\text{win}} with probability 1/2ϵ1/2 - \epsilon and sloses_{\text{lose}} with probability 1/2+ϵ1/2 + \epsilon.

We set ϵ=0.05\epsilon = 0.05 in our experiments. The agent receives a reward of 10001000 at swins_{\text{win}}, and a reward of 1000-1000 at sloses_{\text{lose}}. The reward values for all other states are set to 00.

We compare the new algorithm in our paper with Algorithm 2 from the prior work by Huang et al. [2022]. This is a fair comparison, as these two algorithms have the same deployment complexity. In our experiments, we vary the number of trajectories per deployment (denoted as NN) and report the expected total reward of the returned policy, averaged over 1000 repetitions. Note that in the underlying MDP, the smallest state-reaching probability is 24=0.06252^{-4} =0.0625, which is not exceptionally small. However, even in this setting, we observe the following results.

NNOur algorithmHuang et al.
301.86-4.14
3001.92-3.66
30004.2-0.84
300009.364.52
30000010.009.54

From the results, it is clear that our new algorithm outperforms the algorithm by Huang et al. [2022] across a wide range of NN, as long as NN is not very large. This aligns with our theoretical analysis, as our algorithm quickly identifies s3s_3 as an infrequent state (dimension), and will not waste samples trying to reach s3s_3. On the other hand, such a mechanism does not exist in the previous algorithm by Huang et al. [2022] (as it assumes that all directions (states) are reachable). Consequently, it will waste a significant amount of samples on infrequent states (dimensions).

We note that it is possible to further increase the performance gap between our algorithm and that of Huang et al. [2022], by possibly adding more infrequent states (dimensions) and more levels. Given the time constraints of the author-reviewer discussion period, we will add more discussion on these experiments to our final version.

评论

Thank you, I believe that this experimental validation is a very nice sanity check for the novel theoretical ideas. Also, if you needed any relaxation on the theoretical constants to make it work, it would be nice to mention it in the appendix.

Given this validation, I decided to increase my score to 5.

审稿意见
4

This paper studies reward-free reinforcement learning under linear model, and aims to design deployment efficient algorithm, i.e. the number of exploration policy used is significantly less than the number of online episodes. The paper proposed a new algorithm that only needs O(H)O(H) exploration policies to output an ϵ\epsilon-optimal policy at the planning phase, where H is the horizon length.

优缺点分析

Strengths:

  1. The theoretical result seems novel and it shows that the number of exploration policies is independent of the sub-optimal gap ϵ\epsilon.
  2. The algorithm design is new in terms of the handling infrequent visits in the linear MDP setting.

Weaknesses:

  1. While the algorithm design is indeed novel, and provide insights on how to design deployment efficient algorithms in linear or low-rank RL, the problem is limited to linear MDPs with known features, thus the significance is reduced.
  2. It is better to provide a comparison table of the results of related works showing the deployment complexity, sample complexity and the MDP setting.

问题

  1. Can authors discuss the tradeoff between deployment efficiency and the sample efficiency. Is there any fundamental hardness here (e.g. we cannot achieve deployment efficiency and sample efficiency at the same time) or is it because the technical tools we have so far have limitations. (While there is some discussion after thm4, it seems that it is about the reachability parameter rather than the deployment complexity and the sample complexity themselves if I understand correctly.)

局限性

Yes

最终评判理由

The authors have addressed my major concerns. I decide to keep my score.

格式问题

No further concerns.

作者回复

We thank the reviewer for the valuable comments. Below we provide our responses to each point.

Regarding the linear MDP assumption.

While our algorithm is currently limited to linear MDPs, extending these results to low-rank MDPs or MDPs with linear Bellman completeness would be an interesting direction for future work.

Regarding the table to summarize the related works.

Thanks for the suggestion. We will include this table in the next revision. The most important related works include: (i) Huang et al., 2022, which achieves O(H)O(H) deployment complexity with the reachability assumption; (ii) Zhao et al., 2023, which obtains O(dH)O(dH) deployment complexity for general function approximation.

Regarding the trade-off between deployment efficiency and sample efficiency.

By Theorem 3.2 in Huang et al., 2022, achieving polynomial sample complexity requires at least cH/logd(1/ϵ)cH/\log_d(1/\epsilon) deployments for some constant c > 0. Any deployment budget below this threshold results in super-polynomial sample complexity. With O(H)O(H) deployments, our algorithm achieves polynomial sample complexity. Moreover, with O~(dH)\tilde{O}(dH) deployments, one could learn an ϵ\epsilon optimal policy with a near optimal sample complexity of O~(d2H3ϵ2)\tilde{O}(\frac{d^2H^3}{\epsilon^2}) with the help of online learning methods (He et al., 2022).

The remaining challenge is to determine the deployment complexity threshold for achieving near-optimal sample complexity. Specifically, we ask: Are Ω(dH)\Omega(dH) deployments necessary to attain the near-optimal sample complexity of O~(d2H3ϵ2)\tilde{O}(\frac{d^2H^3}{\epsilon^2})? When the reachability assumption holds, it could be possible to improve the deployment complexity to HH while maintaining near optimal sample complexity with existing techniques. However, when the reachability assumption does not hold, the technical tools we have so far fail to address this problem. We leave this problem as an open question for future research.

References:

Huang et al., 2022: Towards deployment-efficient reinforcement learning: lower bound and optimality.

Zhao et al., 2023: A nearly optimal and low-switching algorithm for reinforcement learning with general function approximation.

He et al., 2022: Nearly minimax optimal reinforcement learning for linear Markov Decision Processes

评论

Thank you for your response and for addressing my concerns. I have no further questions and decide to keep my score.

最终决定

All reviewers agree that the paper is a solid theoretical contribution to the problem of deployment-efficient exploration with function-approximation. It is the first efficient algorithm that handles the whole family of linear MDPs. Existing work, e.g., Huang et al, and Qiao et al. while providing better dependence on parameters like H and d, their sample complexity (at least in finite sample) depends on certain "reachability" conditions.

The results are primarily technical, and the paper is recommended to be accepted as a theoretical work without experiments. The authors should work on addressing all reviewer comments (including improving clarity) for the camera ready version.