PaperHub
5.3
/10
Poster4 位审稿人
最低4最高6标准差0.8
4
5
6
6
2.8
置信度
正确性2.8
贡献度2.5
表达2.8
NeurIPS 2024

Achieving Constant Regret in Linear Markov Decision Processes

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06

摘要

We study the constant regret guarantees in reinforcement learning (RL). Our objective is to design an algorithm that incurs only finite regret over infinite episodes with high probability. We introduce an algorithm, Cert-LSVI-UCB, for misspecified linear Markov decision processes (MDPs) where both the transition kernel and the reward function can be approximated by some linear function up to misspecification level $\zeta$. At the core of Cert-LSVI-UCB is an innovative certified estimator, which facilitates a fine-grained concentration analysis for multi-phase value-targeted regression, enabling us to establish an instance-dependent regret bound that is constant w.r.t. the number of episodes. Specifically, we demonstrate that for a linear MDP characterized by a minimal suboptimality gap $\Delta$, Cert-LSVI-UCB has a cumulative regret of $\tilde{\mathcal{O}}(d^3H^5/\Delta)$ with high probability, provided that the misspecification level $\zeta$ is below $\tilde{\mathcal{O}}(\Delta / (\sqrt{d}H^2))$. Here $d$ is the dimension of the feature space and $H$ is the horizon. Remarkably, this regret bound is independent of the number of episodes $K$. To the best of our knowledge, Cert-LSVI-UCB is the first algorithm to achieve a constant, instance-dependent, high-probability regret bound in RL with linear function approximation without relying on prior distribution assumptions.
关键词
Reinforcement learningConstant regretMisspecified linear MDP

评审与讨论

审稿意见
4

The paper studies constant regret guarantees for linear MDPs. The main result is the first algorithm with this guarantee. The algorithm is a modification of LSVI-UCB with a careful quantization technique and some form of action elimination. This improves upon the regret of previous works by a factor of logK\log K where KK is the number of episodes. Unlike previous works, the algorithm is robust to some degree of model mispecification and does not need to know it in advance.

优点

  1. The method is robust to some model mispecification and does not need to know it in advance.
  2. The regret guarantee is tighter compared to previous works.
  3. The algorithm is computationally efficient in the sense that it runs in polynomial time in the problem parameters.
  4. There are significant technical challenges in the work. In particular the careful quantization required to avoid dependence on KK and the refined analysis of the regret bound by the suboptimality gaps.

缺点

  1. While it's challenging to achieve regret that is independent of KK, I do not see this as a significant improvement over the logarithmic dependence. This is especially true since it is not clear whether the correct polynomial dependence on the remaining parameters has been established. Moreover, this improvement is only relevant in the regime of a fixed success probability δ\delta, which is often dependent on KK, e.g., if bounding the expected regret.
  2. The improvement to the regret comes at a significant increase to the algorithm's complexity. It's not clear whether this is necessary. For example, are you certain that the quantization cannot be replaced by a more careful analysis of the value cover? Why do you need both bonuses and action elimination and sample rejection? The runtime complexity also seems higher (though this is a minor concern). If you think these modifications are indeed necessary, perhaps you could demonstrate this with experiments.
  3. While I am not familiar with better results, the dependence on the worst case gap weakens the result. Do you think it is possible to achieve a better dependence? e.g. on the average inverse gap with respect to the optimal policy.
  4. The writing is not very consistent with some places having phrasing and grammatical issues (e.g., quantification should be quantization and many more).
  5. The algorithm needs to be better explained. While it seems the authors made significant efforts in this regard, many parts of the algorithm remain rather unclear. For example, the bonuses are not immediately apparent. As I understand it, you use a quantization of the standard bonus. It would be much easier for the reader if this is made explicit in the algorithm and explained. Other examples are the action elimination and sample rejection whose purpose is unclear.

问题

See above.

局限性

N/a

作者回复

Thank you for your detailed feedback, and we address your questions as follows:

Q1. Why is it significant to improve the logarithmic dependence on KK?

A1. We would like to first highlight that achieving constant regret is actually an important topic in bandits (Abbasi et al., Papini et al., 2021a, Zhang et al.,) and RL (Papini et al., 2021b). This is not merely shaving off a log\log factor but a stronger performance guarantee than logarithmic regret. Specifically, the constant regret bound indicates that the agent incurs only a finite number of mistakes when making decisions over an infinite run, whereas the number of mistakes can become arbitrarily large if the regret guarantee is only logarithmic.


Q2. Is the correct polynomial dependence on the remaining parameters established?

A2. We would like to point out that our algorithm suggests a O~(d3H5Δ1)\tilde O(d^3H^5\Delta^{-1}) polynomial dependence on the parameters H,d,Δ1H, d, \Delta^{-1}, which is consistent with previous literature (He et al., Papini et al.), as discussed in Table 1. Therefore, we are confident that the polynomial dependence on these parameters is correct.


Q3. What is the necessity of each component in the algorithm?

A3. We believe both the certified estimator and the layer-dependent quantization are necessary for our theoretical analysis. As discussed in Section 6, certified estimators build the connection between different layers ll so that we can avoid taking the union bound over all L=logKL = \log K layers during the concentration. Layer-dependent quantization removes the additional logK\log K dependence introduced during the covering of the value function in Vial et al..

In addition, we have also conducted experiments on a synthetic dataset as presented in the response to all reviewers. In particular, the certified estimator helps stabilize the estimation so that the algorithm can achieve lower regret when the misspecification grows larger. It is true that the quantization does not significantly contribute to improving the regret, which aligns with our intuition since the numerical results are intrinsically discrete and quantized. In addition, experimental results also suggest that neither the certified estimator nor the quantization decreases the efficiency of the algorithm.


Q4. Why is the analysis based on the minimal sub-optimality gap instead of other gaps (like averaged gaps)?

A4. The averaged sub-optimality gap, as you suggested, can only be applied to the tabular bandits setting, to the best of our knowledge. In reinforcement learning with function approximation, the number of actions can be arbitrarily large, making it difficult to define the ‘averaged’ sub-optimal gap in this setting.


Q5. Explain the details of the algorithms? Why do you need both bonus and arm elimination?

A5. First, we would like to clarify that our algorithm does not use the reject sampling, instead, we call the exit condition in Line 11 in Algorithm 2 certified estimator to ensure that the optimistic estimator for the current layer ll should be always greater than the pessimistic estimator from the previous layer l1l-1, and vice versa. Therefore, in case the concentration of layer ll fails, it will be detected and the performance of the estimation is not affected since the certified estimator (line 11) is triggered and the algorithm returns the estimation from the previous layer. There is no connection between the certified estimator and reject sampling since neither of the V,V^,VˇV, \hat V, \check V are sampled from some distribution and our algorithm is a learning algorithm instead of a sampling algorithm.

Second, regarding the bonus and the arm elimination, we quantize the parameter w\mathbf w and U\mathbf U in Line 8, Algorithm 1, in order to control the quantity of the function set. Both of the bonus term ϕ(s,a)U~\lVert\mathbf {\phi}(s, a) \rVert_{\tilde {\mathbf U}} and QQ functions are calculated using this quantized w~\tilde {\mathbf w} and U~\tilde {\mathbf U}. In Algorithm 2, for each layer ll, the action with a bonus term larger than O(2l)O(2^{-l}) will directly get selected for exploration.

Third, Regarding the arm elimination process, for each layer ll in Algorithm 2, we eliminate all actions aa which Qh,kl(s,a)Q_{h, k}^l(s, a) is smaller than the estimated optimal value function Vh,kl(s)V_{h, k}^l(s) by 42l4 \cdot 2^{-l}. As a result, arm elimination shrinks the arm set before proceeding to the next phase l+1l+1. Since the algorithm pursues higher decision in the deeper layer, which is more vulnerable to model misspecification, the arm elimination tries to remove the suboptimal arm in the shallow layers in order to deliver a more robust decision.

Additionally, in response to your question about the simultaneous use of bonus and arm elimination, it's important to note that this combination is commonly employed in bandits (Chu et al., Takemura et al.) and reinforcement learning (Vial et al.). Therefore, our design choice is conventional and aligns with standard strategies in the field.


  • Abbasi-Yadkori et al. "Improved algorithms for linear stochastic bandits." NeurIPS, 2011
  • Papini, Matteo, et al. "Leveraging good representations in linear contextual bandits." ICML, 2021a
  • Papini, Matteo, et al. "Reinforcement learning in linear mdps: Constant regret and representation selection." NeurIPS, 2021b
  • Zhang, Weitong, et al. "On the interplay between misspecification and sub-optimality gap in linear contextual bandits." ICML, 2023.
  • Vial, Daniel, et al. "Improved algorithms for misspecified linear markov decision processes." AISTATS, 2022
  • Chu, Wei, et al. "Contextual bandits with linear payoff functions." AISTATS, 2011
  • Takemura, Kei, et al. "A parameter-free algorithm for misspecified linear contextual bandits." AISTATS, 2021
评论

Thank you for the detailed response.

[A1]: I am fully aware that "shaving" this log factor achieves independence of the time horizon. However, this is only with a fixed probability 1δ1-\delta. When considering an infinite time horizon, it seems unlikely that δ\delta would not scale with the horizon as this means we can incur linear regret with constant probability.

[A2]: The fact that previous works achieve the same dependence on d,H,Δ1d, H, \Delta^{-1} does not imply that this is the correct dependence. Many works indeed focus on achieving improved dependence on KK as long as other parameters scale polynomially. However, I find this sensible only when the dependence on KK is polynomial. When it is logarithmic, I find it very likely that shaving a factor of dd or HH would be more significant than shaving a logK\log K factor.

[A3+5]: Thank you for the explanations. While I'm not fully convinced of the necessity of each part, I get the general picture. Regardless, this is a secondary concern.

[A4]: Regarding gaps, is there a lower bound showing this is impossible in general? I am aware that this concern is not unique to this work but it is a weakness nonetheless.

评论

Thank you for your reply. We addressed your questions and concerns as follows.

Q1. When considering an infinite time horizon, it seems unlikely that it would not scale with the horizon as this means we can incur linear regret with constant probability.

A1. We would like to emphasize that the constant failure probability (e.g., δ=106\delta = 10^{-6}) is acceptable in practice since it is highly unlikely to enter the “bad events” in a realistic RL task. Moreover, the high-probability bounds are well studied and acknowledged in literature (Jin et al., 2020) and cannot be viewed as some “intermediate” result towards the expected result. Therefore, we believe our result is a direct improvement of these works (Jin et al., 2020, Vial et al., 2022) to achieve the constant regret in this regime.

Secondly, we would like to highlight that the constant regret bound is important because it reveals a PAC-type (w.p. 1δ1 - \delta, gap Δ\Delta, RegretO(d3H5Δ1log(1/δ))\text{Regret} \le O(d^3H^5\Delta^{-1}\log (1 / \delta))) guarantee of the regret distribution, which cannot be obtained from expected regret.


Q2. The correctness of the work.

A2. We would like to first acknowledge that it is indeed important to improve on factors dd and HH, where we mentioned a few potential directions in the future work. However, we would like to point out that achieving a constant regret is as important as shaving off other dependence because the factors d,Hd, H and KK are independent from each other. For the case that d,Hd, H is small and the KK is large, removing logK\log K is significant. We would also like to emphasize that the future potentials for shaving off d,Hd, H factors does not imply our algorithm is incorrect because these results are all upper bounds. We have also double checked the proof before and during the rebuttal, and we are confident that our analysis is correct.


Q4. Is there a lower bound showing this is impossible?

A4. In tabular setting, Simchowitz et al. 2019, Xu et al., 2021 provided the gap dependent upper bound by (Theorem 1.1, Xu et al., 2021)

Regret(K)=O~(((s,a,h)S×A×[H]:Δh(s,a)>01Δh(s,a)+ZmulΔmin+SA)poly(H)logK),  (1)\text{Regret}(K) = \tilde{O}\left( \left( \sum_{(s,a,h) \in \mathcal{S} \times \mathcal{A} \times [H]: \Delta_h(s,a) > 0} \frac{1}{\Delta_h(s,a)} + \frac{|Z_{\text{mul}}|}{\Delta_{\min}} + SA \right) \text{poly}(H)\log K \right), \ \ (1)

where ZmulZ_{mul} is the set of state-action pairs (s, a)’s satisfying a is an non-unique optimal action for s. The upper bound (1) contains both the averaged inverse gap Δh1(s,a)\sum \Delta_h^{-1}(s, a) and the minimal gap \Delta_\min. Only when the optimal action for each state is unique (i.e., Zmul=0Z_{mul} = 0), (1) can be reduced to

Regret(K)=O~(((s,a,h)S×A×[H]:Δh(s,a)>01Δh(s,a)+SA)poly(H)logK),\text{Regret}(K) = \tilde{O}\left( \left( \sum_{(s,a,h) \in \mathcal{S} \times \mathcal{A} \times [H]: \Delta_h(s,a) > 0} \frac{1}{\Delta_h(s,a)} + SA \right) \text{poly}(H)\log K \right),

In the general case where the optimal action is not necessary unique, we would like to point out that Xu et al., 2021 proved a lower bound (Theorem 1.3) suggesting no algorithm ALG can achieve a regret such that for all MDPs MM and KK approaches infinity,

Regret(K,M,ALG)=O(((s,a,h)S×A×[H]:Δh(s,a)>01Δh(s,a))logK),  (3)\text{Regret}(K, M, \text{ALG}) = O\left( \left( \sum_{(s,a,h) \in \mathcal{S} \times \mathcal{A} \times [H]: \Delta_h(s,a) > 0} \frac{1}{\Delta_h(s,a)}\right)\log K \right), \ \ (3)

As discussed in Xu et al,. 2021, this suggests that it is not possible to achieve a regret bound that solely depends on the sum of the inverse of the gaps, even in the tabular MDP setting. Xu et al,. 2021 also mentioned that this is a separation between RL and contextual bandits since such a ‘averaged inverse gap’ can be achieved in contextual bandits (Bubeck et al, 2012, Lattimore et al., 2020)

Therefore, in linear MDPs, such a lower bound presented as (3) also suggests that it is impossible to achieve a gap-dependent bound that is solely depend on the average inverse gap Δh1\sum \Delta_h^{-1} since any tabular MDP can be translated into a linear MDP (Example 2.1, Jin et al. 2020).

In short, given the results from literature, we believe that getting such a bound solely depends on the average inverse gap without other dependencies like minimal sub-optimal gap \Delta_\min or SASA is impossible in linear MDPs.


  • Vial, Daniel, et al. "Improved algorithms for misspecified linear markov decision processes." AISTATS, 2022
  • Jin, Chi, et al. "Provably efficient reinforcement learning with linear function approximation." COLT, 2020
  • Xu, Haike et al. "Fine-grained gap-dependent bounds for tabular mdps via adaptive multi-step bootstrap." COLT, 2021
  • Simchowitz, Max et al. "Non-asymptotic gap-dependent regret bounds for tabular mdps." NeurIPS, 2019
  • Bubeck, Sébastien et al. "Regret analysis of stochastic and nonstochastic multi-armed bandit problems." Foundations and Trends® in Machine Learning 5.1 (2012): 1-122.
  • Lattimore, Tor et al. Bandit algorithms. Cambridge University Press, 2020
审稿意见
5

This paper proposed a constant regret learning algorithm for linear MDPs with approximation error (misspecification). Specifically, it proved an instance dependent regret that is independent of the number of episodes. In addition, the algorithm does not require prior knowledge of misspecification level or suboptimality gap.

优点

  1. The background of the problem is stated clearly with well summarized review of the literature.
  2. There is a comprehensive comparison of the new constant regret results with existing works.

缺点

  1. The proposed algorithm and analysis follow the algorithm in Vial et al. (2022) closely. The main differences seem to be the layer-dependent quantification in Algorithm 1 and the exit condition 3 in Algorithm 2. It would be better to discuss more clearly how these two modifications contribute to getting rid of the log(K) dependence.
  2. Some of notations are used without explanation. For example, λ\lambda in algorithm 1, γl\gamma_l in Algorithm 2 and Theorem 5.1. Also, how the certified estimator is defined in Algorithm 2 is not clear.
  3. Some analysis and results require more clarification: (a) The δ\delta condition used in Theorem 5.1 does not match the one used in Lemma C12. (b) It is not very clear how the key results in getting rid of log(k) (Lemma C14 line 765) is derived using xa+bxx\leq a+\sqrt{bx}.

问题

  1. The new algorithm does not require UniSOFT assumption made by Papini et al. (2021a). However, in Thm 5 of that paper, they showed that UniSOFT is a necessary condition for constant regret. Can you explain how this does not apply to the current setup?
  2. It is not very intuitive to understand the constant regret with no dependency on the number of episodes. Are there any simple examples that can run the algorithm empirically to verify this behavior?

局限性

There is no potential negative societal impact.

The authors addressed the limitations in Section 7 such as the tightness of the bound and uniform misspecification assumption.

作者回复

Thank you for your positive feedback. We address your questions as follows:

Q1. How do the layer-dependent quantization and certified estimator contribute to getting rid of the logK\log K dependence?

A1. We would like to highlight the contribution of the layer-dependent quantization and certified estimator in eliminating the logK\log K dependence. Vial et al., 2022 applied a global quantization ϵrnd=d/K2\epsilon_{\text{rnd}} = d / K^2, making the number of quantized parameters w~l\tilde{\mathbf w}_l and U~l\tilde{\mathbf U}_l in each layer ll depend on KK. Our method uses unique quantization parameters klk_l per layer, thus eliminating the logK\log K dependence in regret bound (Lemma D.3).

In addition to the logK\log K term introduced by the quantization, the estimated value function of Algorithm 2 comes from a total of L=O(logK)L = O(\log K) individual regressions. If controlled trivially, the union bound over all LL layers would also introduce the logK\log K dependency to the final regret bound. The certified estimator further helps by connecting estimations across layers. As a result, during the analysis, we can focus on the concentration of some fixed layer l+l_+, which does not depend on the parameter KK (Lemma 6.1). The details are discussed in Section 6, Challenge 1.


Q2. How is the certified estimator and other notations defined?

A2. The certified estimator ensures that each layer's estimate is within a predefined confidence radius from the previous layer. λ\lambda is the regularization parameter from Algorithm 1, and γl\gamma_l is the confidence radius in Theorem 5.1.


Q3. Why is the δ\delta condition different between Theorem 5.1 and Lemma C.12?

A3. Theorem 5.1 integrates Lemmas C.12 and C.14, which hold with probabilities of 13δ1−3\delta and 1δ1−\delta, respectively. Applying the union bound, Theorem 5.1 holds with 14δ1−4\delta, differing from Lemma C.12. We will clarify this in the revision.


Q4. Could you clarify the technical details of getting rid of logk\log k in Lemma C.14 derived using xa+bxx \le a + \sqrt{bx}?

A4. xa+bxx2a+2bx \le a + \sqrt{bx} \Rightarrow x \le 2a + 2b is applied in Section D.10, Line 763 after getting Regret(K)<k=1Kh=1HΔhk+2Regret(K)H2log(Regret(K)/δ)+H2log(Regret(K)/δ).(1)\textrm{Regret}(K) < \sum_{k=1}^{K} \sum_{h=1}^H \Delta^k_h + \sqrt{2\textrm{Regret}(K)H^2\log(\textrm{Regret}(K)/\delta)} + H^2\log(\textrm{Regret}(K)/\delta).\quad (1) Taking x=Regret(K)x = \textrm{Regret}(K), a=k=1Kh=1HΔhk+H2log(Regret(K)/δ)a = \sum_{k=1}^{K} \sum_{h=1}^H \Delta^k_h + H^2\log(\textrm{Regret}(K)/\delta), and b=2H2log(Regret(K)/δ)b = 2H^2\log(\textrm{Regret}(K)/\delta), inequality (1) yields Regret(K)2k=1Kh=1HΔhk+2H2log(Regret(K)/δ)+4H2log(Regret(K)/δ)=2k=1Kh=1HΔhk+6H2log(1/δ)+6H2log(Regret(K))(2)\textrm{Regret}(K) \le 2\sum_{k=1}^{K} \sum_{h=1}^H \Delta^k_h + 2H^2\log(\textrm{Regret}(K)/\delta) + 4H^2\log(\textrm{Regret}(K)/\delta) = 2\sum_{k=1}^{K} \sum_{h=1}^H \Delta^k_h + 6H^2\log(1 / \delta) + 6H^2\log(\textrm{Regret}(K)) \quad (2)

According to the fact that xalogx+bx4alog(2a)+2bx \le a \log x + b \Rightarrow x \le 4a \log (2a) + 2b, letting x=Regret(K),a=2k=1Kh=1HΔhk+6H2log(1/δ)x = \textrm{Regret}(K), a = 2\sum_{k=1}^{K} \sum_{h=1}^H \Delta^k_h + 6H^2\log(1 / \delta) and b=6H2b = 6H^2, (2) becomes Regret(K)(8k=1Kh=1HΔhk+24H2log(1/δ))log(4k=1Kh=1HΔhk+12H2log(1/δ))+12H2.\textrm{Regret}(K) \le \left(8\sum_{k=1}^{K} \sum_{h=1}^H \Delta^k_h + 24H^2\log(1 / \delta)\right)\log\left(4\sum_{k=1}^{K} \sum_{h=1}^H \Delta^k_h + 12H^2\log(1 / \delta)\right) + 12H^2. Hiding the logarithmic factors within the O~\tilde O notation yields the claimed result in Lemma C.14. We will clarify the technical details of this part in the revision.


Q5. What’s the connection between the UniSOFT assumption and the result obtained in this paper?

A5. We would like to highlight the connection between the UniSOFT assumption and the result obtained in this paper, which we also discussed in Remark 5.2. First, Papini et al. showed that UniSOFT is necessary for obtaining an expected constant regret, suggesting that a logK\log K expected regret is unavoidable without UniSOFT. In our result, we proved a high-probability constant regret, suggesting that with probability at least 1δ1 - \delta, the regret is bounded by a constant O(d3H5Δ1log(1/δ))O(d^3H^5\Delta^{-1} \log (1 / \delta)).

Our result does not violate Papini et al. When translating the high-probability regret bound to an expected regret bound, we have E[Regret(K)]O~(d3H5Δ1log(1/δ))(1δ)+δK=O~(d3H5Δ1logK)E[\text{Regret}(K)] \le \tilde O(d^3H^5\Delta^{-1}\log(1 / \delta)) \cdot (1 - \delta) + \delta K = \tilde O(d^3H^5\Delta^{-1}\log K). Setting δ=1/K\delta = 1 / K will lead to a logarithmic expected regret bound. In short, a high-probability constant regret bound results in a logarithmic expected regret bound.

On the other hand, the constant regret upper bound in Papini et al. is obtained under the UniSOFT assumption. Ours, in contrast, obtain the same high-probability constant regret upper bound without the UniSOFT, which is the first constant regret bound without prior assumptions in the literature.

In conclusion, the UniSOFT assumption is a necessary condition for expected constant regret but not a necessary condition for high-probability constant regret. Our result is the first work showing the high-probability constant regret bound without this assumption


Q6. What’s the experimental result and the intuitive understanding of the constant regret?

A6. We have added some experiments on the synthetic dataset to verify the performance of the algorithm, which is presented in the global response. In addition to the experimental results, the understanding of constant regret is quite intuitive: A constant regret bound suggests that the total mistakes made by the RL agent are finite, with high probability. Taking the two-armed stochastic bandit as an example, if the gap between the two arms is strictly positive, it only takes finite time to distinguish the optimal arm from the suboptimal arm. Thus, the total mistakes made by the RL agent are finite.


  • Vial, Daniel, et al. "Improved algorithms for misspecified linear markov decision processes." AISTATS, 2022
  • Papini, Matteo, et al. "Reinforcement learning in linear mdps: Constant regret and representation selection." NeurIPS, 2021
审稿意见
6

This paper gives a constant regret (total regret independent of the number of episodes KK) algorithm for online reinforcement learning of linear MDPs. The environment is considered to be a ζ\zeta-approxinate linear MDP as in (Jin et al 2019), with an assumption that the misspecification level (ζ\zeta) is not too large. The algorithm is a more sophisticated variant of the original LSVI-UCB, which is a further improvement of earlier work that improves the regret of LSVI-UCB (but does not achieve constant regret).

The regret is parametrized by the instance-dependent minimal suboptimality gap Δ:=minh,s,aVh(s)Qh(s,a):Vh(s)Qh(s,a)0\Delta := \min_{h,s,a} \\{V_h^\ast(s) - Q_h^*(s,a) : V_h^\ast(s) - Q_h^*(s,a) \neq 0\\}, which also appears in earlier instance dependent regret bounds for linear MDPs (and also in general episodic MDPs).

Their algorithm, Cert-LSVI-UCB, has a regret bound of O~(d3H5Δ1)\tilde{O}(d^3 H^5 \Delta^{-1}). The dependence on Δ\Delta is optimal, whereas the dependence on dd and HH may be suboptimal.

This paper is also an improvement on the previous work that provides "constant regret" bounds for linear MDPs (Papini et al 2021) in that it does not require the "UniSoft" assumption on the feature mapping (only requires the standard assumptions of misspecified linear MDPs as in Jin et al 2019, plus that the misspecification level ζO~(Δ/dH2)\zeta \leq \tilde{O}(\Delta/\sqrt{d}H^2)).

优点

  • The algorithm and its analysis gives a substantial improvement over earlier algorithms with instance-dependent bounds for linear MDPs, in that it is the first algorithm with a constant regret bound that requires minimal assumptions, even for misspecified linear MDPs.
  • The algorithm is cleanly based on earlier linear MDP algorithms starting from LSVI-UCB, applied in more sophisticated ways (multiple regression phases per episode etc), and the improvements in the algorithm and challenges in the analysis have been highlighted in the main paper itself (especially, in avoiding the dependency of the regret on the number of phases).

缺点

  • The contribution of the paper is entirely theoretical, but most of the proof is in the appendix (which is unavoidable).
  • Some experimental results (even on synthetic data) would definitely enhance the paper.

问题

NA

局限性

  • The limitations have been adequately addressed and there is no negative societal impact.
作者回复

Thank you for the positive feedback on our work! We will address your questions as follows:

Q1. Adding some experimental results in addition to the theoretical result

A1. Thank you for the suggestion. We have included some experimental results on a synthetic dataset during the rebuttal phase. Specifically, these results are provided in our response to all reviewers. The experimental results indicate the potential for achieving constant regret and highlight the contributions of each algorithm component through ablation studies. We will include this information in the camera-ready revision.

审稿意见
6

The paper presents improved regret bounds for Linear MDPs when the suboptimality gap \Delta is known. Most crucially (a bit surprisingly), the regret bounds are constant in a number of episodes K

优点

The contributions are broadly theoretical, and the removal of dependence on the number of episodes K seems to be a good contribution to the academic understanding of the gap-dependent (instance-dependent)learning bounds for reinforcement learning.

The paper provides a good coverage of related works in this area and describes the setting well.

缺点

  1. b_h^\pi seems to be undefined in proposition 3.2
  2. \kapp_\ell in Algorithm 1, line 8, as claimed by the author, helps to get rid of log(K). Can you explain more about how it helps to remove log(K)? is this the reason why the regret bounds do not have log(K) dependence? 3)I would urge the authors to provide a full expression of regret without O notation, including the log factors either in the main text or in the appendix (I could not find a full expression for the result in theorem 5.1). Specifically, it is surprising, perhaps I am missing something, that the regret is O(1) when Delta = 1/(d^3H^5) when K=1. I suspect some lower-order terms are hidden, or maybe the result is only valid for K \geq something. The asymptotic dependence can be right still.

问题

Please see above, I am willing to engage during the rebuttal phase.

局限性

NA

作者回复

Thank you for your positive feedback. We address your questions as follows.

Q1. What's the meaning of notation wbhπ22\lVert wb_h^\pi\rVert_2^2 in Proposition 3.2?

A1. Sorry for the typo caused by the LaTeX macros. The wbhπ22\lVert wb_h^\pi\rVert_2^2 should be whπ22\lVert \mathbf{w}_h^\pi\rVert_2^2 instead.


Q2. How the κl\kappa_l eliminates the dependence of logK\log K.

A2. We discussed the contribution of applying layer-wise quantization in Section 6.1, Line 268. Previous work (Vial et al., 2022) used a global quantization ϵrnd=d/K2\epsilon_{\mathrm{rnd}} = d / K^2. Therefore, the quantity of all possible quantized parameters w~\tilde{\mathbf w} and U~\tilde{\mathbf U} in each layer ll unnecessarily depends on the total number of episodes KK. In our algorithm, each layer ll uses a different quantization parameter klk_l so the quantity of the quantized parameters w~\tilde{\mathbf w} and U~\tilde{\mathbf U} only depends on ll (Lemma D.3).

In addition to the benefit from layer-wise quantization, another technique to eliminate the logK\log K dependence is the certified estimator discussed in Section 6. Without the certified estimator, one needs to cover the parameters for all LL layers. The certified estimator builds the connection between the output value function between consecutive layer ll and l+1l +1. As a result, in the analysis, we can focus on the concentration of some fixed layer l+l_+, which does not depend on the parameter KK (Lemma 6.1). The details are discussed in Section 6, Challenge 1.

Besides the aforementioned challenges and techniques to remove the logarithmic dependence in the estimation error, we also provide a novel concentration analysis in Section 6.2 that achieves constant regret. In sharp contrast, previous analysis (He et al., 2022) has the logarithmic dependence for this concentration.


Q3. Why the regret is O(1)O(1) when K=1,Δ=1/(d3H5)K = 1, \Delta = 1/(d^3H^5)? What is the full expression of regret without OO notations?

A3. We would like to first clarify that the extreme case with O(1)O(1) regret you mentioned is incorrect. The regret is in the form of Regret(K)O~(d3H5Δ1log(1/δ)\text{Regret}(K) \le \tilde O(d^3H^5 \Delta^{\color{red}{\mathbf{-1}}}\log(1 / \delta). Therefore, to achieve Regret=O(1)\text{Regret} = O(1), one would need to set Δ=d3H5\Delta = d^3H^5 instead of Δ=1/(d3H5)\Delta = 1 / (d^3H^5). Since the reward is bounded by 0<r(s,a)<10 < r(s, a) < 1, the gap Δ\Delta is trivially bounded by [0,H][0, H] due to 0Q(s,a)H0 \le Q^*(s, a) \le H. Therefore, letting Δ=d3H5\Delta = d^3H^5 is unrealizable in this setting.

In response to your request for the regret bound without OO notation, the full expression of the regret bound can be obtained through Lemmas C.12, C.13, and C.14. We present the full expression as follows:

&\textrm{Regret}(K) \le \left(8I_{\Delta} + 24H^2\log(1 / \delta)\right)\log\left(4I_{\Delta} + 12H^2\log(1 / \delta)\right) + 12H^2,&I_{\Delta} =4I_{\text{C.13}}\log\left(2I_{\text{C.13}}\right),

> $$ $ &I_{\text{C.13}} = c_1 L_{\Delta}(L_{\Delta}+\log(dH))^2 d^3 H^5 \Delta^{-1} \log(L_{\Delta} d/\delta) + c_2 H^3 \Delta^{-1} \log (H\log(\Delta^{-1})/\delta), &L_{\Delta} = \lceil -\log(0.01 \Delta / H) \rceil. $

The dominating term is IΔIC.13=O~(d3H5Δ1log(1/δ)I_\Delta \sim I_{\text{C.13}} = \tilde O(d^3 H^5 \Delta^{-1} \log(1 / \delta). In the paper we use the OO notation for readability of the proof due to the complexity of this expression.


  • Vial, Daniel, et al. "Improved algorithms for misspecified linear markov decision processes." AISTATS, 2022
评论

Thank you for your detailed response. I am broadly satisfied. My only recommendation (for an updated version) is that authors focus a bit on explaining the key difference that allows them to have regret and not depend on KK, as they have mentioned some of the points in the rebuttal here.

评论

Thank you for your continuous support for our paper. As you suggested, we will expand the discussions on the technical challenges in Section 6 in the revision.

作者回复

We first thank all reviewers for their valuable feedback.

In response to Reviewers 8jf2, 33Lb, and YEKy, we added experiments on synthetic datasets to verify the performance of the algorithm and the contribution of each component. Specifically, we consider a linear MDP with S=4S = 4, A=5A = 5, H=2H = 2, and d=8d = 8. Each element in the feature vector ϕ(s,a)\phi(s, a) and μ(s)\mu(s') is generated by a uniform distribution U(0,1)U(0, 1). Subsequently, ϕ\phi is normalized to ensure that P(ss,a)P(s' | s, a) is a probability measure, i.e., ϕ(s,a)=ϕ(s,a)/sϕ(s,a)μ(s)\phi(s, a) = \phi(s, a) / \sum_{s'} \phi^\top(s, a) \mu(s'). The reward is defined by r(s,a)=ϕ(s,a)θr(s, a) = \phi^\top(s, a) \theta, where θN(0,Id)\theta \sim N(0, I_d).

We also consider adding model misspecification to the transition PP and reward function rr. For a given misspecification ζ\zeta, the ground truth reward function is defined by r(s,a)=ϕ(s,a)θ+Z(s,a)r(s, a) = \phi^\top(s, a) \theta + Z(s, a), where Z(s,a)U(ζ,ζ)Z(s, a) \sim U(-\zeta, \zeta). When adding the model misspecification to the transition kernel, we first random sample a subset S+S\mathcal S_+ \subset \mathcal S such that S+=S/2|\mathcal S_+| = |S| / 2. Then the misspecified transition kernel is then generated by

  • For sS+s' \in \mathcal S_+, P(ss,a)=P(ss,a)+ζ/SP'(s' | s, a) = P(s' | s, a) + \zeta / S
  • For sS+s' \notin \mathcal S_+, P(ss,a)=P(ss,a)ζ/SP'(s' | s, a) = P(s' | s, a) - \zeta / S

We can verify that P(s,a)P(s,a)TV=ζ\lVert P(\cdot | s, a) - P'(\cdot | s, a)\rVert_{TV} = \zeta.

We investigated the misspecification level from ζ=0,0.01,,0.3\zeta = 0, 0.01, \cdots, 0.3 in 16 randomly generated environments over 2000 episodes. We report the cumulative regret and runtime with respect to different misspecification levels. Additionally, we performed an ablation study by

  1. removing the certified estimation (Algorithm 2, Line 11) and
  2. removing the quantization (Algorithm 1, Line 8).

The results of these configurations are presented in the following table. The detailed regret for all misspecification level is presented in Table 1 in the attached PDF. In Figure 1 of the attached PDF, we plot the cumulative regret for 2000 episodes with respect to the misspecification level $\zeta. The cumulative regret curve is plotted in Figure 2 of attached PDF.

Cert.Est.Quant.ζ=\zeta=00.050.10.150.20.250.3Running Time
×\times×\times754.84 ±\pm 221.06868.10 ±\pm 278.241160.24 ±\pm 436.411583.37 ±\pm 690.672160.15 ±\pm 794.802644.05 ±\pm 823.243206.85 ±\pm 825.131654.76
×\times\checkmark755.21 ±\pm 220.71868.25 ±\pm 278.191159.56 ±\pm 436.901583.99 ±\pm 690.682160.03 ±\pm 795.282643.40 ±\pm 823.503206.92 ±\pm 824.891654.21
\checkmark×\times758.26 ±\pm 220.62845.62 ±\pm 297.891015.95 ±\pm 321.121154.01 ±\pm 340.981290.21 ±\pm 380.011433.30 ±\pm 544.801438.27 ±\pm 471.331599.66
\checkmark\checkmark758.63 ±\pm 220.23845.80 ±\pm 297.921015.41 ±\pm 321.571154.45 ±\pm 341.321289.96 ±\pm 380.031434.09 ±\pm 544.401438.55 ±\pm 471.241593.38

The experimental results suggest several key findings that support our theoretical analysis:

  • When the misspecification level is low, it is possible to achieve constant regret, where the instantaneous regret in the final rounds is approximately zero.
  • The certified estimator and the quantization do not significantly affect the algorithm's runtime. In contrast, the certified estimator provides an ‘early-stopping’ condition in Algorithm 2, which slightly reduces the algorithm's runtime. In particular, our algorithm yields a computational complexity of O(d2AHK2logK)O(d^2AHK^2\log K), which is the same as (Vial et al., 2022) and only logK\log K greater than the vanilla LSVI-UCB (Jin et al., 2020) due to the multi-phased algorithm.
  • The certified estimator helps the algorithm by providing robust estimation in the presence of misspecification. As shown in the table, using the certified estimator does not make a significant difference when the misspecification level ζ\zeta is low, but it becomes significant as the misspecification level increases.
  • The quantization does not contribute significantly to the results, as the numerical results are intrinsically discrete and quantized. In Figure 2 of the attached PDF, the regret curve with quantization and the one without quantization are highly overlapped.

We hope these results help clarify the theoretical insights of our work, and we will include this in the revision.


  • Jin, Chi, et al. "Provably efficient reinforcement learning with linear function approximation." COLT, 2020
  • Vial, Daniel, et al. "Improved algorithms for misspecified linear markov decision processes." AISTATS, 2022
最终决定

This papers studies the linear MDP setting and achieves a high probability regret bound that is independent of the number of episodes, while past work had a log(K) dependence. An expected regret bound necessarily has a log(K) dependence, so the novelty holds when the high probability bound holds with probability 1-delta where delta is independent of K. Achieving this result required a careful union bound and analysis of a novel algorithm.

Given that such results hold for bandits, this result is not particularly surprising but nevertheless was unknown. Reviewers were skeptical of the practical value of the result given that the log(K) is necessary for an expected regret bound, and the questionable dependence on the more relevant parameters like the dimension d and horizon H. Indeed, there are many papers that suggest no greater than a d^2 H^3 dependence (versus d^3 H^5) is optimal (e.g., Zannette, Lazaric et al 2020). The authors' defensive reaction to this well-justified line of inquiry was a rebuttal misstep. We suggest the authors more suitably address this dependence in the paper and comment on ways to improve the algorithm and/or analysis. Nevertheless, the techniques and technical content may warrant a valuable contribution.