5.3

/10

Poster4 位审稿人

最低4最高6标准差0.8

2.8

置信度

正确性2.8

贡献度2.5

表达2.8

NeurIPS 2024

Achieving Constant Regret in Linear Markov Decision Processes

Weitong Zhang,Zhiyuan Fan,Jiafan He,Quanquan Gu

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

摘要

We study the constant regret guarantees in reinforcement learning (RL). Our objective is to design an algorithm that incurs only finite regret over infinite episodes with high probability. We introduce an algorithm, Cert-LSVI-UCB, for misspecified linear Markov decision processes (MDPs) where both the transition kernel and the reward function can be approximated by some linear function up to misspecification level $\zeta$. At the core of Cert-LSVI-UCB is an innovative certified estimator, which facilitates a fine-grained concentration analysis for multi-phase value-targeted regression, enabling us to establish an instance-dependent regret bound that is constant w.r.t. the number of episodes. Specifically, we demonstrate that for a linear MDP characterized by a minimal suboptimality gap $\Delta$, Cert-LSVI-UCB has a cumulative regret of $\tilde{\mathcal{O}}(d^3H^5/\Delta)$ with high probability, provided that the misspecification level $\zeta$ is below $\tilde{\mathcal{O}}(\Delta / (\sqrt{d}H^2))$. Here $d$ is the dimension of the feature space and $H$ is the horizon. Remarkably, this regret bound is independent of the number of episodes $K$. To the best of our knowledge, Cert-LSVI-UCB is the first algorithm to achieve a constant, instance-dependent, high-probability regret bound in RL with linear function approximation without relying on prior distribution assumptions.

关键词

Reinforcement learningConstant regretMisspecified linear MDP

评审与讨论

审稿意见

评分: 4置信度: 32024-07-04

The paper studies constant regret guarantees for linear MDPs. The main result is the first algorithm with this guarantee. The algorithm is a modification of LSVI-UCB with a careful quantization technique and some form of action elimination. This improves upon the regret of previous works by a factor of $\log K$ where $K$ is the number of episodes. Unlike previous works, the algorithm is robust to some degree of model mispecification and does not need to know it in advance.

优点

The method is robust to some model mispecification and does not need to know it in advance.
The regret guarantee is tighter compared to previous works.
The algorithm is computationally efficient in the sense that it runs in polynomial time in the problem parameters.
There are significant technical challenges in the work. In particular the careful quantization required to avoid dependence on $K$ and the refined analysis of the regret bound by the suboptimality gaps.

缺点

While it's challenging to achieve regret that is independent of $K$ , I do not see this as a significant improvement over the logarithmic dependence. This is especially true since it is not clear whether the correct polynomial dependence on the remaining parameters has been established. Moreover, this improvement is only relevant in the regime of a fixed success probability $\delta$ , which is often dependent on $K$ , e.g., if bounding the expected regret.
The improvement to the regret comes at a significant increase to the algorithm's complexity. It's not clear whether this is necessary. For example, are you certain that the quantization cannot be replaced by a more careful analysis of the value cover? Why do you need both bonuses and action elimination and sample rejection? The runtime complexity also seems higher (though this is a minor concern). If you think these modifications are indeed necessary, perhaps you could demonstrate this with experiments.
While I am not familiar with better results, the dependence on the worst case gap weakens the result. Do you think it is possible to achieve a better dependence? e.g. on the average inverse gap with respect to the optimal policy.
The writing is not very consistent with some places having phrasing and grammatical issues (e.g., quantification should be quantization and many more).
The algorithm needs to be better explained. While it seems the authors made significant efforts in this regard, many parts of the algorithm remain rather unclear. For example, the bonuses are not immediately apparent. As I understand it, you use a quantization of the standard bonus. It would be much easier for the reader if this is made explicit in the algorithm and explained. Other examples are the action elimination and sample rejection whose purpose is unclear.

问题

See above.

局限性

N/a

作者回复

2024-08-07

Thank you for your detailed feedback, and we address your questions as follows:

Q1. Why is it significant to improve the logarithmic dependence on $K$ ?

A1. We would like to first highlight that achieving constant regret is actually an important topic in bandits (Abbasi et al., Papini et al., 2021a, Zhang et al.,) and RL (Papini et al., 2021b). This is not merely shaving off a $\log$ factor but a stronger performance guarantee than logarithmic regret. Specifically, the constant regret bound indicates that the agent incurs only a finite number of mistakes when making decisions over an infinite run, whereas the number of mistakes can become arbitrarily large if the regret guarantee is only logarithmic.

Q2. Is the correct polynomial dependence on the remaining parameters established?

A2. We would like to point out that our algorithm suggests a $\tilde O(d^3H^5\Delta^{-1})$ polynomial dependence on the parameters $H, d, \Delta^{-1}$ , which is consistent with previous literature (He et al., Papini et al.), as discussed in Table 1. Therefore, we are confident that the polynomial dependence on these parameters is correct.

Q3. What is the necessity of each component in the algorithm?

A3. We believe both the certified estimator and the layer-dependent quantization are necessary for our theoretical analysis. As discussed in Section 6, certified estimators build the connection between different layers $l$ so that we can avoid taking the union bound over all $L = \log K$ layers during the concentration. Layer-dependent quantization removes the additional $\log K$ dependence introduced during the covering of the value function in Vial et al..

In addition, we have also conducted experiments on a synthetic dataset as presented in the response to all reviewers. In particular, the certified estimator helps stabilize the estimation so that the algorithm can achieve lower regret when the misspecification grows larger. It is true that the quantization does not significantly contribute to improving the regret, which aligns with our intuition since the numerical results are intrinsically discrete and quantized. In addition, experimental results also suggest that neither the certified estimator nor the quantization decreases the efficiency of the algorithm.

Q4. Why is the analysis based on the minimal sub-optimality gap instead of other gaps (like averaged gaps)?

A4. The averaged sub-optimality gap, as you suggested, can only be applied to the tabular bandits setting, to the best of our knowledge. In reinforcement learning with function approximation, the number of actions can be arbitrarily large, making it difficult to define the ‘averaged’ sub-optimal gap in this setting.

Q5. Explain the details of the algorithms? Why do you need both bonus and arm elimination?

A5. First, we would like to clarify that our algorithm does not use the reject sampling, instead, we call the exit condition in Line 11 in Algorithm 2 certified estimator to ensure that the optimistic estimator for the current layer $l$ should be always greater than the pessimistic estimator from the previous layer $l-1$ , and vice versa. Therefore, in case the concentration of layer $l$ fails, it will be detected and the performance of the estimation is not affected since the certified estimator (line 11) is triggered and the algorithm returns the estimation from the previous layer. There is no connection between the certified estimator and reject sampling since neither of the $V, \hat V, \check V$ are sampled from some distribution and our algorithm is a learning algorithm instead of a sampling algorithm.

Second, regarding the bonus and the arm elimination, we quantize the parameter $\mathbf w$ and $\mathbf U$ in Line 8, Algorithm 1, in order to control the quantity of the function set. Both of the bonus term $\lVert\mathbf {\phi}(s, a) \rVert_{\tilde {\mathbf U}}$ and $Q$ functions are calculated using this quantized $\tilde {\mathbf w}$ and $\tilde {\mathbf U}$ . In Algorithm 2, for each layer $l$ , the action with a bonus term larger than $O(2^{-l})$ will directly get selected for exploration.

Third, Regarding the arm elimination process, for each layer $l$ in Algorithm 2, we eliminate all actions $a$ which $Q_{h, k}^l(s, a)$ is smaller than the estimated optimal value function $V_{h, k}^l(s)$ by $4 \cdot 2^{-l}$ . As a result, arm elimination shrinks the arm set before proceeding to the next phase $l+1$ . Since the algorithm pursues higher decision in the deeper layer, which is more vulnerable to model misspecification, the arm elimination tries to remove the suboptimal arm in the shallow layers in order to deliver a more robust decision.

Additionally, in response to your question about the simultaneous use of bonus and arm elimination, it's important to note that this combination is commonly employed in bandits (Chu et al., Takemura et al.) and reinforcement learning (Vial et al.). Therefore, our design choice is conventional and aligns with standard strategies in the field.

Abbasi-Yadkori et al. "Improved algorithms for linear stochastic bandits." NeurIPS, 2011
Papini, Matteo, et al. "Leveraging good representations in linear contextual bandits." ICML, 2021a
Papini, Matteo, et al. "Reinforcement learning in linear mdps: Constant regret and representation selection." NeurIPS, 2021b
Zhang, Weitong, et al. "On the interplay between misspecification and sub-optimality gap in linear contextual bandits." ICML, 2023.
Vial, Daniel, et al. "Improved algorithms for misspecified linear markov decision processes." AISTATS, 2022
Chu, Wei, et al. "Contextual bandits with linear payoff functions." AISTATS, 2011
Takemura, Kei, et al. "A parameter-free algorithm for misspecified linear contextual bandits." AISTATS, 2021

评论- response

2024-08-13

Thank you for the detailed response.

[A1]: I am fully aware that "shaving" this log factor achieves independence of the time horizon. However, this is only with a fixed probability $1-\delta$ . When considering an infinite time horizon, it seems unlikely that $\delta$ would not scale with the horizon as this means we can incur linear regret with constant probability.

[A2]: The fact that previous works achieve the same dependence on $d, H, \Delta^{-1}$ does not imply that this is the correct dependence. Many works indeed focus on achieving improved dependence on $K$ as long as other parameters scale polynomially. However, I find this sensible only when the dependence on $K$ is polynomial. When it is logarithmic, I find it very likely that shaving a factor of $d$ or $H$ would be more significant than shaving a $\log K$ factor.

[A3+5]: Thank you for the explanations. While I'm not fully convinced of the necessity of each part, I get the general picture. Regardless, this is a secondary concern.

[A4]: Regarding gaps, is there a lower bound showing this is impossible in general? I am aware that this concern is not unique to this work but it is a weakness nonetheless.

2024-08-14

Thank you for your reply. We addressed your questions and concerns as follows.

Q1. When considering an infinite time horizon, it seems unlikely that it would not scale with the horizon as this means we can incur linear regret with constant probability.

A1. We would like to emphasize that the constant failure probability (e.g., $\delta = 10^{-6}$ ) is acceptable in practice since it is highly unlikely to enter the “bad events” in a realistic RL task. Moreover, the high-probability bounds are well studied and acknowledged in literature (Jin et al., 2020) and cannot be viewed as some “intermediate” result towards the expected result. Therefore, we believe our result is a direct improvement of these works (Jin et al., 2020, Vial et al., 2022) to achieve the constant regret in this regime.

Secondly, we would like to highlight that the constant regret bound is important because it reveals a PAC-type (w.p. $1 - \delta$ , gap $\Delta$ , $\text{Regret} \le O(d^3H^5\Delta^{-1}\log (1 / \delta))$ ) guarantee of the regret distribution, which cannot be obtained from expected regret.

Q2. The correctness of the work.

A2. We would like to first acknowledge that it is indeed important to improve on factors $d$ and $H$ , where we mentioned a few potential directions in the future work. However, we would like to point out that achieving a constant regret is as important as shaving off other dependence because the factors $d, H$ and $K$ are independent from each other. For the case that $d, H$ is small and the $K$ is large, removing $\log K$ is significant. We would also like to emphasize that the future potentials for shaving off $d, H$ factors does not imply our algorithm is incorrect because these results are all upper bounds. We have also double checked the proof before and during the rebuttal, and we are confident that our analysis is correct.

Q4. Is there a lower bound showing this is impossible?

A4. In tabular setting, Simchowitz et al. 2019, Xu et al., 2021 provided the gap dependent upper bound by (Theorem 1.1, Xu et al., 2021)

$\text{Regret}(K) = \tilde{O}\left( \left( \sum_{(s,a,h) \in \mathcal{S} \times \mathcal{A} \times [H]: \Delta_h(s,a) > 0} \frac{1}{\Delta_h(s,a)} + \frac{|Z_{\text{mul}}|}{\Delta_{\min}} + SA \right) \text{poly}(H)\log K \right), \ \ (1)$

where $Z_{mul}$ is the set of state-action pairs (s, a)’s satisfying a is an non-unique optimal action for s. The upper bound (1) contains both the averaged inverse gap $\sum \Delta_h^{-1}(s, a)$ and the minimal gap $\Delta_\min$ . Only when the optimal action for each state is unique (i.e., $Z_{mul} = 0$ ), (1) can be reduced to

$\text{Regret}(K) = \tilde{O}\left( \left( \sum_{(s,a,h) \in \mathcal{S} \times \mathcal{A} \times [H]: \Delta_h(s,a) > 0} \frac{1}{\Delta_h(s,a)} + SA \right) \text{poly}(H)\log K \right),$

In the general case where the optimal action is not necessary unique, we would like to point out that Xu et al., 2021 proved a lower bound (Theorem 1.3) suggesting no algorithm ALG can achieve a regret such that for all MDPs $M$ and $K$ approaches infinity,

$\text{Regret}(K, M, \text{ALG}) = O\left( \left( \sum_{(s,a,h) \in \mathcal{S} \times \mathcal{A} \times [H]: \Delta_h(s,a) > 0} \frac{1}{\Delta_h(s,a)}\right)\log K \right), \ \ (3)$

As discussed in Xu et al,. 2021, this suggests that it is not possible to achieve a regret bound that solely depends on the sum of the inverse of the gaps, even in the tabular MDP setting. Xu et al,. 2021 also mentioned that this is a separation between RL and contextual bandits since such a ‘averaged inverse gap’ can be achieved in contextual bandits (Bubeck et al, 2012, Lattimore et al., 2020)

Therefore, in linear MDPs, such a lower bound presented as (3) also suggests that it is impossible to achieve a gap-dependent bound that is solely depend on the average inverse gap $\sum \Delta_h^{-1}$ since any tabular MDP can be translated into a linear MDP (Example 2.1, Jin et al. 2020).

In short, given the results from literature, we believe that getting such a bound solely depends on the average inverse gap without other dependencies like minimal sub-optimal gap $\Delta_\min$ or $SA$ is impossible in linear MDPs.

Vial, Daniel, et al. "Improved algorithms for misspecified linear markov decision processes." AISTATS, 2022
Jin, Chi, et al. "Provably efficient reinforcement learning with linear function approximation." COLT, 2020
Xu, Haike et al. "Fine-grained gap-dependent bounds for tabular mdps via adaptive multi-step bootstrap." COLT, 2021
Simchowitz, Max et al. "Non-asymptotic gap-dependent regret bounds for tabular mdps." NeurIPS, 2019
Bubeck, Sébastien et al. "Regret analysis of stochastic and nonstochastic multi-armed bandit problems." Foundations and Trends® in Machine Learning 5.1 (2012): 1-122.
Lattimore, Tor et al. Bandit algorithms. Cambridge University Press, 2020

审稿意见

评分: 5置信度: 32024-07-13

This paper proposed a constant regret learning algorithm for linear MDPs with approximation error (misspecification). Specifically, it proved an instance dependent regret that is independent of the number of episodes. In addition, the algorithm does not require prior knowledge of misspecification level or suboptimality gap.

优点

The background of the problem is stated clearly with well summarized review of the literature.
There is a comprehensive comparison of the new constant regret results with existing works.

缺点

The proposed algorithm and analysis follow the algorithm in Vial et al. (2022) closely. The main differences seem to be the layer-dependent quantification in Algorithm 1 and the exit condition 3 in Algorithm 2. It would be better to discuss more clearly how these two modifications contribute to getting rid of the log(K) dependence.
Some of notations are used without explanation. For example, $\lambda$ in algorithm 1, $\gamma_l$ in Algorithm 2 and Theorem 5.1. Also, how the certified estimator is defined in Algorithm 2 is not clear.
Some analysis and results require more clarification: (a) The $\delta$ condition used in Theorem 5.1 does not match the one used in Lemma C12. (b) It is not very clear how the key results in getting rid of log(k) (Lemma C14 line 765) is derived using $x\leq a+\sqrt{bx}$ .

问题

The new algorithm does not require UniSOFT assumption made by Papini et al. (2021a). However, in Thm 5 of that paper, they showed that UniSOFT is a necessary condition for constant regret. Can you explain how this does not apply to the current setup?
It is not very intuitive to understand the constant regret with no dependency on the number of episodes. Are there any simple examples that can run the algorithm empirically to verify this behavior?

局限性

There is no potential negative societal impact.

The authors addressed the limitations in Section 7 such as the tightness of the bound and uniform misspecification assumption.

作者回复

2024-08-07

Thank you for your positive feedback. We address your questions as follows:

Q1. How do the layer-dependent quantization and certified estimator contribute to getting rid of the $\log K$ dependence?

A1. We would like to highlight the contribution of the layer-dependent quantization and certified estimator in eliminating the $\log K$ dependence. Vial et al., 2022 applied a global quantization $\epsilon_{\text{rnd}} = d / K^2$ , making the number of quantized parameters $\tilde{\mathbf w}_l$ and $\tilde{\mathbf U}_l$ in each layer $l$ depend on $K$ . Our method uses unique quantization parameters $k_l$ per layer, thus eliminating the $\log K$ dependence in regret bound (Lemma D.3).

In addition to the $\log K$ term introduced by the quantization, the estimated value function of Algorithm 2 comes from a total of $L = O(\log K)$ individual regressions. If controlled trivially, the union bound over all $L$ layers would also introduce the $\log K$ dependency to the final regret bound. The certified estimator further helps by connecting estimations across layers. As a result, during the analysis, we can focus on the concentration of some fixed layer $l_+$ , which does not depend on the parameter $K$ (Lemma 6.1). The details are discussed in Section 6, Challenge 1.

Q2. How is the certified estimator and other notations defined?

A2. The certified estimator ensures that each layer's estimate is within a predefined confidence radius from the previous layer. $\lambda$ is the regularization parameter from Algorithm 1, and $\gamma_l$ is the confidence radius in Theorem 5.1.

Q3. Why is the $\delta$ condition different between Theorem 5.1 and Lemma C.12?

A3. Theorem 5.1 integrates Lemmas C.12 and C.14, which hold with probabilities of $1−3\delta$ and $1−\delta$ , respectively. Applying the union bound, Theorem 5.1 holds with $1−4\delta$ , differing from Lemma C.12. We will clarify this in the revision.

Q4. Could you clarify the technical details of getting rid of $\log k$ in Lemma C.14 derived using $x \le a + \sqrt{bx}$ ?

A4. $x \le a + \sqrt{bx} \Rightarrow x \le 2a + 2b$ is applied in Section D.10, Line 763 after getting $\textrm{Regret}(K) < \sum_{k=1}^{K} \sum_{h=1}^H \Delta^k_h + \sqrt{2\textrm{Regret}(K)H^2\log(\textrm{Regret}(K)/\delta)} + H^2\log(\textrm{Regret}(K)/\delta).\quad (1)$ Taking $x = \textrm{Regret}(K)$ , $a = \sum_{k=1}^{K} \sum_{h=1}^H \Delta^k_h + H^2\log(\textrm{Regret}(K)/\delta)$ , and $b = 2H^2\log(\textrm{Regret}(K)/\delta)$ , inequality (1) yields $\textrm{Regret}(K) \le 2\sum_{k=1}^{K} \sum_{h=1}^H \Delta^k_h + 2H^2\log(\textrm{Regret}(K)/\delta) + 4H^2\log(\textrm{Regret}(K)/\delta) = 2\sum_{k=1}^{K} \sum_{h=1}^H \Delta^k_h + 6H^2\log(1 / \delta) + 6H^2\log(\textrm{Regret}(K)) \quad (2)$

According to the fact that $x \le a \log x + b \Rightarrow x \le 4a \log (2a) + 2b$ , letting $x = \textrm{Regret}(K), a = 2\sum_{k=1}^{K} \sum_{h=1}^H \Delta^k_h + 6H^2\log(1 / \delta)$ and $b = 6H^2$ , (2) becomes $\textrm{Regret}(K) \le \left(8\sum_{k=1}^{K} \sum_{h=1}^H \Delta^k_h + 24H^2\log(1 / \delta)\right)\log\left(4\sum_{k=1}^{K} \sum_{h=1}^H \Delta^k_h + 12H^2\log(1 / \delta)\right) + 12H^2.$ Hiding the logarithmic factors within the $\tilde O$ notation yields the claimed result in Lemma C.14. We will clarify the technical details of this part in the revision.

Q5. What’s the connection between the UniSOFT assumption and the result obtained in this paper?

A5. We would like to highlight the connection between the UniSOFT assumption and the result obtained in this paper, which we also discussed in Remark 5.2. First, Papini et al. showed that UniSOFT is necessary for obtaining an expected constant regret, suggesting that a $\log K$ expected regret is unavoidable without UniSOFT. In our result, we proved a high-probability constant regret, suggesting that with probability at least $1 - \delta$ , the regret is bounded by a constant $O(d^3H^5\Delta^{-1} \log (1 / \delta))$ .

Our result does not violate Papini et al. When translating the high-probability regret bound to an expected regret bound, we have $E[\text{Regret}(K)] \le \tilde O(d^3H^5\Delta^{-1}\log(1 / \delta)) \cdot (1 - \delta) + \delta K = \tilde O(d^3H^5\Delta^{-1}\log K)$ . Setting $\delta = 1 / K$ will lead to a logarithmic expected regret bound. In short, a high-probability constant regret bound results in a logarithmic expected regret bound.

On the other hand, the constant regret upper bound in Papini et al. is obtained under the UniSOFT assumption. Ours, in contrast, obtain the same high-probability constant regret upper bound without the UniSOFT, which is the first constant regret bound without prior assumptions in the literature.

In conclusion, the UniSOFT assumption is a necessary condition for expected constant regret but not a necessary condition for high-probability constant regret. Our result is the first work showing the high-probability constant regret bound without this assumption

Q6. What’s the experimental result and the intuitive understanding of the constant regret?

A6. We have added some experiments on the synthetic dataset to verify the performance of the algorithm, which is presented in the global response. In addition to the experimental results, the understanding of constant regret is quite intuitive: A constant regret bound suggests that the total mistakes made by the RL agent are finite, with high probability. Taking the two-armed stochastic bandit as an example, if the gap between the two arms is strictly positive, it only takes finite time to distinguish the optimal arm from the suboptimal arm. Thus, the total mistakes made by the RL agent are finite.

Vial, Daniel, et al. "Improved algorithms for misspecified linear markov decision processes." AISTATS, 2022
Papini, Matteo, et al. "Reinforcement learning in linear mdps: Constant regret and representation selection." NeurIPS, 2021

审稿意见

评分: 6置信度: 32024-07-16

This paper gives a constant regret (total regret independent of the number of episodes $K$ ) algorithm for online reinforcement learning of linear MDPs. The environment is considered to be a $\zeta$ -approxinate linear MDP as in (Jin et al 2019), with an assumption that the misspecification level ( $\zeta$ ) is not too large. The algorithm is a more sophisticated variant of the original LSVI-UCB, which is a further improvement of earlier work that improves the regret of LSVI-UCB (but does not achieve constant regret).

The regret is parametrized by the instance-dependent minimal suboptimality gap $\Delta := \min_{h,s,a} \\{V_h^\ast(s) - Q_h^*(s,a) : V_h^\ast(s) - Q_h^*(s,a) \neq 0\\}$ , which also appears in earlier instance dependent regret bounds for linear MDPs (and also in general episodic MDPs).

Their algorithm, Cert-LSVI-UCB, has a regret bound of $\tilde{O}(d^3 H^5 \Delta^{-1})$ . The dependence on $\Delta$ is optimal, whereas the dependence on $d$ and $H$ may be suboptimal.

This paper is also an improvement on the previous work that provides "constant regret" bounds for linear MDPs (Papini et al 2021) in that it does not require the "UniSoft" assumption on the feature mapping (only requires the standard assumptions of misspecified linear MDPs as in Jin et al 2019, plus that the misspecification level $\zeta \leq \tilde{O}(\Delta/\sqrt{d}H^2)$ ).

优点

The algorithm and its analysis gives a substantial improvement over earlier algorithms with instance-dependent bounds for linear MDPs, in that it is the first algorithm with a constant regret bound that requires minimal assumptions, even for misspecified linear MDPs.
The algorithm is cleanly based on earlier linear MDP algorithms starting from LSVI-UCB, applied in more sophisticated ways (multiple regression phases per episode etc), and the improvements in the algorithm and challenges in the analysis have been highlighted in the main paper itself (especially, in avoiding the dependency of the regret on the number of phases).

缺点

The contribution of the paper is entirely theoretical, but most of the proof is in the appendix (which is unavoidable).
Some experimental results (even on synthetic data) would definitely enhance the paper.

问题

局限性

The limitations have been adequately addressed and there is no negative societal impact.

作者回复

2024-08-07

Thank you for the positive feedback on our work! We will address your questions as follows:

Q1. Adding some experimental results in addition to the theoretical result

A1. Thank you for the suggestion. We have included some experimental results on a synthetic dataset during the rebuttal phase. Specifically, these results are provided in our response to all reviewers. The experimental results indicate the potential for achieving constant regret and highlight the contributions of each algorithm component through ablation studies. We will include this information in the camera-ready revision.

审稿意见

评分: 6置信度: 22024-07-18

The paper presents improved regret bounds for Linear MDPs when the suboptimality gap \Delta is known. Most crucially (a bit surprisingly), the regret bounds are constant in a number of episodes K

优点

The contributions are broadly theoretical, and the removal of dependence on the number of episodes K seems to be a good contribution to the academic understanding of the gap-dependent (instance-dependent)learning bounds for reinforcement learning.

The paper provides a good coverage of related works in this area and describes the setting well.

缺点

b_h^\pi seems to be undefined in proposition 3.2
\kapp_\ell in Algorithm 1, line 8, as claimed by the author, helps to get rid of log(K). Can you explain more about how it helps to remove log(K)? is this the reason why the regret bounds do not have log(K) dependence? 3)I would urge the authors to provide a full expression of regret without O notation, including the log factors either in the main text or in the appendix (I could not find a full expression for the result in theorem 5.1). Specifically, it is surprising, perhaps I am missing something, that the regret is O(1) when Delta = 1/(d^3H^5) when K=1. I suspect some lower-order terms are hidden, or maybe the result is only valid for K \geq something. The asymptotic dependence can be right still.

问题

Please see above, I am willing to engage during the rebuttal phase.

局限性

作者回复

2024-08-07

Thank you for your positive feedback. We address your questions as follows.

Q1. What's the meaning of notation $\lVert wb_h^\pi\rVert_2^2$ in Proposition 3.2?

A1. Sorry for the typo caused by the LaTeX macros. The $\lVert wb_h^\pi\rVert_2^2$ should be $\lVert \mathbf{w}_h^\pi\rVert_2^2$ instead.

Q2. How the $\kappa_l$ eliminates the dependence of $\log K$ .

A2. We discussed the contribution of applying layer-wise quantization in Section 6.1, Line 268. Previous work (Vial et al., 2022) used a global quantization $\epsilon_{\mathrm{rnd}} = d / K^2$ . Therefore, the quantity of all possible quantized parameters $\tilde{\mathbf w}$ and $\tilde{\mathbf U}$ in each layer $l$ unnecessarily depends on the total number of episodes $K$ . In our algorithm, each layer $l$ uses a different quantization parameter $k_l$ so the quantity of the quantized parameters $\tilde{\mathbf w}$ and $\tilde{\mathbf U}$ only depends on $l$ (Lemma D.3).

In addition to the benefit from layer-wise quantization, another technique to eliminate the $\log K$ dependence is the certified estimator discussed in Section 6. Without the certified estimator, one needs to cover the parameters for all $L$ layers. The certified estimator builds the connection between the output value function between consecutive layer $l$ and $l +1$ . As a result, in the analysis, we can focus on the concentration of some fixed layer $l_+$ , which does not depend on the parameter $K$ (Lemma 6.1). The details are discussed in Section 6, Challenge 1.

Besides the aforementioned challenges and techniques to remove the logarithmic dependence in the estimation error, we also provide a novel concentration analysis in Section 6.2 that achieves constant regret. In sharp contrast, previous analysis (He et al., 2022) has the logarithmic dependence for this concentration.

Q3. Why the regret is $O(1)$ when $K = 1, \Delta = 1/(d^3H^5)$ ? What is the full expression of regret without $O$ notations?

A3. We would like to first clarify that the extreme case with $O(1)$ regret you mentioned is incorrect. The regret is in the form of $\text{Regret}(K) \le \tilde O(d^3H^5 \Delta^{\color{red}{\mathbf{-1}}}\log(1 / \delta)$ . Therefore, to achieve $\text{Regret} = O(1)$ , one would need to set $\Delta = d^3H^5$ instead of $\Delta = 1 / (d^3H^5)$ . Since the reward is bounded by $0 < r(s, a) < 1$ , the gap $\Delta$ is trivially bounded by $[0, H]$ due to $0 \le Q^*(s, a) \le H$ . Therefore, letting $\Delta = d^3H^5$ is unrealizable in this setting.

In response to your request for the regret bound without $O$ notation, the full expression of the regret bound can be obtained through Lemmas C.12, C.13, and C.14. We present the full expression as follows:

$&\textrm{Regret}(K) \le \left(8I_{\Delta} + 24H^2\log(1 / \delta)\right)\log\left(4I_{\Delta} + 12H^2\log(1 / \delta)\right) + 12H^2,&I_{\Delta} =4I_{\text{C.13}}\log\left(2I_{\text{C.13}}\right),$

> $$ $ &I_{\text{C.13}} = c_1 L_{\Delta}(L_{\Delta}+\log(dH))^2 d^3 H^5 \Delta^{-1} \log(L_{\Delta} d/\delta) + c_2 H^3 \Delta^{-1} \log (H\log(\Delta^{-1})/\delta), &L_{\Delta} = \lceil -\log(0.01 \Delta / H) \rceil. $

The dominating term is $I_\Delta \sim I_{\text{C.13}} = \tilde O(d^3 H^5 \Delta^{-1} \log(1 / \delta)$ . In the paper we use the $O$ notation for readability of the proof due to the complexity of this expression.

Vial, Daniel, et al. "Improved algorithms for misspecified linear markov decision processes." AISTATS, 2022

2024-08-12

Thank you for your detailed response. I am broadly satisfied. My only recommendation (for an updated version) is that authors focus a bit on explaining the key difference that allows them to have regret and not depend on $K$ , as they have mentioned some of the points in the rebuttal here.

评论- Thank you!

2024-08-12

Thank you for your continuous support for our paper. As you suggested, we will expand the discussions on the technical challenges in Section 6 in the revision.

作者回复

2024-08-07

We first thank all reviewers for their valuable feedback.

In response to Reviewers 8jf2, 33Lb, and YEKy, we added experiments on synthetic datasets to verify the performance of the algorithm and the contribution of each component. Specifically, we consider a linear MDP with $S = 4$ , $A = 5$ , $H = 2$ , and $d = 8$ . Each element in the feature vector $\phi(s, a)$ and $\mu(s')$ is generated by a uniform distribution $U(0, 1)$ . Subsequently, $\phi$ is normalized to ensure that $P(s' | s, a)$ is a probability measure, i.e., $\phi(s, a) = \phi(s, a) / \sum_{s'} \phi^\top(s, a) \mu(s')$ . The reward is defined by $r(s, a) = \phi^\top(s, a) \theta$ , where $\theta \sim N(0, I_d)$ .

We also consider adding model misspecification to the transition $P$ and reward function $r$ . For a given misspecification $\zeta$ , the ground truth reward function is defined by $r(s, a) = \phi^\top(s, a) \theta + Z(s, a)$ , where $Z(s, a) \sim U(-\zeta, \zeta)$ . When adding the model misspecification to the transition kernel, we first random sample a subset $\mathcal S_+ \subset \mathcal S$ such that $|\mathcal S_+| = |S| / 2$ . Then the misspecified transition kernel is then generated by

For $s' \in \mathcal S_+$ , $P'(s' | s, a) = P(s' | s, a) + \zeta / S$
For $s' \notin \mathcal S_+$ , $P'(s' | s, a) = P(s' | s, a) - \zeta / S$

We can verify that $\lVert P(\cdot | s, a) - P'(\cdot | s, a)\rVert_{TV} = \zeta$ .

We investigated the misspecification level from $\zeta = 0, 0.01, \cdots, 0.3$ in 16 randomly generated environments over 2000 episodes. We report the cumulative regret and runtime with respect to different misspecification levels. Additionally, we performed an ablation study by

removing the certified estimation (Algorithm 2, Line 11) and
removing the quantization (Algorithm 1, Line 8).

The results of these configurations are presented in the following table. The detailed regret for all misspecification level is presented in Table 1 in the attached PDF. In Figure 1 of the attached PDF, we plot the cumulative regret for 2000 episodes with respect to the misspecification level $\zeta. The cumulative regret curve is plotted in Figure 2 of attached PDF.

Cert.Est.	Quant.	$\zeta=$ 0	0.05	0.1	0.15	0.2	0.25	0.3	Running Time
$\times$	$\times$	754.84 $\pm$ 221.06	868.10 $\pm$ 278.24	1160.24 $\pm$ 436.41	1583.37 $\pm$ 690.67	2160.15 $\pm$ 794.80	2644.05 $\pm$ 823.24	3206.85 $\pm$ 825.13	1654.76
$\times$	$\checkmark$	755.21 $\pm$ 220.71	868.25 $\pm$ 278.19	1159.56 $\pm$ 436.90	1583.99 $\pm$ 690.68	2160.03 $\pm$ 795.28	2643.40 $\pm$ 823.50	3206.92 $\pm$ 824.89	1654.21
$\checkmark$	$\times$	758.26 $\pm$ 220.62	845.62 $\pm$ 297.89	1015.95 $\pm$ 321.12	1154.01 $\pm$ 340.98	1290.21 $\pm$ 380.01	1433.30 $\pm$ 544.80	1438.27 $\pm$ 471.33	1599.66
$\checkmark$	$\checkmark$	758.63 $\pm$ 220.23	845.80 $\pm$ 297.92	1015.41 $\pm$ 321.57	1154.45 $\pm$ 341.32	1289.96 $\pm$ 380.03	1434.09 $\pm$ 544.40	1438.55 $\pm$ 471.24	1593.38

The experimental results suggest several key findings that support our theoretical analysis:

When the misspecification level is low, it is possible to achieve constant regret, where the instantaneous regret in the final rounds is approximately zero.
The certified estimator and the quantization do not significantly affect the algorithm's runtime. In contrast, the certified estimator provides an ‘early-stopping’ condition in Algorithm 2, which slightly reduces the algorithm's runtime. In particular, our algorithm yields a computational complexity of $O(d^2AHK^2\log K)$ , which is the same as (Vial et al., 2022) and only $\log K$ greater than the vanilla LSVI-UCB (Jin et al., 2020) due to the multi-phased algorithm.
The certified estimator helps the algorithm by providing robust estimation in the presence of misspecification. As shown in the table, using the certified estimator does not make a significant difference when the misspecification level $\zeta$ is low, but it becomes significant as the misspecification level increases.
The quantization does not contribute significantly to the results, as the numerical results are intrinsically discrete and quantized. In Figure 2 of the attached PDF, the regret curve with quantization and the one without quantization are highly overlapped.

We hope these results help clarify the theoretical insights of our work, and we will include this in the revision.

Jin, Chi, et al. "Provably efficient reinforcement learning with linear function approximation." COLT, 2020
Vial, Daniel, et al. "Improved algorithms for misspecified linear markov decision processes." AISTATS, 2022

最终决定Accept (poster)

2024-09-25

This papers studies the linear MDP setting and achieves a high probability regret bound that is independent of the number of episodes, while past work had a log(K) dependence. An expected regret bound necessarily has a log(K) dependence, so the novelty holds when the high probability bound holds with probability 1-delta where delta is independent of K. Achieving this result required a careful union bound and analysis of a novel algorithm.

Given that such results hold for bandits, this result is not particularly surprising but nevertheless was unknown. Reviewers were skeptical of the practical value of the result given that the log(K) is necessary for an expected regret bound, and the questionable dependence on the more relevant parameters like the dimension d and horizon H. Indeed, there are many papers that suggest no greater than a d^2 H^3 dependence (versus d^3 H^5) is optimal (e.g., Zannette, Lazaric et al 2020). The authors' defensive reaction to this well-justified line of inquiry was a rebuttal misstep. We suggest the authors more suitably address this dependence in the paper and comment on ways to improve the algorithm and/or analysis. Nevertheless, the techniques and technical content may warrant a valuable contribution.