PaperHub
5.0
/10
Poster4 位审稿人
最低4最高6标准差0.7
4
5
5
6
2.8
置信度
正确性2.8
贡献度2.3
表达2.5
NeurIPS 2024

Model-free Low-Rank Reinforcement Learning via Leveraged Entry-wise Matrix Estimation

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06

摘要

We consider the problem of learning an $\varepsilon$-optimal policy in controlled dynamical systems with low-rank latent structure. For this problem, we present LoRa-PI (Low-Rank Policy Iteration), a model-free learning algorithm alternating between policy improvement and policy evaluation steps. In the latter, the algorithm estimates the low-rank matrix corresponding to the (state, action) value function of the current policy using the following two-phase procedure. The entries of the matrix are first sampled uniformly at random to estimate, via a spectral method, the *leverage scores* of its rows and columns. These scores are then used to extract a few important rows and columns whose entries are further sampled. The algorithm exploits these new samples to complete the matrix estimation using a CUR-like method. For this leveraged matrix estimation procedure, we establish entry-wise guarantees that remarkably, do not depend on the coherence of the matrix but only on its spikiness. These guarantees imply that LoRa-PI learns an $\varepsilon$-optimal policy using $\tilde{\cal O}({(S+A)\over \mathrm{poly}(1-\gamma)\varepsilon^2})$ samples where $S$ (resp. $A$) denotes the number of states (resp. actions) and $\gamma$ the discount factor. Our algorithm achieves this order-optimal (in $S$, $A$ and $\varepsilon$) sample complexity under milder conditions than those assumed in previously proposed approaches.
关键词
Low-rank RL; Entry-wise matrix estimation; Matrix completion

评审与讨论

审稿意见
4

This paper proposes a leveraged matrix estimation method for low-rank policy evaluation and extends it to policy iteration as a model-free learning algorithm. The main idea is to separate the Q matrix estimation into two phases. The first step is to use half of the sample budget to estimate the leverage scores of the Q matrix. Based on the leverage scores, a skeleton of the value matrix is sampled. Then roughly one-fourth of the sample budget is used to sample the entries in the skeleton. The left sample budget is used to sample the entries outside the skeleton, and the final Q-value matrix is constructed by CUR decomposition with inverse leverage scores weighting. It's proved that the leveraged matrix estimation method relaxed eliminates the commonly used assumption of incoherence., and guarantees an entry-wise estimation error. The effectiveness of the proposed method is demonstrated via simulation in the Appendix.

优点

The paper is well-written and easy to follow. The theoretical investigation comes with rigorous proof. By adopting leveraged matrix estimation, the authors can relax the incoherence assumption of the low-rank Q matrix, which serves as a major advantage of the proposed method.

缺点

It seems there is a lack of discussion about leveraged matrix estimation methods in matrix completion/low-rank matrix estimation literature. How does the proposed method perform when compared with other regularized policy evaluation approaches (for example, nuclear-norm/max-norm regularization)? It is crucial in assessing the significance and novelty of this work.

Is that possible to include experiments in real-world datasets to enhance the significance of this work?

问题

Is the low-rank assumption of the Q function under any deterministic policy very restrictive? How does that compare to related works?

It would be better to replace shadows in the figures (experiments) with error bars, as shallows overlay each other and make it hard to read.

局限性

The paper is more in a theoretical flavor and lacks real-world experiments.

作者回复

Thank you for your review! Please find our responses below.

A. Comparison with other matrix estimation methods. As mentioned in Section B.1 of our general rebuttal, our work is based on recent progresses in analysis of matrix completion methods providing entrywise guarantees [1,8,9]. It remains an open question how more global convergence guarantees (in Frobenius or spectral norm) can be utilized effectively for RL, so we compare our work only with methods providing entrywise guarantees.

As you suggested, an alternative to our SVD-based approach would be to use nuclear norm minimization. The authors of [34] leveraged the guarantees for nuclear norm minimization from [9] for learning the QQ-function. However, there are a few reasons why applying nuclear norm minimization is theoretically not straightforward in our setting:

  • Approximate low-rank structure. Since our algorithm is policy iteration-based, our estimates QτπQ_{\tau}^\pi are only approximately low-rank. The result of [9] is based on non-convex optimization, and it is unclear how this approximation error influences the final guarantee of the algorithm. In contrast, a simpler analysis using SVD allows us to explicitly express our bounds in terms of the approximation error.
  • Coherence-free subspace recovery. In our subspace recovery result (Theorem 4), we can bound the subspace error Ui,:U^_i,:O_U^_2\Vert U_{i,:} - \widehat{U}\_{i,:} O\_{\widehat{U}}\Vert\_{2} in terms of Ui,:2\Vert U_{i,:}\Vert_2. It is not clear whether the same can be achieved using current guarantees for nuclear norm minimization, which might instead show dependence on maxi[S]Ui,:2\max_{i\in [S]} \Vert U_{i,:}\Vert_2. We believe this would lead to additional incoherence constants appearing in the sample complexity of our algorithm.

We refer reviewer to Section C of our general rebuttal for a more detailed comparison with other methods employing matrix completion for RL.

B. Real-world experiments. We refer the reviewer to part B.1 of our response to Reviewer WY9t for a discussion on applying our algorithm to gym environments. Moreover, our paper serves more as a proof of concept to show that we can use active sampling to decrease the sample complexity of the problem. We believe that adapting our analysis for more practical, real-world settings by combining our sampling schemes with deep learning can lead to interesting results, but this is beyond the scope of this paper.

C. Our low rank assumption (QπQ^\pi low rank for every deterministic policy π\pi) is less restrictive than previously used assumptions. We refer the reviewer to part B.3 of our general rebuttal for a discussion about our assumptions.

评论

Dear reviewer,

Thanks again for your review. We have answered your concerns and would be grateful to hear your thoughts. If you need further clarification or want to ask more questions, we would be happy to respond.

Best regards.

评论

Dear reviewer,

We thank you once again for your efforts. As the rebuttal period is nearing its end, we would appreciate if you could let us know whether our rebuttal addresses your concerns. We would be also happy to answer any further questions you may have while we still can.

Thank you.

审稿意见
5

This paper considers the reinforcement learning problem with low-rank latent structure. The objective of this problem is of learning an ϵ\epsilon-optimal policy in a tabular setting. For this problem, they devised LoRa-PI (Low-Rank Policy Iteration), a model-free learning algorithm alternating between policy improvement and policy evaluation steps.

Their main innovation lies in the latter part where the algorithm estimates the low-rank matrix corresponding to the (state, action) value function of the current policy using the following two-phase procedure. The entries of the matrix are first sampled uniformly at random to estimate, via a spectral method, the leverage scores of its rows and columns. These scores are then used to extract a few important rows and columns whose entries are further sampled. The algorithm exploits these new samples to complete the matrix estimation using a CUR-like method. For this leveraged matrix estimation procedure, they used spikiness instead of coherence (two-to-infinity norm assumption).

优点

  1. According to the authors, their matrix estimation method is the first able to yield entry-wise guarantees that do not depend on the matrix coherence but on its spikiness instead.

  2. The authors used the tabular setting where entrywise estimation is important and applied an estimation method that fits well with entry-wise estimation. As they mentioned in Lemma 1, the singular value difference could be very large, but the tabular setting lets authors focus more on entrywise estimation. I believe this is a nice strategy for writing.

  3. The authors clearly state why policy iteration is much more useful than value iteration to alleviate condition number assumptions.

缺点

  1. Readers might think this is another application of low-rank estimation literature. Of course, this type of 'adapting results of other fields appropriately' studies is also important in the research, but I'd say this is not ground-breaking.

  2. (Important) I am not sure how meaningful the transition from 'coherence' to 'spikiness' is. Maybe it is directly related to Question 1. If I understand correctly, this is the main reason for using the CUR-based matrix completion instead of others, but I believe the authors should provide a more convincing reason (such as theorems?) to make authors study CUR-based matrix completion.

  3. (Important) It is not clear how much improvement they made. It would be great if there's a table for low-rank RL that compares the performance/assumptions of previous works concisely. Could you add a table of reference that includes all low-rank and many recent tabular RL results to see how much improvement you made?

  4. (Important) Dependence of the threshold parameter β\beta with parameter dd or σd(Qπ)\sigma_d(Q^\pi). See the questions section for details. Also, this β\beta seems very important, but it is written in the appendix. It feels like the authors are trying to deceive readers.

问题

  • (Important) Is spikiness always better than coherence? The authors state that in a specific case, spikiness could be finite but coherence could be very large. Is there a case where coherence is finite but spikiness blows up? To use the word 'less restrictive' I think this proof is also necessary.

  • (Important) Table (mentioned in the weakness section)

  • (Important) Could you explain how your threshold of the singular value could be free from the rank dd? In many sparse linear or low-rank estimations, rank works as a parameter for the threshold. To alleviate the rank condition from the threshold, one must assume the minimum signal condition, such as σmin>ϵ\sigma_{min}>\epsilon for some ϵ>0\epsilon>0. Seems like in your paper it was possible since you assumes 'minimum signal condition' on Theorem 2, ϵ<S+ASAdσd(Qπ)\epsilon < \sqrt{\frac{S+A}{SAd} \sigma_d (Q^\pi)}. You don't know dd or σd\sigma_d, how could you be sure about this? One main selling point of this paper is that the learner doesn't need to know dd or minimum eigenvalue conditions before the game starts. (line 224-225)

  • Is there any great idea to extend your estimation idea to the model-based method?

  • This idea works well in the tabular setting. What about the non-tabular settings, such as linear RL? Let me know the difficulty of extending your result to linear RL.

*** Minor points

  1. Maybe using σmin\sigma_{min} instead of σd\sigma_d? Researchers frequently use σd(Qπ)\sigma_d (Q^\pi) as dd-th largest eigenvalue and d=maxπrank(Qπ)d=\max_\pi rank(Q^\pi). This means in the traditional notation σd(Qπ)=0\sigma_d(Q^\pi)=0. Though they've mentioned it in their related works, it would be better to change it as σmin\sigma_{min} or define σd\sigma_d somewhere near the Notation paragraph.

局限性

N/A

作者回复

Thank you for your careful and insightful review! Please find our responses below.

A. Our low-rank matrix estimation method is novel. Thank you for raising concerns about the novelty of our matrix estimation method. We believe this is an important point to clarify, so we have included a discussion on the novelty of our matrix completion method in Section A of the general rebuttal.

B. Spikiness is always better than coherence. First, note that we not only substitute dependence in incoherence μ\mu for spikiness α\alpha in our sample complexities but also remove the term μ2\mu^2 completely (please see the table in the general rebuttal and the discussion preceding it). Moreover, the spikiness constant appears as a proxy for Qmaxσd(Q)\frac{\Vert Q\Vert_{\max}}{\sigma_d(Q)}, which essentially quantifies the difficulty of recovering low-rank structure from single observations. This makes it a more intuitive parameter than incoherence, which depends on the Euclidean norm of the rows of the singular vector matrices. Assuming bounded spikiness is less restrictive than bounded incoherence because it always holds that $$ \alpha(Q) = \sqrt{SA}\frac{\Vert Q\Vert_{\max}}{\Vert Q\Vert_\textup{F}} \leq \sqrt{SA} \frac{\Vert U\Vert_{2\to\infty} \Vert Q\Vert_{\textup{op}} \Vert W \Vert_{2\to\infty}}{\Vert Q\Vert_\textup{F}} \leq \mu(Q)^2 d.

So bounded incoherence always implies bounded spikiness. Furthermore, note that the CUR method is not crucial for our improvement in the scaling with the incoherence constant. It is active sampling (choosing which rows and columns to sample more) that brings this improvement. However, it is unclear how to leverage active sampling for other methods besides CUR, which is why we employ it in our algorithm. **C. Comparison with literature on low-rank $Q$-learning.** We are thankful again for raising this question. We include a tabular comparison with past works in Section C of our global rebuttal. **D. About the singular value threshold $\beta$.** We reassure the reviewer that we included the definition of $\beta$ in Appendix B.4 (see eq. (10) line 639) only due to space constraints in the main paper. It is clearly stated in Proposition 1 where the definition of the threshold $\beta$ can be found. For the sake of clarity, we will introduce $\beta$ in the main text in the updated version of our paper. Next, we discuss why $\beta$ does not depend on $d$ or $\sigma_d$ explicitly. Note that, by definition, the threshold $\beta$ depends on the number of samples $T$ and moreover, it decays with $\sqrt{T}$. Thus, under assumptions on sample complexity (see (4) in Proposition 1), we ensure that $\beta$ is smaller than $\sigma_d(Q)$. Therefore, all relevant singular values (of which there are $d$) are preserved, and the rest (corresponding to the noise and having a norm smaller than $\beta$) are removed. Please check Appendix B.4 for a more detailed proof. It is also important to note that we do not require knowledge of $\sigma_d(Q)$ for our algorithm to work, but our guarantees are valid for a number of samples $T$ that is sufficiently large depending on $\sigma_d(Q)$. **F. Extension to model-based RL.** We refer the reviewer to [37] for entrywise guarantees for model-based RL. In that work, the recovery depends on the properties of the transition kernels and the reward matrix, and not directly on the properties of $Q$-matrix. The authors of [37] do not use a complex iterative scheme with two-phases as we have done in this paper, since in their setting, the matrices $P$ and $r$ they observe are those that should be recovered. Our setting is more challenging because we do not have direct observations of the $Q$-matrix, we employ active sampling, and we achieve coherence-free guarantees. **G. Extension to linear RL.** Although they are not directly comparable, we believe that linear RL requires a stronger assumption than the one we use in the paper. Specifically, in linear RL, the $Q$-function can be written as $Q(s,a) = \phi(s,a)^\top \theta$ with $\theta\in\mathbb{R}^d$ and a known nonlinear mapping $\phi \in \Phi$. In our setting, $Q(s,a) = U_{s,:} \Sigma W_{a,:}^\top = \mathrm{vec}(U^\top e_s e_a^\top W)^\top \mathrm{vec}(\Sigma)$, and we do not assume knowledge of $U$ or $W$. Therefore, we believe our work is better aligned with low-rank MDPs where neither $\theta$ nor $\phi$ are known. Note that our setting is a special case of low-rank MDPs where the functions $\phi$ have a specific structure (depending on the singular vectors $U,W$). Solving low-rank MDPs in full generality with computationally tractable algorithms remains an interesting open research question.
评论

Dear reviewer,

Thanks again for your review. We have answered your concerns and would be grateful to hear your thoughts. If you need further clarification or want to ask more questions, we would be happy to respond.

Best regards.

评论

Thank you for your detailed rebuttal.

  1. Seems like the table in the common rebuttal shows clear improvement compared to the previous algorithms. It looks great that this work is free from the anchor assumption, and shows the best performance.

  2. About the σd\sigma_d dependence, I understand the author's point, but I can't shake off a certain unease. In short, it seems that betabeta nominally depends only on TT, but TT comes with the condition that it must be "large enough to satisfy all the conveniences." Of course, this is a common practice in RL or bandit papers, but including σd\sigma_d makes me question whether it is really correct to say that we don't know the minimum signal condition. The extent to which T can be assumed is subjective, but if "free from d and κ\kappa (and eventually σd\sigma_d)" is a selling point, I personally believe that we should be able to know in advance how large TT needs to be before running the algorithm. Time is tight, but if possible, it would be helpful for my judgment if you could point out the cases where the TT conditions in the papers included in the rebuttal Table involve assumptions about variables that the agent cannot know.

2-1) Also, what happens when σd\sigma_d is too small and you cannot satisfy the condition on TT? In my intuition, the small singular value cannot affect the result much, but in your analysis, it feels like it will affect a lot on the result.

  1. It's also good to know that coherence always includes spikiness, but according to your calculation, α2<μ4d2\alpha^2 < \mu^4 d^2 and your sample complexity eventually depends on α2\alpha^2 according to Theorem 2 and 3, so in some terrible case, one might say your result is depending on μ4\mu^4, right? Instead of selling μ\mu-freeness, I'd rather say this paper is on a different assumption than μ\mu-based studies. Maybe they assumed they are in a setting of moderate μ\mu, while you are assuming the case of moderate α\alpha.

3-1) It would be also great if the authors could provide a table 'include' dd. I know d<<S,Ad << S, A, but when it is multiplicative I think one should add all the parameters.

Again, thanks a lot for your rebuttal.

评论

Dear Reviewer,

We thank you again for your comments and feedback. We hope that our answers address your concern and would be happy to follow up before the end of the rebuttal phase.

Many thanks.

评论

1. Thank you for recognizing the improvements made by our algorithm. Indeed, one key takeaway is that our adaptive, learning-based approach offers stronger guarantees than uniform random sampling and can even match the performance of settings with prior knowledge of the problem structure.

2. Regarding the condition on TT and its dependence on σd\sigma_d . First of all, we wish to clarify that by saying we do not know σd\sigma_d, we mean that our algorithm does not require as input σd\sigma_d or any lower bound on it. For example, the number of samples per epoch in the algorithms in [34] and [35] depends on σd\sigma_d, while our algorithm does not. This allows our approach to work without prior knowledge of the parameters of QQ-matrices which previous algorithms require. This is especially challenging in iterative settings like ours, where prior knowledge would need to include the parameters of all QπQ^\pi matrices encountered until convergence.

Next, both papers ([34] and [35]) are based on the CUR approach, and we will explain why this approach requires assumptions on the number of samples TT. In the proof of our Theorem 5 we need to make use of the following inequality:

Q~τπ(I,J)op=1σd(Q~τπ(I,J))1σd(Qπ(I,J))Q~τπ(I,J)Qπ(I,J)op \Vert \widetilde{Q}^\pi_{\tau}(\mathcal{I},\mathcal{J})^\dagger \Vert_{\mathrm{op}} = \frac{1}{\sigma_d(\widetilde{Q}^\pi_{\tau}(\mathcal{I},\mathcal{J}))} \leq \frac{1}{\sigma_d(Q^\pi(\mathcal{I},\mathcal{J})) - \Vert \widetilde{Q}^\pi_{\tau}(\mathcal{I},\mathcal{J}) - Q^\pi(\mathcal{I},\mathcal{J}) \Vert_{\mathrm{op}}}

(Note that for simplicity, we set L=R=IdL=R=\mathrm{Id}.) The second term in the denominator is an error term that scales with 1/T1/\sqrt{T}. Therefore, to ensure the expression above is positive, we require a condition of the form T1/σd2(Qπ(I,J))T\gtrsim 1/\sigma_d^2(Q^\pi(\mathcal{I},\mathcal{J})). This condition is thus necessary for the analysis of any CUR-based methods, including [34].

Now, note that in [35] (e.g., Theorem 14 or Corollary 16 for finite S,AS,A), the authors use a much stronger assumption: the discount factor γ=O(σd2(Q(I,J))d2Vmax)\gamma = O\left( \frac{\sigma_d^2(Q(\mathcal{I},\mathcal{J}))}{d^2 V_{\max}} \right), which must hold even in the high sample regime (for large TT). In contrast, our analysis works for any value of γ\gamma.

2.1. Regarding small σd\sigma_d. We agree that a very small σd\sigma_d should not drastically affect the guarantees. If σd\sigma_d is very small and, without loss of generality, σd1\sigma_{d-1} is sufficiently large, one could set deff=d1d_{\mathrm{eff}} = d-1 and obtain bounds that depend on σd1\sigma_{d-1} instead. However, ignoring σd\sigma_d introduces an approximation error that depends on the incoherence of the dd-th singular vectors. To safely disregard the recovery of the dd-th singular value and its associated singular vectors, we must ensure that σd\sigma_d is small and that the dd-th singular vector is sufficiently incoherent. This contrasts with Frobenius/spectral norm guarantees, where neglecting terms corresponding to the dd-th singular value leads only to an additive error of σd\sigma_d, regardless of the singular vectors. We are not aware of any method that can select the rank parameter dd to address these issues without assuming knowledge of singular vectors or the appropriate incoherence parameters.

3. Regarding incoherence and spikiness. While our algorithm's worst-case guarantees scale with μ4\mu^4, we believe this comparison is not particularly useful. As shown in the table, in a general, non-adaptive setting, the guarantees typically scale with α2μ2\alpha^2 \mu^2 . However, in this paper (following [35]), we demonstrate that adaptive sampling allows us to eliminate the dependence on the incoherence parameter μ\mu. On the other hand, the spikiness parameter α\alpha quantifies the difficulty of recovering the full low-rank structure. We believe that the dependence on α\alpha cannot be improved through different sampling methods and instead requires a novel matrix recovery method. We will discuss these issues further in the paper and clarify our concept of incoherent-free matrix recovery.

3.1. Regarding the dependence on the rank parameter dd. We omit explicit mention of the rank parameter dd because the sample complexity scales as d3d^3 in all three CUR-based settings, including ours and those in [34,35]. We believe that the nuclear norm-based method used in Theorem 21 of [34] offers slightly better scaling, with d2d^2. We will clarify this further in our revised version of the manuscript

审稿意见
5

The works present a policy iteration algorithm that relies on supposedly low rank structure of value function (Q-function) The proposed work first estimates the low-rank matrix of a given policy before performing policy iteration step. The Leveraged Matrix Estimation (LME) algorithm learns the Q-matrix for a given policy. LME for a given sampling budget estimates Q-matrix in a 3 step process. The policy iteration algorithm also uses LME at the policy evaluation stage.

优点

The proposed Q-matrix estimation and policy iteration algorithm are interesting for sparse MDPs. Sample complexity bounds for the algorithm Analysis is easy to follow

缺点

It is not clear where such low-rank Q-matrix appear in real-world. Is coherence and spikiness of Q generally appear in control problems? Can authors show any real-world application like gym environments? There is significant body of work that leverage rank deficit structure of Q, it is not immediately what are the additional insights the current draft has over the past work.

Can you compare with other algorithms proposed.

问题

See weakness

局限性

See weakness

作者回复

Thank you for your review! Please find our responses below.

A. Appearance of low-rank QQ matrices in the real world problems. Low-rank QQ matrices have been observed practically, as shown in Section 4.1 in [43] for a wide range of environments (Atari games). Motivated by this observation, further empirical studies and the first theoretical results assuming prior knowledge of the most informative states have been presented in [34,35]. We believe that our paper is well-motivated by the fact that it is the first theoretical paper proving nearly optimal entrywise guarantees with no prior knowledge of the problem structure.

Furthermore, we recognize that assumptions made for theoretical analysis are seldom completely true in real-world problems. For example, regarding the extensively studied framework of linear MDPs, there are very few real-world problems where such structure holds (with fixed feature vectors). Nonetheless, we believe that our analysis should be used as a proxy for real-world environments and can be practically useful even in settings with an approximately low-rank structure.

B. Coherence/spikiness always appear in real-world control problems. Coherence and spikiness are general properties of matrices that are of constant order in very specific settings, assuming that all entries of the matrix are approximately equally informative. We believe that making RL algorithms robust and independent of the specific environment-related quantities, such as incoherence, is an important challenge. In this work, we address this challenge for a class of low-rank QQ matrices by providing robust algorithm that ensures low error for a wide range of matrix instances.

C. Numerical simulations in gym environments. We refer the reviewer to part B.1 of our response to reviewer WY9t for a discussion on applying our algorithm to gym environments. Moreover, since the focus of our work is primarily theoretical, we believe that numerical simulations should complement main theoretical results and be applied in the most illustrative environments. Thus, we are not convinced that repeating the experiments of [35] in the gym environment would significantly strengthen our work.

D. Comparison with literature on low-rank QQ-learning. For a comparison with past work, we refer the reviewer to Section C of our global rebuttal.

评论

Thank you very much for your response. I would be more inclined to see numerical evaluations as a complement to current theoretical results, however that is not a critical for my decision. I prefer to keep my score unchanged

评论

Thanks again for your quick follow up and would be happy to answer any other questions if you have more.

评论

Dear reviewer,

Thanks again for your review. We have answered your concerns and would be grateful to hear your thoughts. If you need further clarification or want to ask more questions, we would be happy to respond.

Best regards.

审稿意见
6

In this paper, the authors consider the problem of learning an ϵ\epsilon-optimal policy in systems with low-rank latent structures and is order optimal under weaker conditions. The proposed algorithm iterates between exploitation (policy evaluation) and exploration (policy improvement). For policy evaluation, the entries are sampled in two stages to complete matrix estimation. The adopted approach provides entry-wise guarantees with respect to the spikiness of the matrix.

优点

  • The authors have done a great job of clearly defining the problem they are trying to solve and conducting a thorough literature survey showcasing the existing work as well as how their results improve upon the existing work.
  • The proposed algorithm is parameter-free which implies that it does not rely on spikiness/rank/coherence bounds to work. Similarly, the lack of dependence of sample complexity bounds also shows the strengths of these results.

缺点

  • The notation can be hard to read especially as l and L and r and R imply completely different things, while in most cases they imply smaller letters imply matrix entries for the matrix indicated by the larger letter.

问题

  • Theorem 1 shows that the Leverage Scores Estimation depends on the dthd^{th} singular value, Wouldn't that impose constraints indirectly on the rank/spikiness of the original matrix?

  • In Line 285, is the subscript of Ω\Omega a typo.

局限性

  • Minor Comment: Please elaborate on acronyms before using them. The goal is to make the paper self-contained for a first-time reader as well even though these acronyms are widely known in practice. Line 10 - CUR, Line 30 - MDP, Line 88 - ERM, MLE
  • Can the authors compare the simulation results in Appendix A with existing approaches akin to [35] to showcase if their algorithm indeed works in a practical setting?
  • Can the authors also showcase how their algorithm holds for different values of rank, spikiness, and condition number?
作者回复

Thank you for your valuable review and positive feedback! Please find our responses below.

A. We do not impose any constraints on the rank/spikiness of the original matrix. According to Theorem 1, we need number of samples T=Ω~((S+A)(1γ)3rmax2κ2SAσd2(Q))T = \widetilde{\Omega} \left( \frac{(S+A)}{(1-\gamma)^3} \frac{r_{\max}^2\kappa^2 SA}{\sigma_d^2(Q)} \right), and note that the second term can be upper bounded by α2\alpha^2 (for κ,d=Θ(1)\kappa,d=\Theta(1)). Thus, the number of samples required for this theorem to hold does depend on the spikiness parameter α\alpha. However, this does not indirectly impose constraints on the spikiness, since this result holds for any level of spikiness. We only require a higher sample complexity for instances with higher spikiness, which intuitively correspond to more difficult instances of matrices. Furthermore, note that we do not require knowledge of the spikiness parameter α\alpha when sufficiently many samples are provided.

B. Numerical experiments.

  • Comparison with numerical simulations from [35]. Note that the algorithm used in the experimental section in [35] corresponds roughly to our LoRa-VI algorithm with uniform anchors presented in Appendix A.4. More precisely, choosing anchor states uniformly at random in the first stage of our algorithm recovers the algorithm in [35] (with the slight modification of using PI instead of VI, which we do for theoretical reasons as discussed in Section 3.3). Thus, all results obtained in [35] are obtainable within our framework as well since it incorporates a more general setting.

  • Additional simulations. We want to emphasize that our main contributions in this paper are theoretical. However, we conducted four different experiments to show the practical relevance of our work. We agree that experimentally testing the dependence on rank, spikiness and condition number is an interesting direction. We will incorporate new experiments regarding these settings in the revised version of our paper.

C. On notation. We thank the reviewer for giving useful feedback about our notation. Here are answers to your questions:

  • using L,RL,R and ,r\ell,r for different quantities: We denote an element (i,j)(i,j) of a matrix MM by Mi,jM_{i,j} in this paper. Thus, elements of the matrices LL and RR are denoted by Li,jL_{i,j} and Ri,jR_{i,j}, respectively. We will add a sentence explaining this in the notation section of the revised paper.
  • subscript in line 285: The subscript \square is not a typo. We use notation Ω\Omega_{\square} to denote elements in the square matrix formed by entries I×J\mathcal{I}\times \mathcal{J} in the skeleton of the matrix.
  • regarding acronyms: We will add descriptions of the acronyms we use. Note that CUR is not an acronym.
评论

Dear reviewer,

Thanks again for your review. We have answered your concerns and would be grateful to hear your thoughts. If you need further clarification or want to ask more questions, we would be happy to respond.

Best regards.

作者回复

Thank you all for your efforts in reviewing our paper. We feel that some of the paper contributions might have been overlooked and we take the freedom to highlight them below.

A. Our matrix completion scheme is novel.

We want to emphasize that our matrix completion method is novel and not merely an application of existing results from matrix completion literature: our work is the first one to leverage entrywise bounds for estimating leverage scores and use them for adaptive sampling. Previously known two-phase matrix completion algorithms (such as that of [7]) only work in noiseless settings since estimating leverage scores has been challenging without appropriate entrywise (or 2\Vert \cdot \Vert_{2\to\infty}) guarantees. We use recent advancements in establishing entrywise guarantees to prove that we can 1) successfully estimate leverage scores and 2) use these estimates to sample in a manner that results in incoherence-free entrywise guarantees.

B. Main contributions of our paper.

Next, we want to emphasize three crucial points in which our algorithm is superior compared to previously known methods:

  • Our proposed method enjoys entrywise guarantees. As discussed in [35], entrywise guarantees are crucial for RL algorithms. Handling error in Frobenius or spectral norm does not seem to be sufficient for studying RL algorithms (see section F.2 in [35]).

  • No explicit dependence on incoherence. As shown in the table below, we remove explicit dependence on incoherence in sample complexity bounds. In [34], sample complexities scale with the incoherence constant μ\mu roughly as μ6\mu^6, but by decoupling the effect of incoherence from the effect of spikiness (corresponding to terms with σd(Q)\sigma_d(Q)), their sample complexity scales as μ2α2\mu^2 \alpha^2 (note that spikiness constant α\alpha satisfies αμ2d\alpha \leq \mu^2 d). In other words, our algorithm is better by a factor of μ2\mu^2 than algorithms based on uniform sampling (studied in [34]), and requires the same sample complexity as the algorithm of [35], which has prior knowledge of anchor states.

  • Our proposed method requires milder assumptions. Specifically, we make two significant improvements in reducing the set of assumptions:

    • we do not require prior knowledge of the most informative states (anchor states) as in [35]. Such knowledge greatly simplifies the problem, essentially reducing it to estimating the part of the matrix corresponding to the anchor states only. We believe this assumption is too strong for real-world applications, and we devise a way to learn such states without any prior knowledge.
    • our low rank assumption (QπQ^\pi low rank for every deterministic policy π\pi) is a relaxation of the previous assumption in [34] (rr and PπP^\pi are low rank for any deterministic π\pi). We achieve this by applying policy iteration-based algorithm instead of previously reported algorithms based on value iteration. We refer reviewers to Section 3.3 in the paper, where we explain in detail the benefits of using PI-based methods.

C. Comparison with other methods

Below, we provide a brief tabular overview of the most relevant algorithms, their performance and assumptions. For the sake of brevity, we omit writing the factors κ,d\kappa,d and use the abbreviations NNM: nuclear norm minimization and MC: matrix completion.

MethodError guaranteesSampling modelAssumptionSample complexity
Oursentrywiseadaptivebounded spikinessα2(S+A)/ϵ2\alpha^2 (S+A)/\epsilon^2
Algorithm 1 [35]entrywiseapriori fixed anchorsanchors apriori knownα2(S+A)/ϵ2\alpha^2 (S+A)/\epsilon^2
LR-EVI (Thm 9 [34])entrywiseunif. anchorsincoherenceμ2α2(S+A)/ϵ2\mu^2 \alpha^2 (S+A)/\epsilon^2
NNM [9] (Thm 21 [34])entrywiseunif. anchorsincoherenceμ2α2(S+A)/ϵ2\mu^2 \alpha^2 (S+A)/\epsilon^2
Two-phase MC [7]exact recoveryadaptivenoiselessnot applicable

We also note that a different type of low-rank structure has recently received significant attention. Namely that QQ-function can be written as Q(s,a)=ϕ(s,a)θQ(s,a) = \phi(s,a)^\top \theta for an unknown vector θRd\theta \in \mathbb{R}^d and an unknown feature mapping ϕ\phi belonging to some function class Φ\Phi. These methods are not directly comparable to those presented in the table above and usually do not utilize matrix completion.

最终决定

The paper provides a method for RL when the policy reward matrices are assumed to be low rank. While similar setups have been considered before, the paper particularly focuses on replacing the assumption of incoherence or known anchors with bounded spikiness. As the authors argued in the rebuttal, spikiness is a weaker assumption. The main component of their method is a matrix completion algorithm that is based on estimating leverage scores. I believe this component by itself is a nice contribution and imo in fact the main point of the paper. Important to note that guarantee required by RL application is entrywise recovery guarantees and not spectral/Frobenius norm guarantees which have been established for the spikiness assumption.

There are a few weaknesses in the presentation of the paper that need to be addressed which put this paper right on the borderline. There are many clarifications made by the authors in the rebuttal including the understanding of the spikiness considition and comparison with incoherence as well as a clean table comparing their methods to other approaches. These are both extremely important to include in the revision. In fact the entire clarifications in their general rebuttal needs to be included in the paper in my opinion.

Overall this is a borderline paper with interesting results and a presentation that can be signifcantly improved. I am voting for acceptance wuth the caveat that my set is very competitive and the presentation shortcomings put this paper on the borderline.