Gaussian Approximation and Multiplier Bootstrap for Polyak-Ruppert Averaged Linear Stochastic Approximation with Applications to TD Learning
We prove rates of normal approximation for the iterates of the linear stochastic approximation algorithm and non-asymptotic guarantees for constructing confidence intervals with multiplier bootstrap
摘要
评审与讨论
The paper presents advancements in the theoretical understanding of the linear stochastic approximation (LSA) algorithm.
It establishes the Berry–Esseen bound for the normal approximation of Polyak-Ruppert averaged iterates, achieving an optimal rate with an aggressive step size of . Additionally, it demonstrates the non-asymptotic validity of confidence intervals using a novel multiplier bootstrap procedure, marking a first in this domain.
The practical utility of these theoretical results is showcased through applications in temporal difference (TD) learning for reinforcement learning.
优点
-
Theoretical Advancements: The paper makes theoretical contributions by establishing the Berry–Esseen bound for the normal approximation of Polyak-Ruppert averaged iterates. Though some previous works have done this in special cases before, this work differs from them in providing a tighter bound.
-
Good Clarity: The paper is easy to follow and well-written. The proof seems correct (not checked very carefully).
缺点
-
Strong Assumptions: The requirement that is uniformly bounded is a strong assumption that might limit the applicability of the results. Relaxing this condition to weaker ones, such as finite moments, could enhance the paper's generality.
-
Missing References: There are some missing references, see the limitations.
-
Discussion on Lower Bounds: The paper focuses on deriving upper bounds but lacks a discussion on the tightness and potential lower bounds. Addressing this aspect could provide a more comprehensive understanding of the bounds' efficacy and limitations.
问题
See the Limitations.
局限性
-
Missing References on Statistical Inference: Statistical inference for nonlinear stochastic approximation has been considered by [1*] and [2*], which is highly related to this manuscript but hasn’t been cited. Note that [1*] provides a Berry-Esseen-like bound for the whole trajectory rather than the averaged iterates. These references could be cited after [36] in line 116.
- [1*] Li, Xiang, Jiadong Liang, and Zhihua Zhang. "Online statistical inference for nonlinear stochastic approximation with Markovian data." arXiv preprint arXiv:2302.07690 (2023).
- [2*] Li, Xiang, et al. "A statistical analysis of Polyak-Ruppert averaged Q-learning." International Conference on Artificial Intelligence and Statistics. PMLR, 2023.
-
Assumption on Bounded : Theorem 1 requires that is uniformly bounded, which is a strong condition. Is it possible to relax this condition to a weaker one, such as only having a finite order of moments (such as the fourth order moment or smaller)?
-
Tightness of Derived Upper Bounds: The paper provides upper bounds, but it would be insightful to discuss the tightness of these bounds. Are there any thoughts or conjectures regarding potential lower bounds?
Minor Corrections:
- Figure 1: The last subfigure should be labeled as (c).
We would like to thank the referee uNZU for careful reading of the manuscript and raising interesting questions. Next, we answer the issues raised.
Missing References on Statistical Inference We thank the referee for provided references and will add them to the revised version of the paper.
Strong Assumptions: assumption on Bounded Indeed, the assumption of bounded is strong, but can be partially relaxed. Following the stablity of matrix products technique, used in [Proposition 3][Durmus et al, 2021], we can generalize the moment bound for products of random matrices (Corollary 4) for the setting when the random variable has only finite number of moments. In particular, finite rd moment of (which implies naturally that also admits only a finite rd moment) is sufficient to obtain the first main result of the paper on the Berry-Esseen inequality (Theorem ). However, it will be not sufficient to prove the boostrap validity (Theorem ), since this result requires high-probability bounds on the product of random matrices . Yet there are two settings when we can generalize our results without the assumption concerning bounded :
- Random matrix is almost sure bounded, but is sub-gaussian, or, more generally, for any it holds that
for some . In such a case, we can generalize our bootstrap validity along the lines of the current proof;
- The random variable is sub-Gaussian, and is sub-Gaussian. In such a case, using the high-probability bounds outlined in [Proposition 3][Durmus et al, 2021], we can have a counterpart of Theorem up to additional powers of . \end{enumerate} We will add the discussion above the the revised version of the paper.
Tightness of Derived Upper Bounds Indeed, it is a very interesting question to obtain the mathching lower bounds that can illustrate the fact that the rate is indeed sharp. See the reply for all referees for details on lower bounds.
Minor Corrections Thanks, we will correct this typo and will perform additional proofreading for the revised version of the paper.
Dear referee,
Please kindly let us know if you have any follow-up questions that need further clarification. Your insights are valuable to us, and we stand ready to provide any additional information that might be helpful.
The present paper studies linear stochastic approximation with martingale difference noise and diminishing step-sizes. The authors obtain Berry-Esseen bounds for the parameter sequence with Polyak-Ruppert averaging as well as a generalization of finite-time bounds for estimation confidence intervals for parameters in LSA. The obtained Berry-Esseen bounds are illustrated by a TD learning numerical example.
优点
To the best of the reviewer's knowledge, both of the contributions are novel. In particular, I believe that this is the first Berry-Esseen bound type bound to be obtained for general linear SA, which is is exciting to see.
The assumptions, contributions and approach to analysis are objectively identified. The authors also did a great job in providing discussions/remarks/intuition for their results.
The paper is well-written but some proofreading is recommended.
缺点
Although the authors did a good job in outlining the scope and contributions of the paper, the analysis and main text are hard to follow given the number of symbols and equations. It is easy for a reader to get lost/distracted midway and it is very hard to keep track of the definitions of each of the terms
I understand that this is an issue with theory papers like this, but would encourage the authors to move unnecessary terms or inequalities to the appendix (e.g. (9) in A3. The exact lower bound adds very little in the main text in my opinion. Its definition could have been postponed to the Appendix.)
The numerical experiments are also weak since many of the plots are not related to the contributions of the paper itself. Plot c seems to be the only one directly related to the theorems of the paper, but it still does not illustrate the theory that well. Maybe including a plot of C k^{-1/4} to plot (b) where C is a constant for comparison could help in inferring convergence rates.
Also, there is a mistake in the label of Figure 1. Subfigure (b) is mentioned twice.
问题
- Could the authors clarify if using Polyak Ruppert averaging is necessary to obtain such a Berry-Esseen bound? It would be exciting to see bounds for unaveraged estimates as well.
-Could the authors run the experiment supporting figure (c) for longer to see if the curve with \gamma = 1/2 willnot continue to go upwards eventually?
局限性
The authors clearly identified the limitations of their results through a clear list of assumptions
We would like to thank the referee ew5L for careful reading of the manuscript and valuable suggestions for presentation improvement. Next, we answer the issues raised.
The analysis and main text are hard to follow Indeed, there are technical details in the main text that complicates reading, yet we were willing to be precise and state exact lower bounds on sample size , especially in and . In the revised version we will modify eq. and eq. , switching to - notation, and move precise bounds to appendix, as was suggested.
The numerical experiments are also weak since many of the plots are not related to the contributions of the paper itself We respectfully disagree with the referee - the figure (a) explains, why step sizes , are less preferrable. In this case the plot (a) illustrates slow convergence of the rescaled error , which is the reason which explains slow convergence of the respective rescaled approximation errors on subfigures (b) and (c). For the new plot (see the attached PDF file veasible for all referees) we included the expression with and without logarithmic scaling. Thus we expect that this quantity converges to a constant when and grows with when . We will also consider running longer experiments, as for now we have included a figure with one more observation (added point corresponding to observations).
Could the authors clarify if using Polyak Ruppert averaging is necessary to obtain such a Berry-Esseen bound? No, in principle it is not necessary, but the result will be slightly different in this case. It is known that (see e.g. [Fort, 2015]), that the corresponding CLT for the last iterate can be written as
where the covariance matrix is different from . Then, using the perturbation-expansion technique from [Aguech et al, 2000], we write that
where is the transient component of the error,
is the leading (with respect to step size) component of the error and is a remainder term. Thus, using the argument from the current submission, is exponentially small in , is the linear statistics in that guaranttes asymptotic normality after re-normalization, and is the remainder term. It can be shown that
Thus, applying similar technique of randomized concentration inequalities (formula 13 in the current submission), we will obtain the Berry-Esseen bound for , which should scale as .
Typos in figure labelling Thanks, we will correct this typo and will perform additional proofreading for the revised version of the paper.
References:
[Aguech et al, 2000] Rafik Aguech, Eric Moulines, and Pierre Priouret. On a perturbation approach for the analysis of stochastic tracking algorithms. SIAM Journal on Control and Optimization, 39(3):872–899, 2000.
[Fort, 2015] Central limit theorems for stochastic approximation with controlled Markov chain 411 dynamics. ESAIM: PS, 19:60–80, 2015.
I appreciate the author's responses.
I believe that it would be beneficial to include some discussion/remarks on the final version about the extension of their results for estimates without PR as in the response provided by the authors.
I apologize for not taking enough time to fully grasp the experiments in the paper, but I understand them now. Thank you for providing a longer run.
We thank the referee for their comments. We will include a discussion on the Berry-Esseen result for last iterate as well as a longer run for simulations.
Overview
Let be i.i.d. random elements with a common distribution over . Given and , the goal of the LSA procedure is to find the unique solution of
Given a decreasing sequence of step sizes and a starting point , the standard LSA is given by and the Polyak-Ruppert averaged LSA is given by
The authors provide Berry-Essen-type bounds for the Gaussian approximation of and for the corresponding multiplicative bootstrap process. Namely, for the Gaussian approximation result they upper bound the quantity where . And for the bootstrap approximation result they upper bound where are obtained from a multiplier bootstrap process. It is interesting to notice that this multiplier process can be evaluated online without keeping a history in memory.
Overall proof arguments
To provide the Gaussian approximation result the authors write where
The authors observe that can be viewed as a noise, which is assumed to be bounded. Meanwhile, the operator is a random perturbation around , which is shown to act as a contraction in an appropriate norm, provided that is Hurwitz. Thus, when one take the PR average the noise is expected to behave as the sum of i.i.d. noises and the contraction term must shrink. This is formally done writing the PR average in the form of Theorem 2.1 [reference 60 of the paper] and bounding the terms given by the latter theorem.
The proof the bootstrap approximation result follows the standard practice in this literature:
- First, conditionally on the sample, a Gaussian approximation is obtained relating the bootstrap to a Gaussian with its covariance.
- Second, a Gaussian comparison theorem relates the latter Gaussian with the desired Gaussian, with some concentration results being used to bound their difference in high probability.
Main claims and observations
The authors show that taking a step size of order yields the best possible convergence rate in their bounds. They provide empirical evidence that this rate is optimal. They also claim to be the first ones to fully provide a non-asymptotic bootstrap approximation result.
优点
The paper seems to be the first to provide non-asymptotic Gaussian and bootstrap approximation bounds for LSA. Their assumptions are quite mild and are in line with the ones made in similar papers in other domains. For instance, assumption A.2 is similar in nature to the boundness assumptions and the strong-covariance assumption made in [A]. The paper is mathematically sound and poses interesting research directions. I'm particularly curious about the convergence rate of suggested by their theoretical results and supported by their experiment. The proposed application to policy evaluation in RL is also interesting. Finally, I point out that the code on the supplementary material was easy to reproduce.
[A] Chernozhukov, Victor, Denis Chetverikov, and Yuta Koike. "Nearly optimal central limit theorem and bootstrap approximations in high dimensions." The Annals of Applied Probability 33.3 (2023): 2374-2425.
缺点
The main claim of the paper is that their bounds suggest an optimal convergence rate of when taking . The bottleneck of this convergence rate comes from an application of Cauchy-Schwarz inequality, the boundness of and the MSE bound on given by Theorem 1. Meanwhile, the authors also provide experimental evidence that this convergence rate is indeed optimal (although not considering gamma values below 0.5 in Figures 1 or 3). A counter-example or a deeper discussion on this convergence rate would be beneficial to the paper: is it an artifact of the proof or is there reason to believe it cannot be improved?
问题
Some observations:
- Line 54 has a typo in "Berry-Essee".
- In Equation (15) the can be removed from to enhance clarity. It is only used to explain that to evaluate the probability in practice one must run several samples of the bootstrap process.
- Line 263 has a typo in "date".
- The Equation after line 303 is using instead of .
- Line 307 asks for a sequence of "TD(0) updates", what does the "(0)" stands for?
- Figure 1 lacks x and y labels. The legend can also be improved for clarity.
局限性
The authors adequately addressed the limitations and the potential impact of their work.
We would like to thank the referee DkFj for the work and for the positive feedback! Next, we answer the issues raised.
Bottleneck of the convergence rate and counter-example or a deeper discussion on this convergence rate This is indeed a very important question. First of all, the analysis of Theorems 1-3 can be adjusted for the whole range of step sizes . The only modification would affect lower bounds on step size in the assumptions and , respectively. There is a reason to believe that the moment bound of Theorem~1 is sharp, that is, the best possible bound on is of order when setting . The corresponding lower bound on moments of the remainder (in ) terms can be found for the setting of strongly convex optimization in [Li et al, 2022]. We expect that this result can be generalized for the LSA setting as well. However, the tightness of the moment of Theorem 1 bound does not directly imply the tightness of the bounds of Kolmogorov distance . It is hard to say if the cross-correlation terms appearing between the linear statistics and non-linear statistics in the bound (13) are sharp. See also the general discussion on this topic.
We leave further exploration of this question as a promising direction for a future work.
Typos and misprints We thank the referee for careful reading of the manuscript and will fix the raised issues in the revised version of the paper. We will also re-generate Figure with longer trajectories and change legend as suggested by the referee in order to improve readability.
Line 307 asks for a sequence of "TD(0) updates", what does the "(0)" stands for? In general one can perform a policy evaluation using the whole family of TD() algorithms, where . One can find the details in the paper [Tsitsiklis and Van Roy, 1996]. However, when we choose an instance of the algorithm with parameter , corresponding dynamics of parameter updates (\theta_k)_{k \geq 0\} becomes non-Markovian. TD() is arguably the most popular algorithm of this family, and the only one which falls directly into the stochastic approximation paradigm.
References: [Li et al, 2022] Li, C.J., Mou, W., Wainwright, M. and Jordan, M., 2022, June. Root-sgd: Sharp nonasymptotics and asymptotic efficiency in a single algorithm. In Conference on Learning Theory (pp. 909-981). PMLR.
[Tsitsiklis and Van Roy, 1996] Tsitsiklis, John, and Benjamin Van Roy. "Analysis of temporal-diffference learning with function approximation." Advances in neural information processing systems 9 (1996).
I thank the authors for their rebuttal. The comment made in the rebuttal on the optimal -rate obtained in [Bolthausen, 1982] for the martingale Berry-Essen should be on the main text. I wonder if the authors can find an example matching their upper bound (thus showing it is optimal), maybe drawing inspiration from the example in Section 6 of [Bolthausen, 1982]. I thank the authors for the explanation of the meaning of TD(0), it should also be included on the revised text.
Yes, we will include the corresponding discussion on the optimal rates in martingale CLT in the revised version of the text, as well as a comment on TD(0). Fetching the example provided in [Bolthausen, 1982] into the LSA paradigm is not immediate, but we work in this direction, and, of course, if we succeed to construct such a lower bound, we will include it in the final text.
We thank the reviewers for their thorough feedback. We are pleased that reviewers deemed our contributions to the Berry-Esseen bounds for Polyak-Ruppert averaged LSA and non-asymptotic bootstrap validity as new.
The general question, which was raised by the referees DkFj and uNZU is related to the tightness of our bounds and availability of certain lower bounds. We do not have a formal proof for the tightness of our bounds and highlight that it is an excellent research question. First of all, we believe that the moment bound on the remainder term in Theorem 1 is sharp in terms of its dependence in , since similar results can be traced for the setting of strongly convex optimization in [Li et al, 2022]. However, the tightness of the moment of Theorem 1 bound does not directly imply the tightness of the bounds of Kolmogorov distance . Second, the key bound (13) in our analysis writes as
where and are the leading (linear) and remainder part of our considered non-linear statistics. It is shown in [Chen and Shao, 2007], Section 4, that the rd term in the above sum can not be removed. However, it is less clear if the same reasoning applies to the correlation term . At the same time, it is this term which explains the scaling of the final bound. However, there is another evidence which suggests that is a correct order. We can write the statistics of interest within the particular decomposition
Here contains remainder terms of non-linear statistics , which are of smaller order in . The first two terms in the above sum forms a martingale with respect to the natural filtration, and it is known that typical Berry-Esseen rate in martingale CLT is , see e.g. [Bolthausen, 1982]. At the same time, the counterexample of Bolthausen has a special structure, which not necessarily falls into the structure of the leading terms above. To conslude, further investigations of lower bounds are needed, but available (incomplete) evidence from both moment point of view and martingale CLT suggests that should be optimal.
We also attach a pdf file with slightly increased number of observations as suggested by the referee ew5L. We increased maximal number of observations by a factor of (till ) and added scaling of Kolmogorov distance by without logarithmic scaling of the -axis to better highlight scaling of our approximation rate.
We address the other more specific concerns directly in the rebuttal to each reviews.
References:
[Chen and Shao, 2007] Chen, L. H. and Shao, Q.-M. (2007). Normal approximation for nonlinear statistics using a concentration inequality approach. Bernoulli 13(2) 581–599. [Bolthausen, 1982] Bolthausen, E. Exact convergence rates in some martingale central limit theorems. The Annals of Probability, pp.672-688, 1982.
The present paper proves normal approximation (CLT) and validity of a bootstrap procedure for the law of the Polyak-Ruppert averaged iterates of linear stochastic approximation (LSA). Their main results have n^{-1/4} convergence rates (up to polylogs and dimension-dependent terms) under reasonable assumptions. Intriguingly, the authors suggest that the magnitude of the error in their CLT might be optimal.
The paper was generally well-received by the reviewers, with scores 5, 6 and 7. The issues raised in the revision period do not seem significant, and can be resolved in the final version. In that connection, I would encourage the authors to add their remarks on weaker assumptions to the text (cf. the response to reviewer uNZU).
Overall, I believe the paper deserves to be accepted: the quantitative CLT is quite interesting, and the bootstrap results seem to be new.