Quantum Speedups in Regret Analysis of Infinite Horizon Average-Reward Markov Decision Processes
Quantum Speedups in Regret Analysis of Infinite Horizon Average-Reward Markov Decision Processes
摘要
评审与讨论
This paper leverages the result in Cornelissen et al. for quantum acceleration of the mean estimation problem for dimensional random variable when the number of quantum experiments is larger than the dimension . Leveraging this fact the authors can reproduce the standard UCRL proof for Jaksch et al. 2010 with a tighter confidence radius for the estimated transition. This leads to logarithmic regret as opposed to the classic regret bound.
After rebuttal
My opinion about the paper did not change
给作者的问题
What is a quantum experiment in reinforcement learning? Please provide an example.
Moreover, the interaction protocol between the learner and the environment is not clear does the agent collect a trajectory and perform a quantum experiment at each step ?
论据与证据
The theorems are clearly proven
方法与评估标准
Not applicable since there are no experiments
理论论述
Yes, I checked them. I don't know where the last bound in equation(84). It seems that the author use a bound on the number of epochs which is polynomial in but logarithmic in , that is . However, this bound seems to be not proven in the text.
实验设计与分析
There are not experiments
补充材料
Not applicable
与现有文献的关系
I am not qualified to assess the quantum computing related literature, I think that there are some missing references in classical regret minimization approaches in infinite horizon reinforcement learning such as
Hong et al., 2024 Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs via Approximation by Discounted-Reward MDPs
Wei et al. 2020, Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes
遗漏的重要参考文献
see above
其他优缺点
The paper has the potential to be an interesting contribution but there are significant motivation issues in my opinion.
First of all, it should be clear what the consequences of this exponential improvement in regret are. The author should say that having access to quantum experiments it is enough to learn an optimal policy ( up to precision) while in the classical setting samples are needed.
However, it should be made clear what is a quantum experiment in the context of reinforcement learning. The author just report the standard definition but an RL example is needed to convince the RL audience that this result is worth being published.
My recommendation is weak reject because the proof is trivial once endowed with the result from Cornelissen et al. 2022 and Jaksch et al. 2010. This is not a problem per se but it makes the technical contribution of the paper not very interesting. I am willing to raise my score if the author clarifies the learning problem considered and provide an example of what is a quantum experiment in the context of sampling a trajectory from the MDP.
其他意见或建议
See above
We thank the reviewer for the insightful comments and valuable suggestions. Below we carefully address each of your concerns:
Bound on epochs in Equation (84)
We appreciate your careful review. Indeed, your observation is correct that the bound on the number of epochs as is essential. This bound follows directly from Proposition 18 in Appendix C.2 of [1], explicitly mentioned right after equation (86) in our manuscript. We will further clarify and explicitly reference this in the revised version for clarity.
Missing references in classical regret minimization
Thank you for suggesting these essential references [2] and [3]. We agree that these works provide essential context and relevance to classical regret minimization. We will incorporate them explicitly in the revised manuscript to improve completeness.
Definition and example of quantum experiment in RL
We clarify that a "quantum experiment" in reinforcement learning corresponds exactly to performing a state-action transition and obtaining the next state encoded as a quantum state . Specifically, given classical state-action pairs , our method obtains quantum samples by leveraging a quantum oracle that prepares the superposition state , representing the probability distribution of transitioning to the next state. Each quantum sample preparation constitutes precisely one quantum experiment in our RL setting.
Interaction protocol between learner and environment
We agree with the reviewer on their interpretation. Our learner performs one quantum experiment at each step. Specifically, after observing a classical state-action pair, the agent queries the quantum oracle, performing a quantum experiment to obtain the next state as a quantum superposition. Thus, at each interaction step, one trajectory step is sampled quantum-mechanically rather than classically.
We will clearly describe this interaction protocol in our revision to eliminate any confusion.
Technical novelties in the algorithms and proofs
We respectfully highlight our technical and conceptual novelties:
-
Our paper provides the first quantum RL framework for infinite-horizon average-reward MDPs, a notably more challenging setting compared to finite-episode or discounted problems. The quantum infinite-horizon setting introduces complexities (e.g., convergence analysis without episodic resets, the inability to directly apply classical martingale techniques and the quantum measurement-collapse issue) not encountered in classical analyses or in finite-horizon analyses.
-
To handle the critical quantum measurement-collapse issue (absent in classical RL), we developed a novel momentum-based updating scheme (Eq. 21–25). In addition, unlike [1], our analytical approach avoids classical martingale-based arguments and introduces entirely novel analytical techniques (Lemma 2–3, Eqs. 34–38).
-
Thus, our proofs, while building upon foundational results such as [1] and [4], involve substantial technical novelties tailored specifically to quantum infinite-horizon RL, beyond straightforward combination of known results.
We sincerely believe these novel contributions significantly enrich the theoretical landscape of quantum RL.
We hope these clarifications address your concerns, highlighting our work's relevance, novelty, and contribution clearly. We kindly request reconsideration of your evaluation in light of these clarifications.
References:
[1] Near-optimal Regret Bounds for Reinforcement Learning, 2008.
[2] Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs via Approximation by Discounted-Reward MDPs, 2024.
[3] Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes, 2020.
[4] Near-optimal quantum algorithms for multivariate mean estimation, 2022
Thanks a lot for your answer, I still feel that the quantum experiments in RL is not motivated. Could you please provide an example of what could the quantum embedding of the next state is ?
Beyond that i didn’t find your argument about the analysis avoids martingale concentration argument. Since you provide bound on the expected regret rather than high probability bounds on the regret i think that concentration argument could be avoided also in a classical analysis.
I will keep my original evaluation
We sincerely thank the reviewer for their continued engagement and helpful feedback. We address your remaining concerns below and hope this clarifies our framework further.
Further Clarifications on Quantum Experiments in Reinforcement Learning
We appreciate your persistence in seeking clarity. As requested, we now provide a more concrete example of what a “quantum experiment” could look like in reinforcement learning (RL):
A quantum experiment in RL corresponds to querying a quantum device to obtain a sample from a quantum-encoded distribution. Specifically, in our setting, given a classical state-action pair , the agent has access to a quantum oracle that returns a superposition state:
This quantum state encodes the transition distribution directly in its amplitudes. Performing a measurement on this state yields a classical next state , this measurement process constitutes one quantum experiment.
A concrete example of this setup is illustrated in [1], where a reinforcement learning agent interacts with a quantum simulator to generate gate sequences that prepare target quantum states. In their environment, the agent takes actions (selecting quantum gates), observes quantum states (via Pauli expectation values), and updates its policy based on measurement fidelity. This is effectively a reinforcement learning process where quantum measurements are part of the interaction loop, closely aligning with our interpretation of quantum experiments in RL.
Thus, in our context, each quantum experiment corresponds to a single interaction with the quantum oracle yielding a quantum sample, which can be used once before collapsing. This is in contrast to classical transitions, which can be reused for statistical estimation.
We finally note that this form of quantum oracle, where the agent performs a quantum experiment to query the oracle and obtain a superpositioned quantum sample, is widely adopted in theoretical quantum learning literature, including optimizations [3], searching algorithms [4], and episodic RL [5]. These quantum experiment frameworks are key mechanisms that enable quantum speedups in sample complexity and regret analysis.
On Avoiding Martingale-Based Analysis
We agree with the reviewer that classical expected-regret analyses can, in principle, avoid martingale concentration inequalities like Azuma-Hoeffding by resorting to union bounds (while getting looser bound). However, we emphasize three key points:
- In classical model-based RL literature, high-probability regret bounds using martingale-based concentration (e.g., Azuma-Hoeffding) are standard practice (e.g., in [2]).
- In our quantum setting, such concentration inequalities do not yet have direct quantum analogs.
- We perform a careful analysis that avoids such inequalities, and demonstrate that the same result can be achieved in the same order of regret.
Therefore, our analysis intentionally avoids martingale-based arguments and uses union bounds (as in Appendix G). This approach, while possibly slightly looser, is both necessary and sufficient for our setting and reflects a genuine novelty in adapting model-based analysis to quantum learning environments.
We appreciate your thoughtful and rigorous evaluation, and we hope the additional clarifications on the nature of quantum experiments and the rationale behind our analysis address your concerns. We remain hopeful that this context and explanation may support a reconsideration of your evaluation.
Reference:
[1]Quantum Architecture Search via Deep Reinforcement Learning, 2021.
[2] Near-optimal Regret Bounds for Reinforcement Learning, 2008.
[3] Quantum speedups for stochastic optimization, 2024
[4] A fast quantum mechanical algorithm for database search, 1996.
[5] Provably Efficient Exploration in Quantum Reinforcement Learning with Logarithmic Worst-Case Regret, 2024.
The paper addresses regret minimization in infinite-horizon MDPs with average reward optimality criteria. Specifically, the aim of the paper is to show a quantum speedup in the regret by employing quantum mean estimators. The paper presents an optimistic algorithm, a variation of the classical UCRL adapted to the quantum realm, which achieves a regret rate of order and various other factors (possibly unnecessary) of . The regret result is given with the corresponding theoretical analysis.
给作者的问题
See above.
After reading the authors' responses, I have a better understanding of the contribution of this submission. I am increasing my score to weak accept, although I believe the presentation shall be heavily edited to avoid confusion about the setting considered in the paper.
论据与证据
I could not understand most of the claims made in the paper.
方法与评估标准
Evaluating the efficiency of a learning algorithm through a regret upper bound is standard in the RL theory literature.
理论论述
I did not check the proofs of the claims. The reported regret result looks reasonable once the reader accepts the faster mean estimation rate, although the latter may be hard to conceive.
实验设计与分析
Not applicable.
补充材料
I did not check the supplementary material
与现有文献的关系
This paper follows the literature on regret minimization for average reward MDPs, introducing quantum tools in the regret analysis. To the best of my knowledge, the latter addition is novel.
遗漏的重要参考文献
I am not familiar with the literature of quantum machine learning. The references provided for regret minimization in average reward MDPs look fine.
其他优缺点
General comment:
First, I shall mention that I come from an RL background, with knowledge of RL theory, but I know nothing of quantum computing and quantum machine learning. That said, if the intended audience of the paper is the RL community, then the presentation shall arguably be adapted to be accessible to RL researchers, as the current explanations of quantum aspects may result hard to grasp. If the intended audience is something else, then fine.
However, I have one main comment to share, which is the main reason behind my confusion and corresponding evaluation. How can a (new) model of computation of any kind improve the statistical complexity of reinforcement learning? From my understanding, the statistical barriers of average-reward MDPs (i.e., regret lower bound) do not depend on the choice of computation model. Thus, I do not understand how quantum computing can break the statistical barriers in this sense. I am formulating some (potentially dumb) hypotheses below:
- There is an additional implicit statistical structure/side information added to the standard RL model. Then, the general regret barrier does not hold anymore, and speedups can be achieved. If this is the case, this additional structure/information shall be explained and discussed with clarity.
- Quantum computing provides faster computation, which can speedup the regret rate of "tractable" algorithms (a.k.a. algorithms that were considered computationally intractable are now tractable thanks to quantum computing). However, the regret barriers do not refer to computationally tractable algorithm, but to any algorithm. Thus, they cannot be overcome with computation speedup only.
- The quantum perspective allows to carry more information in a single sample, which leads to a better sample complexity/regret. However, this sounds like changing the problem and the corresponding result shall not be compared with traditional results (a.k.a. we have a small regret, but for a different problem!). If this is the case, then it makes sense to say that there are applications that admit this new problem formulation, in which the efficiency of learning algorithm fundamentally change. This leads to the following comment.
- Can the authors help me processing the contribution through an example? Let us take an arbitrary application: Learning a policy to trade stocks on the market to maximize profits. At every step, the algorithm collects a sample of the price of some stocks of interest, which constitutes the state information together with the owned stocks and their value. The current policy decides whether to buy/sell stocks at the collected prices. How can quantum computing speedup learning in this scenario?
While I am providing a negative evaluation, it shall be noted that I could not understand the contribution here, so my score can be weighed accordingly.
其他意见或建议
Reporting the regret as is kind of odd since the regret actually depends on . Why not reporting then?
We sincerely thank the reviewer for their thoughtful review. We address each point carefully below, hoping to clarify our contributions clearly.
Writing style intended for ML community
Indeed, our intended audience is the ML/RL community, which may not be deeply familiar with quantum computing. To address this, our manuscript provides comprehensive background in Sections 1–2, including:
- Definitions 1–3 (page 2-3): clearly defining classical random variables, quantum experiments, and quantum evaluation oracles.
- Lemma 1 (pages 3–4): clearly explaining the quantum mean estimation lemma critical for our quantum speedups.
- Appendix B explicitly details Quantum Amplitude Amplification (QAA), foundational to quantum mean estimation.
These definitions and explanations are designed specifically for researchers familiar primarily with classical RL.
How quantum computing improves statistical complexity
We appreciate your insightful hypotheses, all of which correctly capture our quantum RL framework:
-
Implicit statistical structure: Quantum computations inherently encode probability distributions within quantum states' amplitudes, adding implicit side information absent in classical samples.
-
Computational speedup vs. statistical barriers:
Our framework leverages quantum computation’s unique statistical properties (beyond computational speedup alone), enabling quantum mean estimation with significantly reduced sample complexity. -
More information per sample:
Quantum states carry richer statistical information due to superposition, fundamentally enhancing each quantum sample’s informational content.
Quantum computing fundamentally accelerates mean estimation via Quantum Amplitude Estimation (QAE) [1]. By encoding probability distributions directly into quantum amplitudes, QAE estimates expectations through Grover-like iterations [2], achieving quadratic improvement in the required number of samples compared to classical methods. Thus, our quantum approach translates quadratic mean-estimation improvements directly into regret improvements from classical to quantum .
Concrete illustrative example
Consider an infinite horizon RL scenario for dynamic resource allocation in large-scale communication networks, a classical RL application. At each time step, the RL agent decides resource allocations (bandwidth, routing paths, etc.) based on observed network states, including current node congestion and usage levels. Classically, accurately estimating next-state transition probabilities (e.g., likelihood of congestion patterns given certain actions) typically requires extensive repeated sampling and interaction with the network, causing delays and inefficient resource utilization.
In our quantum setting, imagine a scenario where a quantum oracle—consistent with standard quantum oracles extensively studied in quantum optimization and Grover algorithms [2]—encodes the entire probability distribution of next network states into quantum states as a superposition. Such quantum oracles could realistically be implemented via quantum sensors or quantum-assisted sampling techniques that inherently exploit quantum mechanical effects (e.g., interference, coherence). By using Quantum Amplitude Estimation (QAE), our quantum RL agent directly accesses these encoded probability distributions, rapidly and accurately estimating expected state transitions using significantly fewer interactions compared to classical sampling.
This quantum-enhanced method fundamentally reduces statistical complexity and significantly accelerates policy learning, leading to improved real-time decisions in critical scenarios (e.g., minimizing congestion, maximizing throughput). Such quantum-assisted techniques align with emerging research trends in quantum technology, where quantum-based sensors and simulators increasingly enable access to rich statistical data through quantum sampling.
Regret reporting clarification
Thank you for highlighting this point. Our regret bound is explicitly reported as , where hides logarithmic factors in . As defined in footnote 1 on page 1, the growth is at most logarithmic in .
We hope this clarifies our work clearly and kindly ask for reconsideration of your evaluation.
References:
[1] Quantum Sub-Gaussian Mean Estimator, 2021.
[2] A fast quantum mechanical algorithm for database search, 1996.
Dear Authors,
Thanks for your thorough replies to my comments. reading those and the other reviews substantially helped me understanding the contribution being made. I am commenting on my (updated) opinion on the paper and additional comments below. Note that these do not require further input from the authors: I am reporting them as additional feedback that may be used to improve the manuscript (if they happen to be useful).
This paper is addressing a reinforcement learning setting in which a sample collected from the environment carries more information than in the classical setting. The paper shows how this additional information can be exploited to obtain a logarithmic regret in average-reward MDPs. The concrete example hints that this may be obtained by installing quantum sensors on physical systems. I have no idea on whether this is "realistic" but I convene that the paper is addressing an interesting theoretical problem, which can foster development of useful analytical tools. I think the final judgment on the paper will boil down to whether the technical novelty introduced here (commented in the rebuttals of the other reviews) is interesting enough (the momentum technique does look interesting...).
Bottom line: I am open to increase my score. I will engage in the discussion before making a decision.
Some additional comments:
- The paper could state more clearly that the problem is different from standard RL and the traditional results do not hold here. For instance, the MDP could be called a "quantum MDP", in which samples carry more information;
- For the same reason, the paper shall not compare results against classical RL. Although it is perhaps reasonable to say that installing quantum sensors allows for faster regret than classical sensors, the two regret cannot be compared one-to-one;
- Not sure what is the right comparison for the regret provided here. I guess the best way to do it would be to develop a lower bound for this setting;
- What I meant with the comment, is that I don't see the point in reporting with a footnote that says it is actually . Why not reporting ? may suggest the regret is constant at first glance, which is just confusing;
- The manuscript could focus more on the technical novelty and how the employed tools could be of independent interest.
We sincerely thank the reviewer for their thoughtful follow-up, constructive suggestions, and openness to further discussion. We address the additional comments below, aiming to clarify our positioning and improve the clarity of the manuscript accordingly.
Clarifying the Difference from Standard RL/MDPs
We fully agree with the reviewer that our framework is fundamentally different from classical reinforcement learning due to the use of quantum oracles, which enable richer statistical information per sample. Referring to our setup explicitly as a “quantum MDP” is a great suggestion, and we will revise the manuscript to emphasize this distinction more clearly. Highlighting this difference will help avoid confusion and better position our work within the broader RL literature.
On Comparison with Classical RL Results
We acknowledge that our setting departs from the standard classical RL assumptions, and direct one-to-one regret comparisons are not always appropriate. Our intention in comparing with classical results was to illustrate the potential of quantum techniques in improving regret performance, particularly when quantum information is available. Moreover, we emphasize that our work is the first to propose and analyze a quantum reinforcement learning algorithm in the infinite-horizon average-reward setting, which distinguishes it from prior episodic or discounted setups.
Clarifying Regret Reporting Notation
Thank you for raising the issue regarding the use of . In the manuscript, different from , we use the notation, which is standard in RL theory to denote bounds that hide logarithmic factors and focusing on polynomial terms. That said, we fully agree that the presentation may be misleading at first glance. In the revision, we will state more clearly that the regret is polynomially-logarithmic in to avoid confusion and improve accessibility for a broader audience.
On Quantum Regret Lower Bound
We appreciate the reviewer’s suggestion regarding a lower bound specific to our quantum setting. Currently, we achieve a regret of order , which is the tightest possible in terms of polynomials of . However, deriving tight lower bounds with respect to other variables such as , in this quantum model is indeed an interesting direction for future research, and we will acknowledge this explicitly in our revised manuscript.
Technical Novelty of the Momentum-Based Update
We deeply appreciate the reviewer’s recognition of our momentum-based update approach. To elaborate further: classical model-based RL algorithms, including those in [1], typically reuse samples across multiple epochs by incrementally updating state-action count statistics. However, in the quantum setting, this is problematic due to the collapse of quantum states upon measurement, making such reuse infeasible.
Our momentum-based estimator (Eqs. 21–25) addresses this fundamental challenge by enabling accurate, single-use updates from quantum samples—thus ensuring compatibility with quantum constraints and enabling regret minimization despite the one-time-use limitation of quantum measurements. This development is essential for integrating quantum speedups into model-based RL frameworks and represents a core technical novelty of our work.
We are grateful for the reviewer’s careful analysis and suggestions, and we hope these clarifications further demonstrate the significance of our contributions. We would be honored if this additional context helps in reconsidering the evaluation.
Reference:
[1] Near-optimal regret bounds for reinforcement learning, 2008.
Summary The authors study the problem of regret analysis—specifically, finding the policy that minimizes regret—when navigating Markov Decision Processes (MDPs) with simultaneous quantum and classical access to the environment. They propose a concrete quantum protocol based on the quantum mean estimation algorithm, employing a tabular approach to identify an optimal policy for navigating an MDP. Through thorough theoretical analysis, they demonstrate that their policy achieves an exponential improvement in regret scaling with respect to the parameter T (the horizon or number of rounds). Beyond utilizing both quantum and classical access to the environment, the authors introduce two key assumptions:(i) The MDP has finite mixing time.(ii) The reward function is known. The first assumption ensures that the state-action pair table is sufficiently populated through the exploration performed by the learning algorithm.
给作者的问题
See comments and suggestions
论据与证据
Yes, the claims are clearly stated in well-formulated lemmas and theorems, and the provided proofs appear to be sound and convincing.
方法与评估标准
All statements are supported by rigorous lemmas and theorems.
理论论述
All theoretical claims are clearly stated, and the provided proofs appear solid and well-structured.
实验设计与分析
N/A.
补充材料
I have scanned the proofs but have not checked them in full detail.
与现有文献的关系
The authors provide a thorough review of the classical setting of the problem and offer a well-structured overview of the quantum methods and generalizations built upon it.
遗漏的重要参考文献
None that I am aware of.
其他优缺点
-This work highlights a significant exponential speedup in regret scaling with respect to T. The authors employ a well-structured tabular approach, leveraging clever exploration and the quantum mean estimation algorithm to efficiently collect the necessary information to solve for the optimal policy. -The results seem to be both conceptually, and technically close to the series of works cited including https://arxiv.org/pdf/2205.14988 and especially https://arxiv.org/pdf/2302.10796 where the latter obtains an exponential improvement in regret (w.r.t T) for exploration in reinforcement learning. It is not clear to me in what sense is the present manuscript an improvement? What does it achieve that was not achieved there? Are the techniques different?
- In terms of overall applicability a weakness is the reliance on a tabular approach, which limits scalability when the number of states becomes large—a common challenge in reinforcement learning applications. However, this work could serve as an important foundation for extending these methods to model-free approaches.
其他意见或建议
The main comment/suggestion is to clearly explain in what conceptual and technical sense does this work go beyond the lines of works which achieved the sqrt T ->log T improvement. Since this types of work have mostly if not exclusively theoretical relevance (I would be happy to be corrected here), the relevance of this work higes on novel conceptual or technical contributions. Without this being clarified I cannot be confident in giving this paper a high score.
We thank the reviewer for their positive feedback and valuable suggestions. We address each point carefully and explicitly below:
Comparison with [1]
We clarify that the problem studied in [1] is fundamentally different from our setting:
-
Episodic vs. Infinite Horizon:
Reference [1] focuses on episodic RL with a finite horizon, whereas our work investigates infinite-horizon average-reward RL. Infinite-horizon RL is generally considered a more challenging setting because the optimal policy needs to sustain performance indefinitely without episodic resets, requiring different theoretical frameworks, convergence analyses, and stability considerations. -
Technical Differences:
The techniques in [1] depend heavily on episode-wise decomposition, and their quantum acceleration leverages lazy updating with episodic resets. Conversely, our infinite-horizon scenario necessitates a novel momentum-based updating framework (Eqs. 21–25) to avoid repeated quantum measurement collapse. Unlike the episodic case, infinite-horizon settings cannot directly leverage the lazy episodic updating; thus, our work introduces entirely novel analytical tools (Lemma 2–3, Eqs. 34–38).
Our contribution is the first to achieve quantum RL speedup in infinite-horizon average-reward settings, filling a significant theoretical gap distinct from [1].
Scalability and tabular approach
We fully acknowledge the limitation identified by the reviewer regarding the reliance on a tabular approach. Indeed, scalability to large state-action spaces is challenging, and addressing this limitation is an important direction for future work. We explicitly mention this limitation and highlight that extending our quantum RL framework beyond tabular methods (e.g., using function approximation methods or linear mixture models) is an essential next step, deserving further exploration.
Conceptual and technical novelty compared to classical methods
We appreciate the reviewer highlighting this critical point. We emphasize our conceptual and technical novelties:
- Our work is the first to provide a quantum RL framework and regret analysis for infinite-horizon average-reward settings, a distinct and more challenging problem than episodic RL.
- Unlike classical methods, our work uses quantum amplitude estimation (QAE) techniques that fundamentally enhance statistical complexity, achieving exponential improvement ( regret) over classical .
- The momentum-based updating scheme (Eq. 21–25) developed particularly to avoid quantum state collapse is novel and uniquely necessary for infinite-horizon RL. The absence of martingale techniques—ubiquitous in classical regret analysis—further differentiates our analytical approach (Lemma 2–3, Eqs. 34–38).
Our contribution not only establishes theoretical quantum RL foundations in an unexplored domain but also provides novel tools essential for future research in quantum RL.
We sincerely hope this addresses your concerns and clearly clarifies our contributions and novelties. We kindly ask you to reconsider and possibly improve your evaluation in light of these clarifications.
References:
[1] Provably Efficient Exploration in Quantum Reinforcement Learning with Logarithmic Worst-Case Regret, 2024.
I thank the authors for their response. I notice they only discussed the contributions over one of the references I provided? On balance, I think the paper does have merit, but overall do not see sufficient innovation over prevous results to increase the grade. I would be happy to see this paper accepted, but, I am also not sure it certainly has to be.
We thank the reviewer again for the thoughtful engagement and for acknowledging the merit of our work. We would like to briefly clarify a few remaining points:
On the Additional Reference
Thank you for pointing out both references [1] and [2]. We focused our initial comparison on [1], which is the work most related to our infinite-horizon RL setting. The other reference [2] pertains to quantum bandits, which involves different methodology and problem formulation compared to full reinforcement learning with long-term planning and reward assignment. The key difference is that the bandit framework does not include any state evolution, while we also note that the speedup is again due to the improvement in mean estimation like we use. However, the analysis approach cannot be extended to our case directly. That said, we appreciate its relevance to the broader context of quantum learning, and we will explicitly include and discuss this work in our revised related works section for completeness and transparency.
Clarification of Technical Innovations and Novelty
To restate and reinforce our key contributions:
-
First Quantum RL Algorithm for Infinite-Horizon Average-Reward RL with Provable Guarauntees:
Our work is the first to establish a quantum RL and regret analysis framework for the infinite-horizon average-reward setting, which is notably more challenging than the finite-horizon or episodic setups explored in previous quantum works. Unlike classical methods, we leverage QAE to achieve a fundamental improvement in statistical complexity, reducing regret from classical to quantum . -
Novel Momentum-Based Updating Scheme:
The momentum-based update rule (Eqs. 21–25), developed to accommodate quantum measurement constraints and avoid state collapse, enables one-time-use quantum samples while still supporting accurate model estimation and planning. This technique is not present in classical RL and addresses a core obstacle in transferring model-based methods to quantum settings. -
Martingale-Free Regret Analysis:
Our analytical approach bypasses classical martingale-based concentration (e.g., Azuma-Hoeffding), which lacks direct quantum analogues, and instead uses a looser but suitable union-bound-based analysis (Lemmas 2–3, Eqs. 34–38), representing a shift in analytical tools tailored to quantum learning environments.
Together, these contributions provide not only theoretical regret guarantees but also novel tools likely to inspire future developments in quantum reinforcement learning.
We sincerely hope these clarifications help in reassessing the novelty and potential impact of our work. If the reviewer has any further concerns, we would be glad to address them. We greatly appreciate your time and consideration, and hope that this alleviates your concerns.
References:
[1] Provably Efficient Exploration in Quantum Reinforcement Learning with Logarithmic Worst-Case Regret, 2024.
[2] Quantum Multi-Armed Bandits and Stochastic Linear Bandits Enjoy Logarithmic Regrets, 2022.
This work considers the online reinforcement learning problem of controlling an unknown MDP to optimize the long-term average reward. As opposed to the classical setting, they consider a quantum framework (where quantum states carry richer information) and proposed an optimistic-based approach. In this setting, they demonstrate a (at first surprising) regret guarantee of order , in shark contrast with the classical rate (for which there exists both upper and lower bounds).
On the technical side, this work extends the result for quantum MDP from episodic to infinite horizon with average reward and leverages the seminal UCRL approach. Reviewers acknowledged the novelty and interest of some of the technical derivations, in particular the momentum-based updates.
The paper also suffers from several shortcomings:
- the motivation behind the « quantum » MDP framework and its real-world applicability - counterbalanced by the theoretical nature of the contribution.
- the presentation/positioning of the paper which somehow compares the result in the classical setting and the quantum setting. While there is no superiority of one result over the other, it demonstrates how richer information (of the quantum states) can translate in better learning rates.
- the incrementality of the derivation compared to previous works. Yet, some aspects has been recognized novel and of independent interest.
Given the author engagement during the discussion phase, I am relatively confident than the above weaknesses can be addressed in a revised version. Thus, I recommend acceptance if the authors can 1) strengthen the pedagogical introduction to quantum concepts and stress the difference of information richness compared to the classical setting 2) detail the comparison with previous works and highlight clearly the technical challenges and novelty in extending them.