PaperHub
6.8
/10
Spotlight4 位审稿人
最低4最高5标准差0.4
5
4
4
4
3.3
置信度
创新性3.3
质量2.5
清晰度3.3
重要性2.8
NeurIPS 2025

Counteractive RL: Rethinking Core Principles for Efficient and Scalable Deep Reinforcement Learning

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29

摘要

关键词
reinforcement learningsample efficiencyscalable Deep RLdata efficiencyexplorationsample efficient reinforcement learningtemporal difference learning

评审与讨论

审稿意见
5

This paper proposes a novel method called CoAct TD Learning. While conventional reinforcement learning selects actions that maximize the state-action value function, this research demonstrates both theoretically and experimentally that selecting actions that minimize the state-action value function (counteractive actions) can achieve higher TD. The authors show that this approach leads to improved learning efficiency.

优缺点分析

Strengths

S1. Novelty and Problem Importance: The paper presents a theoretical method addressing the "poor initial learning efficiency" that deep RL researchers commonly experience. The authors construct theory that holds even during the early stages of learning when the Q-function is unreliable (they claim that until convergence, Q behaves almost like a random function with little correlation to immediate rewards), maximizing TD error. Specifically, Theorem 3.4 proves that minimization actions achieve TD that is D(s)-2δ-η higher than average actions.

S2. High Practicality: The fact that this method requires zero additional computational cost and can be easily integrated into existing TD learning-based algorithms is extremely valuable in practice.

S3. Implementation from Theory to High-dimensional Experiments: The effectiveness of the method is validated across multiple scales, from basic verification in Chain MDP to large-scale experiments in the Arcade Learning Environment. Direct measurement of temporal difference is also conducted.

Weaknesses

W1. The assumptions of η-uninformed and δ-smooth are very strong, and verification of whether these hold in actual high-dimensional environments is insufficient. In particular, these assumptions may break down as learning progresses, but the impact of this is not adequately discussed.

W2. No theoretical guarantees for convergence of the proposed method are provided. The selection of minimization actions may bias exploration and potentially hinder convergence to the optimal policy, which requires investigation.

Minor Comment The text in the figures is too small and should be enlarged.

问题

Q1. To clarify how well the theory holds experimentally, could you plot how long D(s)−2δ−η>0 continues from the early stages of learning?

Q2. Looking at Figure 2, the learning curve for ε-greedy appears not to have converged. When iterations are increased further, how do the final convergence scores compare between ε-greedy and CoAct TD Learning? While I understand that the proposed method accelerates convergence in early learning, I believe the convergence values are also important in practice.

Q3. As an application area for reinforcement learning, LLM post-training is gaining attention. Post-training starts from a state where the LLM has completed pre-training, but would this method still be effective for improving learning efficiency in such cases? If effective, I believe this would further enhance the value of this research.

局限性

Theory-Practice Gap: The theoretical results are based on strong assumptions, and it is unclear to what extent these assumptions hold in actual complex environments.

Bias Toward Short-term Learning: Experimental evaluation is mainly concentrated on short-term periods (100K interactions), and stability and effectiveness in long-term learning are not sufficiently verified.

Environment Dependency: Verification is centered on the specific domain of Atari games, and generalizability to continuous control tasks and other complex domains is unclear.

Lack of Convergence Guarantees: No theoretical convergence guarantees for the proposed method are provided, leaving questions about reliability in practical applications.

最终评判理由

As I mentioned in the strengths section, this paper addresses the important problem of poor initial learning efficiency in deep reinforcement learning both theoretically and experimentally, and I strongly recommend the acceptance of this paper. The authors' response has also dispelled concerns about exploration bias.

格式问题

Nothing

作者回复

We deeply appreciate your thorough and insightful review, particularly your recognition of the paper's novelty, practicality, and rigorous validation. Thank you very much for finding our work extremely valuable and novel while the effectiveness of our method is validated across multiple scales from theory to high-dimensional large-scale experiments. Your feedback provides clear and sincere insights on how our work offers a significant, practical, and theoretically sound advancement for the deep RL community.

  1. "To clarify how well the theory holds experimentally, could you plot how long D(s)2δη>0D(s) - 2 \delta -\eta >0 continues from the early stages of learning?"

Regarding how well the theory holds experimentally, Figure 3 in Section 5 reports temporal difference throughout the entire training, i.e. both early-stage training and late stage training. The results reported in Figure 3 demonstrate that CoAct TD learning achieves higher temporal difference throughout training compared to canonical methods, which has been predicted by the theoretical results in Theorem 3.4 and Theorem 3.6.

  1. "Looking at Figure 2, the learning curve for ϵ\epsilon-greedy appears not to have converged. When iterations are increased further, how do the final convergence scores compare between ϵ\epsilon-greedy and CoAct TD Learning? While I understand that the proposed method accelerates convergence in early learning, I believe the convergence values are also important in practice."

The cumulative return reached by CoAct TD at iteration 84 is the optimal value for the Chain MDP. At your suggestion we ran the experiment for longer and found that ϵ\epsilon-greedy converged to this same optimal value at iteration 176.

  1. "As an application area for reinforcement learning, LLM post-training is gaining attention. Post-training starts from a state where the LLM has completed pre-training, but would this method still be effective for improving learning efficiency in such cases? If effective, I believe this would further enhance the value of this research."

Thank you very much for this intriguing and insightful question. Yes, indeed. Because the pre-training of an LLM does not involve the similar structure of reinforcement learning as the post-training. Currently, reinforcement learning is used either to enhance the mathematical and scientific reasoning capabilities of LLMs or in alignment as in RLHF. In both of these cases, post-training the large language model is the start of the reinforcement learning training in a newly defined task, and hence yes indeed we can expect CoAct TD to be effective in improving learning efficiency in both of these cases.

评论

Thank you for your response. I now understand that CoAct TD reaches the optimal value in Chain MDP. Regarding Figure 2, if both methods converge to the same value, it would be helpful to extend the x-axis until ε-greedy converges and indicate the number of iterations required for convergence to the optimal value. In tasks where the optimal value cannot be reached, wouldn't the selection of minimizing actions introduce exploration bias compared to ε-greedy, potentially hindering convergence to the optimal policy?

Regarding the effectiveness for LLM post-training, since this involves additional training from a model that already has a certain policy, the effectiveness of this method remains unclear without experiments.

评论

In deep reinforcement learning the optimal value indeed is not reachable as in the tasks of Section 5, and if we visit Figure 1, Figure 4 and Figure 5, we see that CoAct TD consistently performs substantially higher. Hence, it does not introduce exploration bias compared to ϵ\epsilon-greedy.

LLM post-training with reinforcement learning focuses on training the model in a previously unknown task. For instance, the pre-trained large language model right after pre-training is not able to solve any scientific or mathematical questions and even will not quite follow simple instructions. The performance of the model in these new tasks is quite close to zero. Hence, when we start the post-training with reinforcement learning in these newly introduced tasks the performance of the policy is extremely low, quite similar to what we observe at the start of the training in the standard deep reinforcement learning benchmarks.

审稿意见
4

The paper introduces a straightforward yet impactful modification to experience collection in deep reinforcement learning: periodically selecting the action with the lowest Q-value. This approach, called CoAct TD Learning, accelerates learning and improves sample efficiency without adding computational overhead. The method is supported by theoretical analysis, is simple to implement, and is empirically validated on various benchmark tasks.

优缺点分析

Strengths

  • The paper is well written and easy to follow.
  • The idea of leveraging actions with minimum state-action values for experience collection is novel and intriguing. If the theoretical and empirical claims are substantiated, this approach could have significant impact on the field.

Weaknesses

  • Some of the core assumptions underlying the proposed method, particularly those introduced in Section 3, require further explanation and justification.
  • The presentation of the experimental results should be reorganized to improve clarity.

问题

Assumptions in Section 3:

  • Line 128-130: The statement that "During the large portion of the training ... while interacting with new experiences in high-dimensional MDPs as the learning continues" is accurate for the early training stage. However, it is unclear whether this remains true as training progresses. Is there empirical or theoretical evidence to support this claim beyond the initial phase?
  • Line 131-133: The statement that "In particular, in high-dimensional environments in a significant portion ... has little correlation with the immediate rewards." also needs supporting evidence or citation.
  • Theorem 3.4 depends on assumptions that " Qθ(s,a)Q_\theta(s, a) is η\eta -uninformed and δ\delta -smooth." Can you provide proof or empirical evidence that these assumptions hold, especially beyond the very start of training? It appears these conditions may only be valid at initialization and not throughout training.

Experimental results

  • Figure Order and Baselines: Figure 4 (canonical temporal difference learning baseline) should be presented before Figure 1 (QRDQN results), since the canonical method is a more relevant and familiar baseline for most readers. Additionally, QRDQN, while showing a larger performance gap, is less central to the paper’s main claims.
  • Figure 1 Interpretation: The performance gap between canonical ϵ\epsilon -greedy and CoAct TD emerges midway through training (after about 10610^6 environment steps), which seems at odds with the theoretical foundation that relies on random initialization and uniform action selection. This discrepancy should be addressed.
  • Figure 2 Clarity: The four subplots in Figure 2 lack informativeness; the specific ϵ\epsilon values used are not visible or clear. Furthermore, Figure 2 should appear earlier in the paper, as it is discussed before Figures 1 and 4.
  • Additional Baseline: It would strengthen the empirical evaluation to include a comparison with a purely greedy baseline, where ϵ\epsilon -greedy is not used (i.e., the agent always selects the action with the maximum Q-value).

局限性

See weaknesses and questions for more detail.

格式问题

N/A

作者回复

We sincerely thank you for your insightful and in-depth review. We are truly glad you found our work impactful and straightforward, and that you noted our paper is well-written and easy to follow. We are especially glad to hear that you consider our core idea both novel and intriguing, with the potential for significant impact on the field.

  1. “(I) Assumptions: Line 128-130: …It is unclear whether this remains true as training progresses. Is there empirical or theoretical evidence to support this claim beyond the initial phase? (II)Line 131-133: The statement that "In particular, in high-dimensional environments in a significant portion ... has little correlation with the immediate rewards." also needs supporting evidence or citation. (III) Theorem 3.4 depends on assumptions that " Qθ(s,a)Q_\theta(s,a) is η\eta-uninformed and δ\delta-smooth." Can you provide proof or empirical evidence that these assumptions hold, especially beyond the very start of training? It appears these conditions may only be valid at initialization and not throughout training.”

All these questions consolidate the Definitions introduced in Section 3 and its relation to Theorem 3.4 and Theorem 3.6. Yes, there is indeed evidence to support this. In particular, Figure 3 in Section 5 reports the temporal difference throughout the entire training, i.e. both early-stage training and late-stage training. The results reported in Figure 3 demonstrate that CoAct TD learning achieves higher temporal difference throughout training compared to canonical methods as has been predicted by the theoretical results in Theorem 3.4 and Theorem 3.6.

  1. ”Experimental results”

We highly appreciate your insightful suggestions regarding in which places the content and the Figures should appear and the organizational presentation of the experimental results to provide a better reading experience. These suggestions are now indeed reflected for the camera ready version of our paper.

  1. ”Additional Baseline: It would strengthen the empirical evaluation to include a comparison with a purely greedy baseline, where ϵ\epsilon -greedy is not used (i.e., the agent always selects the action with the maximum Q-value).”

At your suggestion we ran the experiment with a purely greedy baseline. This algorithm completely fails to learn, never achieving a reward above 0.2. In contrast as shown in Figure 2, ϵ\epsilon-greedy achieves reward 8 after 100 iterations, and CoAct TD achieves reward 11. We can also add this purely greedy baseline to the results as well.

评论

Thank you for the responses from the authors. However, you explanation regarding the validity of the fundamental assumptions remains unconvincing. Rather than simply comparing the baseline and the proposed method, you should provide theoretical proofs or empirical evidence to support the existence of the hypothesized problem in existing methods. The improvement achieved by the proposed method alone does not directly demonstrate that the baseline suffers from the identified issue. In addition, my questions regarding the experimental results are not addressed.

评论

The theoretical analysis of Theorem 3.4 and Theorem 3.6 shows that amina_{\textrm{min}} will result in higher temporal difference than arandoma_{\textrm{random}}. The results in Figure 3 plot the temporal difference for CoAct and arandoma_{\textrm{random}}. Hence, these results demonstrate that indeed amina_{\textrm{min}} achieves higher temporal difference as predicted by the theoretical analysis of Theorem 3.4 and Theorem 3.6. Regarding your question on Figure 1 interpretation, the part you observe corresponds to the initial warm-up phase of the training in deep reinforcement learning particularly in 100K.

审稿意见
4

The paper proposes a novel plug-and-play module for online reinforcement learning algorithms, called CoAct. CoAct intentionally selects the action with the lowest action value to collect novel experiences and gain information from the environment. Theoretical analysis shows that sampling the action with the minimum value increases the temporal difference (TD) error compared to uniform action sampling, leading to larger updates. Empirical results demonstrate that the method outperforms ϵ\epsilon-greedy, UCB, and Noisy Networks.

优缺点分析

Strengths:

  • The paper provides both theoretical analysis and empirical evaluation to support the intuition and effectiveness of the method.

  • The proposed method is novel and easy to use. As a plug-and-play module, CoAct can be integrated into any existing online reinforcement learning algorithm.

  • The empirical analysis reports both the 80th percentile and the median, offering a detailed overview of the agent’s performance. The parameter settings are reported for reproducibility. 

Weaknesses

  • The explanation remains unclear regarding how a larger TD error contributes to more effective learning. While the theoretical analysis demonstrates that selecting the action with the lowest action value leads to a high TD error, the paper lacks further analysis on how a larger TD error affects the value estimation of the true optimal action, which is ultimately what the agent should learn. After asking "how could minimizing the state-action value function accelerate learning", the paper states that "the realization of the intrinsic difference between aQθmin(s)a_{Q_{\theta}}^{\min}(s) and aQ\*min(s)a_{Q^\*}^{\min} (s) with regard to QθQ_\theta and Q\*Q^\* provides a valuable insight on how counteractive actions do in fact increase temporal difference." It is good that the paper attempts to answer this question, but it would be helpful to provide a more detailed explanation of how this insight further contributes to searching for the true aQ\*maxa_{Q^\*}^{\max}.

  • The application of the method may be limited by the properties or constraints of the environment. Taking aQθmin(s)a_{Q_\theta}^{\min}(s) may not always be applicable in practice. In online settings, every chosen action affects the environment. Since the action value estimation reflects how good an action is, taking the action with a low estimation introduces concerns about whether it causes irreversible damage to the environment, especially in real-world applications like robotics and autonomous driving. In such domains, collecting data following aQθmin(s)a_{Q_\theta}^{\min}(s) could be unacceptable due to safety or cost concerns.

问题

  • It would be great if the authors could explain the intuition of learning for actions with the lowest value estimation. What the agent needs to learn in the end is the action with the highest value, for deriving a good policy. Explainning the intuition more might be helpful for readers to understand the methodology.

  • Could the authors please clarify more about how the algorithm is evaluated? In empirical evaluation, it is not clear whether the reported performance reflects the agent’s performance during online interactions or is based on a a fixed policy rollout, i.e., whether the agent still follows aQθmin(s)a_{Q_\theta}^{\min}(s) as the left side of Algorithm 1 indicates. Given that online actions affect the environment, I think it is important to specify this point. If following aQθmin(s)a_{Q_\theta}^{\min}(s) is part of the learning process, then reporting the return during interactions would be more informative than only evaluating the final policy, as it captures the consequence of taking actions that are considered to be poor choices.

局限性

Yes

最终评判理由

After reading the authors' response and other reviewers' comments, I would like to raise my score, as the response addresses my concerns regarding the intuition, performance reporting, and the role of ϵ\epsilon. However, it would strengthen the paper to include further discussion on the method's sensitivity to ϵ\epsilon, which would help readers better understand its practical applicability.

格式问题

N/A

作者回复

We are truly glad to hear that you found our work novel and easy to use, and that our analysis offers a detailed overview of the agent's performance. We are also sincerely glad you highlighted that our paper provides both theoretical analysis and empirical evaluation, supporting the method's intuition and effectiveness. We highly appreciate your positive and valuable feedback.

  1. "It would be great if the authors could explain the intuition of learning for actions with the lowest value estimation. What the agent needs to learn in the end is the action with the highest value, for deriving a good policy. Explainning the intuition more might be helpful for readers to understand the methodology."

This insightful question is highly related to understanding how higher temporal difference results in effective learning, and the core counterintuition section of our paper that starts in Line 173. In particular, our theoretical analysis shows that taking the lowest value action in fact yields a higher temporal difference compared to canonical, and temporal difference update is essentially the update that leads to learning an optimal policy, i.e. the action with the highest value. Hence, our method shows that by taking action amina_{\textrm{min}} in fact we can increase temporal difference, and hence accelerate learning the highest value action.

  1. ”The explanation remains unclear regarding how a larger TD error contributes to more effective learning.”

Thank you for this insightful question. It is well known in the reinforcement learning literature that larger temporal difference will result in effective learning [1]. Although we also cite this work [1] in page 3 we can also further add more explanation regarding larger TD and effective learning.

[1] Andrew W Moore and Christopher G Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 1993.

  1. "Could the authors please clarify more about how the algorithm is evaluated? In empirical evaluation, it is not clear whether the reported performance reflects the agent’s performance during online interactions or is based on a a fixed policy rollout, i.e., whether the agent still follows amin_Q_θ(s)a^{\textrm{min}}\_{Q\_\theta} (s) as the left side of Algorithm 1 indicates. Given that online actions affect the environment, I think it is important to specify this point. If following amin_Q_θ(s)a^{\textrm{min}}\_{Q\_\theta} (s) is part of the learning process, then reporting the return during interactions would be more informative than only evaluating the final policy, as it captures the consequence of taking actions that are considered to be poor choices."

The reported performance reflects the agent’s performance when amin_Q_θ(s)a^{\textrm{min}}\_{Q\_\theta} (s) is taken.

  1. "The application of the method may be limited by the properties or constraints of the environment. Taking amin_Q_θ(s)a^{\mathrm{min}}\_{Q\_\theta} (s) may not always be applicable in practice. In online settings, every chosen action affects the environment. Since the action value estimation reflects how good an action is, taking the action with a low estimation introduces concerns about whether it causes irreversible damage to the environment, especially in real-world applications like robotics and autonomous driving. In such domains, collecting data following amin_Q_θ(s)a^{\textrm{min}}\_{Q\_\theta} (s) could be unacceptable due to safety or cost concerns."

We would like to emphasize that this question is again highly related to the core counterintuition part of our paper starting in line 173. The core counterintuition introduced in our paper is the realization of the critical difference between amin_Q_θ(s)a^{\textrm{min}}\_{Q\_\theta} (s) and amin_Q\*(s)a^{\textrm{min}}\_{Q^\*} (s). In particular, from the perspective of Q\*Q^\* in fact the amin_Q_θ(s)a^{\textrm{min}}\_{Q\_\theta} (s) is not the true minimum of the optimal state-action value function amin_Q_θ(s)argmin_aAQ\*(s,a)a^{\textrm{min}}\_{Q\_\theta} (s) \neq \textrm{argmin}\_{a \in A} Q^\*(s,a). Thus, the damage that can be done to the environment is relevant to the true rewards obtained by the agent and hence QQ^* but not QθQ_\theta. This is indeed very subtle and requires a bit of counterintuitive thinking. We can also provide more textual explanation on this between Line 173 and 187.

评论

I sincerely thank the authors for the further clarifications provided in the response. However, my concern regarding the applicability of the method remains. Indeed, aQθmin(s)a_{Q_\theta}^{\min}(s) does not necessarily equal to aQmin(s)a_{Q_*}^{\min}(s) especially in the early learning stage, as QθQ_\theta could be inaccurate and far from QθQ_\theta, but as learning progresses, QθQ_\theta is expected to be more accurate and approaches QQ_*. Given that the action-value function reflects the expected return for a given action, taking aQθmin(s)a_{Q_\theta}^{\min}(s) introduces the risk of taking the action expected to have the lowest return, especially when QθQ_\theta is well learned. This choice is problematic in risk-sensitive senarios in real-world.

评论

We sincerely thank you for your kind response on our rebuttal.

To explain this point better perhaps we can revisit the results reported in Figure 3. The results of Figure 3 demonstrate that temporal difference obtained by CoAct TD is substantially higher throughout training, i.e. both early training and late training. Hence, the results of Figure 3 demonstrate that throughout training there is a significant difference between aQθmin(s)a_{Q_\theta}^{\min}(s) and aQmin(s)a_{Q^*}^{\min}(s).

Another point to clarify here is that ϵ\epsilon is not a static parameter. ϵ\epsilon has a decay factor: throughout the training it decreases exactly as in ϵ\epsilon-greedy. Hence, the closer the policy gets to QQ^* the less likely it will have the behaviour executed with ϵ\epsilon probability. Thus, if you are in an highly risk-sensitive scenario you can set your decay factor to reach ϵ=0\epsilon=0 when your main training phase ends. Furthermore, if you are in a scenario as in autonomous driving and have a constrained set of actions due to risk sensitivity you can also take minimum value actions subject to these constraints.

评论

I appreciate the authors' response and the emphasis on the role of ϵ\epsilon. Tuning ϵ\epsilon across different environments raises additional considerations regarding environment accessibility and computational resources. Therefore, it may be worthwhile to include a discussion or experiment on the sensitivity to ϵ\epsilon. That said, the authors’ response addresses my concern, and I will raise my score accordingly.

审稿意见
4

This paper proposes CoAct TD, an exploration strategy that selects the action that minimizes its current Q-value function to improve the sample efficiency of DRL. The authors provide a theoretical proof that the "counteractive" action generates a higher expected TD error, thereby accelerating the learning process. This theory is validated by extensive experiments on the ALE, where the method demonstrates significant performance gains, such as a 248% improvement over a standard baseline on the ALE 100K benchmark. Also, CoAct TD adds no additional computational cost and is designed as a "plug-and-play" module for any algorithm that relies on TD learning.

优缺点分析

Strengths

  • Originality and Significance. The core idea of using the action that minimizes the Q-function for exploration is novel. It addresses the critical problem of sample efficiency with a simple, practical solution. The reported 248% performance improvement on the ALE 100K benchmark highlights the method's potential impact.

  • Quality. The paper has a strong logical structure that flows from theory to practice. It provides a clear theoretical justification (Section 3) for its counterintuitive claim and then rigorously validates this theory with both a simple, motivating example (Section 4) and large-scale, standard benchmarks (Section 5).

  • Clarity. The paper is well-written and easy to follow.

Weaknesses

  • The theoretical proof relies on the Q-function being "η-uninformed" and "δ-smooth" assumptions that hold true during the initial phases of training. However, these assumptions no longer hold during the later stages of training. This raises an unanswered examination regarding the robustness of the method and its long-term convergence properties.

  • The experiments are limited to discrete action environments employing DQN-family algorithms. The assertion of being a "fundamental improvement to temporal difference learning" is heavily broad; The paper fails to present evidence of its relevance to other crucial domains of reinforcement learning, such as actor-critic methods for continuous control (e.g., SAC, TD3), which also depend on TD learning.

问题

  • The theory for CoAct TD relies on the Q-function being uninformed, a condition primarily met during early training. What is the method's behavior in late-stage training when this assumption no longer holds? A brief analysis demonstrating that the agent does not get stuck in known undesirable states would increase my confidence in the method's robustness.
  • The paper's claim of being a "fundamental improvement" is broad, yet experiments are confined to DQN-family algorithms. Could the authors briefly discuss how the argmin Q principle could be extended to continuous control algorithms (e.g., SAC)?
  • The paper links higher TD error to better exploration. Can the authors provide any direct evidence of improved exploration quality, such as higher state visitation entropy compared to baselines?

局限性

yes

最终评判理由

The authors' responses have addressed many of my concerns, so I raise my score to 4

格式问题

None

作者回复

Thank you very much for finding the core idea of our paper novel and well-written while addressing a critical problem with a strong logical structure that flows from theory to practice with rigorous validation of the theory in both motivating examples and large-scale standard benchmarks. We highly appreciate your valuable and insightful review.

  1. "The theory for CoAct TD relies on the Q-function being uninformed, a condition primarily met during early training. What is the method's behavior in late-stage training when this assumption no longer holds? A brief analysis demonstrating that the agent does not get stuck in known undesirable states would increase my confidence in the method's robustness."

Thank you for this thoughtful question. Figure 3 reports temporal difference throughout the entire training, i.e. both early-stage training and late stage training. The results reported in Figure 3 demonstrates that CoAct TD learning achieves higher temporal difference throughout training compared to canonical methods as has been predicted by the theoretical results in Theorem 3.4 and Theorem 3.6. As we investigate the results reported in Figure 3 together with the results in Figure 1 and Figure 5 we see that the agents throughout the entire benchmark robustly overperform compared to the canonical methods.

  1. "The paper's claim of being a "fundamental improvement" is broad, yet experiments are confined to DQN-family algorithms. Could the authors briefly discuss how the argmin Q principle could be extended to continuous control algorithms (e.g., SAC, TD3)?"

Thank you for this insightful question. The policy update in SAC is performed via taking a gradient step with respect to the loss J(ϕ)=DKL(πϕ(st)exp(Qθ(st,)/Zθ(st))J(\phi) = D_{KL}(\pi_{\phi}(\cdot | s_t) || \exp(Q_{\theta}(s_t,\cdot)/Z_{\theta}(s_t)). Here we can take the gradient of the cost function J^(ϕ)\hat{J}(\phi) with Qθ-Q_{\theta} to obtain πϕ\pi_{\phi’} with probability ϵ\epsilon to obtain the analogue of the argmin QQ principle.

  1. "The paper links higher TD error to better exploration. Can the authors provide any direct evidence of improved exploration quality, such as higher state visitation entropy compared to baselines?"

Many thanks for this intriguing question. The rather well-known link between higher temporal difference and better exploration was investigated in quite early work starting from [1], and the main connecting evidence between higher TD error, exploration quality and faster learning in our paper is the illustration of the significantly improved performance of our method in the ALE benchmark.

[1] Andrew W Moore and Christopher G Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 1993.

We highly appreciate again that our efforts to produce a rigorous analysis were recognized by you.

评论

I appreciate the authors for their responses. In response to Question 1, the authors claim that the results in this paper illustrate the proposed method robustly outperforming baselines in the test environments. However, my concerns about the theory's scope and the method's generality remain due to the lack of a theoretical analysis about the gap between the initial theory and the late-stage training.

评论

The training curve and the late stage of training tells us how much the agent learned amaxa_{\textrm{max}}. However, it does not tell us how much the agent learned the value of amina_{\textrm{min}}. Although the agent learned pretty good estimates for the values for amaxa_{\textrm{max}} in the late stage of training, the value estimates for amina_{\textrm{min}} still correspond to the initial stage training levels and this is solely due to the number of updates amaxa_{\textrm{max}} and amina_{\textrm{min}} receive respectively during training. Please note that Definition 3.1. (uninformedness) does not assume or define anything about amaxa_{\textrm{max}}, it just talks about amina_{\textrm{min}} and arandoma_{\textrm{random}}. Figure 3 shows us how this theoretical analysis predicts and in fact generalizes to the agents that learn a policy in high-dimensional environments where the temporal difference of CoAct TD is higher which is the exact theoretical conclusion of Theorem 3.4 and Theorem 3.5. We would be happy to add this explanation and discussion to the paper.

评论

Thanks for your responses, which have addressed some of my concerns

评论

Dear Reviewers,

We sincerely thank each and all of you for the time you have invested in providing a review for our paper. We are truly glad that

Reviewer XeS3 found that the core idea of our paper novel and well-written while addressing a critical problem with a strong logical structure that flows from theory to practice with rigorous validation of the theory in both motivating examples and large-scale standard benchmarks.

Reviewer mnA4 found our method novel and easy to use and that our analysis offers a detailed overview of the agent's performance while further highlighting that our paper provides both theoretical analysis and empirical evaluation to support the intuition and effectiveness of the method. We further sincerely thank the reviewer for the thoughtful discussion we had and for highlighting the value of our work as reflected in their accepting score.

Reviewer LB6J found our work impactful and straightforward, and that further noted our paper is well-written and easy to follow. We are glad to hear that you consider our core idea both novel and intriguing, with the potential for significant impact on the field. We are glad to see you valuing our work highly across all the criteria: quality, clarity, significance and originality, and we sincerely thank the reviewer for highlighting the value of our paper with an accepting score.

Reviewer EGYp found our work extremely valuable and novel while the effectiveness of our method is validated across multiple scales from theory to high-dimensional large-scale experiments. We are truly glad to hear your recognition of our paper's novelty, practicality, and rigorous validation and for recognizing the value of our work with a 5 accept score.

We sincerely thank all of you.

最终决定

This paper introduces an approach to increasing the efficiency of exploration, in which the action with lowest current estimated Q-value is chosen whenever the algorithm explores. The key insight (lower Q-value actions will tend to have high TD errors) is proven formally, and both the theoretical claims and more rapid learning than epsilon-greedy and UCB benchmarks are demonstrated empirically.

The reviewers (and I) agree that this approach is simple, surprising, and original. The empirical evaluations are quite convincing. I believe this work is likely to have significant impact.