Reinforcement Learning with Action Chunking
We propose Q-chunking, a simple, effective offline-to-online RL method that uses action chunking to improve value propagation and exploration via temporally coherent actions.
摘要
评审与讨论
This paper proposed Q-Learning with Action Chunking (Q-LAC), a reinforcement learning method that models action chunking by enabling the policy to output a sequence of actions from a given state. The critic is trained to evaluate the value of a state along with the -step action sequence. The approach first trains in an offline dataset and then further trained through additional online learning, formalizing an offline-to-online setting.
优缺点分析
Strengths
- The paper is clearly written and well-organized.
- The idea of action chunking policy and a chunk-based Q-function is interesting and rarely explored in the RL context.
- The literature review is comprehensive.
Weaknesses
-
The claimed advantages of action chunking over standard 1-step action policies and skill-based approaches are not convincing. Action chunking breaks the Markov property and weakens the tight feedback loop between policy and environment. Let's denote action chunking policy as ACP and the normal 1-step action policy as N1P for convenience. Intuitively, ACP appears less favorable than N1P, it sacrifices reactivity and granularity, and may even be viewed as a degenerate case of N1Ps:
- N1P has finer granularity and higher real-time responsiveness. In ACP, no matter how the future states change, the policy blindly executes the pre-predicted chunk. There's no guarantee that the latter actions in the chunk are reasonable for the future states. In contrast, N1P reacts to every state, which is clearly more rational.
- If ACP is to be responsible for future -step actions, under the same model scale, it requires the policy to have some foresight and predictive capability about the environment's transition model, i.e., some form of internal model-based reasoning. This increases modeling difficulty.
- Under the Markovian assumption: , the policy outputs a single action based on the current state (given less information, do less behavior); Under non-Markovian assumption: , the policy outputs a single action based on the history of states (given more information, do less behavior). In contrast, ACP outputs multiple actions based on one state: (given less information, do more behavior). Clearly, ACP is the least favorable case.
- Furthermore, compared to skill-based methods (as mentioned at around Line 116), while skill-based methods are still step-wise policies internally, they can be interrupted or switched anytime, ACP, on the other hand, enforces hard commitment to an action sequence and cannot be interrupted, which is less flexible than skill-based methods.
-
Limited Contribution. The proposed method can be considered as adding action chunking to the FQL baseline [1]. However, as noted in point 1, the theoretical justification and claimed benefits of action chunking are not well established. Simply integrating action chunking into FQL, without a solid theoretical grounding, offers limited innovation and contribution. Moreover, the reported performance comparisons with FQL deviate notably from those reported in the original FQL paper (see detailed Question 1 below), raising my concern about the claimed improvements empirically.
-
In Figure 2, we can see that AC performs worse than the RLPD baseline in online RL, validating the above concerns. To address this, the paper introduces offline training and behavior regularization, leading to the results in Figure 3. However, these results don't convincingly demonstrate that Q-LAC's performance is due to action chunking or behavior regularization. It seems more likely that the offline training provides a better starting point for the policy (QLAC starts from around 0.3, while RLPD starts from 0), but RLPD quickly reaches the same performance. From this perspective, one main limitation of Q-LAC becomes that it relies heavily on offline data and behavior regularization. Therefore, I tend to view that the use of Q-LAC in offline-to-online scenarios seems to be more of a necessary compromise than a design choice.
-
Experimental comparisons raise several concerns:
- As shown in Figure 4, in about half of the tasks, RLPD achieves performance equal to or better than Q-LAC. This again suggests that Q-LAC’s advantage likely comes from offline pretraining. For example, in the puzzle environment, Q-LAC already reaches the performance peak after offline training. I don’t think this convincingly demonstrates the advantage of Q-LAC, as it only shows improved sample efficiency in online learning, at the cost of more offline training.
- The core contribution of the paper is action chunking, but this point is not well supported by the experiments. If the impact of offline pretraining could be isolated, the results would be more convincing. Worryingly, Figure 2 shows that Q-LAC performs worse than the RLPD baseline in pure online settings; in Figure 4, in half of the environments (puzzle, robomimic, cube-quadruple), FQL-n performs better than FQL; and in all tasks, RLPD consistently outperforms RLPD-AC5. These results seem to contradict the claimed benefit of action chunking.
- The number of baselines (3) and evaluated tasks (6) is relatively small. For example, one of the baselines, FQL compared with 10 baselines on dozens of domains, which is a better experimental practice. Besides, experiments in Figure 5-7 only include one or two tasks. More comprehensive comparisons would strengthen the paper.
In summary, I think the action chunking policy proposed in this paper is an interesting idea, but it lacks sufficient theoretical support regarding MDPs, the Markov property, the Bellman equation, and policy convergence or optimality under this paradigm. The experimental results also fail to show clear advantages of action chunking itself. Therefore, I believe the contribution of this paper is limited.
[1] Park, Seohong, Qiyang Li, and Sergey Levine. "Flow Q-Learning." ICML, 2025.
问题
- How were the baselines fine-tuned in the experiments to ensure they are performing properly? For example, in the original FQL paper (see Figure 12), FQL achieves near-optimal performance (close to 1.0) on tasks such as cube-double and scene. However, in this paper's Figure 4, FQL only reaches around 0.6 on these tasks. What accounts for this difference in performance?
局限性
Yes.
最终评判理由
I acknowledge that the paper presents extensive experimental evaluations. However, I would like to raise my score from 2 to 3, but I still believe the work falls below the NeurIPS acceptance bar. My main concerns are as follows:
- The primary claimed contribution is introducing action chunking to the RL community and validating it through extensive experiments. In my view, this contribution is rather limited in novelty, since action chunking is not a new concept and is closely related to macro-actions, skills, and options in RL.
- The authors’ attempt to apply chunking to online RL was not successful, leading them to turn to the offline-to-online RL setting. I still think this seems more like a necessary compromise rather than a deliberate design choice.
- The authors emphasize additional designs, such as using expressive policies with behavior constraints. However, this is not novel either, as it has already been proposed in FQL, and thus cannot be counted as an original contribution. In essence, Q-LAC is equivalent to FQL + chunking.
- In the rebuttal, the authors noted that chunking can also improve other algorithms, such as BFN and IFQL. However, both are based on flow policies, suggesting that the observed improvement may still stem from the synergy with flow policy rather than chunking itself.
In summary, the work explores the application of action chunking to several RL algorithms, identifying cases where it improves performance and others where it does not, ultimately combining multiple existing strategies to achieve a cumulative enhancement. While the exploration is good, I don't see sufficiently novel or original contributions (e.g., action chunking, flow policy, and behavior constraint are all proposed by prior works), therefore, I give my score to be 3.
格式问题
N.A.
Thank you for your insightful comments and suggestions on our paper. We addressed your main concern about our experiments by evaluating our method, all our baselines (11 baselines, with 3 new baselines) on 25 OGBench tasks (5 tasks for each of the 5 domains). In contrast, FQL evaluates 15 tasks for the offline-to-online setting (1 task for each domain). Our method outperforms all baselines across the board. In addition, we apply action chunking to two of our new baselines (BFN and IFQL) which yield two new variants of our method (Q-LAC-BFN, Q-LAC-IFQL) that also significantly improves the corresponding baselines. In particular, Q-LAC-BFN achieves strong performance across the board on par and sometimes better than Q-LAC. Please see the summary of our new results in our response to Reviewer nCbd.
We believe that there are also several misunderstandings of the reviewer on several key results of our paper that we would like to clarify below:
“What accounts for this difference in performance for FQL compared to the original FQL paper?”
Our results in Figure 4 were an aggregation of both task2 and task4. In contrast, for the offline-to-online RL experiments, the FQL paper only reported performance on the representative task for each domain (e.g., task2 for cube-double). The significantly harder task (e.g., task4) was not evaluated. For the rebuttal, we reran all the experiments on all 5 tasks for each domain, and our reported performance matches the FQL paper on the same task (e.g., task2 on cube-double).
“In Figure 4, in half of the environments, FQL-n performs better than FQL … These results seem to contradict the claimed benefit of action chunking.” “As shown in Figure 4, in about half of the tasks, RLPD achieves performance equal to or better than Q-LAC.”
The results that show FQL-n performs better than FQL on these environments do not contradict the claimed benefit of action chunking. Instead, they show evidence that n-step return does help speed up value learning and Q-LAC uses chunking to enjoy a similar benefit of n-step return backup. In our new results, Q-LAC performs significantly better than FQL-n especially on the hardest domains (cube-triple and cube-quadruple). In terms of RLPD, while it is true that RLPD is a competitive baseline as it can match the online performance on the simpler tasks, it fails on the hardest cube-quadruple domain. In our newest experiments, both FQL and RLPD fail to solve all but one task in cube-quadruple domains. In contrast, Q-LAC solves 4 out of 5 tasks on cube-quadruple.
“Q-LAC performs worse than the RLPD baseline in pure online settings” / “Results don't convincingly demonstrate that Q-LAC's performance is due to action chunking or behavior regularization. It seems more likely that the offline training provides a better starting point for the policy”
On the hardest domain cube-quadruple, Q-LAC exhibits strong performance (solving 4/5 tasks) online with offline pre-trained performance being 0. On the other hand, RLPD methods do not succeed on any of these tasks. This provides strong evidence that the effectiveness of Q-LAC also comes from its ability to learn online effectively rather than just the advantage of being able to pre-train offline as suggested by the reviewer. We would also like to clarify a couple of misunderstandings: (1) None of our methods are evaluated in the pure online setting. All the methods leverage an offline dataset. (2) Figure 2 shows Q-LAC outperforms RLPD, RLPD-AC and RLPD-ACBC after an initial pre-training phase which is opposite to what you seem to suggest in your review. (3) Our paper focuses on the offline-to-online setting which is common to use a pre-training phase [1-3].
| Q-LAC (ours) | Q-LAC-BFN (ours) | RLPD | RLPD-AC | RLPD-AC-BC | |
|---|---|---|---|---|---|
| cube-double-task1 | → | → | → | → | → |
| cube-double-task2 | → | → | → | → | → |
| cube-double-task3 | → | → | → | → | → |
| cube-double-task4 | → | → | → | → | → |
| cube-double-task5 | → | → | → | → | → |
| cube-triple-task1 | → | → | → | → | → |
| cube-triple-task2 | → | → | → | → | → |
| cube-triple-task3 | → | → | → | → | → |
| cube-triple-task4 | → | → | → | → | → |
| cube-triple-task5 | → | → | → | → | → |
| cube-quadruple-100M-task1 | → | → | → | → | → |
| cube-quadruple-100M-task2 | → | → | → | → | → |
| cube-quadruple-100M-task3 | → | → | → | → | → |
| cube-quadruple-100M-task4 | → | → | → | → | → |
| cube-quadruple-100M-task5 | → | → | → | → | → |
[1] "Cal-QL: Calibrated offline RL pre-training for efficient online fine-tuning."
[2] "Flow Q-learning."
[3] "Adaptive policy learning for offline-to-online reinforcement learning."
“RLPD consistently outperforms RLPD-AC5. These results seem to contradict the claimed benefit of action chunking.”
RLPD-AC5 is not our method; the figure shows that naively using action chunking (RLPD-AC5) to an online RL algorithm (RLPD) does not work well. Our key insight is that we need to use an expressive behavior constraint (e.g., behavior constraint on an expressive policy) to regularize the RL training (see Section 4.2). RLPD-AC5 (1) uses a Gaussian policy (much less expressive than a flow policy that is used in Q-LAC), and (2) does not use a behavior constraint whereas our method, Q-LAC, is based on FQL that deploys a behavior constraint.
“[action chunking policy] sacrifices reactivity and granularity, and may even be viewed as a degenerate case of [1-step action policies]” / “[1-step action policy] has finer granularity and higher real-time responsiveness”
Our action chunking policy is not a degenerate case of 1-step action policies. Our action chunking policy can not be fully represented by a 1-step action policy because action chunking policy can produce non-Markovian behavior that is not possible to achieve by a 1-step action policy. While 1-step action policy has higher real-time responsiveness, it also exhibits worse temporal coherency in our experiments (see our response to Reviewer kQoc). Temporally correlated/non-Markovian noises have been known for improving online exploration [1, 2], and our method provides a convenient way to generate temporally coherent actions while also being informed by offline data as prior. While it is true that our action chunking policy can only represent a subset of all non-Markovian policies, in practice, we found that using an action chunking policy (Q-LAC) still consistently outperforms 1-step action policy (e.g., FQL).
[1] "Pink noise is all you need: Colored noise exploration in deep reinforcement learning."
[2] "Continuous control with deep reinforcement learning."
Action chunking policies are limited as they cannot be switched and history-conditioned policies are more favorable. “Advantage of action chunking over 1-step action policies and skill-based approaches are not convincing”
While there are limitations to action chunking policies, they do not diminish their strong empirical performances across the board (see the table in our response to Reviewer nCbd). We show in our new experiments that Q-LAC performs better than all methods that use 1-step policies (FQL, RLPD, BFN) and the skill-based approach (SUPE-GT, see the description in our submission). We additionally include two variants of Q-LAC, Q-LAC-BFN and Q-LAC-IFQL, a result of applying action chunking to two newly added offline-to-online RL baselines, BFN and IFQL (see more details in our response to Reviewer nCbd). As we have shown in the result table, action chunking consistently improves the non-chunking counterpart, supporting our claim that introducing action chunking into RL leads to improved performance.
“action chunking policy proposed in this paper is an interesting idea, but it lacks sufficient theoretical support”
We are happy to hear that you think the action chunking policy proposed in this paper is an interesting idea. To provide more theoretical rigor and as recommended by other reviewers, we have included a theorem and proof below that shows the value backup for our chunked Q-function is unbiased. Please refer to our response to Reviewer joyS for more details on the theorem and discussions.
Thank you again for your constructive feedback and detailed reviews. Please let us know if you have any other concerns or questions. If we have successfully addressed all your concerns, could you kindly raise your rating?
Thanks for the authors' response.
The most important concern from my end is still in my initial review, regarding the contribution and novelty, which is also raised by reviewers nCbd and kQoc. As stated in the paper and rebuttal, Q-LAC is essentially equivalent to introducing the concept of action chunking into FQL, and action chunking itself has already been proposed in prior work.
(Line 261 in paper) Our algorithm builds on top of FQL, and this baseline is a direct ablation for the effect of action chunking.
(Response to Reviewer nCbd) The main difference between FQL and Q-LAC lies in the use of chunking in the actor and the critic.
Although the authors explain that bringing action chunking into FQL involves additional design and adaptation (such as using an expressive policy, discovering that direct online RL fails, thus requiring offline RL), this still falls below the bar of contribution and novelty for NeurIPS.
Besides, the authors’ response still leaves some important points unclear.
- Still unconvincing advantages of AC policy
In my initial review, I raised two points that remain unaddressed:
If ACP is to be responsible for future h-step actions, under the same model scale, it requires the policy to have some foresight and predictive capability about the environment's transition model, i.e., some form of internal model-based reasoning. This increases modeling difficulty.
Under the Markovian assumption, the policy outputs a single action based on the current state (given less information, do less behavior); Under non-Markovian assumption, the policy outputs a single action based on the history of states (given more information, do less behavior). In contrast, ACP outputs multiple actions based on one state (given less information, do more behavior). Clearly, ACP is the least favorable case.
In the rebuttal, the authors also admit:
While there are limitations to action chunking policies, they do not diminish their strong empirical performances across the board.
This leaves me confused. If chunking has some inherent limitations, why does it still show strong empirical performance? Without theoretical or intuitive explanation, these experimental results become even more confusing. We need clearer justification or evidence showing why AC-policy can outperform step-wise policy. From my understanding, AC-policy could save inference time cost, thus being advantageous in high-dimensional visual control robotics or tasks requiring ultra-fast reaction. However, disregarding inference cost, purely from a performance perspective, I believe the step-wise policy should be the upper bound for AC policy, i.e., making a decision per state should be at least as good as making a h-step decision per state. If the authors can provide theoretical support (requested in my initial review) or intuitive explanations to clarify this, like the action chunking’s MDP formulation, performance gain, and convergence properties, it would strengthen the paper.
- Comparison with RLPD
Some issues from my initial review remain only partially addressed:
(1) It is known that directly applying action chunking in online RL is ineffective or even harmful (as Figure 2 in the paper shows). Therefore, authors introduce offline pre-learning to address this, hence my question:
It seems more likely that the offline training provides a better starting point for the policy.
The authors’ explanation is:
On the hardest domain cube-quadruple, Q-LAC exhibits strong performance (solving 4/5 tasks) online with offline pre-trained performance being 0.
Thus, I interpret this as: even if Q-LAC’s offline stage produces a 0-reward policy, the subsequent online phase can still outperform RLPD. My question is: does this 0-reward policy after offline training essentially equal starting from a randomly initialized policy? If so, since it is already shown that pure online RL + AC performs poorly, why would online RL + AC starting from a 0-reward policy suddenly work well?
(2) The comparison with RLPD still seems somewhat unfair. In the paper, the authors note:
(Line 265) Despite its (RLPD) simplicity and a lack of pretraining stage, RLPD remains a strong baseline for the offline-to-online setting.
(Line 269) This baseline (RLPD-AC5) also does not use have a pretraining stage and directly learns online.
But in the rebuttal, they state:
(1) None of our methods are evaluated in the pure online setting. All the methods leverage an offline dataset.
(3) Our paper focuses on the offline-to-online setting which is common to use a pre-training phase.
This makes me think that comparing with an RLPD variant without offline pretraining is unfair.
Even so, according to Figure 4, RLPD still matches or even outperforms Q-LAC in 4 out of 6 domains. Notably, I have several observations:
(a) In the cube-double domain, Q-LAC starts at around 0.3 after offline training while RLPD starts at 0, yet during online learning RLPD reaches its peak after 3 points, whereas Q-LAC only catches up after 5 points;
(b) In the scene domain, Q-LAC reaches ~0.7 after offline training, but decreases to ~0.6 in online learning, whithout any improvement, while RLPD starts at 0 and improves to ~0.9;
(c) In the cube-quadruple-100M domain, both Q-LAC and RLPD start at 0, but RLPD learns nothing, while Q-LAC improves to 1.0.
These observations are very confusing to me. Does action chunking help online learning? Not really, because RLPD performs better only considering the online part, and online even decreases Q-LAC's performance. Does action chunking help with offline learning? Seems yes, because after offline training, Q-LAC has a better starting point than other methods. But why starting from the same 0-reward point, Q-LAC can achieve significantly better performance than RLPD in online learning? I wonder if starting from the Q-LAC's offline policy, but following the RLPD online learning process, what would happen?
Thanks for the follow-up questions! We appreciate your engagement, as it allows us to make further clarifications.
In the following thread, we will address several key concerns raised (and a short summary of how we address them individually):
- Contribution and Novelty:
We demonstrate the significant performance gains achieved by action chunking across multiple existing offline-to-online methods, and show that a naive application of action chunking on Gaussian policies fails (e.g., RLPD-AC). This highlights the importance of the two key design principles in Q-LAC: expressive policies and behavior constraints. We believe that showing how simple modification (action chunking) can lead to significant gains across multiple offline-to-online baselines is a meaningful contribution to the RL and ML community.
- Limitations of Action Chunking Policies:
We discuss the limitations for both the step-wise and action chunking policies. Specifically, step-wise policies exhibit worse temporal coherency (as we show in our empirical analysis) and biased n-step return backup. While action chunking policies also have worse responsiveness and can be sub-optimal in certain cases, as we show in practice, they perform much better than step-wise policies (e.g., Q-LAC vs. FQL, Q-LAC-BFN vs. BFN, Q-LAC-IFQL vs. IFQL) due to
- (A) better temporal coherency in actions for more effective online exploration, and
- (B) unbiased n-step return backups for better value learning.
- Clarification on RLPD comparison
We first explain that direct offline pre-training of RLPD is infeasible since the algorithm is not designed for offline RL. Pre-training it with Q-LAC is also not possible due to the policy class mismatch: RLPD uses a Gaussian policy whereas Q-LAC uses a flow policy. To ensure a fair comparison with an offline-to-online method similar to RLPD, we instead compare Q-LAC with ReBRAC, an offline-to-online RL algorithm that uses Gaussian policies and DDPG-style [3] policy extraction (just like RLPD). Notably, ReBRAC outperforms RLPD on the hardest domain (cube-quadruple), showing the benefits of pre-training. Despite this, Q-LAC still outperforms ReBRAC on three harder domains (cube-double/triple/quadruple) and performs similar to ReBRAC on the two easy domains (scene, puzzle).
- Additional Confusions Regarding Empirical Results:
We provide further empirical results and intuitions to explain the effectiveness of Q-LAC, providing insights for why Q-LAC outperforms prior methods as well as why sometimes RLPD performs better.
In the thread below, we will be referring to the new result table as shown below.
| Q-LAC (ours) | FQL | Q-LAC-BFN (ours) | BFN | Q-LAC-IFQL (ours) | IFQL | FQL-n | BFN-n | RLPD | RLPD-AC | RLPD-AC-BC | ReBRAC | IQL | SUPE-GT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| puzzle-3x3-sparse (5 tasks) | → | → | → | → | → | → | → | → | → | → | → | → | → | → |
| scene-sparse (5 tasks) | → | → | → | → | → | → | → | → | → | → | → | → | → | → |
| cube-double (5 tasks) | → | → | → | → | → | → | → | → | → | → | → | → | → | → |
| cube-triple (5 tasks) | → | → | → | → | → | → | → | → | → | → | → | → | → | → |
| cube-quadruple-100M (5 tasks) | → | → | → | → | → | → | → | → | → | → | → | → | → | → |
| aggregated (25 tasks) | → | → | → | → | → | → | → | → | → | → | → | → | → | → |
Note: please see our response to Reviewer nCbd for more implementation details on the description of the baselines and our Q-LAC methods
Regarding the contribution and novelty
While action chunking might seem simple, our experiments show that it is highly impactful and can be applied to multiple different RL algorithms! (see the table above). In addition, naively using action chunking without an expressive policy / behavior constraint does not work well (e.g., RLPD-AC). We believe that such a significant performance gain from action chunking in the offline-to-online RL setting is a valuable contribution to the machine learning and reinforcement learning community, and we have been especially quite delighted to see how well our method works on top of multiple existing offline-to-online RL algorithms: Q-LAC (FQL), Q-LAC-BFN (BFN).
In my initial review, I raised two points that remain unaddressed.
In our response, we attempted at addressing these two points by discussing the trade-offs between action chunking policies and step-wise policies (in a response to your claim that action chunking policies are a degenerate case of step-wise action policies). We expand the discussions below for better clarity:
While step-wise action policies have better responsiveness and can be more optimal than action chunking policies, they also have limitations. First, step-wise policy cannot produce non-Markovian behavior and it exhibits worse temporal coherency in our experiments (see our response to Reviewer joyS, "How does action chunking policy’s actions deviate from non-chunked policy?") that can hurt online exploration effectiveness. Temporally correlated/non-Markovian noises have been known for improving online exploration [1, 2], and our method provides a convenient way to generate temporally coherent actions while also being informed by offline data as prior. In addition, multi-step value backup has been known to speed up learning (see more discussion in our background section) and learning step-wise policies with n-step return backup is biased. In contrast, learning action chunking policies with n-step return backup is equivalent to learning step-wise policies in an extended action space (the entire action chunk as a single action) with 1-step return, eliminating the bias issue (also see our response to Reviewer joyS [“No formal analysis is provided about unbiased value backup”] for a formal statement of the unbiased n-step backup of Q-LAC).
[1] "Pink noise is all you need: Colored noise exploration in deep reinforcement learning."
[2] "Continuous control with deep reinforcement learning."
... Without theoretical or intuitive explanation, these experimental results become even more confusing.
We show strong empirical evidence (see table above) that the inherent limitations of the action chunking policy design are benign empirically considering the benefits that they bring, namely, (A) temporally coherent actions for better online exploration (see our response to Reviewer joyS, "How does action chunking policy’s actions deviate from non-chunked policy?"), and (B) unbiased n-step backup for better value learning (see our response to Reviewer joyS, "No formal analysis is provided about unbiased value backup"). In particular, we see evidence in our experiments that directly support the benefits that Q-LAC bring:
- (A) aligns with our empirical findings that despite Q-LAC, RLPD, and all other baselines start at 0 performance at the beginning of online training for the cube-quadruple-100M, Q-LAC successfully solves the 4 out of 5 tasks whereas the other baselines fail to do so. This aligns with our claim that Q-LAC produces more temporally coherent actions that are beneficial for online exploration, and thus better online sample efficiency.
- For (B), we can see that both FQL-n and Q-LAC both exhibit strong initial learning progress, likely due to the value learning speedup from the n-step backup. However, FQL-n suffers from an online learning stability issue that is possibly due to the biased n-step return backup that it uses. In contrast, Q-LAC uses unbiased n-step backup that does not suffer from this issue empirically (see our response to Reviewer nCbd, “Could a different choice of n [for FQL-n] match the performance of Q-LAC?” for a more comprehensive analysis on the performance of Q-LAC vs. FQL-n with different action chunk lengths).
Overall, our empirical findings align with the two benefits that we claim Q-LAC exhibits: (A) temporally coherent actions for better online exploration, and (B) unbiased n-step backup for better value learning. We hope that this provides better insights for how Q-LAC is more effective compared to baselines.
Does this 0-reward policy after offline training essentially equal starting from a randomly initialized policy? If so, since it is already shown that pure online RL + AC performs poorly, why would online RL + AC starting from a 0-reward policy suddenly work well?
We would like to clarify that RLPD uses Gaussian policy and is different from Q-LAC which uses flow policy. RLPD-AC works poorly likely due to the fact that it is missing both the key designs that make Q-LAC work: expressive policy and behavior constraint. Q-LAC learns well online because it is able to explore in a more temporally coherent manner (please see more details in our response to Reviewer joyS, "How does action chunking policy’s actions deviate from non-chunked policy?"). This allows Q-LAC to find the correct behaviors that lead to success more quickly, explaining the better online sample efficiency. We would make sure to clarify this in our next revision of the paper.
Comparing with an RLPD variant without offline pretraining is unfair
We would first like to clarify that RLPD does not have a pre-training phase by design and cannot be directly used to do offline RL pre-training. If we naively ran RLPD on the offline data without new online data coming in, it would completely fail offline as it is not designed for offline RL pre-training and would produce out-of-distribution actions that harm TD backup. However, we would be happy to run such a baseline if the reviewer believes this is a good comparison to see.
Instead, for this rebuttal, we compared with the offline-to-online baseline in the literature that is closest to RLPD, ReBRAC [1]. ReBRAC uses Gaussian policies and DDPG-style [3] policy extraction just like RLPD, and it has been shown to be one of the most competitive offline/offline-to-online RL baseline among existing methods that use Gaussian policies ([1, 2]). In our table above, we show that ReBRAC achieves strong performance on the easier tasks (100% for both puzzle and scene) and outperform RLPD on the hardest cube-quadruple-100M, achieving 20% success rate at the end of online training! However, ReBRAC still lags behind in terms of its performance compared to Q-LAC and Q-LAC-BFN which both use expressive flow policies. This provides evidence that ineffectiveness of approaches that use Gaussian policies is not due to the lack of offline pre-training, but rather due to the lack of expressivity in the policy.
[1] Tarasov, Denis, et al. "Revisiting the minimalist approach to offline reinforcement learning." Advances in Neural Information Processing Systems 36 (2023): 11592-11620.
[2] Park, Seohong, Qiyang Li, and Sergey Levine. "Flow q-learning." arXiv preprint arXiv:2502.02538 (2025).
[3] Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971 (2015).
These observations are very confusing to me.
The offline-to-online setting is a somewhat complex problem setting where the results can be hard to interpret. On one hand, the online policies should stay close to the offline pre-trained policies so that they could generate better data for online fine-tuning. On the other hand, staying too close to the possibly sub-optimal offline pre-trained policy could limit online improvements. How to balance this trade-off is still an open question in this problem setting.
Your observation in (a) is an instance of such trade-off between online RL with prior data (RLPD) that does not rely on an offline pre-trained policy and other offline-to-online RL methods (FQL, IDQL, our methods) that do start from an offline pre-trained policy. In some cases, it is better to not start from offline pre-training policies as the pre-training policies may provide a prior that could hurt online exploration, especially if the tasks are relatively easy. This is not unique to the environments we consider in our paper as it has been commonly observed in prior work. To name a few examples,
- Table 1 in the IDQL paper [1] shows that RLPD achieves >=99 success rate on antmaze-umaze-diverse, antmaze-medium-play while IDQL (IDQL-Max) only achieves 85.6 and 97.8 respectively.
- Figure 12 in the FQL paper [2] shows that RLPD peaks earlier than FQL on door-cloned and hammer-cloned, and even outperforms FQL completely on puzzle-4x4.
- Figure 9 in another recent offline-to-online RL paper [4] shows that RLPD performs better than Cal-QL (Cal-QL first does offline pre-training and then online fine-tuning)
One implementation detail that could potentially exacerbate this is the fact that during offline pre-training and online fine-tuning, we use the same behavior cloning coefficient, following prior work FQL [2]. While there have been many prior efforts in using a different training procedure for online learning initialized from offline pre-trained policy [3, 4], these are orthogonal to the focus of our paper as our main goal is to demonstrate the effectiveness of action chunking for improving existing offline-to-online RL methods.
For your observation (b), the environments scene and puzzle-3x3 had misalignment issues between the returns and the success rates (where a higher return could lead to a lower success rate). We have fixed this by re-running all our methods and baselines using a sparse reward function (where the reward is only given when the task is succeeded). Please see our response to Reviewer kQoC, “scene task results Q-LAC exhibits the highest variance among baselines, potentially undermined its online performance.” for more details.
Overall, as shown in our new table above, Q-LAC outperforms RLPD on scene, cube-triple and cube-quadruple-100M (3 domains) and is competitive with RLPD on both puzzle-3x3 and cube-double (2 domains) in terms of the performances at the end of the 1M offline training + 1M online training. Q-LAC (along with another variant of our approach Q-LAC-BFN) achieved the highest average success rate of 86% (averaged across 25 tasks), outperforming all RLPD baselines (max. RLPD at 67%) as well as the offline-to-online RL baseline that is the closest to RLPD, ReBRAC (at 50%).
[1] Hansen-Estruch, Philippe, et al. "IDQL: Implicit q-learning as an actor-critic method with diffusion policies." arXiv preprint arXiv:2304.10573 (2023).
[2] Park, Seohong, Qiyang Li, and Sergey Levine. "Flow q-learning." arXiv preprint arXiv:2502.02538 (2025).
[3] Nakamoto, Mitsuhiko, et al. "Cal-QL: Calibrated offline rl pre-training for efficient online fine-tuning." Advances in Neural Information Processing Systems 36 (2023): 62244-62269.
[4] Zhou, Zhiyuan, et al. "Efficient online reinforcement learning fine-tuning need not retain offline data." arXiv preprint arXiv:2412.07762 (2024).
Does action chunking help with offline learning or does it help with online learning?
We do not have direct evidence that action chunking helps with offline RL. From the table, we can see that aggregated over 25 tasks, in terms of offline RL performance, Q-LAC performs almost the same as FQL (38 vs. 37), Q-LAC-BFN performs almost identical to the base algorithm BFN (52 vs. 51), and Q-LAC-IFQL performs similar to the base algorithm IFQL (39 vs. 36). The main focus of our paper is the offline-to-online setting where we evaluate all the methods at the end of the 1M offline + 1M online training steps, and we see that all of our methods improve upon the base algorithms significantly online: Q-LAC vs. FQL (86 vs. 58), Q-LAC-BFN vs. BFN (86 vs. 63), and Q-LAC-IFQL vs. IFQL (59 vs. 47). This provides another evidence to support our claim that Q-LAC explores more effectively online (with higher temporal coherency in actions – please see more details in our response to Reviewer joyS, "How does action chunking policy’s actions deviate from non-chunked policy?"). While it is possible that action chunking can help with offline RL on harder tasks beyond the benchmark that we consider, we leave that to future work.
I wonder if starting from the Q-LAC's offline policy, but following the RLPD online learning process, what would happen?
This would be a really nice comparison that we would have run in our experiments but unfortunately it is not possible because RLPD uses Gaussian policies and our method uses flow policies. It is not a trivial task to convert the flow policies into Gaussian policies while still preserving their effectiveness. Along this line, we additionally compared our methods with the closest offline-to-online RL baseline that we could find in the prior work that shares many similar designs as RLPD, ReBRAC (Gaussian policy, DDPG-style policy extraction). We have found that even with offline pre-training, ReBRAC fails to outperform Q-LAC and only achieves a final aggregated success rate of 50% (vs. Q-LAC/Q-LAC-BFN at 86%).
Please let us know if you have any other concerns or questions. If we have successfully addressed all your concerns, could you kindly raise your rating? If there are specific experiments/comparisons that you would like to see, please let us know!
Thanks for the authors’ response.
I acknowledge that the paper presents extensive experimental evaluations. I would like to raise my score from 2 to 3, but I still believe the work falls below the NeurIPS acceptance bar. My main concerns are as follows:
- The primary claimed contribution is introducing action chunking to the RL community and validating it through extensive experiments. In my view, this contribution is rather limited in novelty, since action chunking is not a new concept and is closely related to macro-actions, skills, and options in RL.
- The authors’ attempt to apply chunking to online RL was not successful, leading them to turn to the offline-to-online RL setting. I still think this seems more like a necessary compromise rather than a deliberate design choice.
- The authors emphasize additional designs, such as using expressive policies with behavior constraints. However, this is not novel either, as it has already been proposed in FQL, and thus cannot be counted as an original contribution. In essence, Q-LAC is equivalent to FQL + chunking.
- In the rebuttal, the authors noted that chunking can also improve other algorithms, such as BFN and IFQL. However, both are based on flow policies, suggesting that the observed improvement may still stem from the synergy with flow policy rather than chunking itself.
In summary, the work explores the application of action chunking to several RL algorithms, identifying cases where it improves performance and others where it does not, ultimately combining multiple existing strategies to achieve a cumulative enhancement. While the exploration is good, I don't see many novel or original contributions (e.g., action chunking, flow policy, and behavior constraint are all proposed by prior works), therefore, I give my score to be 3.
Thanks for your response and raising the score! We are glad that our previous response was able to address most of your concerns! For the remaining concerns, we address below:
Novelty concern
While using (1) expressive policies and (2) behavior constraints themselves are not novel, the application of them to the action chunking in the context of offline-to-online RL is new and drives the performance of multiple existing offline-to-online RL methods upward significantly (Q-LAC vs. FQL [86 vs. 58], Q-LAC-BFN vs. BFN [86 vs. 63], and Q-LAC-IFQL vs. IFQL [59 vs. 47]). The fact that we have identified these general two design principles (expressive policies and behavior constraints) that make action chunking effective in offline-to-online RL is a valuable insight for the community that has the potential to generalize to other offline-to-online RL methods (different from the three we tested on [FQL, BFN and IFQL]). This aligns with what Reviewer nCbd has noted with a positive stand on the paper, "the fact that action chunking didn't work with RLPD is enough for the current submission to be interesting".
The authors’ attempt to apply chunking to online RL was not successful, leading them to turn to the offline-to-online RL setting.
We believe that the observation that naively applying action chunking to online RL does not work is a valuable insight for the community (as Reviewer nCbd also found this interesting). Our work also shows that action chunking does not directly apply to existing purely online RL methods because they do not use (1) expressive policies and (2) behavior constraints.
More importantly, our work focuses on the offline-to-online RL setting (not just proposing a general RL algorithm). Our main goal is to demonstrate the effectiveness of action chunking for improving existing offline-to-online RL methods. Our choice of specialization the offline-to-online RL is not unique in the literature. There are many existing methods that are also specifically designed for the offline-to-online RL setting [1, 2]. While we focus on the offline-to-online RL problem setting, we believe that the key design principles that we identified would be a valuable insights for future work to develop effective offline-to-online and potentially even purely online RL algorithms by leveraging the principles in our paper. We believe that our work is a valuable step in the direction of tractably learning non-Markovian policies (in the form of action chunking policies) in RL for effective online exploration and improved sample efficiency.
[1] Nakamoto, Mitsuhiko, et al. "Cal-QL: Calibrated offline rl pre-training for efficient online fine-tuning." Advances in Neural Information Processing Systems 36 (2023): 62244-62269.
[2] Zhou, Zhiyuan, et al. "Efficient online reinforcement learning fine-tuning need not retain offline data." arXiv preprint arXiv:2412.07762 (2024).
Thank you again for your insightful feedback and detailed review! Hope this response addresses your remaining concerns!
This paper proposes Q-LAC, an offline to online RL algorithm that employs action chunking to accelerate temporal-difference value updates and promote temporally coherent exploration. Action chunking is a common algorithm in the field of embodied intelligence, and it is highly valuable to study its role in the domain of reinforcement learning. Empirical evaluations across a suite of challenging long-horizon, sparse-reward manipulation tasks demonstrate that Q-LAC outperforms prior approaches in both offline and online RL phases, particularly excelling in sample efficiency during fine-tuning.
优缺点分析
Strengths:
-
The motivation behind this article is excellent. It is highly meaningful to investigate the role of action chunking, which is commonly used in embodied intelligence algorithms, within the realm of reinforcement learning.
-
The proposed algorithm is concise and effective, with strong versatility.
Weaknesses : The overall discussion and analysis in this article are not profound enough, and there are many points that could be expanded upon.
-
There is a lack of theoretical analysis regarding the impact of action chunking on online RL. Additionally, while the speedup and unbiased value backup claims are explained in Section 4.3, no formal analysis or bounds are provided.
-
When the action chunking size is 5, how much deviation is there between the five actions obtained by and the five actions obtained by ? How does this deviation affect the final performance and the two benefits?
-
The article mentions that the learning of value is accelerated, but there is no quantified indicator or experiment to measure the extent of this acceleration.
-
The ablation experiment on action chunking size is incomplete. Intuitively, as the size increases, the subsequent actions will become less accurate. Therefore, finding this critical point is crucial.
-
The article lacks analysis on whether action chunking will bring additional memory or time overhead.
问题
Please see weaknesses.
局限性
The topic of the article is highly valuable, yet many aspects could be explored more profoundly. Should the authors address my concerns, I would be more than willing to consider a higher rating.
最终评判理由
Considering that most of the concerns I raised have been addressed, I have accordingly increased my score from 3 to 4.
格式问题
No
Thanks for your detailed review and insightful comments. We addressed your concern on the lack of analysis and discussion by adding (1) theoretical results that show Q-LAC performs unbiased n-step return backup, (2) empirical analysis that shows Q-LAC’s policy exhibit better temporal coherency compared to the non-chunked baseline, highlighting the effectiveness of action chunking for online exploration, (3) empirical analysis that quantifies the value speedup effects of n-step returns, (4) a more complete ablation study for the effect of action chunk size where we show when a too big of an action chunk size starts to hurt performance, and (5) a detailed analysis of the CPU/GPU memory overhead and computational cost overhead of our action chunking method Q-LAC.
“No formal analysis is provided about unbiased value backup”
To address your concern, we have included a theorem with proof below that shows the value backup for our chunked Q-function is unbiased.
[Theorem: Q-LAC performs unbiased n-step backup]
Let be a trajectory segment generated by following a data collection policy and be the reward received at each corresponding time step. Let be the value for a policy starting from state . Let be the Q-value for a policy starting from state and executing the action sequence .
Under any policy (possibly different from ), the -step return estimate is unbiased for as long as is unbiased for .
[Proof]
From the definition of , we can write it as an expectation: Q^\\pi(s_t, a_t, a_{t+1}, \\cdots, a_{t+n-1}) = \\mathbb{E}\_{s\_{t'} \sim T(\\cdot | s\_{t'}, a\_{t'})} \\left[\\sum\_{t'=t}^{t+n-1}\\left[\\gamma^{t'-t}r(s\_{t'}, a\_{t'})\\right] + V^\pi(s\_{t+n})\\right],
= \\mathbb{E}\_{s\_{t+n}, r\_{t+1}, r\_{t+2}, \cdots, r\_{t+n-1}}\left[\sum\_{t'=t}^{t+n-1}\\left[\\gamma^{t'-t}r\_{t'}\\right] + \\hat{V}(s\_{t+n})\\right]
The second line uses the fact that the rewards are associated with the trajectory segment . The third line uses the fact that is unbiased for . The fourth line uses the definition of .
The implication of this theorem is that if we use the n-step return backup for the chunked Q-function (with a value estimate of ) to convergence it converges to the ground truth -value under the same policy . In contrast, the n-step return backup on the original action space is known to not converge to the correct -value [1,2].
This might be unsurprising because we are effectively treating the action chunk as the new action space and run RL with this transformed action space. Logically, if there is any RL algorithm that is guaranteed to converge to the optimal 1-step policy in the original action space, running the same algorithm on this transformed action space will also converge to the optimal action chunking policy. While it is true that the optimal action chunking policy cannot represent the optimal 1-step policy in some cases, in our experiments, we have found them to be sufficient (see the new results in our response to Reviewer nCbd).
[1] "Revisiting Peng’s Q() for Modern Reinforcement Learning."
[2] "Revisiting fundamentals of experience replay."
How does action chunking policy’s actions deviate from non-chunked policy?
To address your question, we analyze the temporal coherency of the actions generated by the Q-LAC policy and the FQL policy in the beginning of online fine-tuning on task cube-triple-task2 (4 seeds). In particular, we look at the distance between the end-effector positions five time steps apart from our Q-LAC policy compared to FQL policy (e.g., ). A higher distance suggests a higher temporal coherency in the actions taken by Q-LAC and FQL policy. As shown in the table below, Q-LAC exhibits a higher temporal coherency, allowing it to explore in the environment better, and better online sample efficiency (see our new results in our response to Reviewer nCbd).
| Online Environment Steps | Q-LAC | FQL |
|---|---|---|
| Step 5000 | ||
| Step 10000 | ||
| Step 15000 | ||
| Step 20000 | ||
| Step 25000 | ||
| Step 30000 | ||
| Step 35000 | ||
| Step 40000 | ||
| Step 45000 |
“The ablation experiment on action chunking size is incomplete.”
We ran a complete ablation study on more tasks with longer chunk sizes. In our original submission, we only ran on cube-triple-task4. Here we show an aggregation of all five cube-triple tasks to give a more complete picture. As shown in the figure, the final performance saturates at around . When the action chunk size is too large (), the learning fails. This is to be expected as the action prediction for longer chunks of actions is harder, and TD backup n-step return with a large reduces the stitching capability of Q-learning (e.g., the extreme case is Monte-Carlo regression where the algorithm can no longer do any stitching).
| cube-triple (5 tasks) | (our main results) | |||||
|---|---|---|---|---|---|---|
| Q-LAC | → | → | → | → | → | → |
“The article lacks analysis on whether action chunking will bring additional memory or time overhead.”
To address your concern, we analyzed the time and memory that action chunking brings and found that action chunking incurs almost no overhead on CPU & GPU memory and some but manageable overhead on the computational cost especially for online learning where environment steps are still the main bottleneck.
Memory
Action chunking brings little memory overhead on both the CPU memory and GPU memory. For CPU memory, our replay buffer stores the transitions normally and we sample chunks of observations and actions for updating our agent. For GPU memory, action chunks make the input to the policy and the critic larger which make the first layer of our neural network (and the last layer of our neural network for the flow policy) larger. We include the parameter count below:
- Q-LAC: ~5M parameters
- FQL: ~4.9M parameters
- RLPD: ~17.2M parameters
Time
In terms of time, we report the time it requires for each step during the offline phase and the online phase below (in milliseconds). Compared to FQL, Q-LAC is ~40% more expensive offline, and only 10% more expensive in the online phase as the environment step takes most of the time during online training.
| Methods | Q-LAC | FQL | RLPD |
|---|---|---|---|
| online time (ms/step) |
| Methods | Q-LAC | FQL |
|---|---|---|
| offline time (ms/step) |
“The article mentions that the learning of value is accelerated, but there is no quantified indicator or experiment to measure the extent of this acceleration.”
To address your question, we analyzed the convergence of the average Q-value of the state-action pair in the dataset over the course of offline training for Q-LAC, FQL-n and FQL on one of our hardest tasks, cube-quadruple-100M-task2 (averaged over 4 seeds). To gain insights on the convergence rate of the Q-value, we compute the difference in Q-value from the previous time step and highlight the difference if it is greater than 10. Please refer to our response to kQoc for these results.
Both Q-LAC and FQL-n (FQL with n-step returns) have large Q-value differences in the first 100K to 150K steps and then have small Q-value differences for the rest of the offline training. In contrast, FQL has large Q-value differences for a longer time up to almost half way through the offline training (~400K steps). This suggests that both Q-LAC and FQL-n enjoy a faster convergence of the Q-value compared to FQL, providing evidence that n-step backup does accelerate value learning. It is worth highlighting that while this quantifies the value learning speed, it does not quantify how well the value is learned. In fact, in our experiments, even though FQL-n exhibits a faster value learning, it suffers from an online collapse problem (as shown here), possibly due to the bias in the n-step backup. In contrast, Q-LAC does not suffer from this issue, possibly due to its unbiased n-step backup (see more discussion in Section 4.3 Benefit #1).
Thank you again for your constructive feedback and detailed reviews. Please let us know if you have any other concerns or questions. If we have successfully addressed all your concerns, could you kindly raise your rating?
I sincerely thank the authors for their thorough rebuttal and the extensive additional experiments—producing so much new evidence in such a short time is no small feat.
My earlier concerns have been largely resolved; please incorporate these new analyses and results into the final version, as they will substantially strengthen both the theoretical and empirical contributions.
Consistent with my prior commitment, I am raising my score to 4.
Thank you for raising the score, and we appreciate again for your detailed review and insightful comments, which helped strengthen our paper! We will incorporate the additional analyses and results into the final revision!
This paper proposes a method integrating action chunking within an Actor-Critic framework. The approach demonstrates improvements in value backup efficiency for accelerated learning while generating temporally coherent actions to enable effective online exploration. Experimental results provide evidence for the method's sample efficiency compared to baselines, notably in offline-to-online RL settings.
优缺点分析
Pros:
- The paper is well-structured and effectively presents the main idea.
- Integrating the action chunking into both critic learning and policy optimization process demonstrate its effectiveness on improving the sample efficiency especially in the online learning part.
Cons:
-
The core contribution appears largely incremental, primarily combining action chunking with FQL. Rigorous ablation studies are needed to isolate the individual efficacy of each design component. (e.g., Does action chunking generalize as a performance enhancer when integrated with other offline-to-online RL algorithms?)
-
The computational demands appear significant, yet evaluation was restricted to limited tasks per OGbench environment. Comprehensive performance metrics (e.g., aggregated scores across all tasks) are absent, precluding a robust assessment of Q-LAC's overall effectiveness.
-
It seems like in offline "scene" task results Q-LAC exhibits the highest variance among baselines, potentially undermined its online performance, yet this critical limitation lacks sufficient analysis regarding its cause and impact.
-
The manuscript lacks substantive discussion or illustration explaining why action chunking is particularly well-suited to non-Markovian settings or adequately addresses potential side effects.
-
Formal theoretical analysis is absent. Such analysis would significantly strengthen claims regarding the method's effectiveness and robustness.
问题
-
Why does the action chunking formulation yield limited performance improvements in strongly Markovian or fully Markovian situation?
-
Equation 8 claims that action chunking avoids bias in value estimation. What is the associated impact on estimation variance, and how does this affect the stability and convergence of the learning process?
-
How does Q-LAC scale to environments with high-dimensional action spaces? Does effective Bellman updating with chunked high-dimensional inputs necessitate more complex Q-network architectures (e.g., deeper networks, attention mechanisms), and what computational trade-offs arise?
局限性
Yes, the author has included limitation discussion in appendix.
最终评判理由
My final justification for the recommended score is 4.
The paper's primary contribution is introducing an effective action chunking mechanism to RL, thoroughly demonstrating its high efficacy in boosting sample efficiency and ease of deployment. One limitation is that the idea of action chunking has been explored in prior work, which diminishes its novelty.
格式问题
There is no major formatting issues in this paper.
Thanks for the insightful feedback and detailed review. We discuss the novelty of our contribution below. Our main contribution is to introduce effective action chunking into RL, and while this modification might seem simple, our experiments show that it is highly impactful. Furthermore, during the rebuttal, we additionally show that the same novel contribution can be applied to multiple different RL algorithms, further supporting our claim that introducing action chunking into RL leads to improved performance. Please see our response to nCbd for the summary table of our new results.
“The core contribution appears largely incremental”
Our main contribution is to introduce effective action chunking into RL, and while this modification might seem simple, our experiments show that it is highly impactful. For example, FQL fails to solve all but one task in cube-triple and cube-quadruple domains. In contrast, Q-LAC solves 4 out of 5 tasks on both domains. In addition, getting action chunking to work in RL is not trivial. Furthermore, during the rebuttal, we additionally show that the same novel contribution can be applied to multiple different RL algorithms (Q-LAC [FQL], Q-LAC-BFN [BFN], Q-LAC-IFQL [IFQL]), further supporting our claim that introducing action chunking into RL leads to improved performance.
“Does action chunking generalize as a performance enhancer when integrated with other offline-to-online RL algorithms?”
Yes it does! For this rebuttal, we include two variants of Q-LAC, Q-LAC-BFN, a result of applying the same principles to a new offline-to-online RL algorithm and Q-LAC-IFQL, a result of applying the same principles to a offline-to-online RL baseline IFQL (as considered in the FQL paper). Action chunking consistently improves the non-chunking counterpart (e.g., Q-LAC vs. FQL, Q-LAC-BFN vs. BFN, Q-LAC-IFQL vs. IFQL), supporting our claim that introducing action chunking into RL leads to improved performance. Please see the summary of our new results in our response to Reviewer nCbd.
“Comprehensive performance metrics (e.g., aggregated scores across all tasks) are absent”
Our summary table (see our response to Reviewer nCbd) now contains the aggregated scores across all tasks.
“scene task results Q-LAC exhibits the highest variance among baselines, potentially undermined its online performance.”
We have fixed this in our newest experiments where Q-LAC can succeed 100% for all scene and puzzle-3x3 tasks online. In general, we have found that the original OGBench scene and puzzle-3x3 domains have high variances because the success rate is misaligned with the reward function (a shaped reward function that provides some reward for each sub-task finished). In some cases, succeeding in a task requires the agent to un-finish some of the sub-tasks. To eliminate any confounding factor from poor reward shaping, in all our new experiments, we use a sparse-reward function (-1/0) for both scene and puzzle-3x3 domains where a reward of -1 is given as long as the overall task has not been completed (regardless of the sub-task completion). With this modification, all the methods exhibit a much smaller variance. In addition, we now test on all five tasks rather than only two tasks for each domain.
“The manuscript lacks substantive discussion or illustration explaining why action chunking is particularly well-suited to non-Markovian settings.”
To address your concern, we analyze the temporal coherency of the actions generated by Q-LAC and FQL. In particular, we look at the distance between the end-effector positions five time steps apart from our Q-LAC policy compared to FQL policy. A higher distance suggests a higher temporal coherency in the actions taken by Q-LAC and FQL policy. Please see our response to joyS ("How does action chunking policy’s actions deviate from non-chunked policy?") for the result table. As shown in the table, Q-LAC exhibits a higher temporal coherency, allowing it to explore in the environment better, and better online sample efficiency
“Formal theoretical analysis is absent. Such analysis would significantly strengthen claims regarding the method's effectiveness and robustness.”
For theoretical rigor, we added a theorem that shows our method performs unbiased n-step return backup. Please refer to our response to Reviewer joyS for more details.
“Why does the action chunking formulation yield limited performance improvements in strongly Markovian or fully Markovian situation?”
We are unsure how to best interpret this question as it is unclear to us which limited performed improvements that you are referring to. Here is our interpretation – “why does action chunking yields limited performance improvement on scene and puzzle-3x3”. In our new experiments (see our response to nCbd), we found that most baselines already perform quite well on both scene and puzzle-3x3 at around 100% success rate for most of the tasks. We believe that the limited performance improvements are due to the fact that these two domains are already saturated by existing algorithms. If we interpret it wrong, could you please clarify the question?
“Equation 8 claims that action chunking avoids bias in value estimation. What is the associated impact on estimation variance, and how does this affect the stability and convergence of the learning process?”
To address your question, we analyzed the convergence of the average Q-value of the state-action pair in the dataset over the course of offline training for Q-LAC, FQL-n and FQL on one of our hardest tasks, cube-quadruple-100M-task2 (averaged over 4 seeds). We compute the difference in Q-value from the previous time step and highlight the difference in red if it is greater than 10.
| Q-LAC | FQL-n | FQL | |
|---|---|---|---|
| Step 5000 | |||
| Step 55000 | () | () | () |
| Step 105000 | () | () | () |
| Step 155000 | () | () | () |
| Step 205000 | () | () | () |
| Step 255000 | () | () | () |
| Step 305000 | () | () | () |
| Step 355000 | () | () | () |
| Step 405000 | () | () | () |
| Step 455000 | () | () | () |
| Step 505000 | () | () | () |
| Step 555000 | () | () | () |
| Step 605000 | () | () | () |
| Step 655000 | () | () | () |
| Step 705000 | () | () | () |
| Step 755000 | () | () | () |
| Step 805000 | () | () | () |
Both Q-LAC and FQL-n (FQL with n-step returns) have large Q-value differences in the first 100K to 150K steps and then have small Q-value differences for the rest of the offline training. In contrast, FQL has large Q-value differences for a longer time up to almost half way through the offline training (~400K steps). This suggests that both Q-LAC and FQL-n enjoy a faster convergence of the Q-value compared to FQL, providing evidence that n-step backup does accelerate value learning. It is worth highlighting that while this quantifies the value learning speed, it does not quantify how well the value is learned. In fact, in our experiments, even though FQL-n exhibits a faster value learning, it suffers from an online collapse problem (as shown here), possibly due to the bias in the n-step backup. In contrast, Q-LAC does not suffer from this issue, possibly due to its unbiased n-step backup (see more discussion in Section 4.3 Benefit #1).
“How does Q-LAC scale to environments with high-dimensional action spaces? Does effective Bellman updating with chunked high-dimensional inputs necessitate more complex Q-network architectures, and what computational trade-offs arise?”
While it is possible that the chunked action space requires a more complex Q-network, in our experiments, we find that using the same architecture (4 hidden layers with 512 units each) as the non-chunked FQL is sufficient to solve all our tasks. To provide more insights for the scalability of our method, we analyzed the time and memory that action chunking brings and found that action chunking incurs almost no overhead on CPU & GPU memory and some but manageable overhead on the computational cost especially for online learning where environment steps are still the main bottleneck. This suggests that Q-LAC can be easily applied to existing RL algorithms without requiring large overhead on memory and time, making it scalable for more complex problems. Please respond to Reviewer joyS for more details on the memory and time overhead of action chunking.
Thank you again for your constructive feedback and detailed reviews. Please let us know if you have any other concerns or questions. If we have successfully addressed all your concerns, could you kindly raise your rating?
I thank the authors for their clarifications, which have addressed some of my initial concerns. However, I still have the following concerns:
- Why does the online learning approach sometimes degrade performance relative to offline training, as observed in the scenario case illustrated in Figure 4?
- How do we ensure the reliability of sampled multi-step action sequences? While previous experiments demonstrate Q-LAC's unbiased Q-value learning, the reliability of sampled action chunks remains unclear.
- More fundamentally: How does Q-LAC maintain stable learning when performing TD backups using multiple actions sampled from the current policy? Clarifying this mechanic would further enhance the key aspect of the proposed method.
Overall, I still think the novelty is somewhat limited. Therefore, I will maintain my current score.
Thanks for the follow-up questions! On the novelty front, we would like to highlight that while action chunking might seem simple, our experiments show that it is highly impactful and can be applied to multiple different RL algorithms! (see the table below). In addition, naively using action chunking without an expressive polichy / behavior constraint does not work well (e.g., RLPD-AC).
1. Why does the online learning approach sometimes degrade performance?
In Figure 4, FQL-n drops in performance during online. We believe that this might be due to the biased value backup that FQL-n uses. In contrast, Q-LAC is unbiased and does not suffer from this issue. We also conducted additional experiments that show this trend holds across multiple action chunk size choices. See our response to Reviewer nCbd(“Could a different choice of n [for FQL-n] match the performance of Q-LAC?”) for more details on the experiments.
2. How do we ensure the reliability of sampled multi-step action sequences?
This is a good question! We would like to first clarify that all evaluations in our experiments use sampled action chunks from our policy and execute the action one by one in the environment. The fact that Q-LAC's performance is higher than non-chunked baselines show that the sampled action chunks reliable in terms of solving the tasks we consider. In addition, the key design decision that makes it work is to impose a behavior constraint using an expressive flow policy. This ensures that our policy outputs action chunks that are similar to the ones that appear in the offline prior data, yielding a higher temporal coherency in actions compared to the non-chunked methods. Consequently, this leads to better exploration online and better sample efficiency. Please see our response to Reviewer joyS ["How does action chunking policy’s actions deviate from non-chunked policy?"] for more details on our empirical observations on the temporal coherency of Q-LAC (chunked) vs. baseline FQL (non-chunked).
| Q-LAC (ours) | FQL | Q-LAC-BFN (ours) | BFN | Q-LAC-IFQL (ours) | IFQL | FQL-n | BFN-n | RLPD | RLPD-AC | RLPD-AC-BC | ReBRAC | IQL | SUPE-GT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| puzzle-3x3-sparse (5 tasks) | → | → | → | → | → | → | → | → | → | → | → | → | → | → |
| scene-sparse (5 tasks) | → | → | → | → | → | → | → | → | → | → | → | → | → | → |
| cube-double (5 tasks) | → | → | → | → | → | → | → | → | → | → | → | → | → | → |
| cube-triple (5 tasks) | → | → | → | → | → | → | → | → | → | → | → | → | → | → |
| cube-quadruple-100M (5 tasks) | → | → | → | → | → | → | → | → | → | → | → | → | → | → |
| aggregated (25 tasks) | → | → | → | → | → | → | → | → | → | → | → | → | → | → |
3. How does Q-LAC maintain stable learning when performing TD backups using multiple actions sampled from the current policy?
This is a good question! The key design that allows Q-LAC to maintain stable learning when performing TD backups is the same as the solution to the previous question --- imposing a behavior constraint using an expressive flow policy. Our behavior constraint allows Q-LAC to stay close to the behavior distribution such that the -value maximizing actor loss cannot erroneously `exploit' the -function by simply producing out-of-distribution actions (that the -function has not been trained on). We will clarify this in our next revision of the paper.
Please let us know if you have any other concerns or questions. If we have successfully addressed all your concerns, could you kindly raise your rating?
I appreciate the authors' response, which addresses my concerns. I suggest incorporating the previously discussed content into the final manuscript. Accordingly, I will raise the score from 3 to 4.
Thank you for raising the score, and again for your insightful feedback. We will incorporate the discussed content into the final manuscript!
This paper presents an actor-critic method for offline-to-online reinforcement learning. The proposed method, Q-learning with Action Chunking (Q-LAC), uses action chunks or macro-actions for learning (macro) action-values and acting in the environment.
The authors demonstrate that using action chunks with a relatively more straightforward actor-critic approach (RLPD) can lead to a deterioration in performance. Instead, they suggest building on top of Flow Q-learning (FQL), where action chunks can be constrained to be similar to the behavior observed in the offline data.
Experiments in the offline-to-online setting show that the proposed approach outperforms prior methods (RLPD and FQL) on the more challenging domains from the OG-bench.
优缺点分析
Strengths
- The paper presents a relatively simple extension of FQL to incorporate temporally extended actions.
- Further, this extension (Q-LAC) shows promising results on the OG-bench domains considered. In particular, it performs significantly better than relevant baselines (FQL, RLPD) on the more challenging tasks, where a long horizon and sparse reward pose the greatest challenge.
Weaknesses
W1. A weakness of the present submission is in terms of novelty. Q-LAC feels like an incremental contribution on top of FQL. Perhaps more problematically, this connection is not really addressed before sections 4.2 and 5, even after which the exact relationship remains unclear. Could the authors clearly specify the differences between FQL and Q-LAC? Using the terminology of the paper, would it be correct to say Q-LAC = FQL-AC5?
W2. Also related to novelty, the benefits of macro-actions in terms of more temporally coherent exploration and improved credit assignment are generally known (e.g. [1, 2, 3]), limiting the novelty of section 4.3. Lines 35 and 36 claim that chunking has not been widely used in RL; however, it has been employed through macro-actions (as noted in previous references) and for credit assignment [4].
W3. There are some aspects of the empirical analysis that would benefit from clarification.
The authors consider an FQL-n baseline that uses (biased) n-step returns. Due to the difference of for backup vs action chunking, it would also be helpful to see the performance of FQL-n for different choices of , e.g., . Could a different choice of match the performance of Q-LAC? FQL-5 also starts off quite well on cube-quadruple-100M, is there a reason for the sudden collapse?
Another question I had was regarding RLPD vs FQL. The FQL paper’s results show that FQL outperforms RLPD in the offline-to-online case, whereas the results in this paper (cube-double, cube-triple, robomimic from Figure 4, and aggregated results from Figure 1) show the opposite trend. Could the authors explain why this is the case?
—-----------—-----------—-----------—-----------—-----------
On a minor note, it would be helpful to include some more details about FQL in the main paper’s background section or section 4.2. Perhaps sections 4.3, 5.1, and 5.2 could be compressed, and some of those details could be moved to the appendices.
Overall, I appreciate the simplicity of the proposed method and the promising results (strengths), but I think the paper should adjust some claims related to novelty and further discuss related work (W1/2). I also have concerns related to the empirical analysis (W3).
I remain open to increasing the score if the weaknesses and questions are adequately addressed and clarified.
Typos: In the legend of Figure 7. D^comp instead of D^comb.
References
[1] McGovern, Amy, and Richard S. Sutton. "Macro-actions in reinforcement learning: An empirical analysis." (1998)
[2] Durugkar, Ishan P., et al. "Deep reinforcement learning with macro-actions." arXiv preprint arXiv:1606.04615 (2016).
[3] Tan, Aaron Hao, et al. "Deep reinforcement learning for decentralized multi-robot exploration with macro actions." IEEE Robotics and Automation Letters 8.1 (2022)
[4] Ramesh, Aditya, et al. "Sequence compression speeds up credit assignment in reinforcement learning." ICML (2024).
问题
Included in “Strengths and Weaknesses”
局限性
Limitations are touched in Discussions section, but could be elaborated upon.
最终评判理由
Please see my response to the authors:
I would like to thank the authors for their detailed response. I greatly appreciate the time and effort spent to conduct the additional experiments. The response addresses my main concerns and I have increased my rating to "accept".
It's great to see that action chunking can be combined with other offline-to-online approaches. Similarly, Q-LAC performing better than FQL-n for other choices of n also supports the benefit of action chunking/Q-LAC.
The remaining criticism pertains to the novelty of the approach being low (action chunking + FQL). The authors point out that introducing action-chunking to work with these approaches is not straightforward as it doesn't work with RLPD and needs behavioral constraints and an expressive policy. In my opinion, the fact that action chunking didn't work with RLPD is enough for the current submission to be interesting.
格式问题
No concerns
Thanks for your detailed review and insightful comments. We addressed your main concern on (W3) empirical analysis by conducting a more comprehensive set of experiments (11 baselines and 25 tasks in total) and additional ablation studies requested by the reviewer on FQL-n, (W1-W2) novelty by clarifying the key principles behind our method Q-LAC that differentiate us from prior offline-to-online RL methods (e.g., FQL) and skill-based/macro-action-based methods.
New results! (3 Q-LAC methods + 11 baselines on 25 OGBench tasks)
The newly experiments are summarized below (3 seeds for SUPE-GT and RLPD-AC-BC and 4 seeds for everything else). We report the success rate (%) after 1M offline training steps followed by 1M additional steps of online fine-tuning (offline online). Our methods, Q-LAC and Q-LAC-BFN, outperform other baselines by a large margin across the board.
For each row, we bold each number if its confidence interval overlaps with the best method’s confidence interval. Due to the character limit, we omit the confidence interval in the table and the performance on the individual task, but we are happy to provide them during the discussion period.
| Q-LAC (ours) | FQL | Q-LAC-BFN (ours) | BFN | Q-LAC-IFQL (ours) | IFQL | FQL-n | BFN-n | RLPD | RLPD-AC | RLPD-AC-BC | ReBRAC | IQL | SUPE-GT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| puzzle-3x3-sparse (5 tasks) | → | → | → | → | → | → | → | → | → | → | → | → | → | → |
| scene-sparse (5 tasks) | → | → | → | → | → | → | → | → | → | → | → | → | → | → |
| cube-double (5 tasks) | → | → | → | → | → | → | → | → | → | → | → | → | → | → |
| cube-triple (5 tasks) | → | → | → | → | → | → | → | → | → | → | → | → | → | → |
| cube-quadruple-100M (5 tasks) | → | → | → | → | → | → | → | → | → | → | → | → | → | → |
| aggregated (25 tasks) | → | → | → | → | → | → | → | → | → | → | → | → | → | → |
Note: for ReBRAC, IQL and IFQL, we use the same implementation and the same hyperparameter from the official FQL codebase. SUPE-GT is a baseline in our submission. We additionally ran it on all 25 OGBench tasks.
Novelty / What are the differences between FQL and Q-LAC?
The main difference between FQL and Q-LAC lies in the use of chunking in the actor and the critic. While this distinction might seem simple, it results in significant empirical improvements (see our new results above). For example, FQL fails to solve all but one task in the cube-triple and cube-quadruple domains. In contrast, Q-LAC achieves ~53% average success rate for cube-triple and ~77% average success rate for cube-quadruple.
Getting action chunking to work in RL is not trivial. As also shown in the table, naïvely applying action chunking to standard online RL (RLPD) methods does not lead to good performance (RLPD-AC, RLPD-AC-BC). All RLPD approaches obtain low success rate (~10%) on the hardest domain (cube-quadruple). In our paper, we identify two principles that make action chunking work in RL: (1) learn both chunked policy and critic and perform TD-learning on the temporally extended action space (Section 4.1); (2) use expressive policy with behavior constraints (Section 4.2).
For this rebuttal, we show that these principles generalize beyond Q-LAC (based on FQL). We include two variants of Q-LAC, Q-LAC-BFN and Q-LAC-IFQL, a result of applying the same principles to two new offline-to-online RL baselines (BFN and IFQL, see below for more details on these two base algorithms). As shown in the table above, action chunking consistently improves the non-chunking counterpart (e.g., Q-LAC vs. FQL, Q-LAC-BFN vs. BFN, Q-LAC-IFQL vs. IFQL), supporting our claim that introducing action chunking into RL leads to improved performance.
We now describe the base algorithms of Q-LAC-BFN and Q-LAC-IFQL:
- BFN is built on top of an existing algorithm EMaQ [1] by replacing its behavior policy with a more expressive/modern flow BC policy. We call it best-of-n (BFN) because it works by sampling multiple actions from the BC policy and then picking the action that maximizes the critic value for online exploration and the TD target computation (see below for a detailed algorithm description). This type of implicit policy extraction is also used in IDQL [2] with the difference being that IDQL only uses it for evaluation and online exploration, and does not use it for TD target computation because it decouples value learning and policy learning. Despite it being extremely simple to implement, BFN is on par with FQL in our experiments.
- IFQL is an implementation of IDQL where the diffusion policy is replaced by a flow policy. It uses the IQL expectile loss to train Q-function and value-function. It also uses best-of-n sampling, but only for evaluation and online exploration. We use the implementation from the official FQL codebase.
[1] “EMaQ: Expected-max q-learning operator for simple yet effective offline and online rl."
[2] “IDQL: Implicit q-learning as an actor-critic method with diffusion policies."
“Could a different choice of n [for FQL-n] match the performance of Q-LAC?”
This is a great question! The short answer: No.
We conducted new experiments with different choices of on cube-quadruple-task4 (one of the hardest tasks that we consider) where we report the performance at the end of the offline training, online at 200K steps, online at 500K steps, and online at 1M steps respectively (separated by the arrow. e.g., offline online@200K online@500K online@1M).
| cube-quadruple-100M-task4 | (our main results) | ||||
|---|---|---|---|---|---|
| Q-LAC (with chunking) | →→→ | →→→ | →→→ | →→→ | →→→ |
| FQL-n (without chunking) | →→→ | →→→ | →→→ | →→→ | →→→ |
When , we see a collapse in FQL-5 while Q-LAC continues to learn. This might be due to the biased value backup that FQL-5 uses, whereas Q-LAC is unbiased (see more discussions in Section 4.3, Benefit #1). When we increase the horizon (, ), Q-LAC gets even more sample efficient online while the collapse is worse for FQL-7 and FQL-9. This demonstrates Q-LAC’s better scaling capability with respect to the horizon length.
Additional references for macro-actions
Thanks for the additional references! Macro-actions are semantically equivalent to action chunks indeed and we will include the discussion in our paper and adjust our claims accordingly. However, our work leverages such temporally extended action differently from prior work [1,2]. These works leverage a discrete set of macro-actions (action sequences) that are either heuristically extracted from the prior experience or given in advance and a policy can choose between these macro-actions. In contrast, we use an expressive flow-matching based policy to directly parameterize a continuous space of action sequences, and are directly trained and fine-tuned online with RL. Conceptually, how prior work uses macro-actions is similar to those that use skills, options which we discuss in more detail in Section 2.
[1] "Deep reinforcement learning with macro-actions."
[2] "Deep reinforcement learning for decentralized multi-robot exploration with macro actions."
In the FQL paper, RLPD underperforms FQL. In our submission, FQL underperforms RLPD. Why is that the case?
The FQL paper uses custom RLPD implementation which we have found to underperform the official implementation, so we used the official RLPD codebase instead. In addition, FQL uses 2 critics rather than 10 critics as used in the original RLPD codebase. To stay faithful to the original RLPD design, we maintained the use of 10 critics in our runs.
Writing feedback
We have restructured Section 4 to include more details about FQL as well as better motivate the principles behind our method. We will also fix the typo you pointed out in our next revision.
Thank you again for your constructive feedback and detailed reviews. Your feedback has helped us improve our paper a lot. Thank you!! Please let us know if you have any other concerns or questions. If we have successfully addressed all your concerns, could you kindly raise your rating?
I would like to thank the authors for their detailed response. I greatly appreciate the time and effort spent to conduct the additional experiments. The response addresses my main concerns and I have increased my rating to "accept".
It's great to see that action chunking can be combined with other offline-to-online approaches. Similarly, Q-LAC performing better than FQL-n for other choices of n also supports the benefit of action chunking/Q-LAC.
The remaining criticism pertains to the novelty of the approach being low (action chunking + FQL). The authors point out that introducing action-chunking to work with these approaches is not straightforward as it doesn't work with RLPD and needs behavioral constraints and an expressive policy. In my opinion, the fact that action chunking didn't work with RLPD is enough for the current submission to be interesting.
Thank you for raising the score, and again for your detailed review and insightful comments! Your feedback has helped us strength our paper and we will include the additional results in the final revision of the paper!
This paper presents a simple yet effective actor-critic RL algorithm called Q-LAC. The key idea is to use temporally extended actions (i.e., action chunks) for learning (macro) action-values and acting in the environment. The proposed method is implemented upon FQL and tested on a range of long-horizon, sparse reward offline-to-online tasks, demonstrating strong performance. I think this is a timely study, especially since action chunking (ACT) has already become a default design choice for many embodied AI models. Developing an effective and simple offline-to-online finetuning RL method that is compatible with ACT can shed new light on the community.
The paper initially received mixed reviews. The reviewers have raised concerns regarding its novelty, dependencies on FQL, and experiment evaluations. During the rebuttal, the authors have done a good job of convincing most reviewers by providing detailed discussion and comprehensive supplementary results. Most reviewers have raised their evaluations during the rebuttal phase and voted for acceptance. Although there is still one reviewer who holds negative opinions, I think the paper has reached the bar of NeurIPS.
I encourage the authors to further polish the paper in their final version based on the provided suggestions, especially,
- Include the new results and ablations during rebuttal.
- Discuss more about its dependency on FQL in the main text. The current version hides too many details in the Appendix.
- I hope the authors can add more discussion or conduct explorations on non-FQL implementations. As an FQL-based algorithm might be OK for running some small-scale tasks, but it can be hard to use on real-world, large-scale embodied AI models/tasks. Although the authors provided BFN and IFQL implementations during rebuttal, I think they are still somewhat restrictive (still heavily dependent on a flow BC policy). It will be more meaningful if Q-LAC can work on more flexible backbone algorithms.