Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation
摘要
评审与讨论
The paper proposes to using backward reasoning to generate/predict the action chunks intead of all maintream method using next token prediction in a forward manner, claiming that this will enforge a global-to-local structure and better relaize the final goal.
优缺点分析
Strengths:
- The assumption that reasoning the model backward from the keyframe to the current state is intuitively rationale since in a foward manner, based on the characteristic of autoregressive model, the error will compound, leading to less and less accurate with increasing of the horizon.
- The shown results can support the claim that the proposed regularization of latent space, backward reasoning, reversal ensembling, multi-token prediction can all boost the performance to some extent.
Weakness:
-
The main simulation results shown in Table 1 seems not like a standard benchmark, as well as known, the standard 18 tasks shown in PerAct is well used for a standard reference for RLBench, why the author here using 10 self-selected tasks to eval the method, also the author mentioned it decrease the variation to 1, which is kind of suspicious compared with the standard benchmark in RLBench.
-
The main results in Table 1 seems only compared with ACT and DP, which is kind of lack, how about it compared with the state-of-art method OpenVLA, which is a well known standard autoregressive method in forward reasonaing manner.
-
The experiment and ablation parts use lots of space to explore the spatial distribution, which makes me confused, first the description for the experiment in Section 5.3 (the first two parts) is not clear what the author did, is that all about choosing and eval on different demonstrations? how you control the object distributions when generate the demos? what exact the interpolationa and extrapolation mean? Also, most impotant is I am confused and intuitively cannot relate the sptial properties of the policy and the proposed backward reasoning manner, seems the two does not have some casual relationship, i am not convinced by both the shown Figure 5,6 and the conceptual rationale.
问题
First, please refer to weakness to discuss the 3 main weakness, besides that, i still have some small quesitons as follows:
-
I would like to understanding the author's opinion on: only predicting the target keyframe insteading of predicting action chunks from keyframe to current state in a reverse manner, that is, the whole paper rely on claim that using keyframe centered goal conditioned backward reasoning can better lead the model focus on its short-horizon local goal, intead of gradual lose attention and goal in a foward manner, so in that sense, the accuracy whould highly depends on the prediction of the keyframe given the current state, so how about only perdict the keyframe and continouly call go to pose until it gets there, what is the advantages of CoA compared to it?
-
For the inference, how the exact reverse ensemble is implemented? is it aggregate n steps prediction and smooth it? how many steps is actually executed for each prediction call?
-
what is the computation burden of the autoregressive model, since it seems the model should be able to modeling the longest length of a keyframe segment, wonder what is the exact context length of the model.
局限性
yes
最终评判理由
The author has addressed my major concern about the imcomplete results for sim and further explains the task design. I decide to raise my score as borderline accept.
格式问题
Do not find evident formatting issue.
We appreciate your insightful questions and suggestions. Below, we provide detailed responses to your concerns:
w1[Why using 60 tasks rather than 18 tasks]
Thank you for the valuable suggestion. We would like to kindly clarify a possible misunderstanding regarding our evaluation protocol. Our full evaluation covers 60 diverse RLBench tasks rather than just the commonly used 18 standard tasks. While the 18 tasks are often used as a benchmark, they are neither the only nor a required standard for evaluation. In fact, several recent works—including Point Cloud Matters (NeurIPS 2024), Conditional Flow Matching (CoRL 2024), and Generative Image as Action Models (CoRL 2024)—also adopt customized subsets of RLBench tasks tailored to their specific settings.
The rationale behind this is that the standard 18 tasks are frequently employed to evaluate hierarchical policies that rely on high-precision 3D inputs and motion planners. Many of these tasks are quite challenging for RGB-only visuomotor policies, resulting in uniformly low success rates and limited discriminative power. For this reason, we excluded some of these tasks, although 7 out of the 18 standard tasks are still included in our larger 60-task set.
Moreover, to directly address your concern, we now provide results on 18 standard tasks. The results highlight two key points:
- A large performance gap exists between RGB-based imitation learning methods (CNN/ViT-BC, CoA, ACT) and 3D keypoint-based hierarchical models (RVT), which validates the difficulty of these tasks under 2D perception.
- CoA still achieves the best performance among RGB-based imitation learning methods, demonstrating its advantage within this category.
- The results for CNN-BC, ViT-BC, and RVT-2 are directly reported from the paper "RVT-2: Learning Precise Manipulation from Few Demonstrations", while the CoA and ACT results are reproduced by us.
| Task | CNN-BC (Image) | ViT-BC (Image) | RVT-2 (Point cloud, SOTA on this 18 task benchmark, goal pose prediction + motion planner) | CoA (Image) | ACT (Image) |
|---|---|---|---|---|---|
| Close Jar | 0 | 0 | 100.0 ± 0.0 | 0 | 0 |
| Drag Stick | 0 | 0 | 99.0 ± 1.7 | 0 | 0 |
| Insert Peg | 0 | 0 | 40.0 ± 0.0 | 0 | 0 |
| Meat off Grill | 0 | 0 | 99.0 ± 1.7 | 88 | 32 |
| Open Drawer | 4 | 0 | 74.0 ± 11.8 | 88 | 52 |
| Place Cups | 0 | 0 | 38.0 ± 4.5 | 0 | 0 |
| Place Wine | 0 | 0 | 95.0 ± 3.3 | 80 | 56 |
| Push Buttons | 0 | 0 | 100.0 ± 0.0 | 28 | 32 |
| Put in Cupboard | 0 | 0 | 66.0 ± 4.5 | 8 | 0 |
| Put in Drawer | 8 | 0 | 96.0 ± 0.0 | 88 | 60 |
| Put in Safe | 4 | 0 | 96.0 ± 2.8 | 80 | 36 |
| Screw Bulb | 0 | 0 | 88.0 ± 4.9 | 0 | 0 |
| Slide Block | 0 | 0 | 92.0 ± 2.8 | 64 | 36 |
| Sort Shape | 0 | 0 | 35.0 ± 7.1 | 0 | 0 |
| Stack Blocks | 0 | 0 | 80.0 ± 2.8 | 0 | 0 |
| Stack Cups | 0 | 0 | 69.0 ± 5.9 | 0 | 0 |
| Sweep to Dustpan | 0 | 0 | 100.0 ± 0.0 | 92 | 1 |
| Turn Tap | 8 | 16 | 99.0 ± 1.7 | 56 | 36 |
| Average | 1.33 | 0.89 | 81.2 | 37.33 | 18.94 |
w2 [Compare with OpenVLA]
While it would be beneficial to include comparisons to OpenVLA if time and resources permit, it is important to note that OpenVLA targets a different problem setting. It is a large-scale Vision-Language-Action model (~7 billion parameters) pretrained on the extensive Open-XE dataset, requiring substantial computational resources and data.
In contrast, our work focuses on visuomotor policy learning with relatively small-scale models (under 0.5 billion parameters), without reliance on massive pretraining datasets. As such, a direct comparison to OpenVLA would largely reflect differences in model scale and data availability rather than algorithmic advances.
w3[Understanding of experiments]
Thank you for pointing this out. Our goal in Section 5.3 is to evaluate how well CoA generalizes under spatial shifts in object positions. To this end, we evaluate the performance with respect to the object position. This way is also used in previous work like Generative Image as Action Models (CoRL2024), DemoGen: Synthetic Demonstration Generation for Data-Efficient Visuomotor Policy Learning (RSS2025).
It is important to clarify that RLBench primarily varies object positions between training and test scenes, while other attributes such as texture, shape, and color remain fixed. As a result, task success rates serve as a direct indicator of a policy’s spatial generalization capability.
In our setup, we generate 150 object configurations per task. Among these, we select the 100 samples nearest to the center and compute their convex hull. We then randomly sample:
- 50 configurations inside the hull as the training set,
- another 50 configurations within the hull as the interpolation test set, and
- the remaining 50 outside the hull as the extrapolation test set.
Section 5.3 is dedicated to analyzing how well CoA generalizes across these spatial regimes. Specifically:
- Interpolation refers to test object positions within the convex hull of the training distribution.
- Extrapolation refers to positions outside this region, requiring out-of-distribution generalization.
The core insight is that backward reasoning allows the policy to anchor each local action on the final task goal. This global-to-local structure mitigates compounding errors in long-horizon tasks and enhances environment-aware decision-making, especially under spatial shifts. Figures 5 and 6 provide both quantitative and qualitative evidence to support this claim.
q1[Compare with goal+motion planner]
This is a great question. The strategy of first predicting a goal pose and then using a motion planner to reach it is a well-established approach, commonly adopted in hierarchical modeling. We review such methods in our Related Work section, including: “A Multi-Task Transformer for Robotic Manipulation”, “RVT: Robotic View Transformer for 3D Object Manipulation”, “RVT-2: Learning Precise Manipulation from Few Demonstrations”, and “Coarse-to-Fine Q-Attention with Learned Path Ranking.”
CoA differs from these approaches in three key ways:
- Closed-loop execution: Rather than relying on an open-loop planner, CoA generates action chunks that are continually updated based on real-time observations.
- Context-aware control: Tasks with path constraints (e.g., sweeping dust or articulated object manipulation) often fail under naive go-to-pose strategies.
- All hierarchical modeling depends on 3D observation
To further investigate this, we conducted a controlled study where we replaced CoA’s predicted trajectory with direct execution of its keyframe actions (i.e., the goal poses) via a motion planner:
- Open Box: Success rate drops from 72% to 0%, as the task requires executing a specific rotational trajectory to lift the hinged lid.
- Sweep Dust: Success rate drops from 92% to 8%, since merely reaching the end pose fails to guarantee coverage along the intended sweeping path.
We note that some hierarchical models perform well on these tasks using motion planners. However, they often benefit from high-fidelity 3D inputs, intensive computation to accurately infer the goal pose, finer-grained trajectory decomposition, and curved trajectory generation through specific motion planners, all of which differ significantly from our integrated and lightweight design.
q2[Reverse temporal ensemble]
Our reverse ensemble is conceptually similar to ACT’s temporal ensemble, but aligns predictions using the keyframe action as anchor, rather than forward time alignment.
During inference, we generate the trajectory autoregressively from the keyframe backward. Each method—including CoA and baselines—executes one action per inference call, enabling closed-loop control.
q3[computation overhead]
The inference cost is proportional to the predicted trajectory length, which depends on the distance between the gripper and the goal. During training, we determine a maximum sequence length T_max (see Implementation Details and Algorithm 2). Most tasks have a sequence length of 50-100 steps, and for CoA is a single-task imitation learning method, the specific length is up to task.
Dear Reviewer, I hope this message finds you well. I just wanted to gently follow up on our discussion, in case my response was overlooked.
If anything remains unclear, we’d be more than happy to continue the discussion.
Thanks for the rebuttal. The author haved addressed my main concern about the imcomplele results for sim and also give insightful understanding of the goal+motion planner design. I decide to raise my score as borderline accept.
Dear Reviewer,
We sincerely thank you for taking the time to re-evaluate our work after reading the rebuttal. We also appreciate your feedback, and we will update the manuscript accordingly based on the additional clarifications we provided.
This paper introduces Chain-of-Action (CoA), a trajectory autoregressive model for robotic manipulation, which generates action sequences backward from goal keyframes to initial states, aimed at mitigating compounding errors and improving spatial generalization. The method is empirically demonstrated to outperform several baselines on the RLBench benchmark and real-world tasks.
优缺点分析
Strengths:
- The core idea of backward trajectory generation is interesting and potentially effective for addressing compounding errors in robotic manipulation tasks.
- The authors demonstrate strong empirical results on extensive simulation benchmarks (60 RLBench tasks) and real-world tasks.
- The approach is well-explained, with clearly presented visualizations.
Weaknesses:
- The integration-based stopping criterion appears effective for pick-and-place tasks but may be challenging for tasks involving frequent or prolonged contacts, such as locomotion or dexterous manipulation.
- Previous works like ACT succeeded with only an L1 loss over actions. It is unclear why CoA specifically requires the temporal consistency loss to function effectively. Can the authors clarify why their method underperforms without this additional constraint, compared to existing methods?
- Although the paper includes some ablation studies, it remains difficult to pinpoint which components of CoA are primarily responsible for its improved performance over baselines. Several elements (mtp, continuous actions, and temporal consistency loss) are known techniques. In my view, the truly novel aspects are the backward prediction and the use of variable-length action chunks. Table 2 partially addresses the backward prediction by comparing with a forward model, but it would be helpful to also include an ablation that combines forward prediction with fixed-length action chunks. This would better isolate the contribution of the reverse, variable-length modeling.
- No statistical confidence / error bars. Results are means over 25 trials, but variance or CI is never reported.
问题
- The abstract claims an "action-level Chain-of-Thought (CoT) process," yet the method primarily predicts flexible-length action chunks without explicit or latent reasoning steps. Could the authors clarify this discrepancy or better justify the labeling of their approach as CoT?
- One of the unique properties of CoA is the flexible execution segment prediction. However, there is no analysis of that part. The average length of the execution segments is not clearly stated. Given that longer segments could introduce larger accumulative errors, it would be helpful to see an analysis of how segment length affects success rates across tasks.
- What kinds of RLBench tasks still fail (see tails of Fig 4)? Are there qualitative patterns (e.g., high-precision alignments, deformable objects)?
局限性
yes
最终评判理由
The idea of generating the chain of thought in reverse is intriguing, and the experimental results suggest it leads to improved performance. While some of the individual components have been explored in prior work, the specific combination presented here is novel and likely to be of interest to the community. That said, since all of the core ideas have been previously studied and the main contribution lies in how they are combined, I am keeping my score at "borderline accept."
格式问题
All is well
We sincerely appreciate your thoughtful questions and valuable suggestions. Below, we provide detailed responses to address your concerns:
w1[Limitation]
We thank the reviewer for the valuable suggestion. Indeed, the stopping critiria of CoA might face issues when generalize to other scenarios, e.g., dexterous manipulation. However, as a proof of concept, CoA focuses on quasi-static tasks and aims to boost the spatial generalization capabilities. We demonstrated its performance on 60 RLBench tasks. In our future work, we will extend this paradigm by learning the stopping signal.
w2 [Why latent consistency loss]
Overall, our goal is to establish a robust representation for autoregressive modeling in continuous action spaces. While ACT also uses a Transformer decoder, its action predictions are generated in parallel rather than autoregressively. In contrast to LLMs and VLMs, which operate over a fixed, discrete vocabulary, our scenario involves continuous action prediction, where the output cannot be constrained to a predefined token set. As a result, even small drifts in action embeddings can accumulate and amplify over autoregressive steps. To address this, we introduce a latent consistency loss that explicitly enforces the stability of action embeddings across time steps.
w3[Ablation of keyframe action + forward action chunk]
wait for the result
w4[statistical confidence]
This is a valuable question as it concerns whether the method truly works or merely reflects randomness. Variance is indeed reported in many prior works. We fully understand its importance and would like to clarify why we do not include it in our results. Our primary goal is to evaluate the generalization ability of the method across a diverse set of tasks. Therefore, we extend the evaluation to cover 60 different tasks and report the average performance. Conducting multiple training runs for all these tasks would be computationally expensive. To address variance, we perform ablation studies on 10 representative tasks, which can similarly help reduce the variance of the average success rate. In summary, this is a trade-off between computational cost and statistical rigor.
q1[The rationale of name]
Thank you for pointing this out. We agree that the classical notion of Chain-of-Thought (CoT) in language models typically refers to multi-step intermediate reasoning expressed in discrete text. However, the connotation of CoT has been evolving with technological advances, and recent works support our naming rationale.
For example, recent works such as CoT-VLA and Robotic Control via Embodied Chain-of-Thought Reasoning have extended the CoT concept beyond text, conditioning action prediction on image goals.
Similarly, another related work from NeurIPS 2022, Chain of Thought Imitation with Procedure Cloning, introduces procedure cloning without relying on intermediate discrete text reasoning.
Therefore, we follow this broader community understanding and apply the term “chain-of-action” to denote multi-step, intermediate action-level reasoning in continuous action spaces, which we believe is a reasonable and natural extension of the CoT paradigm.
q2[Correlation between segment length and success rate]
The average segment length of CoA across the 60 RLBench tasks is approximately 72, with most trajectories consisting of 2 execution segments. The Pearson correlation between average segment length and task success rate coefficient is −0.27 (p ≈ 0.037), indicating a statistically significant negative correlation: tasks with longer execution segments tend to have lower success rates, likely due to accumulated prediction errors.
This is matched with similar observation in Figure 5: we analyze task success rates under varying levels of spatial variance, which reflects the spread of object positions across demonstrations. Tasks with higher spatial variance generally require longer execution segments. The success rates of all methods decline as spatial variance increases, which is expected due to the increased difficulty and longer required trajectories.
q3[Failure case]
In summary, we did not observe consistent failure patterns across specific categories of tasks (e.g., high-precision or deformable-object tasks). However, distinct failure modes can often be identified at the individual task level, suggesting that performance limitations are more task-specific than type-specific.
Regarding tasks where CoA shows lower performance compared to ACT, DP, or Octo:
- Ocot with Pretrained model and different action space: CoA underperforms relative to Octo on this task. This is largely because Octo is a fine-tuned, pretrained model and operates in a delta joint action space, which is particularly advantageous for tasks requiring precise rotational control. Therefore, direct comparisons on such tasks remain limited in fairness.
- Incidental fluctuations due to training variability: For individual tasks such as Reach Target, Press Switch, and Sweep Dust, the relatively small performance gaps (<12%) can be attributed to non-systematic factors such as optimization variability. That is, re-training the model with the same hyperparameters may yield slightly different results—sometimes even better—and we report results from batch training without any manual selection over individual tasks. Such variability is commonly observed in imitation learning, especially with behavior cloning, and may lead to minor performance shifts across runs. In some cases, CoA may even outperform other methods on the same task under different training seeds. To mitigate these fluctuations, we report average performance over a broad task set (10 tasks for ablations and 60 for overall evaluation), ensuring that our conclusions reflect stable, aggregated trends. Similar instability has also been discussed in prior work, such as "What Matters in Learning from Offline Human Demonstrations for Robot Manipulation".
I thank the authors for their responses and would like to comment on some of them: W2 - Thank you for the clarification, this is indeed a major difference between your work and ACT that I didn’t fully comprehend at the time. W3 - I think you missed part of your rebuttal. W4 - To provide statistical confidence, you do not need to perform multiple training runs over all tasks. If you have multiple evaluation runs over 60 different tasks, that should be more than enough data points for a confidence interval.
We thank the reviewer for the constructive feedback and the opportunity to clarify and strengthen our submission. Please find our detailed responses below:
W3
We apologize for the previous oversight in result reporting. We now include an additional ablation. Specifically, the “Goal+Chunk” variant first predicts the goal pose and then generates a complete action chunk conditioned on the predicted goal. The chunking mechanism follows the original implementation used in ACT.
We observe that the backward paradigm still achieves the best performance, as it naturally models a global-to-local structure for trajectory generation. This new comparison further highlights the effectiveness of our backward generation strategy beyond what was shown in the original submission.
| Method | Success Rate (%) |
|---|---|
| Backward | 75.60 |
| Forward | 66.80 |
| Hybrid | 60.00 |
| Goal+Chunk | 65.40 |
W4
Thank you for the suggestion. If what you are referring to is the variance of success rates across tasks, we can report the following:
| Method | Mean Success Rate | Variance |
|---|---|---|
| CoA | 0.5520 | 0.0932 |
| ACT | 0.3893 | 0.0949 |
| DP | 0.3260 | 0.0957 |
| While CoA achieves the highest average success rate, the comparable variances across methods indicate that the improvement of CoA is not driven by a few outlier tasks but is consistently observed across the task suite. These additional statistics further support the robustness and broad effectiveness of our method, and we will include them in the revised manuscript. |
Dear Reviewer, I hope this message finds you well. I just wanted to gently follow up on our discussion, in case my response was overlooked.
If anything remains unclear, we’d be more than happy to continue the discussion.
Dear authors, thank you for providing the additional results. I have no further questions regarding the paper.
This paper presents Chain-of-Action (CoA), a novel trajectory-level autoregressive modeling framework for robotic manipulation. Unlike conventional forward action generation, CoA introduces backward trajectory prediction, starting from a goal-anchored keyframe and generating actions in reverse. Empirical results show that CoA significantly outperforms prior methods such as ACT and Diffusion Policy on 60 RLBench tasks and 8 real-world kitchen tasks.
优缺点分析
Strength:
-
Strong empirical results: CoA achieves strong performance across both simulation and real-world settings, outperforming several baselines.
-
Comprehensive ablation and case studies: The reviewer appreciates the in-depth ablation study, which provides valuable insights to the contribution of each design choice.
Weakness:
-
Motivation: It is unclear what the biggest motivation is for favoring the reverse autoregressive structure over the forward autoregressive structure. Is it because of the compounding error? If yes, it is worth investigating methods that tackle similar problems.
-
Incremental novelty: This paper essentially proposes an improvement on top of the ACT paper, with goal-conditioning and reverse autoregressive. Goal-conditioning is a well-explored topic, and it is not surprising that it can improve the performance. The reverse autoregressive idea is interesting; however, I find the paper lacks theoretical grounding on the proposed idea, and insufficient discussion on the “chain-of-action”.
-
Writing: The writing can be improved. It would be helpful to explain the overall architecture and then dive deep into each component. Also, many terms are not properly defined. For example, what does the start-of-sequence (SOS) token and the final keyframe action refer to in line 166?
-
Related work: It would be helpful to include more literature on goal-conditioned learning.
-
Baseline: It would be interesting to see how the goal-conditioned forward autoregressive model and goal-conditioned diffusion model perform. These are important baselines to evaluate the contribution of this work.
问题
- What is the reason that the proposed model failed on Press Switch, Reach Target, Sweep Dust and Open Box?
- Does the learned token refers to the latent variable learnt from the encoder? If the proposed network follows the design of ACT, then what is the distribution of this latent variable?
- Can the proposed method generalize to other model architectures other than CVAE?
局限性
Yes.
最终评判理由
The authors provided high-quality responses during rebuttal and addressed most of my concerns. This work presents a novel approach to handling compounding error for autoregressive models, which is important for robotic manipulation tasks. I do think that 1) improving the writing of the manuscript and 2) providing a more comprehensive benchmark could strengthen the paper. However, given the responses provided during rebuttal, I would like to raise my score for this submission.
格式问题
N/A.
We appreciate your insightful questions and suggestions. Below, we provide detailed responses to your concerns:
w1[Motivation]
To clarify our motivation more precisely, we decompose it into two aspects:
- Goal conditioning helps mitigate compounding errors.
- Backward prediction can seamlessly incorporate goal-conditioned generation, whereas forward prediction cannot do so as naturally.
Evidence that goal conditioning reduces compounding errors:
Waypoint-Based Imitation Learning for Robotic Manipulation (CoRL 2023) demonstrates that decomposing trajectories into intermediate waypoints effectively mitigates cumulative errors. Similarly, hierarchical modeling approaches discussed in related work support this idea. For example, Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation (CVPR 2024) leverages keyframes and a diffusion-based planner, achieving higher success rates than ACT and Diffusion Policy.
Why goal conditioning with backward prediction is better:
Backward prediction can incorporate goal conditioning following the natural trajectory order, preserving the spatial dependencies between waypoints, which leads to better performance. Of course, forward prediction can also incorporate goal conditioning, as shown by the setting Hybrid in Table 2, where the goal is predicted first and then actions are generated in a forward manner. However, its performance is even worse than pure Forward. This is because backward prediction imposes a natural global-to-local constraint on the predicted actions. In contrast, the hybrid approach directly jumps from the goal prediction to the current action, lacking spatial causal continuity.
w2[Novelty]
Since our technical approach has not been extensively explored before, there may be concerns about its rationale. To address this, we highlight two possibly related works that partially support our idea:
- There is a concurrent work titled Efficient Robotic Policy Learning via Latent Space Backward Planning (accepted to ICML 2025, posted on arXiv May 11, four days before NIPS DDL), which performs backward planning in the latent space by predicting future latent states from current observations. This is fundamentally different from our approach, as we conduct backward planning directly in the action space, enabling explicit autoregressive generation of action sequences from goal to start.
- Another related work is Chain of Thought Imitation with Procedure Cloning, which introduces procedure cloning and conditions action prediction only on goal poses. However, it does not incorporate backward trajectory prediction.
In summary, to the best of our knowledge, ours is the first method to perform explicit reverse-order prediction in the action space.
On the naming rationale of “chain-of-action”: We believe the community widely accepts that “chain-of-…” terminology refers to the presence of an intermediate reasoning or procedural process between inputs and outputs, not necessarily limited to symbolic or language-based reasoning like in large language models (LLMs). For instance, recent work like CoT-VLA simply uses an image goal as a condition for action prediction, yet the naming convention is broadly accepted. This supports our use of “chain-of-action” to describe our method’s autoregressive backward action reasoning.
w3[Writing]
We apologize for the unclear description in the current manuscript and would like to update. To clarify, the start-of-sequence (SOS) token serves as a query input to the transformer decoder and then transferred to the keyframe action that initiates the autoregressive prediction sequence.
w4[Related work]
We would like to clarify that our work is specifically related to subgoal-conditioned policy learning in robotic manipulation, rather than the broader class of general goal-conditioned methods. Prior works that use goal poses as subgoals predominantly depend on explicit motion planners and 3D state observations, these methods have been comprehensively reviewed in “Hierarchical Modeling in Robotic Manipulation.”
For approaches utilizing alternative subgoal modalities—such as language, bounding boxes, or visual traces. We have already discussed relevant works in our review of CoT-style methods in robotic manipulation.
apart from these, we notice more works that employ subgoals for manipulation, including:
- Waypoint-Based Imitation Learning for Robotic Manipulation
- Hierarchical Foresight: Self-Supervised Learning of Long-Horizon Tasks via Visual Subgoal Generation
- Subgoal Diffuser: Coarse-to-Fine Subgoal Generation to Guide Model Predictive Control for Robot Manipulation
w5[Baseline]
We will improve our writing for clearer expression. In fact, we have already presented results for goal-conditioned forward prediction (the Hybrid setting in Table 2), as well as for ACT with keyframe action conditioning. As mentioned in Appendix D, adding keyframe action conditioning improves ACT’s performance from 48.8% to 51.6%. Additionally, we conducted an experiment showing that Diffusion Policy improves from 42.0% to 44.8% with keyframe conditioning. We will update these results accordingly in the manuscript.
q1[Failure case]
In summary, we did not observe consistent failure patterns across specific categories of tasks (e.g., high-precision or deformable-object tasks). However, distinct failure modes can often be identified at the individual task level, suggesting that performance limitations are more task-specific than type-specific.
Regarding tasks where CoA shows lower performance compared to ACT, DP, or Octo:
- Ocot with Pretrained model and different action space: CoA underperforms relative to Octo on this task. This is largely because Octo is a fine-tuned, pretrained model and operates in a delta joint action space, which is particularly advantageous for tasks requiring precise rotational control. Therefore, direct comparisons on such tasks remain limited in fairness.
- Incidental fluctuations due to training variability: For individual tasks such as Reach Target, Press Switch, and Sweep Dust, the relatively small performance gaps (<12%) can be attributed to non-systematic factors such as optimization variability. That is, re-training the model with the same hyperparameters may yield slightly different results—sometimes even better—and we report results from batch training without any manual selection over individual tasks. Such variability is commonly observed in imitation learning, especially with behavior cloning, and may lead to minor performance shifts across runs. In some cases, CoA may even outperform other methods on the same task under different training seeds. To mitigate these fluctuations, we report average performance over a broad task set (10 tasks for ablations and 60 for overall evaluation), ensuring that our conclusions reflect stable, aggregated trends. Similar instability has also been discussed in prior work, such as "What Matters in Learning from Offline Human Demonstrations for Robot Manipulation".
q2/q3[About CVAE and latent token]
We would like to clarify a potential misunderstanding and modify the writing to better reflect our approach. As shown in Figure 1, we do not use the CVAE component from ACT, so there is no latent variable from CAVE. The SOS token actually indicates a start token for the autoregressive transformer decoder.
CoA only share identical number of layers of transformer encoder and decoder with ACT
As detailed in Section 4 (Implementation Details – Network Architecture), our model do share architectural similarity with ACT, as both employ a 4-layer Transformer encoder and a 7-layer Transformer decoder. However, we intentionally removed the CVAE module to maintain a minimalist design. We will further clarify this distinction explicitly in our manuscript to avoid confusion.
Dear Reviewer, I hope this message finds you well. I just wanted to gently follow up on our discussion, in case my response was overlooked.
If anything remains unclear, we’d be more than happy to continue the discussion.
I would like to thank the authors for the detailed responses. I appreciate the efforts to address all my questions. I think the authors have addressed most of my questions. Motivation and novelty - Thanks for the clarification. If the claim "ours is the first method to perform explicit reverse-order prediction in the action space" is accurate, I agree that this work provides enough contribution to the field. And thanks for providing a comprehensive related work. I do have a few follow-up questions on "Why goal conditioning with backward prediction is better":
- Maybe I am missing something, but for the same predicted goal, why does the backward prediction outperform the forward prediction? Is the global-to-local constraint not available for forward prediction? Do you have a perspective that is not performance-driven?
- Could you also provide some clarification on "the hybrid approach directly jumps from the goal prediction to the current action, lacking spatial causal continuity"?
Thank you for raising this insightful question, it highlights a key advantage CoA.
Summary
Regarding Q1, as briefly explained in line 289, while the Hybrid (goal + forward) paradigm does have a global-to-local constraint, it lacks spatial continuity. This significantly increases modeling difficulty, leading to weaker fitting ability compared to our reverse. It can be confirmed by our training loss analysis below, Reverse consistently achieves lower loss than Hybrid. Additionally, the Reverse paradigm facilitates dynamic stopping more effectively than Hybrid.
Regarding Q2: The terms continuity, spatial continuity, and spatial causal continuity are used interchangeably in our paper, they all refer to the same concept. A precise definition will be provided below for clarity.
Modeling Comparison
| Paradigm | Order of Prediction | Goal-conditioned | Spatial Continuity |
|---|---|---|---|
| Forward | ✘ | ✔️ | |
| Reverse (CoA) | ✔️ | ✔️ | |
| Hybrid | ✔️ | ✘ |
Why Reverse Performs Best:
CoA combines two essential structural properties:
-
Goal Conditioning: The entire trajectory is anchored at the goal (), guiding all subsequent predictions and reducing compounding errors. Its benefits have already been well established in our earlier discussion.
-
Spatial Continuity (also refer to as
spatial causal continuityin previous clarification): Predictions follow naturally adjacent actions, maintaining smooth and coherent transitions. We will elaborate on this aspect using formal equations in the next section.
Both Forward and Reverse maintain spatial continuity, but only Reverse also leverages goal conditioning. While Hybrid is goal-conditioned, it breaks the spatial continuity.
Spatial Continuity Analysis:
Reverse (refer to Formulation 1, lines 118–119):
Each prediction step is spatially adjacent, enabling local and coherent transitions anchored by the goal.
Hybrid:
-
While it is goal-conditioned, the prediction of after introduces a long-range spatial jump, which breaks the spatial continuity. Intuitively, predicting is clearly harder than predicting as the latter preserves spatial continuity. This influence propagates through the entire sequence, degrading the model’s fitting ability. This can be justified by
Training loss analysisbelow. -
Moreover, the lack of continuity makes dynamic stopping difficult to implement effectively ( In the Hybrid paradigm, the stopping criterion is modified to stop when the predicted pose approaches the goal pose, rather than the start pose). It is because in Hybrid a single incorrect stop can break the entire trajectory generation. In contrast, for Reverse, even if stopping occurs too early, the trajectory remains coherent and anchored to the goal. This can be justified by our visualization of trajectories. (sorry for it can not be shown here)
These two issues significantly increase the difficulty of generating a spatially continuous trajectory.
Training loss analysis:
This table shows the total training loss across 10 representative tasks. We observe that Reverse consistently achieves lower training loss compared to Hybrid on all tasks, with the average loss reduced from 0.00376 (Hybrid) to 0.00160 (Reverse).
| Task Name | Reverse | Hybrid |
|---|---|---|
| Average | 0.00160 | 0.00376 |
| turn_tap | 0.00194 | 0.00453 |
| take_lid_off_saucepan | 0.00159 | 0.00528 |
| sweep_to_dustpan | 0.000917 | 0.00184 |
| stack_wine | 0.00268 | 0.00452 |
| reach_target | 0.00106 | 0.00236 |
| push_button | 0.00153 | 0.00466 |
| press_switch | 0.00152 | 0.00307 |
| pick_up_cup | 0.00124 | 0.00451 |
| open_drawer | 0.00145 | 0.00172 |
| open_box | 0.00196 | 0.00411 |
We appreciate the opportunity to clarify this fundamental insight and will further emphasize this distinction in the revised manuscript.
Thanks for the response. I think the latest response does provide clarity on the key contribution of the submission. I don't have any further questions, and since the authors addressed all my concerns during the rebuttal. I would like to raise my score.
Dear Reviewer
Your feedback has been invaluable in helping us improve the paper. We’re glad to hear that our responses addressed your concerns. We sincerely appreciate your support and thoughtful feedback throughout the review process.
The paper presents Chain-of-Action (CoA), a novel visuo-motor policy architecture for robotic manipulation, built upon trajectory autoregressive modeling with a reverse generation paradigm. CoA generates the actions by reasoning backward from a goal-specific keyframe action to the initial state. This backward generation enforces a global-to-local structure, ensuring that every local action remains tightly aligned with the task objective and significantly mitigating compounding errors. The authors introduce four key design elements to realize the CoA framework:
- Continuous action token representation to avoid quantization errors.
- Multi-token prediction to improve local action coherence.
- Dynamic stopping for flexible trajectory lengths.
- Reverse temporal ensemble to aggregate multiple backward rollouts. Experimental results on the RLBench benchmark (60 simulation tasks and 10 representative tasks) demonstrate that CoA achieves state-of-the-art performance, outperforming strong baselines such as ACT and Diffusion Policy (DP), with average improvements of 16% and 23% respectively. Notably, CoA shows superior spatial generalization, maintaining higher success rates even as the spatial variance of the environment increases, and proves more robust in out-of-distribution scenarios. Ablation studies confirm the importance of each design component. Real-world experiments on a Fetch robot further validate CoA’s effectiveness, achieving a 15% higher average success rate than ACT on eight kitchen manipulation tasks.
优缺点分析
Strengths
- The proposed algorithm is both simple and intuitive. By leveraging a backward, goal-conditioned trajectory generation paradigm, the approach is conceptually easy to understand and implement, yet powerful in practice.
- The method achieves strong empirical performance, consistently outperforming existing baselines across a wide range of simulated and real-world robotic manipulation tasks.
- The paper provides extensive experimental analysis, including comprehensive comparisons to widely-used methods, thorough ablation studies, and in-depth exploration of spatial generalization and robustness, which together offer convincing evidence for the effectiveness of the proposed approach.
Weaknesses
- The method is limited to manipulation tasks where actions correspond directly to end-effector poses. Both the keyframe extraction heuristic and the distance-based stopping mechanism rely on this correspondence, making it difficult to generalize the approach to broader classes of robotic tasks (e.g., locomotion, multi-agent coordination, or tasks with more abstract action spaces).
- While the algorithm’s simplicity and intuitive design are strengths, it raises the question of whether reverse-order planning is truly novel, or if similar ideas have been explored in previous work. A more thorough review of related literature on backward or reverse-order trajectory generation would strengthen the paper.
问题
- I am curious whether there are truly no existing works that adopt a similar reverse-order trajectory generation approach. If your response (or further discussion among reviewers) can clarify the novelty of CoA relative to prior literature, I would be inclined to rate the paper more favorably.
- I would also like to better understand why the proposed method achieves better performance than Diffusion Policy. As I understand, Diffusion Policy also considers future states by denoising current actions, which seems to have a similar effect to the reverse-order generation in CoA. Is there a specific reason for CoA’s superior performance? For example, could it be due to the action sequence length? According to Figure 7, CoA generates sequences of about 80 steps, whereas Diffusion Policy typically generates only around 20 steps per trajectory. Is it possible that DP does not fully account for the keyframe action or long-horizon dependencies as effectively due to this shorter sequence length?
局限性
yes
最终评判理由
The authors demonstrated the novelty of their proposed method and its superiority over existing approaches through their rebuttal. Accordingly, I raised my rating.
格式问题
There is no major formatting issue
We appreciate your insightful questions and suggestions. Below, we provide detailed responses to your concerns:
w1[Limitations in applicability]
We acknowledge that this limitation exists in our implementation in this paper. However, the core idea of CoA can be extended to other domains through domain-specific heuristics and design choices. For example, in quadruped locomotion, foot touchdown events can naturally serve as keyframes. Moreover, the dynamic stopping mechanism could also be learned depending on the specific scenario, rather than relying on a distance-based rule.
w2/q1[Regarding novelty]
While backward planning has been extensively studied in the classical motion planning and optimization communities, to the best of our knowledge, it has not been explored within learning-based visuomotor policies prior to our work.
However, we actually find a concurrent work titled “Efficient Robotic Policy Learning via Latent Space Backward Planning” (accepted to ICML 2025, released on arXiv on May 11, four days before the NeurIPS deadline) also adopts a backward planning strategy—but in the latent space, by predicting future latent states and sub latent goals from current observations and . This approach is fundamentally different from ours: we perform backward planning directly in the action space, enabling explicit autoregressive generation of a whole trajectory. Our advantage is that: 1. the modeling of CoA can serve as a effective alternative to action chunking for it is highly compatible with other existing methods 2. CoA offers better interpretability, as its generated trajectories can be visualized to reveal the underlying decision-making logic.
Another related work is “Chain of Thought Imitation with Procedure Cloning”(NIPS 2022), which proposes procedure cloning conditioned only on the final goal pose. However, it does not involve backward prediction of the full trajectory.
In summary, our method is the first manipulation imitation learning policy to perform backward prediction directly in the action space.
q2[Comparison with Diffusion Policy with longer action horizon]:
Thank you for raising this insightful point. To directly address your question, we conducted additional experiments to explicitly compare CoA with Diffusion Policy (DP) under longer action horizons. Specifically, we evaluated DP on 10 representative RLBench tasks (as previously used in Table 1's ablation studies) with extended action horizons of 100 and even 180 steps. Note that 180 exceeds the maximum full trajectory length observed across these 10 tasks (178.96 steps), as detailed below:
Full Trajectory Lengths (10 representative tasks):
| Task | Full Trajectory Length |
|---|---|
| turn_tap | 146.48 |
| take_lid_off_saucepan | 98.80 |
| sweep_to_dustpan | 118.08 |
| stack_wine | 178.96 |
| reach_target | 43.88 |
| push_button | 84.52 |
| press_switch | 116.84 |
| pick_up_cup | 97.28 |
| open_drawer | 105.28 |
| open_box | 152.36 |
The resulting success rates for DP with extended action horizons are summarized below:
| Action Horizon | Success Rate (%) |
|---|---|
| 20 | 41.6 |
| 100 | 42.0 |
| 180 | 42.0 |
| 100 (w/ Goal) | 44.8 |
We observed that extending DP’s action horizon from its original length of 20 steps to 100 or even 180 steps yielded minimal improvement (from 41.6% to only 42%), indicating that DP does not effectively leverage longer action horizons. This finding aligns with the original Diffusion Policy paper, which noted optimal performance at around 8-step horizons, beyond which performance deteriorates.
Thus, the superior performance of CoA compared to DP can be attributed to CoA's ability to effectively handle long-horizon trajectories and explicitly condition each local action sequence on the final goal via backward reasoning. In contrast, despite denoising steps implicitly considering future states, DP inherently struggles to capture long-horizon dependencies when the action horizon is significantly increased.
I appreciate the authors for addressing my concerns about the novelty and the comparison with existing algorithms. Accordingly, I have raised my rating.
We sincerely appreciate your recognition and the rating adjustment. Thank you again for your valuable feedback.
Dear Reviewer, I hope this message finds you well. I just wanted to gently follow up on our discussion, in case my response was overlooked.
If anything remains unclear, we’d be more than happy to continue the discussion.
The paper mentions in the abstract: "Empirically, we observe CoA achieves the state-of-the-art performance across 60 RLBench tasks and 8 real-world manipulation tasks". For the claim of SOTA performance to be validated, SOTA policies in the benchmark should be compared against, such as RVT-2 and 3D diffuser actor. Could the authors please show the results side by side of these methods and theirs, to see what is the gap in performance, and if there is any?
Key Differences in Methodology
| Aspect | 3D-based Hierarchical Methods | Image-based Visuomotor Policies |
|---|---|---|
| Typical Methods | PerAct, RVT-2, 3D Diffuser Actor | ACT, Diffusion Policy, CoA |
| Input Modality | 3D point cloud / RGB-D | RGB-only |
| Pipeline | Two-stage: keyframe + motion planning | End-to-end trajectory prediction |
| Execution Mode | Open-loop between keyframes | Closed-loop |
Why Our Main Benchmark is 60 Tasks
The RLBench-18 subset originates from PerAct (CoRL 2022) and is tailored to 3D-based pipelines. Our 60-task RGB benchmark:
- Covers a broader and more diverse range of tasks (including 7 from the 18-task set).
- Better reflects the strengths and weaknesses of image-based policies. The 18 tasks are often tailored for 3D-based hierarchical pipelines. For RGB-only policies, several of these tasks are too hard and frequently result in zero success, reducing their evaluative value.
- Recent literature has adopted customized task subsets. For example, Point Cloud Matters (NeurIPS 2024), Conditional Flow Matching (CoRL 2024), and Generative Image as Action Models (CoRL 2024) all adopt their own task selections based on their model scope and evaluation needs.
In summary, we will revise the abstract to avoid overclaiming, and we have provided the requested side-by-side comparison. While CoA is not SOTA when compared to 3D-based hierarchical pipelines, it achieves the strongest performance within image-based visuomotor policies, which is the intended scope of our claim.
Dear Area Chair,
Thank you very much for raising this point.
We acknowledge that the current abstract might unintentionally imply a broader SOTA claim than intended. Our actual intention was to highlight CoA’s empirical strength within the scope of image-based visuomotor policies. We will revise the abstract to the following phrasing to avoid misunderstanding:
"Empirically, we observe that CoA outperforms representative imitation learning algorithms such as ACT and Diffusion Policy across 60 RLBench tasks and 8 real-world manipulation tasks."
Direct Comparison with RVT-2 and 3D Diffuser Actor
On the RLBench-18 subset used by RVT-2 and 3D Diffuser Actor, we report the results side-by-side below. As shown, 3D-based hierarchical methods indeed achieve higher overall success rates. However, within image-based visuomotor policies, CoA achieves the strongest performance, showing the advantage of reverse modeling.
| 3D-based Hierarchical methods | Image-based visuomotor policy | ||||||
|---|---|---|---|---|---|---|---|
| Task | 3D Diffuser Actor | RVT-2 | Image-BC (CNN) | Image-BC (ViT) | DP | ACT | CoA |
| Average | 81.3 | 81.4 | 1.33 | 0.89 | 17.33 | 24.44 | 37.33 |
| Close Jar | 96.0 ± 2.5 | 100.0 ± 0.0 | 0 | 0 | 0 | 0 | 0 |
| Drag Stick | 100.0 ± 0.0 | 99.0 ± 1.7 | 0 | 0 | 0 | 0 | 0 |
| Insert Peg | 65.6 ± 4.1 | 40.0 ± 0.0 | 0 | 0 | 0 | 0 | 0 |
| Meat off Grill | 96.8 ± 1.6 | 99.0 ± 1.7 | 0 | 0 | 16 | 32 | 88 |
| Open Drawer | 89.6 ± 4.1 | 74.0 ± 11.8 | 4 | 0 | 44 | 52 | 88 |
| Place Cups | 24.0 ± 7.6 | 38.0 ± 4.5 | 0 | 0 | 0 | 0 | 0 |
| Place Wine | 93.6 ± 4.8 | 95.0 ± 3.3 | 0 | 0 | 56 | 56 | 80 |
| Push Buttons | 98.4 ± 2.0 | 100.0 ± 0.0 | 0 | 0 | 0 | 32 | 28 |
| Put in Cupboard | 85.6 ± 4.1 | 66.0 ± 4.5 | 0 | 0 | 0 | 0 | 8 |
| Put in Drawer | 96.0 ± 3.6 | 96.0 ± 0.0 | 8 | 0 | 40 | 60 | 88 |
| Put in Safe | 97.6 ± 2.0 | 96.0 ± 2.8 | 4 | 0 | 24 | 36 | 80 |
| Screw Bulb | 82.4 ± 2.0 | 88.0 ± 4.9 | 0 | 0 | 0 | 0 | 0 |
| Slide Block | 97.6 ± 3.2 | 92.0 ± 2.8 | 0 | 0 | 0 | 36 | 64 |
| Sort Shape | 44.0 ± 4.4 | 35.0 ± 7.1 | 0 | 0 | 0 | 0 | 0 |
| Stack Blocks | 68.3 ± 3.3 | 80.0 ± 2.8 | 0 | 0 | 0 | 0 | 0 |
| Stack Cups | 47.2 ± 8.5 | 69.0 ± 5.9 | 0 | 0 | 0 | 0 | 0 |
| Sweep to Dustpan | 84.0 ± 4.4 | 100.0 ± 0.0 | 0 | 0 | 100 | 100 | 92 |
| Turn Tap | 99.2 ± 1.6 | 99.0 ± 1.7 | 8 | 16 | 32 | 36 | 56 |
Note:
- Results for CoA, ACT, and DP are from our implementations.
- All others (3D Diffuser Actor, RVT-2, Image-BC) are from RVT-2, 3D Diffuser Actor, and PerAct.
- For image-based policies, some tasks have near-zero success, this reflects task difficulty, not evaluation error.
- Image-BC results are from PerAct (CoRL 2022).
- Point Cloud Matters (NeurIPS 2024) also reported 0 success for ACT and DP on many tasks.
The paper introduces Chain-of-Action (CoA), a novel visuo-motor policy architecture for robotic manipulation based on trajectory autoregressive modeling with a reverse-generation paradigm. CoA generates actions by reasoning backward from a goal-specific keyframe action to the initial state. The method outperforms standard 2D policies. All reviewers responded positively, citing its strong empirical performance and comprehensive evaluation and ablations. The paper is recommended for publication.