6.0

/10

Poster4 位审稿人

最低2最高5标准差1.1

4.0

置信度

创新性2.5

质量3.0

清晰度3.0

重要性2.3

NeurIPS 2025

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Zewei Zhou,Tianhui Cai,Seth Z. Zhao,Yun Zhang,Zhiyu Huang,Bolei Zhou,Jiaqi Ma

OpenReview PDF

提交: 2025-05-11更新: 2025-10-29

摘要

关键词

End-to-end Autonomous DrivingVision Language Action ModelReinforcement Finetuning

评审与讨论

审稿意见

评分: 2置信度: 42025-06-18

The authors propose a novel autoregressive framework for end-to-end automated driving, by using a VLA with tokenized and discretized actions. The actions were tokenized to remove the possibility of impossible actions. The authors' model works with a dual-thinking approach where slow-thinking produces a chain-of-though (CoT) as well as a set of action tokens for the next 5 seconds of planning. Meanwhile quick-thinking skips CoT and just returns a set of action tokens. The authors first train using a supervised fine-tuning approach in which they use both ground truth for planning as well as reasoning data for CoT, where reasoning data was created as distilled knowledge from a larger VLM. After SFT the authors propose an RFT approach using GRPO to enhance model efficacy and performance. The model was tested on multiple benchmarks and shows performant results.

优缺点分析

Quality - The paper is well written and all the claims the paper formulated are properly addressed and argumented. The methods that were used to test the model are appropriate and in line with traditional benchmarks. The paper does not provide a performance test, as it is only mentioned once, but no further analysis was done, only stating that it achieves near real-time performance using many GPUs. The results of experiments are just presented as is, without any real discussion other than putting into words what is said on the graphs. No deeper understanding of results, or discussing potential drawbacks of the method.
Clarity - While the main methodological and experimental approaches are defined precisely, the main drawback I see here is that the actual process of discretizing and tokenizing the actions is not explained in detail as other parts which are less significant as this is one of the main contributions of the paper. Giving a better overview as to how it was discretized, why the specific use of short-term spatial positions, why K-disk clustering (and how it was set up) etc. would benefit the paper greatly.
Significance - The results are promising but not as good as other simpler methods provided in benchmarks competing against the proposed model, in all but one category when compared to other models is this proposed model better, and that is when using best-of-N planning which other models don't have access to. Meanwhile the one-shot and post-RFT models are quite underperformant compared to others, averaging the lower half of metrics when ranked against provided models.
Originality - The main insight from my perspective that the paper provides is that when tokenizing the action space there can be no nonsensical outputs. Which is more of a product of the method than an insight, since there are only feasible actions the model can only output feasible actions. Other parts of the model, pipeline and framework are not novel and do not provide new insights, but are rather logical results of steps taken in the paper. The paper does provide an overview of a framework though, but there is no real scientific insight gained here, more an engineering assessment of a framework, that in the end does not provide any inherent benefits over others performance-wise.

问题

Please provide a more detailed explanation as to the exact procedure of tokenizing the action space. Why do you tokenize in such a way? Why do you use K-disk and not something else? Why do you use short-form why not a different approach etc.?
Why is the method worse in some categories than other benchmark models? Please discuss and analyze the results, not only present them.

局限性

The limitations subsection only discusses the real-world feasibility of the method, while not touching up any other limitations of the model or discussing the results of experiments that were on average below the other methods in some cases by a larger margin.

最终评判理由

Although the authors coud clarify several of ours and the other reviewers questions, the main issue of low novelty and questionable significance, as remarked also by another reviewer, remain. Thus I keep my rating.

格式问题

None.

作者回复

2025-07-31

1. Performance test and discussions

We thank the reviewer for highlighting the need for a more thorough analysis and discussion of the experimental results. We note that we have provided a more detailed analysis in the supplementary materials, and we provide additional analysis and discussions below.

First, we would like to clarify that our near real-time inference results were obtained using a single GPU, not multiple GPUs as the reviewer noted. In the following sections, we include a detailed summary of model size, inference speed (in FPS), and hardware configurations, in comparison with other methods.

Further, following the reviewer’s suggestion, we conducted additional experiments to analyze runtime performance under various training settings. The results show that our proposed post-training method (RFT with CoT length penalty) can significantly improve inference efficiency by reducing slow thinking. These results show that RFT effectively shifts model behavior toward faster inference modes without sacrificing planning performance.

Method	Minimum (s)	Maximum (s)	Avg. Runtime (s)	Fast Thinking (%)	Slow Thinking (%)
SFT	0.997	13.706	3.951	66.8	33.2
RFT (w/o CoT length penalty)	1.002	13.471	3.840	68.4	31.6
RFT (w/ CoT length penalty)	0.999	12.964	1.312	96.6	3.4

We also break down runtimes by thinking mode. As shown in the table below, fast thinking yields much faster and more stable inference compared to slow thinking. Our training approach encourages the model to use fast thinking in simple scenarios and reserve slow reasoning only when necessary.

Thinking Mode	Min Runtime (s)	Max Runtime (s)	Avg. Runtime (s)
Fast Thinking	0.997	1.116	1.072
Slow Thinking	7.607	13.706	10.518

2. Process of discretizing and tokenizing actions

We thank the reviewer for pointing out the need for greater clarity regarding our action tokenization process. We have provided more implementation details in the supplementary materials. To address the reviewer's comments, we include the following elaborations:

Discretization Strategy: We want to emphasize the structural differences between robotic manipulators and autonomous vehicles first. Robot arms are fully actuated, holonomic systems capable of executing independent actions per dimension. However, vehicles represent classical underactuated, nonholonomic systems with three degrees of freedom (i.e., position $x$ , $y$ , and $heading$ ) controlled indirectly via acceleration and steering rate. Therefore, we discretize the action space by representing each action token as a short-term, kinematically feasible trajectory segment (0.5 seconds). This tokenized representation encodes motion primitives that are reachable under nonholonomic constraints, serving as building blocks for long-term trajectories.
Use of Short-term Spatial Positions: Short-term trajectory segments provide a compact and expressive representation of immediate vehicle movement, which aligns well with the autoregressive generation of our model. This strategy reduces the complexity of long-horizon prediction and improves training efficiency.
Why K-Disk Clustering: This method groups similar short-term state transitions into discrete tokens while preserving continuity and ensuring dense coverage of diverse motion patterns across three degrees of freedom (i.e., position $x$ , $y$ , and $heading$ ).

[1] Brohan et al. "RT-1: Robotics transformer for real-world control at scale." RSS 2023.

[2] Pertsch et al. "Fast: Efficient action tokenization for vision-language-action models." RSS 2025.

We compared our K-disk tokenization against RT-1 (action bins) and FAST (tokenizing trajectories based on discrete cosine transform) using a fixed codebook size of 2048. The results below report the accuracy of trajectory reconstruction based on ground-truth tokens. Our method achieves the lowest reconstruction error.

Tokenization	ADE(m)↓	FDE(m)↓
RT-1 (Action Bins)	0.101	0.178
FAST	0.028	0.031
K-disk (Ours)	0.018	0.020

To assess downstream impact, we compare our model with the FAST tokenization method (without RFT) on the nuPlan (NAVSIM) test data. The table below shows that our method significantly outperforms FAST in planning metrics. This is largely due to our tokens directly representing physically meaningful vehicle motions, whereas FAST tokens encode more abstract frequency components, which are harder to learn from limited data.

Tokenization	PDMS↑	No At-Fault Collision↑	Progress↑
FAST	65.04	92.23	65.56
Ours	80.54	96.89	75.82

Thus, our discretization and tokenization strategy not only improves trajectory reconstruction accuracy but also translates into better planning performance.

3. Analysis of benchmark results

We thank the reviewer for their feedback on the significance of our results. While we acknowledge that our model does not achieve absolute SOTA performance, we would like to clarify several key aspects of our framework that we believe underscore its contribution and broader significance.

Generality and Simplicity. Unlike prior works that are often optimized for specific benchmarks and benefit from privileged modalities such as LiDAR, HD maps, and object-level annotations, our framework is designed to be general, scalable, and end-to-end, relying only on camera inputs, language, and trajectory supervision. We have listed the comparison of different methods in the nuPlan (NAVSIM) benchmark in the table below. Despite the absence of such privileged inputs, our model achieves competitive results in the benchmark without dataset-specific engineering. In addition, our model is competitive across multiple datasets and tasks (nuPlan, nuScenes, Waymo, and CARLA). We consider this cross-domain consistency a key contribution. Our results are also promising and suggest the viability of a unified formulation of planning and reasoning in autonomous systems.
Scalability with Data. We observe clear data scaling trends in our experiments, indicating that model performance improves substantially with increased training data. Due to the limited size of available open-source datasets, we believe that the full potential of our approach has not yet been realized.

Best-of-N Analysis and RFT. The best-of-N planning evaluation is intended to illustrate the performance ceiling achievable by leveraging the model’s inherent trajectory diversity. Although competing baselines do not include best-of-N evaluation, we emphasize that our RFT method achieves performance close to this upper bound while operating in a single-shot manner.

Methods	Input	Training	PDMS↑
TransFuser	Cameras, LiDAR	Object‑level Annotation, GT Trajectory, HD Map, Semantic Map	83.88
DiffusionDrive	Cameras, LiDAR	Object‑level Annotation, GT Trajectory, HD Map	88.10
Hydra‑MDP	Cameras, LiDAR	Object‑level Annotation, GT Trajectory, Scoring	91.26
Centaur	Cameras, LiDAR	Object‑level Annotation, GT Trajectory, Human Feedback	92.10
TrajHF	Cameras, LiDAR	Object‑level Annotation, GT Trajectory, Human Feedback	93.95
AutoVLA	Cameras, Language	GT Trajectory, CoT, Reward Feedback	89.11

4. Novelty

We appreciate the reviewer for the opportunity to clarify the contributions and insights presented in our work. We respectfully argue that the originality of our paper lies not only in these action tokenization methods but also in the following key contributions:

Unified Vision-Language-Action Planning Framework: Our framework bridges vision, language, and action in a unified learning setup for autonomous driving tasks, where discrete, physically-plausible trajectory tokens form the basis for autoregressive prediction. Although individual components (e.g., trajectory learning or language reasoning) have been studied in isolation, their integration under the VLA paradigm is novel and offers new potential for end-to-end, generalizable planning.
Reinforcement Fine-Tuning for Trajectory Planning: To the best of our knowledge, this is the first application of GRPO for trajectory-level policy refinement in autonomous driving with vision-language models. While RFT has been extensively explored in natural language generation, our work demonstrates that it can also be effectively applied to trajectory prediction and refinement, yielding measurable improvements in planning quality. This establishes a new direction for aligning planning policies with preference-based or task-driven feedback.
Fast-Slow Planning Paradigm: We provide new insights into how SFT (mixture thinking-mode training) and RFT can be leveraged to enable an adaptive fast-slow planning paradigm for VLA models. Empirically, this strategy improves overall planning performance and significantly enhances runtime efficiency, as the fast-thinking mode can suppress language-based reasoning outputs in straightforward scenarios, but keep reasoning capabilities in challenging situations.

2025-08-04

First I thank the authors for their response.

I want to point out again that the main issue regarding the results and comparison was not the results themselves, but the fact that there is no dicussion in the paper that explain why these results are as they are, compared to other methods.

However, my main concern is still novelty, which still remains despite the rebuttal, why I will not change my rating.

2025-08-05

We sincerely thank the reviewer for their feedback and engagement throughout the review process. We greatly appreciate the opportunity to further clarify the contributions and significance of our work.

1. Discussion of Results

We acknowledge the reviewer’s concern regarding the discussion on the experimental results. Due to space constraints and the breadth of our experiments, some in-depth analyses were provided in the supplementary materials.

We have provided additional discussion and analysis in the rebuttal, such as Section 3 - Analysis of benchmark results. As part of a revision, we are committed to restructuring the content to allocate more space for discussing performance, shortcomings, and comparisons across benchmarks.

To reiterate the key results and discussions we aim to highlight:

Cross-domain Performance: Our model was evaluated across diverse benchmarks (nuPlan, nuScenes, Waymo, CARLA), in both open- and closed-loop settings. This broad evaluation supports the model’s generality.
Generality and Simplicity: Unlike many baselines relying on privileged inputs (LiDAR, HD maps, object-level labels), our approach depends solely on camera inputs, language, and trajectory supervision, achieving competitive results without dataset-specific engineering or additional sensors.
Strength in Long-Tail Scenarios: Our framework effectively integrates language inputs and transfers extensive world knowledge from large language models to driving actions, leading to improved performance in challenging long-tail scenarios and achieving the highest RFS (Spotlight) scores on the Waymo E2E driving dataset (the most difficult cases).

We will revise the paper to emphasize these analyses and incorporate a clearer discussion of where and why our model performs compared to existing methods.

2. Novelty and Contributions

We respectfully disagree with the assessment that the work lacks novelty, and would like to further clarify our contributions:

Action Tokenization for Driving Systems: As emphasized in our rebuttal, existing action tokenization methods designed for fully actuated, holonomic robotic systems cannot be directly implemented to underactuated, nonholonomic vehicle systems. Our action tokenization method not only ensures feasible action outputs but also effectively represents the action space of underactuated, nonholonomic vehicle systems, bridging vision, language, and action into a unified learning paradigm.
Unified End-to-End VLA Planning Framework with Fast-Slow Thinking Paradigm: Our framework presents a novel, general, and scalable end-to-end approach that integrates planning and reasoning in a single autoregressive process. This stands in contrast to hierarchical end-to-end models [1][2] or modular structures [3]. The fast-slow thinking paradigm further enables adaptive reasoning, improving both efficiency and performance.
Effective Training Paradigm: Our training pipeline (SFT + RFT) and the proposed VLA framework provide new insights on RFT to enhance planning performance and running efficiency in autonomous driving. To the best of our knowledge, this is the first application of GRPO for trajectory-level policy refinement in autonomous driving with vision-language models.

[1] DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models, Conference on Robot Learning (CoRL), 2024

[2] Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving, Arxiv, 2024

[3] AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning, Arxiv, 2025

We believe these points across action representation, vision-language planning paradigm, and training methodology offer new insights and tools for both the VLA and autonomous driving communities.

We are grateful for the reviewer’s feedback, which has helped us better articulate the significance and novelty of our work. We sincerely hope the reviewer could reconsider their rating in light of these clarifications, as we believe our paper offers meaningful insights and contributions to the field.

Thank you once again for your time and constructive input.

2025-08-06

I thank the authors again for the repsonse. However, repesting the more or less same points again is not helpful. I think we have come to an end of this discussion. Similar to reviewer 5HFo, I still have concerns with respect to novelty and significance. Tokenization has been already applied in robotics very similarly, and AD is just a special form of robotics in my point of view. Unified E2E VLA with Fast-Slow thinking isn't new and you have referenced previous works that do exactly the same thing the way you proposed. Same goes for the effective training paradigm, you are training paradigms that have been used simiarly in such systems before. As for the discussion of results, I do not think that the presented results show the claimed generality, since you rely on camera, as you state yourselves.

Thus, I will maintain my rating, as explained before, and will now finalize the review.

2025-08-07

Thank you once again for the additional clarification. While we understand that you are finalizing your review, we appreciate the opportunity to respond now that your specific concerns are more clearly stated. We would like to respectfully highlight several points of clarification that we believe underscore the novelty and significance of our work. Our intention was not to reiterate without substance, but to ensure key contributions were properly communicated. In this response, we aim to provide clearer differentiation from prior work and emphasize the technical insights of our approach.

Action Tokenization in Robotics vs. Autonomous Driving

Discrete tokenization has become a common technique in VLA models to integrate actions into VLMs for outputting control commands. This approach has been used in robotics, as seen in works like RT-1 [1], OpenVLA [2], and FAST [3]. Although autonomous vehicles are a specialized form of robotics, they differ significantly in terms of system dynamics and control constraints. Specifically, autonomous driving involves nonholonomic, underactuated systems with different requirements than robotic manipulators.

We show that tokenization methods developed for robotics do not perform well in autonomous vehicles. Therefore, we utilize a domain-specific tokenization scheme (K-Disk) that accounts for the vehicle’s kinematics and trajectory feasibility. As shown in our experiments, this results in lower reconstruction errors and better downstream planning performance compared to RT-1 and FAST tokenization. We believe this domain adaptation and its impact represent a meaningful novelty in VLA models for autonomous vehicles.

[1] RT-1: Robotics Transformer for Real-World Control at Scale. RSS 2023.

[2] OpenVLA: An Open-Source Vision-Language-Action Model. CoRL 2024.

[3] FAST: Efficient Action Tokenization for Vision-Language-Action Models. RSS 2025.

Fast-Slow Planning Paradigm: Implementation Difference

The fast-slow thinking paradigm is an established concept in cognitive science and AI systems [4]. We referred to one previous work (DriveVLM [5]) using this concept for autonomous vehicles. However, we emphasize that our implementation is significantly different from it. In DriveVLM, fast and slow processes are handled by two separate models, for example, a conventional BEV-based system for the fast component and a VLM for the slow component. This dual-model approach can introduce conflicts when switching between modes, require more training resources, and have significant engineering complexity.

In contrast, our method integrates both fast and slow thinking within a unified autoregressive framework. We also use RL post-training to enable the model to improve its behaviors and adaptively switch between modes based on reward feedback. This unified design not only simplifies deployment but also provides improved performance and inference efficiency.

[4] Thinking fast and slow in AI. AAAI 2021.

[5] DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models. CoRL 2024.

Effective Training Paradigm and Use of RFT

While post-training of VLMs/LLMs using GRPO-style reinforcement learning has been explored in language generation, to the best of our knowledge, we are the first to apply reinforcement fine-tuning to trajectory-level planning in VLA models for autonomous driving.

Our results show that RFT significantly improves planning performance and enables behavior shaping (e.g., encouraging fast thinking where possible). This adaptation of a language modeling paradigm to the AD domain is non-trivial and demonstrates practical utility.

Generality and Cross-Domain Generalization

We clarify that the generality of our approach is demonstrated by the ability of a single model to work effectively across multiple datasets and benchmarks (nuPlan, Waymo [Vision-only End-to-End Driving], nuScenes, CARLA [Closed-Loop Driving]) without requiring architecture changes or domain-specific tuning. The same model generalizes to different environments, tasks, and datasets, which is rarely achieved or showcased in existing end-to-end driving models. While we rely on camera input, this is a deliberate design choice to ensure scalability and accessibility. Prior methods often rely on LiDAR, HD maps, or object-level annotations, which we explicitly avoid to ensure generality across varied datasets and settings.

We sincerely hope that these clarifications help better communicate the contributions and novelty in our work. Again, we appreciate the time and consideration the reviewer has given to our work. We hope that the above clarifications help position our contributions in a more precise light, even if they do not change the final evaluation.

Thank you again for your time and consideration.

审稿意见

评分: 5置信度: 42025-06-29

The paper introduces AutoVLA, a novel Vision-Language-Action (VLA) framework for end-to-end autonomous driving. AutoVLA integrates reasoning and action generation within a single autoregressive model, supporting dual thinking modes: "fast thinking" for direct trajectory generation in straightforward scenarios and "slow thinking" with chain-of-thought (CoT) reasoning for complex situations. The training involves supervised finetuning (SFT) with trajectory and reasoning data, followed by reinforcement finetuning (RFT) to optimize planning performance and reduce unnecessary reasoning. The first highlight of the paper is its strong planning/driving performance on both real-world (nuPlan, nuScenes, Waymo) and simulated (CARLA) datasets. The second highlight is its ability to improve planning/driving performance when reasoning is involved, which is a capability not demonstrated by previous VLM-based frameworks.

优缺点分析

Strengths:

AutoVLA unifies reasoning and action generation in a single model, addressing limitations of prior models that often produce infeasible actions or rely on complex intermediate representations.
The model can dynamically switch between fast and slow thinking modes to balance efficiency and accuracy. This is a feature not explored in previous work.
The paper evaluates AutoVLA across various benchmarks (nuPlan, nuScenes, Waymo, CARLA), demonstrating strong planning and driving performance in both open-loop and closed-loop settings.
AutoVLA also achieves significantly stronger performance when incorporating reasoning in the language space, as shown in Figure 6 and Table 1. This suggests that the planning and language branches are closely related and complementary. This is a relationship that previous papers have not explored.

Weaknesses:

While the paper emphasizes the reasoning capabilities of AutoVLA, it does not quantitatively evaluate AutoVLA's performance on language-related benchmarks like DriveLM. Since DriveLM is also based on nuScenes and CARLA, including such results would further strengthen the paper and better validate the model’s reasoning ability.
The previous problem raises a second question. The reinforcement finetuning (RFT) process penalizes long language outputs with the $r_\text{CoT}$ term, which improves efficiency but may compromise the reasoning depth. A more systematic evaluation or additional case studies would help clarify the trade-offs involved.
Since AutoVLA is trained on the Carla-Garage dataset for Table 2 according to Line 230 (which adopts PDMLite as the expert planner), it would be more appropriate to include comparisons with stronger baselines trained on PDMLite-gathered datasets (e.g., TF++ and SimLingo), rather than only comparing against weaker baselines trained on the Bench2Drive dataset.
As the model employs a large vision encoder to process multiple camera views across several frames, this inevitably results in high computational cost and inference latency. A latency up to 1–2 seconds would make the model impractical for real-world deployment. It would also be helpful to report latency statistics in more detail (such as maximum, minimum, and average latencies for the two thinking modes) rather than only providing an average value for the entire dataset.

问题

Overall, I think AutoVLA is a good paper. Please refer to the weakness section for questions. I only have two extra questions regarding the teacher model: how does the teacher model Qwen2.5-VL-72B perform on vehicles' ego status information? Are there any cases where the teacher model fails to make safe decision based on the textual velocity, acceleration, and historical actions?

局限性

Please refer to the weakness section.

最终评判理由

The authors have provided detailed comparisons with other baselines and ablation studies addressing efficiency concerns, and I will therefore maintain my current rating of accept.

格式问题

I have no concerns regarding the formatting.

作者回复

2025-07-31

1. Evaluation on language-related benchmarks

We thank the reviewer for the suggestion of evaluating the model’s reasoning ability on language-related benchmarks. While we agree that this could offer additional insights, we would like to clarify that our model is not trained for VQA tasks typically assessed by DriveLM. Instead, we repurpose the DriveLM dataset to enable structured CoT reasoning in support of action planning. As a result, direct evaluation using the DriveLM benchmark is not well aligned with the functions of our model.

Moreover, language reasoning performance in our setting is primarily determined by the underlying VLM and its training corpus. Thus, it does not serve as a key differentiator in our work. Our contribution lies in augmenting VLMs with planning capabilities for autonomous driving, rather than improving general language understanding.

2. Evaluation of RFT on reasoning and planning

We thank the reviewer for raising this question regarding the trade-off between reasoning depth and efficiency in the RFT process.

To address the reviewer’s concern, we clarify that the CoT penalty during RFT is applied using a sigmoid-based normalization function, with a tolerance length $L_{tol}$ calibrated based on the average length of reasoning annotations in our training data. This design ensures that excessively long CoT outputs receive meaningful penalties, while preserving the flexibility for necessary CoT reasoning. The goal is to improve inference latency in time-sensitive planning tasks.

To evaluate the trade-off, we conducted an ablation study comparing the model’s performance with and without the CoT penalty. Due to the time limitation during the rebuttal period, we finished 1,500 training steps, and we compared the model with the same setting with a CoT length penalty. The results in the table below show that adding a CoT length penalty in RFT results in better planning performance and significantly reduces inference latency.

Model	PDMS	Avg. Runtime (s)
No CoT Penalty	85.21	3.840
CoT Penalty	86.65	1.312

We will include expanded results and analysis in the revised paper.

3. Comparison with stronger methods on CARLA

We thank the reviewer for pointing out the need for more comparisons. We compare AutoVLA with methods trained on PDMLite-gathered datasets. We also carefully tuned the low-level controller, which we found to be a critical factor in tracking the planned trajectory. As a result, AutoVLA’s driving score improved from 78.84 to 80.12. Our updated results demonstrate that AutoVLA is comparable to these state-of-the-art models.

Specifically, while Transfuser++ employs LiDAR inputs and multimodal feature fusion, AutoVLA is a vision-only model. Similarly, SimLingo utilizes approximately 650k samples from different sources, whereas AutoVLA is trained with only 289k samples. Despite these differences, AutoVLA achieves comparable performance in closed-loop driving.

Method	Driving Score	Success Rate	Efficiency	Comfortness
TCP-traj	63.45	37.79	228.46	30.76
Transfuser++	84.21	67.27	--	--
SimLingo	85.07	67.27	259.23	33.67
AutoVLA	80.12	61.36	175.65	38.88

4. Computational cost and inference latency

We acknowledge the concern regarding inference latency, which may be a limiting factor of our current system. However, we note that the current system has not been optimized for runtime efficiency, as our primary focus in this work is to demonstrate the model's reasoning and planning capabilities. We plan to apply model quantization and other techniques to improve inference speed, which have already been widely explored in LLM. We believe that with such runtime optimizations, the model can achieve near real-time performance even with slow reasoning.

We report more latency statistics over 1,500 test samples from the nuPlan dataset. The results show that the base SFT model relies more heavily on slow thinking, resulting in higher latency. Our proposed RFT method effectively reduces unnecessary slow reasoning while preserving its use in complex cases, leading to significantly improved runtime efficiency.

Method	Minimum (s)	Maximum (s)	Avg. Runtime (s)	Fast Thinking (%)	Slow Thinking (%)
SFT	0.997	13.706	3.951	66.8	33.2
RFT (w/ CoT length penalty)	0.999	12.964	1.312	96.6	3.4

Additionally, we provide a breakdown of runtimes by thinking mode. The results show that fast thinking is significantly faster and more stable than slow thinking.

Thinking Mode	Minimum (s)	Maximum (s)	Avg. Runtime (s)
Fast Thinking	0.997	1.116	1.072
Slow Thinking	7.607	13.706	10.518

5. Teacher model

We appreciate the reviewer’s suggestion to provide more details regarding our teacher model. Detailed descriptions of the teacher model and visualizations of the reasoning data annotations are included in the Supplementary Material. Our teacher model incorporates ego vehicle status (velocity, acceleration, historical actions) as language-based inputs.

Due to Rebuttal time constraints, we randomly selected 60 Waymo scenarios to conduct an ablation study comparing inference accuracy with and without ego status input. Human verification revealed that excluding ego status decreased accuracy from 86.7% to 78.3%. Supplementary Material Figure S3 (scenario 1) exemplifies a situation where ego states are essential; specifically, the vehicle has been stopped at a stop sign, and it can accelerate and execute a left turn, rather than stop at the stop sign.

Furthermore, our method integrates ground-truth driving actions as explicit hints to guide the generation of causal explanations. These ground-truth actions are generated by manually defined rules based on ground-truth trajectories. Consequently, the final decisions of our model can strictly adhere to ground-truth driving actions, achieving a human-verified accuracy of 93.6% across a randomly selected set of 3,862 Waymo scenarios. The error comes from the inherent limitation of the rule-based approach, particularly in handling complex multi-stage behaviors commonly encountered in challenging Waymo scenarios, such as accelerating from a stop sign and subsequently decelerating for pedestrians. Incorrect predictions identified during human verification are manually removed or revised to ensure accuracy.

2025-08-05

Thanks for the rebuttal. The authors have addressed my major concerns, and I will maintain my current rating.

2025-08-05

We sincerely appreciate your thoughtful and constructive feedback. Thank you for recognizing the key strengths of AutoVLA, including its unified reasoning and planning framework, dual thinking modes, and strong performance on benchmarks.

We are glad that our responses have addressed your concerns, and we are grateful for your support in recommending the acceptance of our work.

We will revise the final version of the paper carefully in line with your suggestions.

审稿意见

评分: 4置信度: 32025-07-02

The paper presents AutoVLA, an end-to-end vision–language–action (VLA) architecture for autonomous driving that (i) tokenises continuous vehicle motion into a learned “physical-action codebook” and feeds those tokens directly into a pretrained vision–language model (Qwen2.5-VL-3B) so that one autoregressive decoder jointly produces chain-of-thought (CoT) reasoning and a 5 s trajectory; (ii) trains with mixed supervision—trajectory-only (“fast thinking”) and trajectory + CoT (“slow thinking”); and (iii) applies reinforcement fine-tuning with Group-Relative Policy Optimisation (GRPO), using nuPlan PDMS and an explicit CoT-length penalty, to improve planning performance and suppress unnecessary reasoning. Experiments on nuPlan-NAVSIM, nuScenes, Waymo-E2E and CARLA Bench2Drive show competitive or SOTA open-loop and closed-loop metrics; ablations demonstrate gains from CoT, data scaling, action tokenisation and RFT.

优缺点分析

Strengths

Contributes a 2048-entry motion codebook and integrates it as additional vocabulary inside the LLM, avoiding extra decoders (Sec. 3.1, Table 3). Produces physically feasible trajectories directly; outperforming text-waypoint outputs in L2/collision and latency.
Mixes reasoning-augmented and reasoning-free samples (Sec. 3.3) so the same model can adaptively emit CoT when needed (Fig. 6c). Ablation (Fig. 4) shows that with sufficient data (>50 k) CoT improves PDMS and collision metrics.
Extends RFT beyond meta-actions (AlphaDrive) to full trajectory tokens (Sec. 3.4), achieving +10.6 % PDMS and −66 % runtime on NAVSIM (Fig. 6a). Provides evidence that penalising token length indeed triggers “fast thinking”.
Four real/sim datasets; both open-loop (nuPlan, nuScenes) and closed-loop (Bench2Drive). AutoVLA (Best-of-N) tops NAVSIM PDMS (92.1) and achieves best driving score & success rate on Bench2Drive (Table 2).

Weaknesses

The author only validates the effectiveness of AutoVLA in the autonomous driving scenario but does not mention the robotic scenario, which would more prominently highlight the importance of action tokenization mentioned in the paper.
The size of the action codebook (K=2048) is empirically chosen, without conducting relevant ablation studies or verifying its dynamic feasibility.
More meticulous and in-depth experiments and analyses are needed for the fast and slow systems. There are no relevant quantitative results in the main text, and the results in the appendix do not show significant advantages.

问题

Please refer to the "Weakness"

局限性

yes

最终评判理由

The authors have provided analysis and explanations related to application scenarios, parameter experiments of the codebook, and ablation studies of the fast-slow system. However, they have not yet provided experiments in robotic scenarios, although this does not affect my judgment. I hope there could be some preliminary results in robotic scenarios to support the reliability of the method.

格式问题

None

作者回复

2025-07-31

1. Robotic scenario

We thank the reviewer for raising this point. While our work focuses on autonomous driving, the underlying ideas of AutoVLA, are indeed inspired by and relevant to broader robotic scenarios.

We would like to clarify that the motivation for our action tokenization arises from the unique challenges posed by vehicle dynamics, which differ significantly from typical manipulation robots (e.g., RT-1 [1], FAST [2]). Most robotic platforms in those settings are holonomic and fully actuated, allowing for independent per-dimension action control at each timestep. In contrast, vehicles are underactuated, nonholonomic systems with kinematic constraints that prevent arbitrary motion, such as instantaneous lateral movements.

To address this, our action tokenization method encodes short, dynamically feasible trajectory segments in terms of position and heading, enabling smooth and physically consistent vehicle motion. Nevertheless, we agree that evaluating AutoVLA in general robotic scenarios would further highlight the broader utility of the proposed framework. Extending our framework to robot navigation represents a promising direction, and we plan to pursue this in future work.

[1] Brohan et al. "RT-1: Robotics transformer for real-world control at scale." RSS 2023.

[2] Pertsch et al. "Fast: Efficient action tokenization for vision-language-action models." RSS 2025.

2. Action codebook size

We appreciate the reviewer’s question regarding codebook size. The codebook size $K=2048$ used in our AutoVLA framework was selected based on empirical evaluations that balance trajectory reconstruction accuracy with training stability. We conducted comparative evaluations of reconstruction accuracy using various codebook sizes across multiple tokenization methods, including RT-1, FAST, and our K-disk tokenization. The results demonstrate that our K-disk method consistently achieves the best reconstruction accuracy across different codebook sizes.

When the codebook size is small ( $K=256$ ), the reconstruction error is significantly higher due to limited expressiveness. As the codebook size increases beyond 1024, the improvements in accuracy become marginal, and the risk of redundant or overlapping motion patterns also increases. To strike a balance between accuracy and codebook efficiency, we selected $K=2048$ as a practical choice.

Tokenization	Codebook Size	ADE(m)↓	FDE(m)↓
RT-1 (Action Bin)	256	0.1440	0.2942
RT-1 (Action Bin)	1024	0.1052	0.1883
RT-1 (Action Bin)	2048	0.1014	0.1775
RT-1 (Action Bin)	4096	0.1001	0.1739
FAST (DCT)	256	0.1708	0.2137
FAST (DCT)	1024	0.0522	0.0588
FAST (DCT)	2048	0.0281	0.0309
FAST (DCT)	4096	0.0149	0.0161
K-disk (Ours)	256	0.0687	0.1034
K-disk (Ours)	1024	0.0253	0.0282
K-disk (Ours)	2048	0.0182	0.0203
K-disk (Ours)	4096	0.0141	0.0155

We also evaluated downstream performance using the PDMS metric on the nuPlan (NAVSIM) test data. Our results show that $K=2048$ yields the highest score among the testing configurations. While increasing the size from 1024 to 2048 leads to an improvement, expanding the size from 2048 to 4096 yields a marginal decrease. The result indicates that a codebook size of 2048 is sufficient to capture the majority of meaningful motion primitives, and that a larger codebook size introduces increased redundancy, making further expansion unnecessary.

Based on analyses of trajectory reconstruction accuracy and driving performance across various $K$ values, we selected $K=2048$ as our codebook size.

Codebook Size	PDMS↑
1024	79.47
2048	80.54
4096	80.39

Lastly, we emphasize that the discrete trajectory segments encoded in our codebook are derived from real-world driving data, inherently satisfying vehicle kinematic constraints. This ensures that the learned tokens are not only expressive but also physically feasible.

3. Quantitative results for the fast-slow system

We thank the reviewer for highlighting the need for a more in-depth analysis of the fast-slow thinking system. We have conducted additional experiments and provided detailed quantitative results to support the effectiveness of our approach. The advantage of the fast-slow thinking system is its ability to maintain or improve planning performance while significantly reducing unnecessary slow reasoning, thereby enabling efficient inference. Our results in the main paper demonstrate that the RFT with the CoT length penalty training achieves better planning performance while substantially reducing average inference time.

We add results for RFT without the CoT length penalty, which shows slightly worse planning performance and inferior runtime efficiency due to excessively slow reasoning. This highlights the advantage of our training method for fast-slow systems: by incorporating a CoT length penalty, we guide the model to use slow reasoning only when necessary, achieving better overall running efficiency.

Model	PDMS	Avg. Runtime (s)
SFT	80.54	3.951
RFT (w/o CoT length penalty)	85.21	3.840
RFT (w/ CoT length penalty)	86.65	1.312

We further analyze the distribution of runtimes and mode usage across test samples. The results demonstrate that our proposed post-training method (RFT w/o CoT length penalty) significantly improves inference efficiency by reducing reliance on slow thinking while preserving reasoning capability when needed.

Method	Minimum (s)	Maximum (s)	Avg. Runtime (s)	Fast Thinking (%)	Slow Thinking (%)
SFT	0.997	13.706	3.951	66.8	33.2
RFT (w/o CoT length penalty)	1.002	13.471	3.840	68.4	31.6
RFT (w/ CoT length penalty)	0.999	12.964	1.312	96.6	3.4

We also provide a breakdown of runtime by thinking mode across the testing samples. Fast thinking is significantly faster and more stable than slow thinking, while slow thinking, though more time-consuming, can be kept for complex scenarios.

Thinking Mode	Minimum (s)	Maximum (s)	Avg. Runtime (s)
Fast Thinking	0.997	1.116	1.072
Slow Thinking	7.607	13.706	10.518

These results demonstrate the advantage of our fast-slow thinking system. The ability to switch between fast and slow thinking enables AutoVLA to respond efficiently in straightforward scenarios while retaining the capacity for deeper reasoning when required.

评论- comments

2025-08-07

The reviewer has effectively alleviated my concerns. I maintain the rating. Besides, it is expected that some experiments in robotic scenarios can be provided, even if they are preliminary.

2025-08-07

We thank the reviewer for the positive evaluation and are glad that our responses have addressed your concerns. We agree that robotic experiments would further strengthen our work. Due to current hardware and integration constraints, we are unable to include them at this stage. We consider this an important next step and plan to conduct real-world robotic deployments as part of our future work. We greatly appreciate your time and thoughtful feedback throughout the review process.

审稿意见

评分: 4置信度: 52025-07-10

This paper presents an end-to-end autoregressive model for autonomous driving. The proposed model is built upon a pre-trained VLM backbone that performs adaptive reasoning and predicts trajectories, where each trajectory is tokenized into a sequence of physically feasible tokens. The authors first adopt supervised fine-tuning to enable the model to perform both reasoning and planning tasks, and then apply reinforcement fine-tuning to train the model to avoid unnecessary reasoning. Experiments conducted on multiple datasets validate the proposed method.

优缺点分析

Strengths:

The paper is well-written and well-organized, and it is easy to follow and understand.
The proposed VLA is a straightforward and technically sound extension of the VLM. The experiments are comprehensive and demonstrate that the method is promising in certain cases.
The proposed adaptive reasoning is interesting and practical in terms of improving efficiency.

Weaknesses:

The ablation study on the text action output is not sufficient to demonstrate the effectiveness of the proposed action tokenization method. There are multiple advanced action tokenization schemes, such as per-dimension, per-timestep binning scheme [1] and [2], which are not compared.
In reinforcement fine tuning, the authors formulate the planning problem as a VQA task, where the output actions are not executed and therefore do not affect subsequent observations. This formulation is misleading and potentially limits the model's closed-loop planning performance.It is also unclear why the authors did not use RFT in closed-loop experiments.
The significance of the proposed framework is difficult to assess. Considering the model size and inference time, the model doesn't bring much improvements and sometimes performs worse in the benchmark evaluations. It would be helpful if the authors can provide the model sizes and FPS of all compared methods. Moreover, the proposed CoT training degrades performance in simple scenarios like nuScenes, which is quite counterintuitive. Regardless of efficiency, CoT should help the model to make better decisions, why does it hurt performance in this case? It is also unclear to me if the RFT can mitigate this issue.

[1] Brohan et al. "Rt-1: Robotics transformer for real-world control at scale." arXiv 2022. [2] Pertsch et al. "Fast: Efficient action tokenization for vision-language-action models." RSS 2025.

问题

As noted in the weaknesses section, I would appreciate it if the authors could address the following questions:

How do the authors justify the effectiveness of the proposed action tokenization method, given that it is only compared to a simple text action output?
Why is RFT not used in closed-loop experiments? Could RFT improve performance on the nuScenes dataset when using CoT?
Can the authors provide additional evidence to support the significance of the proposed framework

局限性

yes

最终评判理由

The work makes a certain contribution by presenting a straightforward extension of VLM for planning tasks, supported by extensive experiments. However, it still has limitations that cannot be easily addressed or overlooked, even though the authors have provided explanations. Therefore, I will keep my rating as borderline accept.

格式问题

作者回复

2025-07-31

1. Action tokenization

We thank the reviewer for the comment regarding the evaluation of our action tokenization method. As suggested, we compare our tokenization approach with other advanced tokenization methods used in robotic control, including action bins in RT-1 [1] and the DCT-based method in FAST [2]. The table below summarizes the accuracy of each method. Specifically, we evaluate trajectory reconstruction accuracy (ADE/FDE) across 120K trajectories from the nuPlan dataset, using a consistent codebook size of 2048.

[1] Brohan et al. "RT-1: Robotics transformer for real-world control at scale." RSS 2023.

[2] Pertsch et al. "Fast: Efficient action tokenization for vision-language-action models." RSS 2025.

It is important to emphasize the structural differences between robotic manipulators (e.g., in RT-1 and FAST) and autonomous vehicles first. Robotic arms are typically fully actuated, holonomic systems capable of executing independent actions per dimension at each timestep. However, vehicles represent classical underactuated, nonholonomic systems with three degrees of freedom (i.e., position $x$ , $y$ , and $heading$ ) controlled indirectly via acceleration and steering rate. These systems are constrained by kinematics and cannot perform arbitrary motions. Therefore, our approach instead represents actions as short, feasible trajectory segments, encompassing position $x, y$ , and $heading$ , enabling the generation of smoother and physically feasible trajectories for vehicle systems.

For the RT-1 tokenization method, we discretize acceleration and steering rate using uniform action bins and apply a kinematic model to reconstruct the long-term trajectory. However, since we only have trajectory-level data and we have to indirectly derive control actions from it, this binning approach leads to the highest reconstruction error. In contrast, our method achieves the lowest ADE and FDE, demonstrating its ability to represent the trajectory accurately. FAST method performs comparably in terms of reconstruction accuracy, but it uses variable-length token sequences (typically 1 to 25 tokens per trajectory), which complicates the selection of appropriate token lengths and makes fixed-horizon trajectory prediction more challenging.

Tokenization	ADE(m)↓	FDE(m)↓
RT-1 (Action Bin)	0.101	0.178
FAST	0.028	0.031
K-disk (Ours)	0.018	0.020

We also evaluate planning performance using the FAST tokenization method (without RFT) on the nuPlan (NAVSIM) data. The table below shows that our method outperforms FAST in planning metrics, including PDMS, No At-fault Collision, and Ego Progress. This is largely due to our tokens directly representing meaningful vehicle motions, whereas FAST tokens encode more abstract frequency components, which are harder to learn from limited data.

Tokenization	PDMS↑	No At-Fault Collision↑	Progress↑
FAST	65.04	92.23	65.56
K-disk (Ours)	80.54	96.89	75.82

2. RFT in closed-loop experiments

We thank the reviewer for the insightful comment and agree that closed-loop RFT would provide a better assessment of long-term planning performance. However, closed-loop fine-tuning using existing open-source driving simulators (e.g., CARLA) is computationally prohibitive due to their limited simulation throughput and system bottlenecks. Furthermore, the primary goal of our RFT approach is to improve trajectory planning accuracy and running efficiency, rather than to explicitly address robustness to distributional shifts. In future work, we plan to enhance the model architecture and develop more efficient simulation environments to enable closed-loop RFT and validate its capability.

3. Benchmark evaluation and significance

We appreciate the reviewer’s concern regarding the significance of our proposed framework. We need to clarify that the primary goal of our work is to propose a general and scalable framework that can operate across a wide range of autonomous driving tasks, rather than optimizing for a single benchmark or dataset. Accordingly, our evaluations are intentionally broad, covering diverse datasets such as nuScenes, nuPlan, Waymo, and CARLA, without relying on dataset-specific engineering or privileged modalities. Our method consistently demonstrates competitive results across different datasets and benchmarks, which we argue is a meaningful indicator of its generalization ability.

It is important to note that many competing methods in the benchmarks are specialized and benefit from LiDAR sensors or privileged information (e.g., object-level annotations and map priors). In contrast, our approach is purely end-to-end, relying only on raw visual inputs and language and action supervision, which aligns with our goal of developing a more unified and scalable autonomous driving paradigm.

Moreover, our results reveal clear data scaling trends: the performance of our model improves with increased data, suggesting that its full potential has not yet been realized under the current limitations of available open-source datasets.

Comparison of model sizes and FPS

Many baseline methods do not report model parameters or FPS, making direct comparison difficult. We have included our model's size and runtime performance in the table below, and will add those results in the revised paper. We acknowledge that model size and inference speed are limitations. However, we note that the current system has not been optimized for efficiency, as our primary focus is to demonstrate the model's reasoning and planning capabilities. We also show that RFT enables adaptive thinking modes and improves performance and inference speed.

Methods	Input	Training	PDMS↑	Model Size	FPS	GPU
TransFuser	Cameras, LiDAR	Object‑level Annotation, GT Trajectory, HD Map, Semantic Map	83.88	641.3 MB	6.67	NVIDIA A100
DiffusionDrive	Cameras, LiDAR	Object‑level Annotation, GT Trajectory, HD Map	88.10	–	–	–
Hydra‑MDP	Cameras, LiDAR	Object‑level Annotation, GT Trajectory, Scoring	91.26	–	2.80	NVIDIA A100
Centaur	Cameras, LiDAR	Object‑level Annotation, GT Trajectory, Human Feedback	92.10	–	3.58	NVIDIA A100
TrajHF	Cameras, LiDAR	Object‑level Annotation, GT Trajectory, Human Feedback	93.95	–	–	–
AutoVLA	Cameras, Language	GT Trajectory, CoT, Reward Feedback	89.11	7.6 GB	0.77	NVIDIA L40S

Notably, our training method is also applicable to smaller VLMs, which can improve runtime performance in deployment. We also plan to explore LLM quantization in future work to enable near real-time inference even with slow reasoning.

4. CoT training in nuScenes

Thank you for pointing out the performance degradation with CoT training. After careful examination, we found that this issue comes from the generation parameters used across different settings. For models trained with CoT, we adopted more diverse and random generation parameters (Temp=0.8, Top-p=0.5, Top-k=20) to encourage more reasoning outputs. In contrast, the action-only models employed more deterministic settings (Temp=0.01, Top-p=0.001, Top-k=1). However, this diversity-promoting generation parameter introduces increased randomness in inference, which negatively impacts trajectory accuracy metrics, such as L2 error on the nuScenes dataset.

We conducted an additional experiment to examine the influence of generation parameters. The results indicate that when CoT models are evaluated using the same generation parameters as the action-only models, the performance becomes comparable, with no significant difference. We will revise the paper accordingly to clarify this explanation and include the new experimental results.

Model	Generation Parameter	L2 Error(m)↓	Collision Rate(%)↓
Action-only	Temp=0.01, Top-p=0.001, Top-k=1	0.70	0.31
Action-only	Temp=0.8, Top-p=0.5, Top-k=20	0.83	0.38
CoT	Temp=0.8, Top-p=0.5, Top-k=20	0.85	0.35
CoT	Temp=0.01, Top-p=0.001, Top-k=1	0.71	0.30

Regarding the RFT in nuScenes, the core of RFT is verified reward functions. Unlike traditional displacement-based metrics (e.g., ADE), which merely enforce adherence to reference trajectories, the reward function necessitates comprehensive evaluation encompassing comfort, safety, and efficiency. The NAVSIM platform within nuPlan provides such a comprehensive evaluation platform with the PDMS metric. Conversely, nuScenes (designed for perception tasks) lacks a comparable planning evaluation framework, limiting the exploration of RFT. Additionally, nuPlan provides a larger and more diverse driving dataset (178K data), whereas nuScenes contains only 25K data from two similar urban locations as nuPlan (Boston and Las Vegas). Therefore, we argue that RFT's performance benefits can generalize from nuPlan to nuScenes dataset.

2025-08-06

I appreciate the authors' time and effort in addressing my questions. While many of my concerns have been resolved, the concern regarding the significance of the work still remains. Specifically, the authors highlight the generalization ability of the proposed framework across different datasets, but it seems that the model still needs to be trained separately for each dataset. The authors also argue that other methods require object-level annotations for training. However, the proposed method also relies on CoT for training, which is not more cost-effective in my view. Lastly, considering the slow inference speed and the underwhelming performance, the potential impact of this work appears to be quite limited. Therefore, I will keep my current rating.

2025-08-06

We sincerely appreciate your feedback and your recognition of our efforts in addressing the earlier concerns. We respectfully respond to the remaining points regarding the significance and potential impact of our work, with the aim of further clarifying the contributions of our framework.

Generalization and Dataset-Specific Training

While we fine-tune the model on individual datasets to improve performance for specific benchmarks, we emphasize that the architecture, training method, and action tokenization remain consistent across datasets. This stands in contrast to many existing approaches that require extensive dataset-specific engineering, such as custom sensor fusion modules, object detection pipelines, or HD map integration.

In our work, a single unified model is evaluated across diverse datasets and benchmarks (including nuPlan, Waymo Vision-only End-to-End Driving, nuScenes, and CARLA Closed-loop Driving) without architectural modifications or domain-specific tuning. This level of cross-domain generalization is rare among end-to-end driving models, and significantly reduces engineering effort while enhancing scalability.

Cost-Effectiveness and CoT Training

Although CoT training still introduces a form of supervision, it is fundamentally different from object-level annotations or HD maps, which require costly manual labeling and dedicated infrastructure.

Our CoT supervision is generated automatically using capable vision-language models, with only minimal human verification of textual descriptions, rather than manual annotation across different tasks and sensor configurations.
Notably, language (CoT) can be a general abstraction of driving perception and policy, allowing adaptation across various driving scenarios and sensor modalities.
Furthermore, our framework effectively integrates language-based instructions, aligning world knowledge in large language models with physical driving actions.

We show that this leads to improved performance in long-tail scenarios, and achieves the highest RFS (Spotlight) score on the Waymo End-to-End Driving benchmark, which contains the most challenging cases.

Significance and Potential Impact

We acknowledge the current limitation in inference speed and have discussed it explicitly. However, we would like to emphasize that this work is an important step toward bridging large vision-language models with trajectory planning for autonomous driving, unifying perception, reasoning, and planning in a single model.

Our RFT training paradigm shows strong potential for improving both planning performance (11% increase in planning score) and efficiency (67% decrease in runtime), and for grounding physical actions in language-based reasoning.
Additionally, our results show clear data scaling trends, suggesting further performance gains with larger and more diverse datasets.

We believe these aspects indicate substantial long-term promise despite current limitations.

We hope these clarifications help better convey the significance and contributions of our work. We would be very grateful if you would consider these additional points in evaluating your assessment.

Thank you again for your time and consideration.

2025-08-07

I thank the authors for their clarification. I acknowledge the contribution of the work, which presents a straightforward extension of VLM for planning tasks with extensive experiments. Thus, I am inclined to borderline accept the work. However, the work still has limitations that cannot be easily addressed or overlooked, even though the authors provided explanations. Therefore, I cannot raise my score further.

2025-08-07

We sincerely thank the reviewer for acknowledging the contributions of our work and for the positive feedback. We are committed to addressing the current limitations, particularly in runtime and deployment efficiency. We greatly appreciate your time, constructive comments, and engagement throughout the review and discussion process.

评论- [Action Required] Participate in Reviewer-Author Discussion

2025-08-05

Dear Reviewers,

As the author-reviewer discussion period will end soon (Aug 6, 11:59 PM AoE), please take a moment to read the authors’ responses and post a reply - either to acknowledge their clarifications or to raise any remaining concerns.

Thank you for your time and contributions to the review process.

Best regards,

最终决定Accept (poster)

2025-09-17

This paper proposes AutoVLA, a vision-language-action framework for end-to-end autonomous driving that integrates trajectory planning and reasoning in a single autoregressive model. The authors introduce action tokenization, dual thinking modes (fast and slow), and reinforcement fine-tuning to improve efficiency and adaptability.

The final ratings are mixed, ranging from reject to accept (1 reject, 2 borderline accepts, 1 accept). Reviewer xmFE (accept) acknowledged the unified reasoning-planning design and found the results strong on several benchmarks. Reviewer Tsqu (borderline accept) appreciated the design of action tokenization and RFT, but requested further evaluation in broader robotic scenarios. Reviewers 5HFo (borderline accept) and zoc9 (reject) remained concerned after the rebuttal, particularly regarding the novelty, practical impact, and inference efficiency of the system.

Overall, the AC finds this paper a good attempt to integrate large vision language models into end-to-end autonomous driving, and a well-engineered system with comprehensive experiments across multiple benchmarks. However, concerns remain regarding its practical significance - given the large model size, different from previous fast-slow systems with two separate models, the dual-thinking design still lacks real-time applicability. While some concerns remain, the paper still presents a well-executed contribution toward applying VLA in autonomous driving, and the AC recommends acceptance.