Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2)
Reinforcement Learning Successfully Attempt for End-to-End Autonomous Driving in CARLA v2
摘要
评审与讨论
This paper introduces Raw2Drive, the first model-based reinforcement learning framework designed for end-to-end autonomous driving using raw sensor inputs. It addresses the inefficiencies of imitation learning and the training difficulties of RL with high-dimensional sensory data. The core contribution is a dual-stream architecture where a privileged world model guides the training of a raw sensor world model via a proposed Guidance Mechanism. The method achieves outperforming results on CARLA v2 and Bench2Drive benchmarks.
优缺点分析
Strengths:
- The dual-stream MBRL architecture and the Guidance Mechanism are novel and straightforward.
- Experimental results on challenging benchmarks like CARLA v2 and Bench2Drive demonstrate both efficacy and generalizability.
- Architecture and training pipeline are well-detailed with clear figures.
Weaknesses:
- While innovative, the training phase relies on data (privileged inputs) that may not always be scalable in real-world applications: Now all data are from simulation data, there may be sim2real gap.
- Some details of method may need more explain: how the stochastic state is sampled from a one-hot distribution? What dose the stochastic state contain?
- Despite mentioning practical implications, real-world feasibility is only lightly discussed; real-time inference constraints or transfer to real data are not tested.
- Perhaps more analysis and discussion is needed: The paper says that the multi-head Raw Sensor World Model is difficult to achieve stable convergence during training. Is the Privileged World Model has the same situation?
问题
- How does Raw2Drive perform with out-of-distribution data or in other simulators or real-world scenarios?
- How can the privileged stream be practically realized in non-simulated settings where ground-truth semantics or HD-maps are not perfectly available?
- How the stochastic state is sampled from a one-hot distribution? What dose the stochastic state contain?
- The paper says that the multi-head Raw Sensor World Model is difficult to achieve stable convergence during training. Is the Privileged World Model has the same situation?
局限性
yes
最终评判理由
All in all, I appreciate authors detailed responses. If authors can resolve sim2real gap, it would greatly enhance the value of this work, and I would give it an accepted rating. In its current state, I can only give a borderline rating.
格式问题
no
Thanks for your acknowledgement and kind advice. Regarding your concerns, we give responses below:
Weakness 1,3,Question 1, Discussion on Transferability in Real-world Autonomous Driving
-
We appreciate the reviewer’s concerns regarding the Transferability in Real-world Autonomous Driving.
-
About the Transferability of RL
-
Recent breakthroughs of reinforcement learning (RL) in domains such as LLMs[1], VLMs[2] and AlphaFold[3], has shown that RL's unique advantages compared to SFT/imitation learning. While real-world deployment of RL remains challenging, the community has made significant efforts in high-fidelity simulation generation[4] and reconstruction[5,6], though their readiness for large-scale RL deployment remains limited.
-
Our work builds upon the trend of RL, serving as a preliminary study of applying RL to end-to-end autonomous driving. Due to the unavailablility of real-world-level authentic simulator, we use CARLA but we argue that the scientific problem studied and discoveries of the early exploration in this work are general and worth-sharing for the community.
-
-
About the Transferability of Raw2Drive
-
Raw2Drive utilizes a dual-stream world model to effectively decouple perception and decision-making. On the decision side, trajectory prediction is conducted based on structured perception outputs (e.g., BEV or compact representations)—a widely adopted practice in both academia and industry. On the perception side, the emphasis is on enhancing the accuracy of these BEV-based representations.
-
By leveraging the world model, our paradigm enables a fully differentiable pipeline while preserving a separation between perception and planning.
-
Furthermore, with recent progress in high-fidelity scene reconstruction[4] and simulation technologies [5,6], a practical deployment path is to pretrain the world model via imitation learning and fine-tune the policy with reinforcement learning—an increasingly feasible strategy in real-world and industrial settings.
-
[1] 2024, Technical Report, DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
[2] 2025, Arxiv, VLM-R1: A stable and generalizable R1-style Large Vision-Language Model
[3] 2021, Nature, Highly accurate protein structure prediction with AlphaFold
[4] 2025,ICCV, HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation
[5] 2025, ICCV, BézierGS: Dynamic Urban Scene Reconstruction with Bézier Curve Gaussian Splatting
[6] 2025, Arxiv, ReSim: Reliable World Simulation for Autonomous Driving
Weakness 2, Question 3, Explanation of Stochastic State
- In our model, the stochastic latent state plays a key role in capturing uncertainty in world model predictions. It consists of multiple categorical variables(num=32), each represented as a one-hot vector(dim=32). These variables are parameterized by logits output from the world model. During training, the stochastic state is sampled using a Gumbel-Softmax estimator to maintain differentiability. At inference, the mode (argmax) is used to select one-hot values. The final stochastic state is the concatenation of these one-hot vectors, serving as the discrete latent representation of the environment dynamics.
Weakness 3, Results about Real-time Inference
- Thank you for the suggestion. We have added latency analysis for each module. In end-to-end autonomous driving, the perception backbone (e.g., surround-view image encoder) typically dominates the overall latency. As shown below, our world model and policy are highly efficient (each under 2ms), while the raw sensor stream is mainly bottlenecked by the vision encoder (e.g., BEVFormer).
| Method | Modality | Encoder Latency(ms) | World Model Latency(ms) | Policy Latency(ms) |
|---|---|---|---|---|
| Privileged Stream | BEV State | 2(5xConv) | 2 | 2 |
| Raw Sensor Stream | Multi-view images | 600(BEVFormer) | 2 | 2 |
Weakness 4 and Question 4, Discussion on Stability of World Model Training(Privileged vs Raw Sensor)
-
We appreciate the reviewer’s concern. The observed training instability mainly affects the Raw Sensor World Model, which learns from high-dimensional, partial visual inputs. In contrast, the Privileged World Model is trained on structured BEV states that are low-dimensional, semantically aligned, and less ambiguous—leading to much faster and more stable convergence without perceptual noise or occlusion.
-
As shown in Figure 7 (Appendix F), removing rollout guidance causes the Raw Sensor model’s loss to oscillate and fail to converge. We further provide a visualization in Appendix G.2, where adjacent frames appear visually similar but correspond to drastically different rewards. This semantic ambiguity makes it hard for the Raw Sensor World Model to capture consistent dynamics, reinforcing the need for privileged guidance during training.
Question 2, Discussion on Privileged Input in non-simulated settings
- Thank you for the question. In non-simulated settings, privileged inputs are typically obtained from perception outputs—either generated by state-of-the-art detection models[7][8] or manually annotated—both of which are common practices in industry. While ground-truth semantics or HD maps may not be perfectly available, this is also the case in many real-world trajectory prediction setups[9], where models are trained on perception outputs that are imperfect but sufficiently reliable.
Our framework follows this convention and can incorporate such perception-derived privileged inputs accordingly. This makes our privileged stream design practical and compatible with real-world deployment pipelines.
[7]2024, CVPR, UniMODE: Unified Monocular 3D Object Detection
[8]2024, ICLR, LaneSegNet: Map Learning with Lane Segment Perception for Autonomous Driving
[9]2025, CVPR, Truncated Diffusion Model for Real-Time End-to-End Autonomous Driving
I thank authors detailed response. Some concerns are addressed. But my primary concern remains the sim2real gap(it seems like I was not the only reviewer that concerning this gap). This gap is one of key factors affecting the value of this paper. I'll keep my score and if it's accepted, please add contents of rebuttal to the paper.
Thank you again for your valuable comments. We hope our responses have effectively addressed the concerns raised. If there are any remaining concerns, we would be grateful for further clarification.
We sincerely thank you for your valuable comments, which have greatly helped us improve the quality and clarity of our work. We fully agree with your insightful perspective on the sim2real gap—a critical challenge for RL-based driving systems. While CARLA remains the only viable closed-loop simulator suitable for RL research at present, our work focuses on policy learning and introduces a dual-stream world model design that is conceptually decoupled from the underlying simulator. As generative or reconstruction-based simulators continue to evolve, we believe that exploring end-to-end RL within these emerging frameworks will become increasingly practical and impactful. We will incorporate these valuable discussions into the revision. Thank you again for your thoughtful feedback!
This paper proposes a novel framework that leverages reinforcement learning with world model pretraining to achieve robust and sample-efficient end-to-end driving in the CARLA simulator. The key contribution lies in the introduction of an aligned world model (AWM) that is jointly trained with a latent policy to ensure representation consistency between pretraining and downstream RL fine-tuning. By enforcing representation alignment through shared encoders and temporal consistency, Raw2Drive effectively bridges the gap between offline world model learning and online policy optimization. The method demonstrates superior performance and generalization across diverse CARLA v2 towns and weather conditions, outperforming existing model-based and model-free baselines in both success rate and driving smoothness.
优缺点分析
Strengths:
-
The introduction of the Aligned World Model (AWM) with shared encoders and temporal consistency regularization effectively bridges the gap between offline world modeling and online RL, which improves policy stability and sample efficiency.
-
The method achieves state-of-the-art performance in CARLA v2 across challenging generalization settings.
3.By leveraging pretrained world models and aligned representations, the approach reduces the data required during the online reinforcement learning phase.
Weakness:
-
The evaluation is confined to the CARLA simulator, and it remains unclear how well the approach generalizes to real-world driving scenarios with noisy sensors and domain shift. How can an RL trained well in simulation contributed to real world end-to-end?
-
While alignment is emphasized as a core novelty, the paper could benefit from more extensive ablations on the individual contributions of alignment components (e.g., encoder sharing vs. temporal loss).
-
Aligning representation spaces does not guarantee the world model remains useful during online RL updates. If the policy diverges during exploration, the pretrained world model might become misaligned or irrelevant, leading to degraded performance or instability.
-
Lack of comparison with recent RL-based method like [1]. There are no sourc code.
[1] AdaWM: adaptive world model based planning for autonomous driving, ICLR, 2025
问题
-
Is representation alignment truly necessary for performance gains, or could simpler techniques (e.g., fine-tuning with frozen encoders) suffice?
-
Does the aligned world model retain predictive accuracy throughout policy fine-tuning? As world models may degrade as the policy distribution shifts.
-
Could forcing a single encoder to serve both the world model and RL agent lead to suboptimal representations for either task?
-
How fair are the comparisons to IL baselines in terms of total environment interactions and training time?
-
Does the approach generalize across input modalities or sensor configurations?
-
Is Raw2Drive compatible with safety constraints or explainability requirements for real-world deployment?
局限性
yes
最终评判理由
Raw2Drive offers interesting methodological insights in transfering the world model capability from perfect BEV perception, to end-to-end Autonomous Driving model. This is one of the most practical problem to be solved in the e2e autonomous driving area. Regarding my previous concerns in baseline comparison, sim2real (alignment) capabilities, and in real-world AD applications, the authors have solved my questions with additional results. Hence, I recommend to remain the Bordeline Accept rating, with confidence towards the acceptance of this paper
格式问题
N/A.
Thanks for your acknowledgement and kind advice. Regarding your concerns, we give responses below:
Weakness 1, Question 6, Discussion on Transferability and Explainability in Real-world Autonomous Driving
-
We appreciate the reviewer’s concerns regarding the Transferability and Explainability in Real-world Autonomous Driving.
-
Recent breakthroughs of reinforcement learning (RL) in domains such as LLMs[1], VLMs[2] and AlphaFold[3], has shown that RL's unique advantages compared to SFT/imitation learning. While real-world deployment of RL remains challenging, the community has made significant efforts in high-fidelity simulation generation[4] and reconstruction[5,6], though their readiness for large-scale RL deployment remains limited.
-
Our work builds upon the trend of RL, serving as a preliminary study of applying RL to end-to-end autonomous driving. Due to the unavailablility of real-world-level authentic simulator, we use CARLA but we argue that the scientific problem studied and discoveries of the early exploration in this work are general and worth-sharing for the community.
-
Regarding explainability(Question 6), as shown in Figure 8, we reconstruct BEV representations instead of expensive RGB images. This not only provides supervision for training the world model but also enhances interpretability by offering a structured and semantically meaningful intermediate representation.
[1] 2024, Technical Report, DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
[2] 2025, Arxiv, VLM-R1: A stable and generalizable R1-style Large Vision-Language Model
[3] 2021, Nature, Highly accurate protein structure prediction with AlphaFold
[4] 2025,ICCV, HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation
[5] 2025, ICCV, BézierGS: Dynamic Urban Scene Reconstruction with Bézier Curve Gaussian Splatting
[6] 2025, Arxiv, ReSim: Reliable World Simulation for Autonomous Driving
Weakness 2, Additional Ablations and Discussion on Alignment
-
Thanks for your suggestions. In fact, Table 7 presents ablations on the three types of latent states(Abstract-State Alignment).
-
About Spatial-Temporal Alignment
- In addition, we add results on Spatial-Temporal Alignment (based on Dev10). Spatial Alignment ensures consistency between image representations and BEV representations—removing it is similar to “driving blind.” Temporal Alignment maintains the consistency of future predictions across time. Both components are essential and complementary for stable world model rollout.
| Spatial Alignment | Temporal Alignment | DS | SR(%) |
|---|---|---|---|
| ✗ | ✗ | 0.0 | 0.0/10 |
| ✓ | ✗ | 13.6 | 1.2/10 |
| ✗ | ✓ | 9.24 | 0.8/10 |
| ✓ | ✓ | 83.5 | 7.5/10 |
- About Encoder Sharing
- Notably, because the privileged and raw sensor streams have different inputs, their encoders cannot share parameters.
Weakness 3, Clarification on World Model Alignment during Online RL Updates
-
We appreciate the reviewer’s concern, but we respectfully disagree with this point. During online RL updates, there is no misalignment between the world models.
-
In the first stage, similar to standard model-based RL approaches [7], the Privileged World Model and Privileged Policy are updated alternately. In the second stage, as shown in Figure 4, the Privileged World Model is frozen, ensuring a stable supervision signal for optimizing the Raw World Model. Therefore, no misalignment occurs during training.
[7] 2025, Nature, Mastering Diverse Domains through World Models
Weakness 4, Clarification on the Comparison with AdaWM
- In fact, as shown in Table 1, we include a comparison with AdaWM[8]. It is important to note that AdaWM is not an end-to-end autonomous driving method. Its benchmark - ROM03, a derivative of CARLA v1, which lacks complex corner cases(CARLA v2, such as Bench2Drive) and is not open sourced yet.
[8] 2025, ICLR, AdaWM: adaptive world model based planning for autonomous driving
Questions 1, Clarification on the Necessity of Representation Alignment
-
As described in Sec. 3.1 and Sec. 3.2, the dual-stream world model processes two distinct input modalities: privileged input consists of time-sequenced BEV semantic masks, while raw sensor input includes multi-view images and IMU data. Due to their fundamentally different input formats and encoder architectures, simple fine-tuning between the two streams is not feasible.
-
As shown in Figure 7 (Appendix F), removing rollout guidance causes the Raw Sensor model’s loss to oscillate and fail to converge. Since the world model is used for future prediction, representation alignment is essential to ensure consistency between the two streams during rollout.
Question 2, Clarification on World Model Stability During Policy Fine-Tuning
- In our dual-stream design, the first stage follows a standard model-based RL paradigm[7], where the Privileged World Model and Privileged Policy are updated alternately to ensure stable training. In the second stage, as shown in Figure 5, both World Models are frozen during raw policy fine-tuning. As a result, the predictive accuracy of the world model remains unaffected, and there is no degradation due to policy distribution shift.
Questions 3, Clarification on Single Encoder to Serve both the World Model and RL Agent
- This is a common practice in model-based RL[7,8,9]. During training, the RL agent interacts with the world model instead of the simulator, significantly reducing simulation overhead and improving training efficiency. One of the key roles of the world model is to predict the next observation, which is then used by the RL agent for decision-making. Therefore, using a single shared encoder for both the world model and the RL agent is not only reasonable but necessary to maintain consistency in representation and ensure effective policy learning.
[8] 2020, ICLR, Dream to Control: Learning Behaviors by Latent Imagination
[9] 2021, ICLR, Mastering Atari with Discrete World Models
Questions 4, Discussion on Fairness of Comparison with IL Methods and Interaction Nums
-
As noted in Sec. 4.1, we adopt the same 1000 training routes from Bench2Drive to ensure fairness. However, as an online learning paradigm, RL naturally collects a larger number of interactions—both successful and failed—to iteratively refine its policy. This trial-and-error process is a core advantage of RL over offline imitation learning.
-
In Bench2Drive-Base, the 1000 routes yield a total of 250k samples. To further analyze interaction efficiency, we also include additional ablations. Similar to early work such as MaRLn [10], which trained and tested on the same set of routes(albeit with limited generalization), we show that while training on fixed routes is feasible, generalization remains a challenge.
-
Notably, Raw2Drive enables better generalization to unseen scenarios by learning to predict future dynamics—this requires more interactions. As the number of interactions increases from 250k to 750k and 1M, we observe performance gains. Both successful and failed trajectories are crucial for learning a robust world model. However, we acknowledge that training time also increases accordingly.
| Method | Training Route Num | Test Route Num | Interactions | DS | SR(%) | Training Time | |
|---|---|---|---|---|---|---|---|
| Ours(RL) | 1000 | 220 | 250k | 58.34 | 30.00 | 8 GPU Days | |
| Ours(RL) | 1000 | 220 | 750k | 68.45 | 47.73 | 30 GPU Days | |
| Ours(RL) | 1000 | 220 | 1M | 71.34 | 50.24 | 40 GPU Days | |
| Ours(RL) | 220 | 220 | 250k | 66.82 | 45.45 | 8 GPU Days | |
| UniAD(IL)[11] | 1000 | 220 | 250k | 45.81 | 16.36 | 30 GPU Days |
[10]2020, CVPR, End-to-End Model-Free Reinforcement Learning for Urban Driving using Implicit Affordances
[11]2023, CVPR, Planning-oriented Autonomous Driving
Questions 5, Discussion and Ablation on Generalization Across Input Modalities and Sensor Configurations
- Thanks for your suggestions. please refer to the answer: Reviewer #ps71 (Questions b, Additional Results on LiDAR-Based Study)
Thank you again for your valuable comments. We hope our responses have effectively addressed the concerns raised. If there are any remaining concerns, we would be grateful for further clarification.
Thanks for the detailed response and discussion regarding my questions. I will keep my score and recommend the acceptance.
Thanks again for your time and advice!
This paper introduces Raw2Drive, a reinforcement learning algorithm for end-to-end autonomous driving from raw image inputs. A key component of the approach is a world model trained from pixels, which learns under the guidance of a privileged world model trained on Bird’s Eye View (BEV) representations. The core novelty lies in the distillation mechanism that transfers structured knowledge from the BEV-based model to the raw-sensor-based model, enabling more efficient learning and improved driving performance.
优缺点分析
The paper demonstrates strong empirical performance, supported by thorough experiments on well-established benchmarks. The comparisons with baselines are well-structured and convincingly highlight the practical relevance and effectiveness of the proposed method. However, from an algorithmic standpoint, the novelty is relatively limited. The student-teacher structure at the core of the method is a well-explored paradigm in the literature. For example, similar ideas have been investigated in: Zhang J., Huang Z., Ohn-Bar E. Coaching a Teachable Student. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7805–7815. While I acknowledge that many of these prior approaches focus on imitation learning, the current paper would benefit from a more comprehensive Related Work section. This should not only clarify the distinction between IL and RL settings, but also deepen the algorithmic comparison with relevant methods that use similar distillation or privileged information paradigms. A richer contextualization would help better situate the algorithmic contributions of this work.
问题
a. In Section 2 (Problem Formulation & Related Works), the authors mention that raw observations are collected using cameras, LiDAR, and IMU. However, the proposed algorithm relies solely on images, effectively discarding the data from the other modalities. What is the rationale behind this design choice? Would the authors consider incorporating multimodal sensors (e.g., LiDAR or IMU) to further improve performance or robustness? b. In the conclusion, the authors claim that “our proposed paradigm is not limited to camera-only systems and is applicable to guiding the learning of any sensor through the privileged input.” While this is an appealing prospect, learning directly from raw point-clouds (e.g., LiDAR) is significantly more challenging. It would be valuable if the authors could elaborate on how their approach could generalize to such modalities, and whether any specific architectural adaptations would be required. c. The performance gains reported in Table 4 are substantial, especially when compared to IL-based baselines. Can the authors analyze the underlying causes of these improvements? Specifically, why does MBRL seem to consistently outperform IL approaches in this setting?
局限性
Limitations of the proposed method are only briefly addressed in the Appendix and focus mainly on broader issues—such as the general challenges of real-world RL and the nature of the privileged information—rather than on limitations specific to the proposed algorithm. A more transparent and detailed discussion of the method’s limitations (e.g., dependency on privileged BEV data during training, transferability to diverse real-world environments, or failure cases) would greatly improve the overall rigor and self-awareness of the work.
最终评判理由
I appreciate the authors' detailed response, as they have addressed most of my concerns. However, a common issue raised by several reviewers is the ability to generalize beyond CARLA. Overall, I believe the paper is borderline, and to clearly justify acceptance, the approach must be validated to demonstrate its generalizability. I will slightly raise my score.
格式问题
N/A
We would like to first extend our sincere gratitude for your time and effort in evaluating our manuscript. Your thorough evaluation and insightful comments are greatly appreciated. We will address your questions point by point and hope to resolve your concerns effectively.
Weakness 1, Discussion and Clarification on Novelty and Student-Teacher Paradigm
-
Thank you for the helpful suggestion. Our core focus is reinforcement learning in end-to-end autonomous driving. With recent progress in model-based RL [1,2], our work introduces two key novelties:
- Successfully achieve end-to-end RL To our best knowledg, Raw2Drive is the first successfully achieve end-to-end RL with performance significantly better than the end-to-end IL of the same period.
- Dual-stream World Model as a Bridge: We use a dual-stream world model to decouple perception and decision-making. This design simplifies the joint optimization problem, which is often difficult to converge in end-to-end autonomous driving.
- Dual-stream World Model Alignment: Some student-teacher methods such as CaT [3] and DriveAdapter [4] are based on imitation learning. While they achieve strong performance in CARLA v1, they struggle in corner cases under the more challenging Bench2Drive (CARLA v2) benchmark. Moreover,Unlike standard teacher-forcing strategies, as shown in Figure 6 in manuscript, future prediction in driving involves both stochastic (e.g., other agents’ intentions) and deterministic (e.g., ego state) components over time and space. We propose Spatial-Temporal Alignment and Abstract-State Alignment to ensure consistency between the two streams during rollout.
-
We will revise the final version of the paper accordingly, adding citations to better emphasize these contributions and improving the discussion of related methods.
[1] 2020, ICLR, Dream to Control: Learning Behaviors by Latent Imagination
[2] 2021, ICLR, Mastering Atari with Discrete World Models
[3] 2023, CVPR, Coaching a Teachable Student
[4] 2023, ICCV, DriveAdapter: Breaking the Coupling Barrier of Perception and Planning in End-to-End Autonomous Driving
Questions a, Clarification on Raw Sensor Inputs
- Thank you for the question. As stated in Sec. 3.2 (Line 97), the raw sensor input includes multi-view images and IMU data. For visual simplicity, the IMU input is omitted from the figure but is fully incorporated in the model implementation. We will update in the final version.
Questions b, Additional Results on LiDAR-Based Study
-
Regarding the LiDAR sensor, this highlights another advantage of our method-extensibility. Our dual-stream world model decouples perception and decision-making, where the perception module focuses on enhancing the accuracy of BEV-based representations. To further validate this extensibility, we include additional experiments using LiDAR inputs. Specifically, we replace BEVFormer[5] with BEVFusion[6]. Incorporating LiDAR leads to some performance improvement, particularly in reconstructing occluded objects, but the gains are marginal as perception accuracy is already near saturation. Moreover, most occluded objects are distant and have limited impact on immediate decision-making.
-
Due to the latest NeurIPS submission policy, we are unable to include visualization links, but we will provide them in the final camera-ready version.
| Modality | Encoder | DS | SR(%) |
|---|---|---|---|
| Camera | BEVFormer[5] | 71.36 | 50.24 |
| Camera+LiDAR | BEVFusion[6] | 72.43 | 50.91 |
[5] BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers
[6] BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation
Questions c, Analyzing the Performance Gains of MBRL over IL
-
Thank you for raising this important analysis. As noted in the Introduction, imitation learning (IL) relies on expert data and often struggles with out-of-distribution generalization. It is also prone to issues such as causal confusion (e.g., in traffic light scenarios, where the model may mistakenly associate “other vehicles move” with “I should move”) and compounding errors during long-horizon rollouts. In contrast, reinforcement learning (RL) learns through trial and error, optimizing behavior directly from reward signals.
-
Recent breakthroughs in RL across domains such as LLMs [7], VLMs [8], and AlphaFold [9] have demonstrated RL’s distinct advantages over supervised fine-tuning or imitation learning, particularly in discovering policies that go beyond demonstrator behavior.
-
In summary, while IL provides a strong initialization from expert behavior, it is fundamentally limited by demonstration quality and generalization capacity. Our MBRL framework leverages privileged guidance and learned world models to go beyond imitation, enabling robust and reward-driven policy learning that better handles long-horizon reasoning and distributional shifts. We will incorporate this analysis in the final version to clarify the advantages of MBRL over IL.
[7] 2024, Technical Report, DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
[8] 2025, Arxiv, VLM-R1: A stable and generalizable R1-style Large Vision-Language Model
[9] 2021, Nature, Highly accurate protein structure prediction with AlphaFold
Limitaions 1, Discussion on Method Limitations
- Thank you for the valuable suggestion. As a reinforcement learning framework, our method requires trial-and-error interaction, which incurs additional computational cost compared to imitation learning that directly learns from expert demonstrations. Moreover, due to the lack of highly realistic real-world simulators, we conduct experiments in CARLA. While CARLA has limitations in fidelity, we believe the core scientific questions and early-stage insights presented in this work are broadly applicable and valuable to the community. We will explicitly discuss these limitations in the final version.
Limitation 2, Discussion on Transferability in Real-World Autonomous Driving
-
We appreciate the reviewer’s concerns regarding the practicality of transferability in real-world autonomous driving.
-
About the Transferability of RL
-
While real-world deployment of RL remains challenging, the community has made significant efforts in high-fidelity simulation generation[10] and reconstruction[11,12], though their readiness for large-scale RL deployment remains limited.
-
Our work builds upon the trend of RL, serving as a preliminary study of applying RL to end-to-end autonomous driving. Due to the unavailablility of real-world-level authentic simulator, we use CARLA but we argue that the scientific problem studied and discoveries of the early exploration in this work are general and worth-sharing for the communit.
-
-
About the Transferability of Raw2Drive
-
Raw2Drive utilizes a dual-stream world model to effectively decouple perception and decision-making. On the decision side, trajectory prediction is conducted based on structured perception outputs (e.g., BEV or compact representations)—a widely adopted practice in both academia and industry. On the perception side, the emphasis is on enhancing the accuracy of these BEV-based representations.
-
By leveraging the world model, our paradigm enables a fully differentiable pipeline while preserving a separation between perception and planning.
-
Furthermore, with recent progress in high-fidelity scene reconstruction[4] and simulation technologies [5,6], a practical deployment path is to pretrain the world model via imitation learning and fine-tune the policy with reinforcement learning—an increasingly feasible strategy in real-world and industrial settings.
-
[10] 2025,ICCV, HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation
[11] 2025, ICCV, BézierGS: Dynamic Urban Scene Reconstruction with Bézier Curve Gaussian Splatting
[12] 2025, Arxiv, ReSim: Reliable World Simulation for Autonomous Driving
Thank you again for your valuable comments. We hope our responses have effectively addressed the concerns raised. As the rebuttal period is ending soon, we’d sincerely appreciate a score update if the concerns are resolved. If there are any remaining concerns that prevent a higher score, we would be grateful for further clarification.
I appreciate the authors' detailed response, as they have addressed most of my concerns. However, a common issue raised by several reviewers is the ability to generalize beyond CARLA. Overall, I believe the paper is borderline, and to clearly justify acceptance, the approach must be validated to demonstrate its generalizability. I will slightly raise my score.
Thank you for your valuable comments, which have greatly helped us improve the quality and clarity of our work. We fully agree with your insightful perspective on the sim2real gap—a critical challenge for RL-based driving systems. While CARLA remains the only viable closed-loop simulator suitable for RL research at present, our work focuses on policy learning and introduces a dual-stream world model design that is conceptually decoupled from the underlying simulator. As generative or reconstruction-based simulators continue to evolve, we believe that exploring end-to-end RL within these emerging frameworks will become increasingly practical and impactful. We will incorporate these valuable discussions into the revision. Thank you again for your thoughtful feedback!
The paper introduces an end-to-end trained RL architecture for the CARLA autonomous driving simulator. The method is evaluated on the CARLA Leaderboard v2 as well as on Bench2drive (also in CARLA).
优缺点分析
Strengths:
- The authors focus on closed-loop evaluation
- Getting strong results with end2end RL for AD seems novel. The authors claim it's the first, which I am not sure about, but may at least be the first one with good results.
- The authors include a long video example of the agent driving
Weaknesses:
- Scope/Transfer: The main weakness is already admitted in the title, which says "Autonomous Driving (in CARLA v2)". The main claim is end2end RL where the authors train from camera in CARLA and they evaluate in CARLA. Although probably different scenarios in CARLA, I presume those still use mostly the same visual assets. This means that it is unclear how the end2end part would transfer to the real world (which looks and behaves differently).
- Normally for IL approaches you compete on the amount of data needed to reach a certain performance, and it's also easier to collect in the real world while a human drives the car. Having RL drive a car during training does not seem realistic. I cannot find how many interactions was needed to train this e2e architecture in the main paper, but I presume it was a lot. That leaves sim2real, or how do you envision one would use this type of model?
- The paper framing, structure and writing could overall be improved.
- The intro is rather negative of IL while pushing the virtues of RL, but RL also has weaknesses (like large number of world interactions needed, sim2real).
- The related work has been relegated to the appendix, which makes it difficult to evaluate what is novel.
- What is novel in the method is rather unclear. The listed contributions only claims to be the first AD e2e RL. The method part contains very few references, it is unclear if this is all novel or from related work. Is the WM alignment part novel? If so I suspect that could have been a contribution on its own.
- The intro takes aim at beating the CARLA v2 Leaderboard, but when we get to results it just says this uses bad metrics and focuses on Bench2Drive (also in CARLA) without explaining why. I had to read [40] to understand why, but [40] also introduces a new metric that purports to fix the problems, so why did the authors not use this?
问题
-
Clarify what is novel in the method. Is the privileged WM training approach novel?
-
How many interactions were needed to train the RL policy, and what RL method did you use exactly? The main paper only describes the WM training. I only found a brief mention in the appendix of the policy training being an actor-critic approach.
-
How would this model transfer to outside of CARLA? Even to another simulator? Maybe NAVSIM?
局限性
Sim2real transfer is unclear and I can't find a discussion of this. Otherwise ok.
最终评判理由
So far the rebuttal clarified some things but the fundamental weaknesses remain: no generalization tests outside of CARLA limits its potential as an AD paper, and while the dual stream RL approach is interesting, it is only tested on CARLA/AD so we do not know how well it would generalize to other problems. I still think that the approach is interesting and it already does have CARLA in the title so if these limitations (and framing of the contributions) were made clearler in the paper I would be OK with this being published.
EDIT: I updated my score to borderline accept after further discussion with the authors and finding the baseline was also just RL in CARLA and still got accepted to ECCV'24, but my reservations about sim2real stand and I hope they can address this in future work.
格式问题
-
As already mentioned above, the presentation could be improved.
-
Slight overuse of bold
-
Typo on L118 "the fisrt type"
We would like to first extend our sincere gratitude for your time and effort in evaluating our manuscript. Your thorough evaluation and insightful comments are greatly appreciated. We will address your questions point by point and hope to resolve your concerns effectively.
Weakness 1,Question 3, Discussion on Practicality of RL Training and Transfer
-
We appreciate the reviewer’s concerns regarding the practicality of RL training. Recent breakthroughs in reinforcement learning (RL) in domains such as LLMs[1], VLMs[2] and AlphaFold[3], have shown that RL's unique advantages compared to SFT/imitation learning. While real-world deployment of RL remains challenging, the community has made significant efforts in high-fidelity simulation generation[4] and reconstruction[5,6], though their readiness for large-scale RL deployment remains limited.
-
Our work builds upon the trend of RL, serving as a preliminary study of applying RL to end-to-end autonomous driving. Due to the unavailability of real-world-level authentic simulator, we use CARLA but we argue that the scientific problem studied and discoveries of the early exploration in this work are general and worth-sharing for the community.
[1] 2024, Technical Report, DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
[2] 2025, Arxiv, VLM-R1: A stable and generalizable R1-style Large Vision-Language Model
[3] 2021, Nature, Highly accurate protein structure prediction with AlphaFold
[4] 2025,ICCV, HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation
[5] 2025, ICCV, BézierGS: Dynamic Urban Scene Reconstruction with Bézier Curve Gaussian Splatting
[6] 2025, Arxiv, ReSim: Reliable World Simulation for Autonomous Driving
Weakness 1,Question 2, How many interactions were needed to train this end-to-end architecture?
-
As noted in Sec. 4.1, we adopt the same 1000 training routes from Bench2Drive to ensure fairness. However, as an online learning paradigm, RL naturally collects a larger number of interactions—both successful and failed—to iteratively refine its policy. This trial-and-error process is a core advantage of RL over offline imitation learning.
-
In Bench2Drive-Base, the 1000 routes yield a total of 250k samples. To further analyze interaction efficiency, we also include additional ablations. Similar to early work such as MaRLn [7], which trained and tested on the same set of routes(albeit with limited generalization), we show that while training on fixed routes is feasible, generalization remains a challenge.
-
Notably, Raw2Drive enables better generalization to unseen scenarios by learning to predict future dynamics—this requires more interactions. As the number of interactions increases from 250k to 750k and 1M, we observe performance gains. Both successful and failed trajectories are crucial for learning a robust world model. However, we acknowledge that training time also increases accordingly.
| Method | Training Route Num | Test Route Num | Interactions | DS | SR(%) | Training Time |
|---|---|---|---|---|---|---|
| Ours(RL) | 1000 | 220 | 250k | 58.34 | 30.00 | 8 GPU Days |
| Ours(RL) | 1000 | 220 | 750k | 68.45 | 47.73 | 30 GPU Days |
| Ours(RL) | 1000 | 220 | 1M | 71.34 | 50.24 | 40 GPU Days |
| Ours(RL) | 220 | 220 | 250k | 66.82 | 45.45 | 8 GPU Days |
| UniAD(IL)[8] | 1000 | 220 | 250k | 45.81 | 16.36 | 30 GPU Days |
[7]2020, CVPR, End-to-End Model-Free Reinforcement Learning for Urban Driving using Implicit Affordances
[8]2023, CVPR, Planning-oriented Autonomous Driving
Weakness 1,Limitations 1, Discussion on How to use this paradigm in Real-World?
-
Thanks for your suggestions. Raw2Drive utilizes a dual-stream world model to effectively decouple perception and decision-making. On the decision side, trajectory prediction is conducted based on structured perception outputs (e.g., BEV or compact representations)—a widely adopted practice in both academia and industry. On the perception side, the emphasis is on enhancing the accuracy of these BEV-based representations.
-
By leveraging the world model, our paradigm enables a fully differentiable pipeline while preserving a separation between perception and planning.
-
Furthermore, with recent progress in high-fidelity scene reconstruction[4] and simulation technologies [5,6], a practical deployment path is to pretrain the world model via imitation learning and fine-tune the policy with reinforcement learning—an increasingly feasible strategy in real-world and industrial settings.
Weakness 2, Question 1, Contribution and Novelty of Raw2Drive
-
Thank you for the helpful suggestion. Due to page limitations, we moved the related work section to the appendix. Nonetheless, as shown in Table 1 of the Introduction, we compared prior work, including both RL-based and world-model-based approaches. Our core focus is reinforcement learning in end-to-end autonomous driving. Early work such as MaRLn [7] (CoRL 2017) encountered significant challenges due to the difficulty of learning directly from raw images, which made policy optimization unstable.
-
With recent progress in model-based RL [9, 10], our work introduces two key novelties:
- Successfully achieve end-to-end RL To our best knowledg, Raw2Drive is the first successfully achieve end-to-end RL with performance significantly better than the end-to-end IL of the same period.
- Dual-stream World Model as a Bridge: We use a dual-stream world model to decouple perception and decision-making. This design simplifies the joint optimization problem, which is often difficult to converge in end-to-end autonomous driving.
- Dual-stream World Model Alignment: Unlike standard teacher-forcing strategies [11, 12], future prediction in driving involves both stochastic (e.g., other agents’ intentions) and deterministic (e.g., ego state) components over time and space. We propose Spatial-Temporal Alignment and Abstract-State Alignment to ensure consistency between the two streams during rollout.
-
We will revise the final version of the paper accordingly to better highlight these contributions.
[9] 2020, ICLR, Dream to Control: Learning Behaviors by Latent Imagination
[10] 2021, ICLR, Mastering Atari with Discrete World Models
[11] 2021, ICCV, End-to-End Urban Driving by Imitating a Reinforcement Learning Coach
[12] 2022, NeurIPS, Teacher Forcing Recovers Reward Functions for Text Generation
Weakness 2, Clarification on the Bench2Drive Metric
- In fact, as noted in Sec. 5.1 of [13], their evaluation metric is based on Bench2Drive, which is consistent with ours. Furthermore, [13] modifies the base penalty coefficients for CP (crossing red lights) and CV (vehicle collisions), reducing them specifically to address issues introduced by their early stopping strategy.
[13]2025, Arxiv, Hidden Biases of End-to-End Driving Datasets
Question 2, More introduction of RL method
- We follow DreamerV3 [14], a representative model-based RL method, which employs the REINFORCE[15] algorithm for policy training. As detailed in Our Appendix B and E, we employ a latent imagination-based actor-critic framework. Unlike standard actor-critic methods [16] that rely on real environment rollouts, we train the actor and critic entirely within a learned latent space using trajectories imagined by the world model. The actor is optimized to maximize expected long-term reward under these imagined rollouts, while the critic learns to predict value estimates in the same latent space, enabling sample-efficient and stable training.
[14] 2025, Nature, Mastering Diverse Domains through World Models
[15] 1992, Machine Learning, Simple statistical gradient-following algorithms for connectionist reinforcement learning
[16] 2017,NeurIPS, Proximal Policy Optimization Algorithms
Question 3, Clarification on NAVSIM simulator
- Thank you for the question. NAVSIM[17] is a non-reactive simulator designed for benchmarking perception and planning, but not for training interactive policies. It does not support closed-loop interactions or provide observations after interaction, making it unsuitable for reinforcement learning, which relies on trial-and-error learning with visual feedback. We will clarify this limitation in the revised version.
[17] 2024, NeurIPS, NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking
About Paper Formatting Concerns
- Thank you for pointing this out. We apologize for the oversight and will carefully correct all typos and formatting issues in the final submission.
Thank you again for your valuable comments. We hope our responses have effectively addressed the concerns raised. As the rebuttal period is ending soon, we’d sincerely appreciate a score update if the concerns are resolved. If there are any remaining concerns that prevent a higher score, we would be grateful for further clarification.
I thank the authors for the clarifications, please also include them in the paper if it gets accepted.
I have two remaining concerns that your rebuttal did not fully address :
- In particular I still find the lack of experiments on generalization performance beyond CARLA a problem for an RL AD approach trained in CARLA. You do have CARLA in the title, but this limitation could be clearar in the conclusions/future work. To be honest, I am still not sure why you could not at least compare on IL benchmarks with camera input. I suggested Navsim, which is even a limited simulator of sorts, non-reactive agents, but anything would be good because we currently have no evidence that this transfers outside of CARLA which diminishes the impact as an AD application paper.
- As for performance in CARLA, a) can you clariy what you mean by UniAD being trained on 250k "interactions". Does this mean 250k unique data samples from an expert policy in the simulator? I don't recall UniAD having official results for CARLA or Bench2Drive, so which data was used to train it? b) How was your "250k" policy trained, do you use the same 250k data generated by your own policy to train both of your world models, or is it first 250k interactions for the privileged WM and later 250k interactions for the Raw Data WM? Since they are aligned via your "guidance" objective, your Raw-Data WM is to some extent learned via supervised learning from the privileged VM. c) Have you tried ablating against directly training an end-to-end policy via IL from Think2Drive as an expert policy? Similar methods have been used to great effect in robotics [1*]. Without this ablation it diminishes your contribution claims for the dual stream RL training method.
[1*] Miki, Takahiro, et al. "Learning robust perceptive locomotion for quadrupedal robots in the wild." Science robotics 7.62 (2022): eabk2822.
Thanks for your reply.
Further Clarification on Concern 1
-
Thank you for the suggestion. First, we would like to clarify that NAVSIM[1] does not support end-to-end reinforcement learning (RL). As an online RL agent must learn from trial and error, NAVSIM[1], being non-reactive and pseudo-closed-loop, cannot provide RGB frames outside of the dataset(e.g., collisions or going off-road). Thus, it is not suitable for training end-to-end RL.
-
In contrast, CARLA remains the most widely adopted closed-loop simulator for end-to-end RL. Numerous prior works have conducted experiments solely within the CARLA, such as TCP [2], InterFuser [3], ReasonNet [4], ThinkTwice [5], TransFuser++ [6, 7], DriveAdapter [8], LMDrive [9], ReasonPlan [10], SimLingo [11], ORION [12], and ETA [13]. In our work, we thoroughly validated our method in two standard closed-loop benchmarks, Leaderboard 2.0 and Bench2Drive. If the reviewer is aware of other benchmarks that support end-to-end RL, we would humbly appreciate a reference for our consideration.
-
Additionally, as generative[14] or reconstruction-based simulators[15, 16] become more mature, it may be feasible to explore end-to-end RL in such frameworks. However, our current focus is on policy learning, which is decoupled from the fidelity of rendering or simulation.
-
Finally, we would like to emphasize that Table 5 of Bench2Drive includes only imitation learning baselines, except for ours. This highlights our core contribution: successfully enabling end-to-end RL that outperforms IL methods.
[1]2024, NeurIPS, NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking
[2]2022, NeurIPS, Trajectory-guided Control Prediction for End-to-end Autonomous Driving: A Simple yet Strong Baseline.
[3]2022, CoRL, InterFuser: Safety-Enhanced Autonomous Driving Using Interpretable Sensor Fusion Transformer
[4]2023, CVPR, ReasonNet: End-to-End Driving with Temporal and Global Reasoning
[5]2023, CVPR, Think Twice before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving
[6]2021, CVPR, Multi-Modal Fusion Transformer for End-to-End Autonomous Driving
[7]2023, TPAMI, TransFuser: Imitation with Transformer-Based Sensor Fusion for Autonomous Driving
[8]2023, ICCV, DriveAdapter: Breaking the Coupling Barrier of Perception and Planning in End-to-End Autonomous Driving
[9]2024, CVPR, LMDrive: Closed-Loop End-to-End Driving with Large Language Models
[10]2025, CoRL, ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving
[11]2025, CVPR, SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment
[12]2025, ICCV, ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation
[13]2025, ICCV, ETA: Efficiency through Thinking Ahead, A Dual Approach to Self-Driving with Large Models
[14]2025,ICCV, HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation
[15]2025, ICCV, BézierGS: Dynamic Urban Scene Reconstruction with Bézier Curve Gaussian Splatting
[16]2025, Arxiv, ReSim: Reliable World Simulation for Autonomous Driving
Further Clarification on Concern 2
-
Question a, Clarification on UniAD being trained on 250k “interactions”
- Bench2Drive has become a widely adopted benchmark, with official implementations of imitation learning methods such as UniAD, VAD. The “250k interactions” mentioned in UniAD correspond to the official Bench2Drive dataset, which includes 1000 routes totaling approximately 250k steps.
[17]2023, CVPR, Planning-oriented Autonomous Driving
[18]2023, ICCV, VAD: Vectorized Scene Representation for Efficient Autonomous Driving
-
Question b, Clarification on “How was your ‘250k’ policy trained”
-
While UniAD and VAD are trained using the high-quality 250k expert trajectories provided in Bench2Drive, our method generates 250k data points through online reinforcement learning interactions. These trajectories include failure cases such as collisions and traffic violations, making them significantly noisier and lower in quality compared to expert data. However, this is precisely the strength of reinforcement learning—it learns from trial and error.
-
As described in Appendix B (Details of Training Pipeline), our training consists of two stages. We first train a privileged WM and a privileged policy using privileged inputs; Then, we leverage this privileged WM to guide the learning of the raw sensor WM and policy, enabling effective RL policy training from raw sensor data.
-
-
Question c, Clarification "directly training an end-to-end policy via IL from Think2Drive as an expert policy"
- As shown in Table 5, all compared methods—TCP-traj*, ThinkTwice*, and DriveAdapter* (* denotes expert feature distillation)—are imitation learning (IL) approaches similar to those discussed in [19], where an RL expert policy is distilled into an IL end-to-end model. Additionally, UniAD is a standard IL method that learns directly from expert demonstrations.
- In contrast, our core contribution lies in successfully enabling expert-guided RL end-to-end model through a dual-stream WM design and alignment mechanism, achieving superior performance compared to IL-based approaches.
[19]2022, Science robotics, Learning robust perceptive locomotion for quadrupedal robots in the wild
Concern 1:
"Thank you for the suggestion. First, we would like to clarify that NAVSIM[1] does not support end-to-end reinforcement learning (RL). As an online RL agent must learn from trial and error, NAVSIM[1], being non-reactive and pseudo-closed-loop, cannot provide RGB frames outside of the dataset(e.g., collisions or going off-road). Thus, it is not suitable for training end-to-end RL."
But you could still evaluate the actions your agent produces on real-world camera input to that of an expert, as is done for IL papers in AD. This is not perfect (there might be several good choices of driving actions), but I suggested Navsim because it tries to account for this multi-modality in its scoring function.
Additionally, for IL papers it is easy to collect more real world data as this is done passively while a human drives. Training RL in the real world seems extremely challenging from a safety perspective. Hence how it transfers from sim2real, or at least something else, is key for your approach to work in AD. Maybe we will have fully realistic simulators in the future and all training could be done on sim data, but we don't know that, and current approaches are far from fully realistic or as diverse as the real world.
Even if we did have simulators that good, would your 250k limitation on the number of interactions be relevant? I would argue that correct metric then should be success rate with as much data as needed. As compute is likewise getting cheaper, and your approach hinges on there being a privileged expert policy (from which one can generate more IL data). In that case it might be more interesting to compare against end-to-end PPO and an end-to-end IL (from privileged policy) trained on 10M or 100M sim interactions and see which performs best. If you do not cap the number sim iterations, model-free approaches like PPO might perform best.
Of the papers you list that were only evaluated in CARLA, all of them were IL approaches. However, I note that your baseline Think2Drive is also an RL approach that is only evaluated in CARLA and it apparently got accepted to ECCV'24, so there is some precedent. I will raise my score to borderline accept as the results in CARLA look good and I thought the dual-WM approach itself was interesting.
Thank you for your time and valuable suggestions. We fully agree with your perspective on the importance of sim-to-real transfer and the value of NAVSIM in evaluating real-world driving scenarios. However, NAVSIM currently is non-reactive and does not provide RGB frames beyond the pre-collected dataset, so we are unable to conduct RL training and, consequently, cannot validate RL-based approaches in this environment.
While CARLA remains the only viable closed-loop simulator for RL research at present, our work focuses on policy learning and introduces a dual-stream world model design that is conceptually decoupled from the specific simulator.
We fully agree with your perspective that, with access to high-fidelity simulators and sufficient computational resources, comparing final performance with unlimited data is indeed a valid and important metric. That said, end-to-end RL methods like PPO remain highly challenging for autonomous driving, due to the complexity of handling surround-view, high-dimensional, and temporally correlated visual inputs. In this context, developing sample-efficient RL approaches is still critical. In the LLM community, both supervised fine-tuning (SFT) and reinforcement learning (e.g., RLHF) have their own strengths—SFT is more stable and scalable, while RL offers better alignment with long-horizon objectives. Similarly, in vision-based decision-making, the trade-off between IL and RL presents an interesting avenue for future exploration. Once again, thank you for your insightful feedback. We will incorporate these discussions into the revised version.
Dear reviewers,
As the Author–Reviewer discussion period concludes in a few days, we kindly urge you to read the authors’ rebuttal and respond as soon as possible.
-
Please review all author responses and other reviews carefully, and engage in open and constructive dialogue with the authors.
-
The authors have addressed comments from all reviewers; each reviewer is therefore expected to respond, so the authors know their rebuttal has been considered.
-
We strongly encourage you to post your initial response promptly to allow time for meaningful back-and-forth discussion.
Thank you for your collaboration, AC
This paper introduces Raw2Drive, a model-based reinforcement learning framework for end-to-end autonomous driving in CARLA using raw image inputs. The core idea is to leverage a privileged BEV-based world model to guide the training of a raw-sensor world model through a distillation and alignment mechanism. The approach is evaluated extensively on CARLA Leaderboard v2 and Bench2Drive, demonstrating superior performance over baselines.
The main strengths of the paper lie in its emphasis on closed-loop evaluation on established benchmarks and its strong empirical performance, achieving state-of-the-art results. While the scope is limited to CARLA, raising questions about transferability to real-world driving, reviewers recognize that demonstrating robust end-to-end reinforcement learning from raw sensory inputs is an important advance. Overall, the paper is considered a strong contribution and is recommended for acceptance.