Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving
LeapAD, a new autonomous driving paradigm inspired by human cognition, improves adaptability and interpretability in complex scenarios through dual-process decision-making and continuous learning from past experiences.
摘要
评审与讨论
The paper introduces LeapAD, an interesting paradigm for autonomous driving inspired by human cognitive processes, addressing the limitations of prevailing data-driven methods in complex scenarios. LeapAD incorporates a dual-process decision-making module consisting of an Analytic Process (System-II) for logical reasoning and experience accumulation, and a Heuristic Process (System-I) for quick, empirical decision-making based on the learned knowledge from System-II. By emulating human attention to focus on critical objects, LeapAD simplifies environmental interpretation and mitigates decision-making complexities. The system is tested in the CARLA simulator, demonstrating superior performance over camera-only methods with less labeled data. The Heuristic Process shows continuous improvement through a reflection mechanism and a growing memory bank, indicating the effectiveness of the dual-process approach.
优点
The paper presents several notable strengths across dimensions of originality, quality, clarity, and significance:
Originality
- Dual-Process Decision-Making: The combination of an Analytic Process (System-II) and a Heuristic Process (System-I) emulates human cognitive functions, offering a biologically inspired framework for autonomous driving.
Quality
I think the quality is good.
- Continuous Learning: The reflection mechanism and growing memory bank enable continuous learning and improvement, showcasing the adaptability of the proposed system.
Clarity
The paper is well-written, with clear and concise explanations of complex concepts. The dual-process framework and its components are described in detail, making the methodology accessible to a broad audience.
Significance
Advancing the Field: By introducing a dual-process decision-making framework, the paper opens avenues for research in autonomous driving and artificial intelligence, potentially influencing future developments in the field.
缺点
While the paper presents some interesting contributions, there are areas where improvements could be made:
Methodological Concerns
While I appreciate the design of the Analytic Process and Heuristic Process, does the paper clearly distinguish between the two? My understanding is that the Analytic Process uses LLMs, while the Heuristic Process uses a lightweight language model. Why can it be called the Heuristic Process? It would be better to clearly state why can it be called the Heuristic Process and the Analytic Process.
Experimental Limitations
- Quantitative Metrics: The paper's experimental results are primarily based on the CARLA simulator, lacking real-world experiments. CARLA scenarios are still too simple. It would be better to report results that can comprehensively evaluate the performance of LeapAD, such as using the real-world dataset nuScenes.
Clarity and Presentation
- Technical Details: This paper is based on Qwen VLM. It is not clear whether the performance improvement is due to this Qwen VLM or the two-system design. It would be better to include more ablation studies to explore the influence of VLMs, such as LLaVa.
By addressing these weaknesses, the authors can provide a more thorough and robust evaluation of LeapAD.
问题
There are areas where improvements could be made:
Methodological Concerns:
The paper should clearly distinguish between the Analytic Process and Heuristic Process? How are these processes defined and why is the Heuristic Process called such if it uses a lightweight language model?
Experimental Limitations:
Can results be reported to evaluate the performance of LeapAD using a real-world dataset like nuScenes?
Clarity and Presentation:
It is not clear whether the performance improvement is due to Qwen VLM or the two-system design. Can you report ablation studies to explore the influence of VLMs, for example, also try LLaVa?
By addressing these questions, the authors can provide a more thorough and robust evaluation of LeapAD.
局限性
Yes, the limitations are discussed in Section 6.
Dear Reviewer:
Thank you for your constructive comments. We provide discussions and explanations about your concerns as follows.
Q1: The paper should clearly distinguish between the Analytic Process and Heuristic Process? How are these processes defined and why is the Heuristic Process called such if it uses a lightweight language model?
A1: As explained in the introduction, our approach is inspired by the dual-process theory of human intelligence. The distinction between the Analytical Process and Heuristic Process lies primarily in their roles within our system. The Analytical Process is rational, slow, and excels in logical reasoning and creativity across various domains, while the Heuristic Process is quicker, empirical, and domain-specific. Although we use a lightweight language model, it performs a function in our dual-process decision module that aligns with the role of the Heuristic Process.
Q2: The paper's experimental results are primarily based on the CARLA simulator, lacking real-world experiments. CARLA scenarios are still too simple. It would be better to report results that can comprehensively evaluate the performance of LeapAD, such as using the real-world dataset nuScenes.
A2: It is important to note that our primary focus is on exploring continuous learning in closed-loop autonomous driving based on a dual-process approach. Closed-loop autonomous driving involves interactions between agents and the environment, while publicly available datasets like nuScenes only provide open-loop evaluations. The open-loop evaluations show certain inherent limitations: (1) the current decisions do not influence subsequent navigation, making it impossible to assess cumulative errors; (2) there is no interaction between agents, leading to a lack of dynamic behavior; (3) and there is no global closed-loop evaluation metric. We believe that close-loop experiments in real-world environments are primarily constrained by the availability of high-fidelity simulators. Currently, there is no well-established high-fidelity simulator, but developing such a simulator is an area we are actively working on.
Q3: This paper is based on Qwen VLM. It is not clear whether the performance improvement is due to this Qwen VLM or the two-system design. It would be better to include more ablation studies to explore the influence of VLMs, such as LLaVa.
A3: Our scene understanding module (Qwen-VL) and decision-making module (dual-process design) are relatively independent components. Qwen-VL provides scene descriptions, which are fed into our dual-process decision-making module for driving reasoning and decision. Directly applying Qwen-VL for the decision-making process does not work well. We experimentally found its output can not align well with the data format required for decision reasoning.
Empirically, accurate scene understanding positively correlates with the effectiveness of subsequent decision-making. However, since our focus is on exploring the dual-process approach for autonomous driving rather than scene understanding per se, we have not extensively investigated the network architecture of the VLM. We selected Qwen-VL as the scene understanding module due to its demonstrated strengths in visual understanding and grounding on various public benchmarks, aligning well with our requirements.
As suggested by the review, we also added additional ablation studies to explore the influence of Qwen-VL and LLaVA. This includes scene understanding evaluation results on the Rank2tell (real) and Carla (simulated) datasets, as well as the closed-loop performance, detailed below. We provided Grounded scores, including precision, recall, and F1 score, to assess the models' grounding performance, as well as Chat Scores, including language score (ROUGE) and GPT score (GPT-4-turbo), to evaluate the models' reasoning and question-answering capabilities. For the close-loop experiments, we test the performance of the first eight routes of the Town05 Short benchmark. From these results, it is evident that Qwen-VL exhibits superior grounding abilities and better DS scores in closed-loop settings.
(1)Evaluations on the collected CARLA data.
| VLMs | Precision | Recall | F1 Score | ROUGE | GPT Score |
|---|---|---|---|---|---|
| LLaVA-1.5-7B | 34.95 | 32.00 | 33.41 | 78.04 | 58.09 |
| Qwen-VL-7B | 51.41 | 47.14 | 49.18 | 83.24 | 63.01 |
(2)Evaluations on Rank2Tell test dataset.
| VLMs | Precision | Recall | F1 Score | ROUGE | GPT Score |
|---|---|---|---|---|---|
| LLaVA-1.5-7B | 28.49 | 25.22 | 26.75 | 69.26 | 65.20 |
| Qwen-VL-7B | 46.70 | 37.37 | 41.52 | 70.23 | 66.59 |
(3) Closed-loop experiments on Town05 Short benchmark
| VLMs | DS | RC | IS |
|---|---|---|---|
| LLaVA-1.5-7B | 78.87 | 86.12 | 92.75 |
| Qwen-VL-7B | 88.25 | 100 | 88.25 |
Thanks for your rebuttal. My concerns have been resolved.
Dear Reviewer kf7r:
We sincerely thank you for your valuable feedback and for acknowledging our efforts.
Your time and thorough review of our work are greatly appreciated.
This paper presents LeapAD, a dual-process closed-loop autonomous driving system.
LeapAD first uses a VLM to analyze the scene by selecting and locating critical objects in the scene, and then it uses a dual-process learning approach to learn driving behaviors.
The dual-process learning system contains an Analytical Process and a Heuristic Process. The Analytical Process is strong but expensive to run. It is used to summarize the driving experience into the Memory Bank. The Heuristic Process is more lightweight and is used to generate controls to control the vehicle. The Heuristic Process is trained with data in the Memory Bank.
The Analytical Process can also reflect from collision events in previous simulation runs. It will analyze the cause of the collisions and save the knowledge in the Memory Bank.
The authors evaluated the LeapAD method in closed-loop simulation with the CARLA simulator. They used the Qwen models as the VLMs and GPT-4 for the Analytical Process.
The evaluation result shows that LeapAD surpasses the performance of the other camera-only models on the CARLA Town05 benchmark.
优点
-
The dual-process idea is neat and thought-provoking. It equips the autonomous driving system with the ability to learn from past experiences.
-
The method achieves stronger performance than state-of-the-art methods in CARLA closed-loop simulation.
-
This paper is well-written and provides sufficient details for reproducing their approach.
缺点
- The performance improvement is not very significant compared to the baseline.
问题
N/A
局限性
Yes
Dear Reviewer:
Thanks a lot for your acknowledgement, and we appreciate the time and effort you dedicated to enhancing the quality and clarity of our manuscript.
Q: The performance improvement is not very significant compared to the baseline.
A: Thanks for your feedback. As you mentioned in summary, LeapAD proposes a new paradigm for autonomous driving that addresses the limitations of current data-driven methods in complex scenarios. We focus on verifying the superiority of this dual-process system. At the same time, we also mentioned in the limitations section that we are actively researching several areas, such as integrating time inputs, enabling VLM to participate in reflection processes, and developing real-world simulations to further improve the performance of our system.
Thank you for your response. I will keep my rating.
Dear Reviewer zRJR:
We sincerely thank you for your valuable feedback and for recognizing our work. Your time and thorough review of our work are greatly appreciated.
This paper introduces a paradigm to design an annotation-efficient end-to-end autonomous driving system that harnesses the power and generalizability of open-source LLM models. It proves that critical frame/instance selection are critical to a decision-making module training. This method is evaluated by closed-loop testing in CARLA and achieves the SOTA performance among camera-based methods.
优点
- The core idea is straightforward.
- Achieves the SOTA result.
- Complete adequate ablation studies to support its claim.
缺点
- No quantitative benchmark on its VLM module on simulation and the real world. Only some samples are listed in the paper.
- The paper only presents an overall benchmark on the system but no failure case analysis.
- The result relies on the foundation model performance and the paper does not show a way to fill the gap between the simulation and the real world, which limits its impact.
问题
- Why decouple into 2 separated modules, scene understanding, and decision making?
- The scene understanding section mentions that the motion direction is one of the outputs. However, since the input sensor data is single-frame based, how does the model know the motion direction?
- It is not clear how the interaction with GPT4 completes in the reflection mechanism. It would be better to provide more details.
局限性
See weaknesses.
Dear Reviewer:
Thank you for your constructive comments. We will discuss and explain your concerns as follows.
Q1: No quantitative benchmark on its VLM module on simulation and the real world. Only some samples are listed in the paper.
A1: Thank you for your valuable suggestions. We have added evaluation results of VLMs on Rank2tell (real) and Carla (simulated) datasets. We evaluated the performance of Qwen-VL-7B on these two datasets, and also evaluated LLaVA-1.5-7B for comparison. We provided Grounded scores, including precision, recall, and F1 score, to assess the models' grounding performance, as well as Chat Scores, including language score (ROUGE) and GPT score (GPT-4-turbo), to evaluate the models' reasoning and question-answering capabilities. Our experiments found that Qwen-VL-7b demonstrates stronger grounding abilities.
(1)Evaluations on the collected Carla data.
| VLMs | Precision | Recall | F1 Score | ROUGE | GPT Score |
|---|---|---|---|---|---|
| LLaVA-1.5-7B | 34.95 | 32.00 | 33.41 | 78.04 | 58.09 |
| Qwen-VL-7B | 51.41 | 47.14 | 49.18 | 83.24 | 63.01 |
(2)Evaluations on Rank2Tell test dataset.
| VLMs | Precision | Recall | F1 Score | ROUGE | GPT Score |
|---|---|---|---|---|---|
| LLaVA-1.5-7B | 28.49 | 25.22 | 26.75 | 69.26 | 65.20 |
| Qwen-VL-7B | 46.70 | 37.37 | 41.52 | 70.23 | 66.59 |
Q2: The paper only presents an overall benchmark on the system but no failure case analysis.
A2: Thanks for your advice. We have included two typical failure cases in the uploaded PDF.
(1) “Run a red light” as shown in Figure 1 in the uploaded PDF. In this scenario, the system lacks temporal information regarding the yellow light's remaining duration, making it difficult to determine whether to accelerate through or stop. When the light is yellow, the system cautiously issues a “DC” command, causing the vehicle to cross the stop line slowly. When the light turned red, CARLA interpreted this as running a red light, even though a “STOP” command was issued at this time.
(2) “Collision” as shown in Figure 2 in the uploaded PDF. In this case, the VLM did not detect the car at the left rear edge of the field of view due to the camera’s field of view limitation. Furthermore, in the CARLA setting, other vehicles will not proactively yield to the ego vehicle, leading to collisions caused by other vehicles.
Q3: The result relies on the foundation model performance and the paper does not show a way to fill the gap between the simulation and the real world, which limits its impact.
A3: We use the VLM to observe and interpret the driving environment and provide scene descriptions. The LLM then performs driving reasoning and makes decisions based on these descriptions. Consequently, the domain gap between simulated and real-world scenarios primarily affects the scene descriptions generated by the VLM. Notably, the VLM demonstrates strong generalization capabilities. The data used for fine-tuning includes both simulated and real-world data, which enables our VLM to generate accurate scene descriptions in both contexts. This is illustrated by the cases shown in Figures 6 and 7 of the paper, and further supported by the quantitative experiments (A1) we have included.
While there is an inherent gap between simulation experiments and real-world closed-loop scenarios, we believe the main limitation is the current lack of a high-fidelity simulator in the industry, as highlighted in the limitations.
Q4: Why decouple into 2 separated modules, scene understanding, and decision making?
A4: We have divided the system into separate modules for the following reasons:
(1) The modular design enables easy replacement and upgrading of individual components, particularly the scene understanding module and the Analytic Process.
(2) We adopt the scene understanding module to generate scene descriptions that effectively encode the environment. The dual-process decision-making module generates the driving reasoning and decision. These historical scene descriptions and reasoning can be conveniently encoded into a vector database (memory bank) for rapid retrieval of similar scenes and to guide the Heuristic Process in making accurate decisions through a few-shot approach.
(3) As noted in the limitations section of our paper, the VLM’s inability to participate in the reflection mechanism hinders further system improvements. Addressing this limitation will be a key focus for future development.
Q5: Since the input sensor data is single-frame based, how does the model know the motion direction?
A5: Indeed, accurately assessing vehicle motion based on a single image alone is challenging. Interestingly, we have found that, due to the remarkable generalization abilities of large vision language models, it is possible to infer the motion direction of vehicles simply based on the vehicle’s heading. As noted in the limitations section of our paper, the current version relies solely on single-frame input and lacks temporal information. Incorporating temporal cues to more accurately assess the motion of surrounding objects will be a focus for future work.
Q6: It is not clear how the interaction with GPT4 completes in the reflection mechanism. It would be better to provide more details.
A6: In the appendix (Figure 16), we provide a detailed example of the reflection mechanism and explain it thoroughly in Section F. Specifically, when a traffic incident occurs during the Heuristic Process, the reflection mechanism is triggered. During reflection, information from historical frames is fed into the Analytic Process to identify and correct any potentially erroneous reasoning decisions. These corrections are then added to the memory bank to further enhance the accuracy of the Heuristic Process. For more detailed information, please refer to the relevant sections in the appendix.
Thanks for your rebuttal.
Dear Reviewer eSfn:
We sincerely thank you for your valuable feedback and for acknowledging our efforts.
We appreciate the time and effort you spent reviewing our work.
The paper "LeapAD" introduces a new approach to autonomous driving that addresses key challenges in adaptability and interpretability. It draws inspiration from human cognition to enhance decision-making processes in complex environments. The system incorporates two complementary processes:
- Analytic Process: Provides thorough analysis and reasoning, accumulating driving experience through logical reasoning.
- Heuristic Process: Employs swift, empirical processing and learns from the Analytic Process through supervised fine-tuning. This dual-process setup enhances adaptability and performance.
Closed loop testing in the CARLA simulator demonstrates that LeapAD outperforms methods relying solely on camera input. The Heuristic Process can inherit knowledge from an Analytic Process powered by GPT-4, leading to continuous performance improvements as the memory bank expands.
优点
- The paper is generally well-written and but easy to follow. Good motivation for the model design in the introduction.
- I like the problem setup: how can we design AV systems that continually learn from its mistakes.
- The experimental results seem to support the authors' claims.
缺点
I overall liked the idea of closed-loop autonomous driving approach that could emulate the critical attention mechanisms required for smooth driving environment in safety critical scenarios. The notion of heuristic and analytical processes for executing actions in robotics seems a novel approach. However, my primary concern lies in the setup of data and models for generating scene descriptions into text to identify critical objects. Operating within the text domain, which requires subsequent interpretation and tokenization by the analytical and heuristic modules, seems less efficient than using a direct vectorized representation. For instance, representing an object with parameters such as {v = 0.2 m/s, s = 3m, class = Car} is likely more efficient and robust than the text output "The car is 3 m away from the ego vehicle and is moving at 0.2 m/s." This textual method could lead to inefficiencies, especially in scenarios with multiple dynamic actors.
问题
- The authors should detail the data generation process for complex driving scenarios like intersections, lane changes, and overtaking. Based on my understanding, the current model primarily focuses on simpler scenarios involving a single-lane and limited interaction with other actors.
- I recommend evaluations in more dynamic settings such as intersections and scenarios involving lane changes and overtaking, where multiple actors interact and cooperate for safety.
- A comparison with traditional vectorized or feature-based planning systems, such as Wayformer or Precog, would be beneficial. These systems process scenes as images and convert data into vectors instead of text, which might offer insights into efficiency and performance.
- I see DriveLM also has good performance in the case of VLM based driving. Is there any reason why that has not been put as a baseline, considering the similarity in dataset generation processes.
I look forward to seeing how these suggestions might be incorporated to further enhance the robustness and applicability of the proposed approach in more complex driving scenarios.
Ref:
局限性
N/A
Dear Reviewer:
Thank you for your constructive comments. We provide discussions and explanations about your concerns as follows.
Q1: My primary concern lies in the setup of data and models for generating scene descriptions into text to identify critical objects. Operating within the text domain, which requires subsequent interpretation and tokenization by the analytical and heuristic modules, seems less efficient than using a direct vectorized representation.
A1: Indeed, adopting vectorized representations can significantly compress data, thereby accelerating inference speed. However, by logically integrating target attributes (such as orientation, distance, and category) into coherent natural language statements, we not only clarify the meaning of these physical quantities but also better align with the data distribution of pre-trained large language models. This enables us to leverage the embedded world knowledge in these models to enhance their understanding of autonomous driving environments while maintaining their generalization capabilities.
It is also worth noting that, as shown in Table 1 of our paper, compared to traditional vectorized or feature-based methods such as VAD, InterFuser, and TransFuser, the amount of labeled data we use is typically one to two orders of magnitude smaller. Furthermore, our dual-process decision-making process requires no human intervention and is capable of continuous improvement, clearly demonstrating the effectiveness of knowledge-driven methods.
Q2: Evaluations in more dynamic settings as intersections and scenarios involving lane changes and overtaking, where multiple actors interact and cooperate for safety.
A2: In fact, the closed-loop benchmark in CARLA Town05 involves numerous dynamic settings, such as intersections, traffic lights, STOP signs, and sudden appearances of pedestrians or cyclists from the side. Our method has achieved strong performance across these diverse scenarios, demonstrating its adaptability to dynamic environments. Additionally, it is important to emphasize that our approach can handle complex situations such as intersections and some corner cases. As shown in Figures 13 and 14 in the appendix, our method is able to handle intersections with multiple actors (vehicles and traffic lights) interacting (Figure 13), and react appropriately to unexpected events, such as a cyclist suddenly appearing from the side (Figure 14) by making timely decisions like slowing down.
Moreover, the project page linked in our paper's abstract includes several demos that further illustrate our method's adaptability in dynamic environments. For example, the first video shows our method slowing down and coming to a stop when a cyclist suddenly appears at an intersection between the 5-15 second and 35-40 second marks. The second video, between the 25-45 second marks, showcases our method of navigating a complex intersection with multiple participants, and the third video demonstrates our approach stopping briefly at an intersection with a STOP sign.
Q3: A comparison with traditional vectorized or feature-based planning systems, such as Wayformer or Precog, would be beneficial. These systems process scenes as images and convert data into vectors instead of text, which might offer insights into efficiency and performance.
A3: It is important to note that our primary focus is on exploring continuous learning in closed-loop autonomous driving based on a dual-process approach, using image inputs and a knowledge-driven method. Unlike our approach, Wayformer does not use image inputs but rather relies on ground truth sparse abstract state descriptions of the world, while Precog requires LiDAR input. Both of these methods are evaluated in an open-loop setting, which has certain inherent limitations: (1) the current decisions do not influence subsequent navigation, making it impossible to assess cumulative errors; (2) there is no interaction between agents, leading to a lack of dynamic behavior; (3) and there is no global closed-loop evaluation metric. In fact, we have already compared our method with traditional vectorized or feature-based systems. As shown in Table 1, methods like InterFuser, TransFuser, and VAD use neural networks to represent scenes as implicit vectors. Our method outperforms most of these approaches while only requiring one to two orders of magnitude less labeled data, and it also has the capability for continuous learning. This further demonstrates the effectiveness of leveraging knowledge from large models.
Q4: I see DriveLM also has good performance in the case of VLM based driving. Is there any reason why that has not been put as a baseline, considering the similarity in dataset generation processes.
A4: Our LeapAD features a dual-process approach for closed-loop autonomous driving, whereas DriveLM only reports open-loop performance without providing closed-loop metrics. Additionally, DriveLM has not released the CARLA dataset, the fine-tuned network weights on CARLA, and the inference code on CARLA, making it difficult to conduct closed-loop experiments using DriveLM. Furthermore, DriveLM is based on the BLIP-2 architecture, while Qwen-VL, which we use, supports higher resolutions and has demonstrated better grounding and vision-language understanding capabilities across multiple benchmarks, even with a similar parameter count. Finally, the output from the original DriveLM is quite verbose. For our dual-process decision method, the scene understanding-related data in DriveLM is more important. Therefore, we refined DriveLM's data and combined it with Rank2Tell and collected CARLA data to fine-tune Qwen-VL as our scene understanding module.
I would like to keep my score.
This is where I disagree. In the case of, say, 5 vehicles and 5 pedestrians, the text generation part will be noisy and inaccurate. In this work, the model gets good results because, in most of the test cases, you only have traffic signs that are not cluttered and 1-2 actors. So, describing the situation in text is very easy compared to dense urban driving. Textual context learning there will do poorly compared to vectorized methods.
The authors have also responded to my question 1, where I requested details on the data generation process for complex driving scenarios like intersections, lane changes, and overtaking. This would have cleared my doubts about whether this method will work in urban driving.
Dear Reviewer:
Thank you for your response.
Regarding the text generation of complex scenario you mentioned:
- First, we also recognize the complexity of driving scenarios. Therefore, as emphasized in our paper, we focus exclusively only describing the key objects that may influence driving decisions, reducing complexity and enhancing the efficiency of subsequent decision-making. In practice, we only need to consider those traffic participants who might interact with the ego car, such as nearby vehicles or pedestrians who may cross the road. For example, in the case of, say, 5 vehicles and 5 pedestrians, only one or two of each are actually important for driving decision-making.
- Secondly, we provide textual descriptions with grounding boxes for each key target to further identify and differentiate these traffic participants, please refer to Section 4.1 and the case study for more details.
- Finally, as shown in our experiments—specifically in Figure 8 of our paper—our method effectively grounds these critical traffic participants, even in complex scenarios with many people (>5).
Regarding data generation process of question 1, we provided a detailed explanation in Section 4.1 of the paper, as well as in Appendix B and C. In fact, the data generation process is applicable across various complex scenarios. For scene understanding, we primarily describe the semantic attributes, spatial motion properties, and behavioral reasoning of key targets. Please refer to Figure 9 for more details. For more prompt details for the decision module, please refer to Figure 10 and 11.
Dear Reviewers,
Thank you very much for taking the time to review this manuscript and for helping to improve our work. I greatly appreciate all your comments and suggestions. Please find my detailed responses below.
As suggested by Reviewer eSfn (Q2), we have included visualizations of the failure cases in the uploaded document. Additionally, due to the page limitation, please refer to A2 for detailed failure case analyses.
The paper "LeapAD" introduces a novel approach to autonomous driving, drawing inspiration from human cognitive processes to enhance decision-making in complex environments. The system integrates two complementary processes: an Analytic Process (System-II) and a Heuristic Process (System-I). The Analytic Process provides thorough, logical reasoning and accumulates driving experience, while the Heuristic Process enables quick, empirical decision-making by leveraging knowledge learned from the Analytic Process. LeapAD's design is centered on the efficient use of labeled data, leveraging open-source large language models (LLMs) to analyze and interpret scenes by focusing on critical objects. The system's dual-process framework allows it to balance the thoroughness of the Analytic Process with the speed of the Heuristic Process, enhancing both adaptability and performance. Evaluation of LeapAD in closed-loop testing using the CARLA simulator shows that it outperforms existing camera-only methods, achieving state-of-the-art performance. The Heuristic Process benefits from continuous improvement through a reflection mechanism that allows the Analytic Process to analyze collision events and update the system's memory bank, further refining decision-making capabilities over time. This approach highlights LeapAD's potential to address the challenges of adaptability and interpretability in autonomous driving.
The paper is well-written and easy to follow. The authors and reviewers engaged into a fruitful discussion where most of the issues could be resolved. The authors provided additional quantitative experimental evidence (e.g. on failure cases and on the VLM module).
There was some discussion about the fact the LeapAD uses text to encode the scene (rather than a vectorized representation), which is still left open. However, in summary all reviewers voted for acceptance. I, therefore, recommend to accept the paper and encourage the authors to use the feedback provided to improve the paper for its final version, and to add some discussion on the open point of the textual representation.