6.3

/10

Poster3 位审稿人

最低6最高7标准差0.5

3.3

置信度

COLM 2024

ScenicNL: Generating Probabilistic Scenario Programs from Natural Language

Karim Elmaaroufi,Devan Shanker,Ana Cismaru,Marcell Vazquez-Chanlatte,Alberto Sangiovanni-Vincentelli,Matei Zaharia,Sanjit A. Seshia

OpenReview PDF

提交: 2024-03-23更新: 2024-09-05

TL;DR

A Compound AI System that constructs Probabilistic Scenario Programs from DMV crash reports where each samples yields a possible reconstruction of the accident in simulation.

摘要

关键词

probabilistic programming languagescyber physical systemsautonomous vehiclesscenariosdomain specific languageslow resource languages

评审与讨论

审稿意见

评分: 7置信度: 42024-05-08

This paper describes a system to generate a 3D animated scene of vehicle accidents from written descriptions. The processing pipeline take a description as input, passes it to a large large language model that generates a program to control the CARLA visualization system that produces the animation.

The authors collected a corpus of accident narratives that they classified according to their interpretation difficulty. Given a text input, the authors used a LLM and a sequence of prompting strategies to identify the actors in the text, infer some unknown conditions like the weather if not described in the text, and generate the code to create the scene and animate the actors. The authors used an SQL-like LLM language to create this code from a set of constraints on the actors. Finally, the code serves as input to the CARLA API to produce the animation.

The authors evaluated their system on the easy part of their dataset with a few metrics, but not the final visualization

接收理由

Interesting and ambitious application
Smart aggregation of tools: LLMs and CARLA visualization to produce a 3D generation
Partial evaluation that hints at a promising performance

拒绝理由

The system seems to be at an early development stage
The part describing the generation of code with LMQL and HyDE would be impossible to reproduce from the description in the article
The authors contribution is sometimes difficult to evaluate as they merely glue together existing programs and resources
The system is only partially evaluated. The authors notably do not address the validity of the end visualization
The authors used a curious wording in the introduction to describe the problem they solve. See the paragraph starting with Given the critical need for simulation based testing...

给作者的问题

Did you think of an evaluation of the final generation?
Will you release your code or instructions, especially the part involving LMQL?
How did you create the example scenic programs from Fig. 1?

作者回复

2024-05-30

Did you think of an evaluation of the final generation?

We refer the reviewer to Section 4, where we did perform a full evaluation. As part of our evaluations, we had 3 people familiar with Scenic read the output programs to judge them for their Accuracy, Relevance, and Expressiveness. Details of the rubric used will be provided in the appendix. This is one of the main benefits of using Scenic as an abstraction over simulators i.e. Scenic eliminates the need for sitting through hours of video to judge a simulation. Consider the example of an object smaller than a car blocking the path of that car. If the camera angle of the simulator is set to follow the vehicle, no one would ever know what object is blocking the vehicle. However, by reading the program, one would immediately realize there is an object located in the same lane as the vehicle.

In Section 3.5, we experimented with using a Vision-Language Model to determine if such models could automate part of the human evaluations. However, as mentioned, it is left as future work as there are several problems which limit the VLMs from replacing our human evaluators: (i) how many and what camera angles are needed to capture all objects? (ii) when the VLM fails to understand spatial relationships between objects, how do we proceed? (iii) should the feedback be incorporated into the natural language reasoning of the LLM judges or directly into the programs?

Will you release your code...

Yes, by camera ready

How did you create the example scenic programs from Fig. 1?

The example programs used in Figure 1 were sourced from the publicly available programs in Scenic’s Github. Specifically those using CARLA as a backend simulator: https://github.com/BerkeleyLearnVerify/Scenic/tree/main/examples/carla

The authors contribution is sometimes difficult to evaluate...

We refer the reviewer to the several existing works that we cited which have either (i) used a similar data source and performed reconstructions in simulation by paying many crash reconstructions experts to do it by hand (Scanlon et al. with Waymo) or (ii) solved a limited set of scenarios which were not real-world crash scenarios but rather simple driving examples e.g. TARGET by Deng et al. Still these methods were done by using a template based approach e.g. ADEPT by Wang et al.

2024-06-05

I found the rebuttal convincing and I increased my score.

2024-06-05

Thanks for the authors' reply! I increased my score by 1 because I like this work's application and the methods seem to achieve good performance.

审稿意见

评分: 6置信度: 32024-05-11

Due to the scarcity of real-world recorded car crash videos and sensor data, there is a heavy demand for simulation-based data. This paper proposes to use LLMs to generate Scenic programs based on natural language descriptions from car crash reports. These programs can then be simulated and used to train more robust autonomous driving systems.

They combine many prompt engineering techniques and propose ScenarioNL that achieves 90% program syntactic correctness and gets high human ratings (4.3 out of 5 on average).

接收理由

They apply LLMs in an interesting, novel, and challenging scenario.
They thoroughly explore and combine many prompt engineering techniques.
Their results show that their final method (ScenarioNL) works very well, compared to single prompt engineering method.
They show that through careful prompt engineering, LLMs can be used to generate domain-specific languages that LLMs have not seen and little data exists for finetuning.

拒绝理由

The paper is more like a technical report for solving a domain-specific code generation task via prompt engineering. I find limited research interest out of it.
Though the paper tries to claim that their methods are not unique to Scenic, the generalizability is concerning. ScenarioNL is more like a task-specific recipe. I don't see how it can easily generalize to other domain specific languages.
As an application oriented paper, the discussion or evaluation on how useful the LLM-generated programs are for downstream training of autonomous driving systems is quite thin.

给作者的问题

Do you plan to release your code and the generated programs? Because the whole system involves many steps, without code releasing, there might be reproducibility issues.

作者回复

2024-05-30

Do you plan to release your code and the generated programs...

We will release our code by camera ready.

The paper is more like a technical report...

Our baseline evaluations demonstrate that LLMs cannot generalize to writing code for domain specific languages. However, as shown in our work, the combination of filtering, prompting, and reasoning can help guide an LLM towards writing code in new settings. The current space of ways one can use an LLM is huge and continuously growing. Our work offers one successful method in a space full of many unsuccessful paths.

Concerning limited research interest, we kindly refer the reviewer to the introduction which cites several works from both industry & academia that have attempted to solve the same problem but in a limited manner.

Though the paper tries to claim that their methods are not unique...

One of our motivations was to study simulation of multi-agent systems in physical environments. This motivation is shared by industry, e.g., OpenAI’s SORA is their proposal for “general purpose simulators of the physical world” (https://openai.com/index/video-generation-models-as-world-simulators/). Unlike OpenAI, we wanted to an interpretable, controllable representation with precise semantics for simulation generation and control. Scenic was a natural choice that fit this, as evident by the related works that have also tried to generate simulations using Scenic. As per Scenic’s documentation, it supports several simulators apart from CARLA and is used in other domains like Aviation & Robotics. While these other domains are interesting, they do not offer a dataset of recorded failures. As mentioned in the introduction, several regulatory bodies record these interesting failure cases for autonomous vehicles. Finally, as mentioned in Section 3.3, we keep all reasoning and soliciting of information in natural language. The only portion which includes Scenic specific statements are the Few-Shot & Constrained Decoding prompts to help generate syntactically correct code.

As an application oriented paper...

Simulation based testing is a popular evaluation method for industry leaders such as Waymo. One potential use is for the companies whose accidents we replicated to retest their simulated vehicles using our programs or as an evaluation benchmark. Another possibility as mentioned in the conclusion, is to leverage the simulations in the training of RL based agents (also an area where the Scenic has been used).

审稿意见

评分: 6置信度: 32024-05-13

This paper introduces ScenarioNL, an AI system that creates scenario programs from natural language in police crash reports. It uses a Probabilistic Programming Language called Scenic to handle uncertainties about the scenarios. The system, which combines LLMs, prompting strategies, a compiler, and a simulator, has been tested on autonomous vehicle crash reports from California.

接收理由

Very clear and compelling motivation that helps the reader understand the problem of creating scenario programs to model CPS.
The processed AV DMV crash reports datasets is a useful contribution for the research community.
The methods underlying ScenarioNL highlight some limitations of LLMs in out-of-domain settings and how to circumvent them.

拒绝理由

There are few details of the evaluation setup. For example, how many experts were used to evaluate the outputs? Were the outputs anonymized?
The system behind ScenarioNL is interesting but has several components that are not presented in detail. It would be beneficial to have a more thorough description of exactly which components are used in ScenarioNL and which ones contribute the most to the overall performance. In particular, are all the components needed?
The RAG/RAG+HyDE performance is very high compared to the rest. Why is that? How did you ensure that the data store used does not contain examples that are similar to the solution?

给作者的问题

Why is expressiveness interesting to track and how is it correlated with desirable properties? Wouldn’t accuracy or semantic similarity to the language prompt be a more interesting metric?
See questions regarding the evaluation setup
See questions about RAG results
Have you tried to provide Scenic documentation in a prompt? It would be interesting to know the performance of this simple approach to understand the degree to which LLMs are capable to rely on a description of a DSL

作者回复

2024-05-30

Why is expressiveness interesting to track...

Expressiveness is a property of probabilistic programs which is more heavily used than in imperative programs. For example, a police report may indicate that a vehicle was traveling at “approximately 35 mph”. A low expressiveness would be from a program that just assigns that speed directly. A high score would be from a program that assigns the speed as a distribution (around 35 mph). This is valuable downstream because each sampled program would result in a different vehicle speed which could lead to different behaviors and different findings of how the autonomous driving systems react.

Accuracy and Relevance are also human evaluated measures recorded in our work. The full rubric of the proposed 3 measures will be provided in the Appendix.

There are few details of the evaluation setup. ...

3 people familiar with Scenic randomly evaluated a subset of the outputs per experiment. Please let us know if further questions remain

The RAG/RAG+HyDE performance is very high compared to the rest. ...

The examples used are the handful publicly available in the Scenic Github. These example programs are not related to autonomous vehicle accidents. In fact, to the best of our knowledge there is no current work that leverages the crash reports of the California DMV, therefore, all scenarios used were not previously seen. Could you kindly clarify the former question? The compilation rate of RAG was 1% and RAG+HyDE was 7% as detailed in Table 1, which in fact was not the highest of the baseline methods.

Have you tried to provide Scenic documentation in a prompt? ...

We tried but did not run a full baseline evaluation on due to poor performance. All Scenic documentation is on Github. We took the full documentation (64,265 tokens using the llama tokenizer), chunked it, and provided it into a Pinecone Vector DB. We then tried using RAG only with this database as well as a combined approach of RAG from this database AND RAG with a single Scenic program retrieved (from all available Scenic programs even those using different simulators in different domains for a total of 73,825 tokens) but found that in either case, performance was poor. This is most likely due to the documentation explaining how things are implemented under the hood (no pun intended) rather than offering examples and explaining proper usage of Scenic features and syntax.

2024-06-02

Thank you for your responses! I stand by my score as I think there are some missing important details in the evaluation setup and system that need to be described and analyzed further.

最终决定Accept

2024-07-10

This is an interesting paper, with a bit of out of the ordinary feel to it in the current LLM work. so it could complement well the conference's program. The reviewers overall seemed to think the paper is in a good shape for publication (following a discussion), even if it has some faults that could potentially be corrected.