The necessity of making PADriver a MLLM for HighwayEnv is unclear.

The most important contribution of this paper is the proposed PADriver, which is an MLLM trained for HighwayEnv. As mentioned in Line044, the main reason is that only providing text description is not sufficient to capture enough scene information. Therefore, they use vision inputs with BEV rasterizations.

However, in fact, the BEV image from HighwayEnv does not provide any additional information than all the agent's locations in the scenario, which can be easily extracted as float numbers and provided as either text input or vectorized low-dimensional features. Therefore, it is unclear to me why is using vision input necessary and what benefits it provides for HighwayEnv. It seems to me using text inputs or simply agent-location vectors will be sufficient.

In short, I agree that vision input might be important for more realistic environments like CARLA. However, as this work only conducts experiments on HighwayEnv, the purpose of using MLLM is unclear.

The necessity to train the MLLM PADriver instead of directly prompting SOTA MLLMs is unclear.

Related to the last part, PADriver is trained with a large amount of driving data generated by rule-based policies as well as human annotations. However, because the BEV rasterization in HighwayEnv is a very easy and intuitive representation of the scenario, it might be very likely that current MLLMs (e.g., GPT4o) might be already able to solve the PAD problem with relatively good performance.

DILU has already proved that pure-text GPTs are already able to do well in HighwayEnv with text input, I believe with proper prompting and input formatting, it is very straightforward to modify DILU's prompts and use GPT4o to take in the image input and output personalized driving behaviors. To show the necessity of training an MLLM to solve this issue, I believe including GPT4o's performance is important.

The use of personalized driving prompts is very limited.

Although the paper has an emphasis on "personalized driving prompts", the kind of personalization is only limited to "slow, medium, and fast" and do not include very detailed personalized prompting like "always try to drive on the leftmost side", "try to keep a distance of 10m from the car in front", "do not exceed acceleration of XX level".

The reason why the detailed personalized prompts are expected is that the PADriver is an MLLM, it is very natural for it to take in free-form and high informative language descriptions of the personalized needs. The current support for 3 levels of speeding can be simply input to the model with a one-hot vector with 3 dimensions, without any text involved.

As personalized driving is the core focus of this paper, I think this limitation becomes an important drawback of this work.