We are deeply grateful to the reviewer for their thorough and insightful feedback. Your contribution of time and expertise has significantly enriched the development of our research!

Q1: how the proposed Place-Dependent Cross-Entropy Loss behaves when there is a difference in the order of magnitude between the input and the target values. From the description of the method, it appears that the Place-Dependent Cross-Entropy Loss does not take into account the decimal point position (and, for example, this means that the target probability distributions from Figure 2 for numeric string "2.69" will be the same as for numeric string "26.9")

Thank you for pointing this out! As detailed in the Experimental Details paragraph in Section 4, we consistently format all numeric strings to the same number of digits during training. For example, in the BDD-X dataset, we represent numbers using five digits, formatting "2.69" as "02.690" and "26.9" as "26.900." This ensures that the Place-Dependent Cross-Entropy Loss can effectively balance the loss for each numeric string. We apologize for any confusion and will clarify this point further in the revision.

Q2: I would appreciate it if the authors could clarify this point and the statement in lines 247-248: "Notice that when \sigma is set to 0, the loss reduces to the original definition of joint CE loss for the entire numeric string", which again is unclear as the cross-entropy loss takes into account the decimal point position.

Apologies for any confusion. As shown in the paper, the PDCE loss is defined as . When is set to 0, the distribution simplifies to , which places a probability of 1 on the target digit (which is just the hard label for digit ). Consequently, all are also 1, according to the pseudo-code shown in Fig 2. As a result, it just reduces to computing the KL divergence between the probability distribution over possible digits for a hard assigned label , which is exactly the standard cross-entropy loss. Sorry for the confusion, we will clarify this further in our revision.

Q3: The method offers no guarantee that reprompting the multimodal large language models will generate a new high-level action that does not violate the knowledge rules, since such models are data-driven.

Thank you for the insightful question. We acknowledge the potential confusion arising from our brief mention of reprompting in the main text, with more detailed explanations deferred to Appendix A.5.

To clarify, during safety verification, the Markov Logic Network (MLN) outputs a corrected action, such as 'stop.' However, 'stop' is a predicate from the MLN, not a natural high-level description. Therefore, "reprompting" in this context involves mapping the MLN-provided 'stop' back to a high-level action. For example, we prepend "The ego vehicle should stop." to the query "What is the action of the ego car?" resulting in the new prompt "The ego vehicle should stop, then what is the action of the ego car?" as demonstrated in Fig 5 on Page 13. This simple mapping effectively transforms "stop" into a natural description like "The car slows to a stop," which is then used to overwrite the original erroneous response.

Thus, reprompting here refers to mapping an inferred action, such as 'stop,' into an appropriate high-level description, rather than re-asking the multimodal large language model (MLLM) the same question to obtain a different response. We apologize for any confusion and will integrate these details into the main text of the revision.