Thank you for your valuable feedback and positive evaluation. Your input will help us to improve our paper.

(Confusion about logprobs): To simplify the analysis in section 2 and the start of section 3, we assume access to all V logprobs for each output. Line 127 makes the observation that we don’t actually need all of these logprobs to calculate ---we only need the first D logprobs and .

(Prompt Length) We limited the inversion of prompts to 64 tokens during both training and testing, consistent with L2T. While our qualitative examples (Appendix E) show prompts of varying lengths, we agree that a systematic analysis of how prompt length affects inversion performance would be interesting. We will consider adding such an analysis to our final version.

(Missing Metrics): Your assumption is correct that these scores are too low to be informative. Exact match scores for system prompt recovery (Table 3) are indeed very low (near 0%), and similarly, exact match and BLEU scores for model transfer (Table 4) are substantially lower than the reported Token F1 scores. Additionally, L2T does not report BLEU and exact match scores for their transfer experiments. We are happy to include these numbers for completeness.

(Missing Baselines): For Table 3 (system prompt recovery), the L2T did not conduct any system prompt inversion experiments, so we only compare against O2P. Since O2P generally outperforms L2T, we did not deem it necessary to run these experiments ourselves. For Table 4 (transfer performance), the baseline coverage reflects methodological constraints:

L2T's model transfer only works for models with the same tokenizer (we report results for Llama 2 13B which shares a tokenizer with Llama 2 7B, but cannot transfer to Mistral 7B)
O2P did not report numbers for Llama 2 13B in their original work, so we only use their reported Mistral 7B results. We do not believe that adding this would be particularly informative or change our conclusion that O2P outperforms our method in the transfer setting.