We sincerely thank you for your constructive and helpful comments, as well as your recognition of the strengths of our work, including our motivation to optimize overlooked system prompts and our comprehensive experiments, supporting the efficacy of our method. We have carefully considered and faithfully responded to your comments. --- > W1. The paper lacks theoretical insights into why meta-learning is effective for system prompt optimization and how it enables performance improvements. We appreciate your feedback regarding the desire for a deeper theoretical understanding. However, as you acknowledged as strengths of our work, we provide comprehensive experiment results with detailed analyses, which support all the claims that we make: the system prompt learned from meta-learning improves generalization and accelerates adaptation. Specifically, we show consistent performance improvements across 14 unseen tasks in 5 domains, including cross-domain and cross-model generalization (Table 3 and Figure 5), demonstrating that meta-learning enables generalization over diverse domains and models. Also, the optimized system prompt via meta-learning enables the rapid and scalable adaptation, requiring fewer iterations and data at test time (Figure 6). Therefore, while a formal theoretical analysis is indeed an interesting direction for future work, we strongly believe that the lack of such insight should not be viewed as a limitation, given that our empirical evidence sufficiently provides concrete support for the efficacy of our method. --- > W2. The tasks included are limited to deterministic problems; thus, the effectiveness of the framework on subjective tasks remains unclear (while system prompts are also used in such chat scenarios). Thank you for raising this concern, and to address this, we evaluate the efficacy of the global system prompt optimized by MetaSPO on the MT-Bench dataset [1], which is designed for multi-turn, open-ended subjective tasks, such as Writing, Humanities, and Roleplay. Then, as shown in Table B.1 below, we observe that MetaSPO consistently outperforms baselines (including SPRIG and CoT) across tasks, with GPT-4o mini for evaluation. In other words, these findings provide strong empirical evidence that MetaSPO can apply to a diverse range of tasks, including deterministic and subjective tasks. Table B.1. Result of MetaSPO on MT-Bench [1] |Base Model|Method|Humanities|Roleplay|Writing|Avg.| |---|---|:---:|:---:|:---:|:---:| |Llama3.2 (3B)|Default|7.65|7.60|7.55|7.60| ||CoT|**7.70**|7.60|7.50|7.60| ||Service|7.65|6.75|7.60|7.33| ||SPRIG|7.60|7.30|7.60|7.50| ||MetaSPO|**7.70**|**7.65**|**7.85**|**7.73**| |Llama3.1 (8B)|Default|8.20|7.50|**7.75**|7.82| ||CoT|8.20|7.90|7.70|7.93| ||Service|7.80|7.75|7.65|7.73| ||SPRIG|**8.25**|7.85|7.60|7.90| ||MetaSPO|8.20|**8.05**|**7.75**|**8.00**| [1] Judging LLM-as-a-judge with MT-Bench and Chatbot Arena, Neurips D&B Track 2023 --- > Q1. Do the authors conduct any case studies to illustrate how the model output changes from incorrect to correct? Is the improvement mainly driven by format-related instructions in the optimized system prompt, as suggested by its inclusion of output format in Appendix? This is an excellent suggestion, and based on it, we conduct additional qualitative case studies provided below, to show how MetaSPO corrects errors. Specifically, in a reasoning task (called Colored Object), the model with the default system prompt yields an incorrect answer (“red”) despite correct formatting, while the model with the system prompt from MetaSPO produces the correct answer (“brown”) by reasoning through the item order. Notably, many optimized system prompts (such as other Medical and Reasoning domain examples in Table 18) do not include format instructions, which confirms that performance gains stem not primarily from output formatting alone, but rather from improved task understanding and reasoning behavior encouraged by the optimized system prompt. **Case 1: Colored Object (Reasoning)** ```python User Prompt with question: Read the following questions carefully and provide the correct answers based on the details supplied in each input. Ensure your answers accurately reflect the conditions described and count or describe the objects as requested. On the floor, you see several items arranged in a row: a brown paperclip, a mauve necklace, a turquoise dog leash, a red jug, and a teal pen. What is the color of the item furthest from the pen? At the end present your answer in and . Ground Truth: Brown MetaSPO Response(Correct): To determine the color of the item furthest from the pen, we need to consider the arrangement of the items. Since the items are arranged in a row, we can deduce the order from left to right. The items in the order are: 1. Brown paperclip (...) 5. Teal pen The pen is at the end of the row, so the item furthest from the pen is the brown paperclip. Brown Default Response (Wrong): To determine the color of the item furthest from the pen, we need to consider the arrangement of the items. Since the items are arranged in a row, we can count from the pen to the other end. 1. The pen is at one end. 2. The teal pen is closest to the pen. (...) 6. The red jug is at the farthest end. So, the item furthest from the pen is the red jug. Red ``` **Case 2: Dental (Medical)** ```python User Prompt with question: For each of the following questions, select the correct answer from the provided options and indicate your choice clearly. Functional appliances in skeletal class II are indicated when: Options: A. Maxilla is normal, mandible is retrognathic B. Maxilla is prognathic C. Mandible is retrognathic in a mature individual D. When severe crowding is present At the end present your answer in and . Ground Truth: A MetaSPO Response(Correct): A Rationale: Functional appliances in skeletal class II are (...) Default Response(Wrong): C ``` --- > Q2. Have the authors tried applying their method to more capable models (e.g., GPT-4o-latest or o3) or more challenging tasks (e.g., AIME 2024 or LiveBench)? In response to your question, we additionally conduct experiments with GPT-4o, evaluating the system prompt optimized from GPT-4o mini to GPT-4o without further optimization, and as shown in Table B.2 below, MetaSPO achieves the highest average score over baselines, indicating its strong transferability to even one of the strongest models. Also, for challenging tasks, while our evaluation already includes multiple challenging reasoning benchmarks, which require multi-step inference and compositional reasoning, we further conduct the suggested experiments on LiveBench [2], evaluating the system prompt optimized for the reasoning domain using GPT-4o mini and reporting its performance on the reasoning category of LiveBench [2]. The results in Table B.3 then show that MetaSPO outperforms baselines, highlighting its effectiveness on complex reasoning scenarios. Table B.2. Unseen generalization results of GPT-4o with system prompts optimized from GPT-4o mini ||Review|Analysis|||Reasoning||| |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| ||Electronics|Pet|Sports|Count|Epistemic|Color|Avg.| |Default|73.4|75.0|69.2|78.2|81.6|98.0|79.2| |CoT|73.0|76.0|70.6|80.0|80.0|98.4|79.7| |Service|70.6|75.0|66.8|82.0|81.6|98.0|79.0| |SPRIG|73.0|73.8|68.0|81.8|81.8|98.0|79.4| |MetaSPO|76.9|75.4|77.3|82.9|81.1|98.9|**82.1**| Table B.3. Result of MetaSPO on LiveBench (Nov 25, 2024) [2] ||LiveBench| |:---:|:---:| |Default|35.3| |CoT|35.8| |Service|36.1| |SPRIG|31.7| |MetaSPO|**39.1**| [2] LiveBench: A Challenging, Contamination-Limited LLM Benchmark, ICLR 2024 --- > Q3. Could the authors provide detailed results using only ten user prompts (e.g., in the form of a table similar to Table 1) to isolate the impact of system prompt optimization? Thank you for your insightful suggestion. As suggested, we evaluate performance with only the ten user prompts without any system prompt (to isolate its effect), and report the results in the “None” row of Table B.3. From this, we observe that it shows comparable performance with the default system prompt and substantially lower performance than our optimized MetaSPO prompt. We will make sure to include this result in the revision. Table B.4: Results on the unseen generalization scenario, where None refers to using only the user prompts without any system prompt, compared against Default and MetaSPO. |Domain||Medical||||Review|Analysis||Reasoning||Safety||Grounding||| |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |Methods|Ana.|Ped.|Den.|Sur.|Ele.|Pet|Spo.|Cou.|Epi.|Col.|A.H.|Eth.|N.Q.|Web.|Avg.| |Default|36.1|38.9|25.8|32.3|41.3|41.5|29.3|43.5|28.3|56.6|21.2|28.7|15.1|11.6|32.2| |None|40.0|36.1|31.0|33.9|38.5|39.4|27.7|42.0|34.7|59.0|19.1|27.8|13.3|10.5|32.4| |MetaSPO|**48.9**|**46.7**|**36.4**|**40.0**|**61.8**|**64.9**|**61.5**|**47.1**|**43.0**|**66.6**|**29.1**|**43.9**|**19.1**|**13.7**|**44.5**| --- > Q4. In Figure 6, why does the performance of the default system prompt vary across different iterations and example counts? We observe that the performance of the default system prompt generally improves with more iterations and higher example counts, as it increases the opportunities to adapt and incorporate relevant information from more computation. However, the performance does not increase monotonically, which we attribute to the inherent variance in LLM outputs and the stochastic nature of the prompt optimization technique (namely, ProTeGi) that may yield instability. Specifically, we found that it has an average standard deviation of 0.015 across different example counts, which reflects the variance in the results of Figure 6.