PaperHub
6.4
/10
Poster4 位审稿人
最低4最高4标准差0.0
4
4
4
4
4.0
置信度
创新性3.0
质量3.0
清晰度3.0
重要性2.5
NeurIPS 2025

Matryoshka Pilot: Learning to Drive Black-Box LLMs with LLMs

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29
TL;DR

M-Pilot uses a small, controllable language model to guide a large, complex language model through complex tasks, improving its reasoning, planning, and personalization capability.

摘要

关键词
Large Language ModelsBlack-Box LLMsLLM ReasoningLLM PlanningLLM Personalization

评审与讨论

审稿意见
4

The paper proposes the use of a trained LLM through iterative direct preference optimisation to generate the prompts that help a black-box LLM solving tasks that typically require guidance to achieve good performance. Experiments are reported in summarisation, reasoning and planning domains.

优缺点分析

Quality

The proposed method is sound and relies on the combination of different techniques to generate the data to train the LLM controller, e.g. Behavioural Cloning to initialise the controller, RLHF style of data generation and iterative DPO for training. The experimental results are reported in detail and well supported by the appendix.

I miss more detail focusing on the training overhead, and an ablation study on the components in Figure 3. Some data can be found in the Appendix regarding training resources, but it would highly benefit the paper if comments are included in the main paper instead.

Clarity

Overall, the paper is well written, but section 2 (Multi-turn) and section 3 could be improved. Figure 2 and 3 need to be expanded with more details, and refer to it in section 2 and 3.3. This will help putting all the pieces together. As it is, Section 3.3, a key section in the paper, is hard to follow.

Minor:

  • Table 4 needs space between the caption and the text
  • Set notation is used in line 83 and 75, but in the former it is referring to a sequence.

Significance - Originality

The paper provides a novel approach to generate guidance: instead of being hand-crafted, this is provided by a trained LLM. The idea is novel, and results show some advantage. The only caveat is the use of a hand-crafted prompt for each task. See questions as I need clarification.

问题

  • In Table 4, why the performance is 11.11% success rate over the Exam task without optimising the controller, what makes this task particularly challenging?

  • The Controller starts with an initial prompt hand-crafted for each task. Is this given to all baselines as well? otherwise, it seems like the baselines are solving a different task. Furthermore, why is this not generated automatically by the controller?

  • How applicable is this approach when facing a new task, which components would require input from human feedback?

局限性

Yes

最终评判理由

I thank the authors for their response, they addressed all my questions and added valuable extra experiments of interest.

格式问题

N/A

作者回复

We are grateful for the reviewer’s constructive remarks and suggestions. We respond to each point in the following:

Weakness: Clarity and Completeness of Method Presentation

Thank you for the insightful suggestion.

Regarding the missing details and ablation study, we have included the training overhead in Appendix Section C.7 and Table 16. In addition, we present ablation studies in the appendix: Section C.4 introduces various iterative training methods, and Section C.5 compares different controller model backbones. We agree that presenting these results in the main paper would improve readability. Accordingly, we will revise the main text to briefly summarize the training cost across different tasks and highlight the effectiveness of different controller architectures and training methods.

For clarification of sections and figures, we will explicitly reference the relevant components in Figures 2 and 3 within Sections 2 and 3.3 to improve clarity and navigation for readers.

Regarding other formatting issues, we will increase the spacing between table captions in Table 4 and body text to enhance visual clarity. As for the set notation in line 83, we acknowledge that it is not strictly accurate: while sets imply unordered, distinct elements, the guidance steps gig_i in our case represent an ordered sequence that may contain repetitions. To address this, we will rephrase it as “a sequence of T intermediate guidance steps g1g_1, g2g_2,…, gTg_T” to better reflect the intended meaning.

Question 1: Exam Task Difficulty without Optimization

We manually reviewed all the tasks in the Exam benchmark and identified the primary reasons why the unoptimized controller fails while the optimized controller succeeds.

To recap, the controller model is responsible for decomposing the task into intermediate steps, which are then used to guide the black-box LLM in invoking functions from the Agent class to solve each step.

In many failure cases, the unoptimized controller produces incorrect step formulations. For example, when the task requires examining a pencil using a desklamp, the unoptimized controller might output the step: “examine the pencil with the desklamp.” This leads to an error since the Agent class has no function named examine. In contrast, the optimized controller reformulates the step as: “take the pencil with the desklamp.” which correctly guides the black-box LLM to place the pencil under the desklamp, thus effectively performing the examination without triggering an error. This type of semantic mismatch accounts for most of the unoptimized controller's failures.

Additionally, the unoptimized controller sometimes fails to produce sufficiently detailed decompositions. For instance, if the task is to look at an alarm clock under a desklamp, the correct sequence of steps should be: (1) take the alarm clock, (2) turn on the desklamp, and (3) look at the alarm clock under the light. However, the unoptimized controller may directly attempt to “look at the alarm clock under the desklamp” without taking the alarm clock or turning on the light, resulting in an execution error.

Question 2: Consistency and Automation of Initial Prompting

Thank you for the insightful question. We want to clarify this question from two perspectives:

  1. There are no "hand-crafted" prompts— the initial prompt is simply used to present the task across all baselines.
  2. The inputs and outputs to the black-box LLM follow the same prompt format in both our method and the baselines, ensuring that all approaches are solving the same task under consistent conditions.

Below we elaborate on each point.

First, we emphasize that the initial prompt is not hand-crafted for specific tasks; rather, it serves as a general instruction to indicate the problem. Specifically, it can be simple while still providing sufficient context. In AlfWorld, for instance, we use a simple instruction:

Decompose the task into steps. First give a general plan of how you would solve the task, then for each step you plan to take, mark with '[Step xx]'.

We accompany this with a single one-shot example. Despite its simplicity and lack of task-specific hand-crafting, this initial prompt, when paired with our optimized controller, enables the white-box model to effectively guide the black-box LLM and achieve strong performance.

Additionally, regarding the difference between our pipeline and direct prompting baselines, we would like to clarify that the intermediate prompt (i.e., the input to the black-box LLM) closely resembles the format used in direct prompting. For example, in AlfWorld, the black-box LLM is required to annotate each action step before generating code, effectively guiding the code generation process. Similarly, the intermediate prompt produced by our controller model serves the same purpose and follows the same format. As a result, both approaches are solving the same task under equivalent prompt structures.

Question 3: Generalization and Human Dependency in New Tasks

We appreciate your insightful suggestion regarding extending the framework to new tasks requiring human feedback. To this end, we additionally implement the framework on code generation, using the APPS [1] benchmark, a large-scale and challenging code generation benchmark comprising 10,000 difficult programming problems. While this allows for scalable and fully automated preference annotation, it is worth noting that the test cases themselves are human-authored, and thus implicitly reflect human judgment regarding correctness.

​ In this setup, the controller model is tasked with generating high-level observations of the problem, which are then provided as guidance to a black-box LLM to assist in generating the final code solution. Due to time constraints, we were unable to optimize the intermediate prompts; However, our approach still outperforms strong baselines. Specifically, we compare against the following: (1) direct prompting, (2) a black-box + black-box pipeline using the same prompt as our method, and (3) an unfinetuned controller + black-box setting. For both our method and direct prompting, we also incorporate a reflection strategy to further boost performance. All experiments use Qwen2.5-7b-instruct as the controller model and GPT-4o-mini as the black-box LLM, with the controller trained via DPO on collected preference data.

MethodStrict AccuracyTest Coverage
Gpt-4o-mini (Try=1)0.3670.528
Gpt-4o-mini (Try=4)0.3980.56
Gpt-4o-mini + Gpt-4o-mini0.3640.528
Qwen2.5-7b-instruct + Gpt-4o-mini0.3670.528
M-Pilot + Gpt-4o-mini (Try=1)0.3740.539
M-Pilot + Gpt-4o-mini (Try=4)0.4110.572

Here, Strict Accuracy refers to solving a problem by passing all test cases, while Test Coverage denotes the average pass rate across test cases. Despite the lack of optimal intermediate prompt design, our method surpasses the strongest baseline by 0.7% in the single-pass setting, and this margin increases to 1.3% with four reflection attempts.

Compared to simpler domains such as planning, code generation poses significantly greater complexity and requires more careful design of intermediate prompts. We believe that with more refined prompt engineering, our framework could yield even stronger results in this setting.

[1] Measuring Coding Challenge Competence with APPS. (Hendrycks et al., 2021)

We hope our responses address the reviewer’s concerns and further highlight the value of our work. Please feel free to let us know if any questions remain.

评论

I thank the authors for their response, they addressed all my questions and added valuable extra experiments of interest.

评论

Thank you for your feedback on our additional experiments and for your positive assessment. Your constructive suggestions throughout the review process have really improved our work. We appreciate your time and effort.

审稿意见
4

This paper proposes Matryoshka Pilot (M-Pilot), which utilizes a lightweight white-box LLM controller that guides a large-scale black-box LLM generator by decomposing complex tasks into a series of intermediate outputs. Technically, M-Pilot is trained to pivot the outputs of the black-box LLM aligning with preferences during iterative interaction using DPO.

优缺点分析

Strengths

  1. The authors conducted experiments across multiple domains, including personalization, reasoning, and planning, demonstrating the effectiveness of their method.
  2. The proposed method, i.e., driving a black-box LLM using a white-box LLM by regarding the black-box LLM as an environment, to the best of my knowledge, is novel.

Weaknesses

  1. Iteratively annotating preferences for black-box LLM outputs can be labor-intensive.

问题

  1. How you ever considered using online RL to train the white-box LLM on domains where there have verifiable rewards, like Math Reasoning?
  2. Does training the white-box LLM to steer the black-box LLM to output desired contents really easier than training the white-box LLM to ouput desired contents? How can it be proved? Do you have any explanation or intuition?

局限性

yes

最终评判理由

Several of my concerns have been solved.

格式问题

N/A

作者回复

We are grateful for the reviewer’s constructive remarks and suggestions. We respond to each point in the following:

Weakness 1: Preference Annotation Cost

We first emphasize that our method is compatible with both single-turn DPO and iterative DPO, and other RL algorithms for the white-box LLM training, with both human ranking and automatic verifier. In fact, in our experiments, we already exploit the verifier in learning.

For example, in planning tasks like AlfWorld, preferences are determined based on whether the generated plan leads to successful task execution automatically. In reasoning tasks such as GSM8K, preferences can be inferred by comparing the final answer against the ground truth. In personalization tasks like LaMP-1/2M/2N, preferences are derived based on comparing the final answer against the ground truth; For LaMP-3/4, preferences are derived based on comparing MAE/Rouge-L scores.

This design enables scalable and reproducible preference annotation without requiring manual human supervision, thereby making our framework more practical and efficient for real-world applications.

Question 1 & Question 2: Comparison with Online RL and Direct Output Training

Thank you for the question. We have included additional evaluations of the LLaMA3-8B-Instruct model on the GSM8K and GSM-Hard datasets. Following the same experimental settings as outlined in the paper, we employed a three-shot prompt design and incorporated closed-loop refinement via PAL during the inference stage.

We evaluated the performance of both the vanilla LLaMA3-8B-Instruct model and the DPO fine-tuned LLaMA3-8B-Instruct model as well as SFT fine-tuned LLaMA3-8B-Instruct model. As shown in Table 1, even when utilizing a relatively less powerful black-box model such as GPT-3.5, our white-box + black-box framework outperforms the powerful SFT fine-tuned white-box model by 6.4% on the GSM8K dataset and 6.3% on the GSM-Hard dataset. These results strongly demonstrate that our black-box + white-box approach is significantly more effective than directly fine-tuning the white-box model.

Table 1: Comparison with fine-tuned white-box models on GSM task

GSM8KGSM-Hard
LLaMA3-8B-Instruct0.8410.661
SFT fine tuned LLaMA3-8B-Instruct0.8470.675
DPO fine tuned LLaMA3-8B-Instruct0.8320.641
M-Pilot (LLaMA3-8B-Instruct) + gpt3.50.9110.738
M-Pilot (LLaMA3-8B-Instruct) + gpt4o-mini0.9560.779

The reason is that solving the task often requires advanced reasoning capabilities, which are typically beyond the capacity of a white-box LLM at the 8B scale. In contrast, black-box LLMs have already acquired extensive knowledge during pre-training. Through reinforcement learning-based tuning, the white-box LLM can more effectively learn how to elicit and leverage this embedded knowledge from the black-box LLM, leading to improved accuracy.

Thank you once again for your insightful review. We appreciate your feedback on our work. Feel free to let us know if you have any further questions, and we are happy to discuss further.

评论

Dear Reviewer CPkc,

Thank you for taking the time to review our work and for providing such insightful suggestions! We will make sure to incorporate the additional claims and experimental designs in future revisions.

During the rebuttal phase, in the Preference Annotation Cost section, we clarified that our framework is compatible with various training strategies, including multi-turn DPO training and different reinforcement learning algorithms with diverse reward formulations. To further demonstrate the practicality of our approach, we conducted case studies across all three tasks in the main paper, showing that our method remains efficient and feasible in real-world applications.

In the Comparison with Online RL and Direct Output Training section, we added new experiments comparing our framework with direct training of white-box LLMs using both SFT and DPO. These results show that, even when using a relatively less powerful black-box LLM like GPT-3.5, our method outperforms the strongest white-box fine-tuning baseline by 6.3%. We also provided analysis explaining the underlying reasons for this advantage, further highlighting the effectiveness and promise of our white-box controller guiding black-box LLMs framework.

Thank you again for your thoughtful and constructive feedback!

Please let us know if you have any further questions. If you feel our response has addressed your concerns, we would be grateful if you would consider raising your rating for our paper. We truly appreciate your consideration.

Best Regards,

Authors of Matryoshka Pilot

审稿意见
4

The paper introduces Matryoshka Pilot (M-Pilot), a framework that employs a lightweight white-box LLM as a controller to guide a black-box LLM through complex tasks such as reasoning, planning, and personalization. The key idea is to decompose complex tasks into intermediate outputs, which the controller generates to steer the black-box LLM toward better performance. The framework leverages iterative direct preference optimization (IDPO) to refine the controller's guidance based on feedback from the black-box LLM's outputs. Empirical results demonstrate significant improvements over baseline methods across multiple tasks, including reasoning (GSM8K), planning (ALFWorld), and personalization (LaMP).

优缺点分析

Strengths

  1. The paper presents a well-designed framework with thorough empirical validation across diverse tasks. The experiments are comprehensive, and the results are statistically significant, demonstrating consistent improvements over strong baselines.
  2. The paper is well-structured and clearly written. The methodology is explained in detail, and the figures (e.g., Figure 2, 3) effectively illustrate the framework.
  3. The idea of using a white-box controller to guide a black-box LLM via intermediate outputs is novel. The integration of IDPO for iterative refinement is a creative approach to improving guidance quality.

Weaknesses

  1. While the method shows strong performance on selected benchmarks, it is unclear how well it generalizes to other tasks or domains not covered in the paper.
  2. The iterative optimization process (IDPO) may introduce additional computational costs, which are not thoroughly analyzed.
  3. Some baseline comparisons such as against recent adaptive prompting methods like DSP or APE, which could strengthen the claims, are missing.
  4. The paper does not deeply analyze the quality or interpretability of the intermediate guidance generated by the controller, which could be important for trust and debugging.

问题

  1. How is the generalizability of M-Pilot? How does M-Pilot perform on tasks outside reasoning, planning, and personalization (e.g., creative writing, code generation)? Are there task types where this approach might struggle?
  2. Have the authors considered comparing against recent adaptive prompting methods or retrieval-augmented LLMs?
  3. Could you perform intermediate guidance analysis? Could the authors provide examples or a qualitative analysis of the intermediate guidance generated by the controller? How does its quality vary across tasks?
  4. Under what conditions does M-Pilot fail to improve black-box LLM performance? Are there cases where the controller’s guidance degrades output quality?

局限性

The authors acknowledge that the method relies on the quality of the white-box controller and that the framework may not generalize to all black-box LLMs equally. However, they could further discuss:

  • The sensitivity of M-Pilot to the choice of white-box LLM, for example, would a smaller controller still work?
  • Potential negative societal impacts, such as misuse of the framework for generating deceptive or biased outputs.

最终评判理由

I recommend accept

格式问题

N/A

作者回复

Thank you for your insightful comments and for taking the time to review our paper. We respond to your suggestions as follows:

Weakness 1 & Question 1: Generalization to Unseen Tasks

We appreciate your suggestion to extend the framework to other task domains. We emphasize that our method is widely applicable to different problems and we additionally test the methods on code generation, using the APPS [1] benchmark, a large-scale dataset with 10,000 challenging programming problems.

In this setup, the controller generates high-level observations of the problem, which are then used to guide a black-box LLM in producing the final code. Due to time constraints, we were unable to optimize the intermediate prompts; However, our approach still outperforms strong baselines. Specifically, we compare against: (1) direct prompting, (2) a black-box + black-box pipeline using the same prompt as our method, and (3) an unfinetuned controller + black-box setting. For both our method and direct prompting, we also incorporate a reflection strategy to further boost performance. All experiments use Qwen2.5-7b-instruct as the controller model and GPT-4o-mini as the black-box LLM, with the controller trained via DPO on collected preference data.

MethodStrict AccuracyTest Coverage
Gpt-4o-mini (Try=1)0.3670.528
Gpt-4o-mini (Try=4)0.3980.56
Gpt-4o-mini + Gpt-4o-mini0.3640.528
Qwen2.5-7b-instruct + Gpt-4o-mini0.3670.528
M-Pilot + Gpt-4o-mini (Try=1)0.3740.539
M-Pilot + Gpt-4o-mini (Try=4)0.4110.572

Here, Strict Accuracy refers to solving a problem by passing all test cases, while Test Coverage denotes the average pass rate across test cases. Despite the lack of optimal intermediate prompt design, our method surpasses the strongest baseline by 0.7% in the single-pass setting, and this margin increases to 1.3% with four reflection attempts.

Compared to simpler domains such as planning, code generation poses significantly greater complexity and requires more careful design of intermediate prompts. We believe that with more refined prompt engineering, our framework could yield even stronger results in this setting.

[1] Measuring Coding Challenge Competence with APPS. (Hendrycks et al., 2021)

Weakness 2: Computational Overhead of Iterative Optimization

We acknowledge that iterative DPO introduces additional computational overhead compared to single-turn DPO.

We first emphasize that our method is compatible with both single-turn DPO and iterative DPO, and other RL algorithms for the white-box LLM training.

Our experiments suggest that even a single iteration of DPO is sufficient to yield strong performance. For example, as shown in Table 4, the second iteration only brings a marginal improvement of 1.49% over the first, while a single-turn DPO already outperforms the strongest baseline by 6.72%. This strong performance of single-turn DPO is also evident in both reasoning and personalization tasks, as demonstrated in Table 3 and Table 1, respectively.

We exploited iterative DPO in our setting to make the algorithm is compatible with both offline setting (by a single-turn upon the provided samples) and on-policy setting (with data collected through the current policy).

Weakness 3 & Question 2: Missing Comparison with Adaptive Prompting Methods

We appreciate the suggestion to include additional comparisons with adaptive prompting methods. In response, we now compare our approach with APE[2] and PHP[3] on GSM8K reasoning tasks, as shown in Table 1. It is worth noting that DSP[4] involves additional tuning via SFT and/or RL; thus, we include PHP as a more comparable baseline.

Table 1: Comparison with baselines on GSM8K

APEPHPM-Pilot(Ours)
0.670.850.93

All experiments are conducted using GPT‑3.5‑Turbo as the black-box LLM. Our method consistently outperforms baseline approaches on GSM8K.

APE optimizes prompts by searching over a candidate pool generated by an LLM, while PHP incrementally uses previously generated answers as hints to guide the model toward the correct solution. Although these methods share our goal of prompt optimization via LLMs, they both rely on training-free, in-context learning strategies.

In contrast, DSP utilizes a small policy model to generate static clues, without modeling multi-turn interactions between the controller and the black-box LLM. Our method introduces a trainable white-box controller that not only performs prompt optimization but also engages in multi-turn interaction with the black-box model. This design offers a more powerful and flexible mechanism for guiding the black-box LLM and further enhancing its capabilities. We will incorporate these comparisons into our revised work.

[2] Large language models are human-level prompt engineers. (Zhou et al., 2022)

[3] Progressive-hint prompting improves reasoning in large language models. (Zheng et al., 2023)

[4] Guiding Large Language Models via Directional Stimulus Prompting (Li et al., 2023)

Weakness 4 & Question 3: Limited Analysis of Intermediate Guidance

We appreciate the helpful suggestion. We have included a case study on intermediate guidance for three tasks in Appendix Section I. Here, we provide a more focused and detailed analysis of the quality of intermediate guidance across the three task domains.

In Alfworld, the controller plays a key role in generating a high-level plan for the given task, which is then used to guide a black-box LLM in executing the required actions. For example, given the task "put a hot potato in the garbage can", the controller produces the following decomposition:

General plan: Retrieve a list of receptacles to locate a potato, bring it to a microwave for heating, and then place it into a garbage can.
Step 1: Get a list of likely receptacles containing a potato.
Step 2: Search through the receptacles until a potato is found.
Step 3: Pick up the potato.
Step 4: Go to a microwave and heat the potato.
Step 5: Go to a garbage can and dispose of the potato.

With this structured guidance, the black-box LLM is able to successfully complete the task.

In contrast, when prompted to solve the task directly without decomposition, the LLM tends to follow an incorrect default action. For example, "go to a stoveburner to heat the potato", which consistently fails with the error: "I cannot heat the potato 2 using the stoveburner."

In GSM8K, the controller is responsible for generating intermediate reasoning steps that guide the black-box LLM to arrive at the correct final answer. Take the following math problem as an example:

"Every day, Wendi feeds each of her chickens three cups of mixed chicken feed... In the morning, she gives her flock 6,887,483 cups of feed, and another 25 cups in the afternoon. If the flock contains 20 chickens, how many cups of feed does she need to give in the final meal of the day?"

Without decomposition, the black-box LLM incorrectly attempts to reverse-engineer the feed-per-chicken values from the morning and afternoon feed, resulting in a negative and invalid final answer (e.g., -8.0).

In contrast, with controller-generated intermediate guidance:

1. Calculate the total amount of feed given in the morning and afternoon.
2. Determine the total daily feed requirement per chicken.
3. Subtract the given feed from the total to find the final meal's requirement.

The black-box model is able to reason in the correct direction, yielding the correct solution structure.

In LaMP-1, the task for the controller is “Write a summary, in English, of the research interests and topics of a researcher who has published the following papers.” The black box model will answer the question about the author's reference choice based on the summary. So the performance heavily depends on the summary.

An example of the summary: “The researcher's work spans several domains within wireless sensor networks, network communications, multimedia, and social network analysis. Specifically, the research interests include:...” High-quality summaries generated by the controller are able to provide good and direct guidance for the answer.

In summary, M-Pilot is able to generate high-quality intermediate responses, and thus improve the overall performance on various tasks.

Question 4: Failure Cases of Controller Guidance

We manually reviewed all intermediate guidance across the AlfWorld and GSM-Hard evaluation sets. In AlfWorld, we observed that whenever M-Pilot fails, direct prompting also consistently fails, highlighting the strong quality of the intermediate guidance.

In GSM-Hard, we identified 11 out of 1,319 cases where M-Pilot fails while direct prompting succeeds. For example, consider the question:

Last Friday, 13 of the 82 teachers at Rydell Elementary School were sick. There were 2,655,185 substitute teachers called in to help. How many teachers were at school that day?

M-Pilot decomposes this into:

1. How many teachers were present at school that day?
2. How many teachers were not sick?
3. How many teachers were not called in as substitutes?

We found that the third question is ill-posed and semantically meaningless, which misleads the black-box LLM and leads to an incorrect answer.

Overall, we find that M-Pilot only degrades performance when it generates clearly misleading intermediate guidance, which occurs infrequently in our experiments, and is limited to extremely challenging tasks like reasoning. In the aggregate, M-Pilot still brings a significant improvement across the entire dataset.

Thank you again for your review. We hope our response could address your concerns. If you have any further questions, we would be happy to discuss them further.

评论

Dear Reviewer V28U,

Thank you for taking the time to review our work and for providing many constructive comments! We will make sure to incorporate the additional claims, baseline comparisons, and experimental designs in future revisions.

During the rebuttal phase, in the Generalization to Unseen Tasks section, we extended our framework to the code generation domain. Despite the limited time, we were able to demonstrate consistent improvements over strong baselines, further supporting the effectiveness and generalizability of our method.

In the Computational Overhead of Iterative Optimization section, we highlighted that even a single round of RL training can already outperform the strongest baselines. Moreover, our framework is compatible with multiple RL algorithms and supports both online and offline settings, reinforcing the robustness and flexibility of our approach.

In the Comparison with Adaptive Prompting Methods section, we conducted additional experiments on GSM8K, where our method outperformed the strongest adaptive prompting baseline by 8% using GPT-3.5-Turbo. We also provided a clear distinction between our approach and these baselines, emphasizing that the controller model in our framework engages in multi-turn interaction with black-box LLM, which traditional prompting cannot.

To further illustrate the effectiveness of our intermediate guidance, we included case studies across all three tasks, analyzing how the controller helps guide black-box LLMs toward more effective task completion. We also examined failure cases, noting that they are extremely rare. For example, on AlfWorld, we did not observe a single instance where direct prompting succeeded while our method failed. On the more challenging GSM-Hard task, only 11 out of 1,319 cases exhibited such behavior, and we provided explanations for these exceptions. In aggregate, these results demonstrate the significant and reliable performance gains enabled by our method.

Thank you again for your thoughtful and constructive feedback!

Please let us know if you have any further questions. If you feel our response has addressed your concerns, we would be grateful if you would consider raising your rating for our paper. We truly appreciate your consideration.

Best Regards,

Authors of Matryoshka Pilot

审稿意见
4
  1. The paper addresses the challenge that improving black-box LLMs is difficult because additional training on their inaccessible parameters is infeasible.
  2. It introduces Matryoshka Pilot (M-Pilot), where a separate, trainable white-box LLM acts as a controller to guide a black-box LLM by generating intermediate outputs.
  3. The method improves performance by 3.19% in reasoning accuracy, 7.46% in planning success rate, and 5.82% in personalization accuracy on average.

优缺点分析

Strengths:

  • The framework is validated by showing strong performance against various baselines, including in-context learning and adapter-based approaches (Sec 4).
  • M-Pilot is also shown to be more effective than using a black-box LLM as a controller (Appendix C.6), which is consistent with the core motivation of the paper.

Weaknesses:

  • While the paper shows good sample efficiency on the planning task, its effectiveness in a true few-shot setting across all tasks remains unclear. It would be beneficial to see a true few-shot learning experiment where the controller is trained on a variety of small numbers of preference pairs (e.g., 1, 2, 4, 8, 16, …).
  • The white-box controller's own capability might be a potential bottleneck. If a problem is too complex or requires niche knowledge to even decompose, the small model may fail to generate useful guidance. It would improve the effectiveness of the method if the paper shows some results on tasks from highly complex and specialized domains.
  • Crucial examples of the intermediate guidance for each task type (e.g., Figure 5) should be included in the main content to improve the readability of the paper.

问题

  1. The paper could provide a more detailed justification for why using a trainable white-box model is superior to using a highly engineered prompt to instruct the black-box model to first generate its own detailed chain-of-thought (i.e., guidance) and then provide an answer.
  2. Do the white-box model training and multi-step process of using a controller to generate guidance introduce significant latency compared to a single direct prompt?

局限性

yes

最终评判理由

The authors provide a comprehensive response during the rebuttal addressing most my concerns.

I would encourage the authors to include these findings in a future version of the paper. I will maintain my initial score of 4.

格式问题

n/a

作者回复

Thank you so much for your positive and encouraging feedback. Below, we address your comments and suggestions in details:

Weakness 1: Few-shot Learning Experiments

Thank you for the suggestion. Below, we present few-shot experiments conducted on GSM8K, AlfWorld, and LaMP. In these experiments, the controller model is fine-tuned with varying numbers of preference pairs (1, 2, ..., 128). For comparison, we also include the performance of the unfine-tuned controller as well as the controller fine-tuned on the full dataset. The results are summarized as follows:

AlfWorld:

# Pairs01248163264128Full
AlfWorld0.8130.7910.8360.8210.8210.8280.8360.8660.8810.948

GSM:

# Pairs01248163264128Full
GSM8K0.9540.9560.960.9580.9580.9590.9590.9610.9610.964
GSM-Hard0.780.8010.80.7960.7990.7980.7990.8030.8010.801

LaMP: Accuracy for LaMP-1/2M/2N, MAE for LaMP-3 (lower is better), Rouge-L for LaMP-4

# Pairs01248163264128Full
LaMP-10.6080.6160.6200.6190.6190.6110.6210.6220.6240.640
LaMP-2M0.4990.5010.5020.5010.4990.5040.5060.5060.5060.509
LaMP-2N0.7770.7780.7740.7800.7810.7830.7860.7990.8050.808
LaMP-3 (⬇️)0.3130.3120.3120.3110.3120.3110.3110.3090.3070.301
LaMP-40.1390.1390.1390.1400.1410.1410.1420.1440.1460.153

We use Qwen2.5-7B-Instruct as the controller model across all tasks. For the GSM8K and LaMP series, GPT-4o-mini serves as the black-box LLM, while GPT-3.5-turbo-0125 is used for AlfWorld.

From the results, we observe that on unfamiliar datasets such as AlfWorld, increasing the number of preference pairs significantly boosts performance, with a +9.0% gain when scaling from 1 to 128 pairs. In contrast, on tasks the controller is already familiar with, such as GSM8K, the performance improvement under the few-shot setting is marginal (less than 1.0%).

For LaMP-1, training on a single sample is able to obviously improve the performance, while for others the performance gradually improves with increasing data points.

This can be explained by the content generated by the controller. Compared to other LaMP tasks, in LaMP-1, the controller generates a summary, which has a greater impact on the black-box model’s ability to solve the problem. Even with very few samples, the controller can still significantly improve performance through training.

Weakness 2: Controller Capacity on Complex Tasks

Thank you for the insightful suggestion. We agree that the white-box capability might be a potential bottleneck for extremely complicated situations. However, we conjecture that for most of the cases, a medium size LLM is enough for task decomposition.

To this end, we conducted additional experiments on APPS [1], a large-scale and challenging code generation benchmark consisting of 10,000 difficult programming problems.

In this setup, the controller model is tasked with generating high-level observations of the problem, which are then provided as guidance to a black-box LLM to assist in generating the final code solution. Due to time constraints, we were unable to optimize all the intermediate prompts; However, our approach still outperforms strong baselines. Specifically, we compare against the following: (1) direct prompting, (2) a black-box + black-box pipeline using the same prompt as our method, and (3) an unfinetuned controller + black-box setting. For both our method and direct prompting, we also incorporate a reflection strategy to further boost performance. All experiments use Qwen2.5-7b-instruct as the controller model and GPT-4o-mini as the black-box LLM, with the controller trained via DPO on collected preference data.

MethodStrict AccuracyTest Coverage
Gpt-4o-mini (Try=1)0.3670.528
Gpt-4o-mini (Try=4)0.3980.56
Gpt-4o-mini + Gpt-4o-mini0.3640.528
Qwen2.5-7b-instruct + Gpt-4o-mini0.3670.528
M-Pilot + Gpt-4o-mini (Try=1)0.3740.539
M-Pilot + Gpt-4o-mini (Try=4)0.4110.572

Here, Strict Accuracy refers to solving a problem by passing all test cases, while Test Coverage denotes the average pass rate across test cases. Despite the lack of optimal intermediate prompt design, our method surpasses the strongest baseline by 0.7% in the single-pass setting, and this margin increases to 1.3% with four reflection attempts.

Compared to simpler domains such as planning, code generation poses significantly greater complexity and requires more careful design of intermediate prompts. We believe that with more refined prompt engineering, our framework could yield even stronger results in this setting.

[1] Hendrycks, Dan, et al. "Measuring coding challenge competence with apps." arXiv preprint arXiv:2105.09938 (2021).

Weakness 3: Missing Intermediate Guidance Examples in Main Content

We appreciate the suggestion and agree that including key examples of intermediate guidance can enhance the clarity of our approach. Accordingly, we will move Figure 5 into the main paper to improve readability and better illustrate the guidance process across different task types.

Question 1: Justification for White-box Controller vs. Engineered Prompting using Black-box LLM

Thank you for the insightful suggestion! Below we first provide a general explanation, followed by a detailed example from AlfWorld to support our response.

Overall Benefit – Effective Problem Decomposition with Feedback:

As described in our abstract, we treat the black-box LLM as an environment and the white-box LLM as a controller. The white-box model decomposes tasks and queries the black-box model. The black-box model then either interacts with the environment or compares its outputs to ground truth, returning feedback. This feedback serves as a supervisory signal to train the controller, allowing it to learn which decompositions are most effective. As a result, the trained white-box controller produces decompositions that better guide the black-box model to solve tasks. This is what black-box model alone, without feedback, cannot achieve.

Example – AlfWorld Task: "Put a clean mug in coffee machine" Given this task and the environment state:

White-box Controller's plan:

General plan: I need to get a list of receptacles to find a mug, take the mug to a sinkbasin, clean it, and put it in a coffee machine.
Step 1: get a list of receptacles where a mug is likely to appear.
Step 2: go to each receptacle in the list until seeing a mug.
Step 3: identify the mug I just found and take it.
Step 4: go to a sinkbasin to clean the mug.
Step 5: go to a coffee machine and put the mug in it.

Black-box Controller's plan:

General plan: I need to find a clean mug and put it in the coffee machine.
Step 1: Get a list of receptacles where a clean mug is likely to be found.
Step 2: Go to each receptacle in the list until finding a clean mug.
Step 3: Identify the clean mug and take it.
Step 4: Go to the coffee machine.
Step 5: Put the clean mug in the coffee machine.

The black-box Controller mistakenly assumes a clean mug is readily available, skipping the cleaning step. In contrast, the white-box Controller correctly decomposes the task to include cleaning, leading to a successful outcome. This highlights the value of decomposition guided by environment feedback.

Question 2: Latency Introduced by Controller Pipeline

Thank you for the insightful question regarding the latency of our framework. We acknowledge that our method introduces additional latency to some degree. Specifically, training the white-box controller on ~2k preference pairs takes approximately 1 hour using two GPUs. For inference, the framework is roughly 2× slower than direct prompting. For example, on GSM8K without any parallelization, our evaluation pipeline takes 159 minutes, compared to 75 minutes for direct GPT-4o-mini prompting.

However, this added cost brings substantial benefits: the training adapts the output of LLMs to specific direction, with much less computation consuming compared to post-training, while the multi-step process provides more refined controlling ability, compared to a single direct prompt. In fact, our framework significantly enhances the performance of the black-box LLM, yielding improvements of 3.19% in reasoning, 7.46% in planning, and 5.82% in personalization. In this sense, the increased time can be viewed as a trade-off for better performance.

Importantly, the white-box controller only needs to be trained once. We also show that it generalizes well and can be transferred to new tasks in a plug-and-play manner, further improving overall efficiency.

Thank you again for your review. We hope our response could address your concerns. If you have any further questions, we would be happy to discuss them further.

评论

Dear Reviewer qqQB,

Thank you for taking the time to review our work and for providing such valuable feedback! We will make sure to incorporate the additional claims and experimental designs in future revisions.

During the rebuttal phase, in the Few-shot Learning Experiments section, we conducted additional experiments across all three tasks—GSM8K, AlfWorld, and LaMP. The results show that even a small number of preference examples can lead to significant improvements in unfamiliar domains like AlfWorld, whereas the impact is less on more familiar tasks such as GSM8K.

In the Controller Capacity on Complex Tasks section, we extended our approach to the code generation domain. Despite the limited rebuttal time, we demonstrated consistent performance gains over strong baselines, further validating the generalizability and effectiveness of our method.

To clarify the advantage of our approach over traditional black-box prompting, we emphasized that our method can leverage supervisory signals from the environment, which direct prompting cannot. We also provided a detailed example on the AlfWorld task to illustrate how this advantage translates into more effective problem decomposition.

In the Latency section, we conducted a quantitative analysis comparing runtime and effectiveness between our approach and black-box baselines. We explicitly highlighted the trade-off between latency and performance, demonstrating that the increase in computation is justified by the substantial improvement in results.

Thank you again for your constructive suggestions!

Please let us know if you have any further questions. If you feel that our response addresses your concerns, we would be grateful if you would consider raising your rating for our paper. We truly appreciate your consideration.

Best Regards,

Authors of Matryoshka Pilot

评论

Thank you for the comprehensive response and additional results. I would encourage the authors to include these findings in a future version of the paper. I will maintain my initial score of 4.

最终决定

The paper proposes M-Pilot, a framework that employs a lightweight white-box LLM as a controller to guide a black-box LLM through complex tasks such as reasoning, planning, and personalization. The key idea is to decompose complex tasks into intermediate outputs, which the controller generates to steer the black-box LLM toward better performance. The framework leverages iterative direct preference optimization (IDPO) to refine the controller's guidance based on feedback from the black-box LLM's outputs. Empirical results demonstrate significant improvements over baseline methods across multiple tasks, including reasoning (GSM8K), planning (ALFWorld), and personalization (LaMP).

Strengths:

  • The method of iterative DPO using one lightweight LLM as a controller is novel.
  • The method achieves strong results compared to existing methods.

Weakness:

  • Generalization of the method's performance. The authors provided additional experiment on code generation task.
  • It is unclear about the sample efficiency.

The authors' responses are satisfactory. The authors are expected to incorporate these new results and analysis in the revision.