Prompted Policy Search: Reinforcement Learning through Linguistic and Numerical Reasoning in LLMs
Prompted Policy Search uses LLMs for RL policy search, outperforming common methods across diverse tasks.
摘要
评审与讨论
Current RL algorithms rely solely on scalar reward signals for policy optimisation, overlooking the rich semantic information about the environment that can be expressed through text. While some recent works incorporate textual cues by augmenting parts of the learning algorithm with LLMs, the authors of this paper propose a novel approach where they make the LLM the sole optimiser, fully integrating it into the optimisation process without relying on traditional RL components. They do this via iterative prompting, where the LLM has to generate the parameters of the policy based on past generations and performance scores. Additionally, they show that incorporating domain-specific information, hints, can improve the sample efficiency of the optimisation procedure.
优缺点分析
The paper is well-written and easy to read. The core idea of using a large language model (LLM) as a policy optimiser is novel, and the experiments demonstrate that LLMs can effectively perform direct policy optimisation in reinforcement learning settings. The finding that incorporating semantic knowledge—such as environment information and hints—can enhance optimisation performance is a notable contribution. It highlights the potential of leveraging textual information and the LLM’s inductive biases to improve RL performance and reduce sample complexity. Finally, the generalisation results from fine-tuning the LLM using GRPO on new environments are promising.
That said, there are limitations with the proposed approach/paper:
-
The paper would benefit from more explicitly clarifying whether the goal is to demonstrate that LLMs "can" serve as policy optimisers, or whether they are aiming to achieve state-of-the-art performance in the specific settings considered (e.g., tabular MDPs and linear policy classes). If the former is the focus, the use of LLMs as optimisers for tabular MDPs is indeed novel; however, optimising the weights of a linear network has been shown in prior work [3], which limits the novelty of this aspect of the results.
-
The proposed method closely resembles evolutionary approaches to reinforcement learning, such as Evolution Strategies (ES) [2]. Therefore, I strongly recommend that the authors include ES as a baseline for comparison in their experiments.
-
Similar to ES, the proposed method appears to scale poorly with the number of parameters. In addition, I suppose ProPS also scales poorly with the decimal precision needed to predict the parameters, making this worse. I recommend that the authors discuss these limitations in more detail in the paper.
-
Why optimise over policy parameters rather than represent the policy as code, as done in [1]? This could offer greater expressiveness compared to a linear policy. I strongly recommend including this as a baseline for comparison.
Minor Clarifications:
- (Line 328) What do you mean by differentiable in this context?
- (Line 201) Context Length and In-context history are the same? (confusion since you use “and”)
- (Line 144) .. language embedding in latent space .. – what do you mean by this? Are the prompts being embedded before passing them to the llm?
- How crucial is adding the iteration number and total number of iterations in the context to the performance?
[1] Bosio, Carlo, and Mark W. Mueller. "Synthesizing interpretable control policies through large language model guided search." arXiv preprint arXiv:2410.05406 (2024).
[2] Salimans, Tim, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. "Evolution strategies as a scalable alternative to reinforcement learning." arXiv preprint arXiv:1703.03864 (2017).
[3] Yang, Chengrun, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, and Xinyun Chen. "Large language models as optimisers." arXiv preprint arXiv:2309.03409 (2024).
问题
-
Can you speak more to the practicality of the approach (in real-world / Deep RL settings) as discussed in Weakness (3)?
-
Why is representing the policy as a neural net a better approach than representing it as code?
-
Can you benchmark your algorithm against ES [2] and [1] (which I believe are alternatives to the proposed approach)?
-
Do the baseline methods also employ a linear policy/value network architecture? If yes, I would recommend adding results with the recommended architecture (ex. 2-layer MDP), to give a sense of the optimality gap of the proposed approach (and the baselines)
-
In the experiments in Appendix C, why is searching for params in a random k-dim subspace of R^d, better than directly searching in the space of R^k? Can you run an ablation for this?
-
I didn’t follow Figure 7, what does time v/s steps per episode mean (isn’t the horizon fixed per env)? Also, can you share the wall-clock time for the entire training (8000 episodes)?
In its current form, I believe the work should be rejected, primarily due to concerns about the practicality of the approach. There is insufficient evidence that it scales to more complex environments. Addressing the above questions could lead me to raise my score.
局限性
As noted in Weakness (3), I believe the method scales poorly with both the number of parameters and the precision required for the parameter values. I strongly recommend that the authors discuss this limitation in the paper.
最终评判理由
The authors have addressed all my concerns, and I believe the paper makes valuable contributions to the community. I have therefore raised my score to “Accept.”
格式问题
none
We sincerely thank the reviewer for the insightful review, and for finding the paper well-written and easy to follow. We are especially grateful for their recognition of the core contributions of our work, namely, the novelty of using an LLM as a direct policy optimizer, the demonstrated performance gains from incorporating semantic knowledge, and the promising generalization results from our fine-tuning experiments. We appreciate the reviewer’s feedback, and will address the questions in detail below.
W1: Clarifying paper goal
Our goal is to demonstrate a novel capability: LLMs can serve as competent numerical optimizers over RL policies while also being able to integrate knowledge that is available in natural language, e.g. semantic knowledge from books, library documentation, human hints, etc. While achieving SotA is not our primary aim, we compare against standard RL baselines to assess our approach. We respectfully disagree that our work is a straightforward extension of prior approaches like [3]. Our method differs in both scope and formulation, particularly in fact that OPRO does prompt optimization and not numerical optimization. Out of space consideration, we ask the reviewer to please check our response to Reviewer UFYh (W1). We truly appreciate it!
W2/Q3: Similarity to ES approaches
This is an interesting observation, which we only partially agree with. In ES, a key concept is the “population”. With each update step a new population is generated and the old one discarded. The population can be seen as an estimate of the distribution of likely locations of an optimum: good candidates remain, bad candidates slowly vanish (survival of the fittest). Our approach does not have such a concept and rather just keeps a history of all previous solutions, even the bad ones. It is up to the LLM to pick and choose where in the search space to continue searching based on the history. In that sense it may be closer to algorithms such as Tabu search. Our approach is able to combine multiple previous solutions together which may work similar to recombination in traditional ES methods. However, note that the paper [2] does not use any explicit recombination operator. Instead, in [2] learning is the result of updating the parameters of a distribution over the parameters (a Gaussian) and then sampling from it. We argue that this is very different since we do not have an explicit parametric distribution. To summarize (a) we do not have populations but rather a history, (b) we only generate one next candidate, and (c) we do not explicitly update or sample from a parametric distribution.
As suggested by the reviewer, we conducted a comprehensive comparison against OpenAI-ES [2]. Based on the discussion above, we include two other relevant baselines: (a) Tabu search, a history-based method and (b) (μ + λ)-ES, an evolution strategy that has an explicit recombination operator. The results across our 15 benchmark tasks are presented in the table below, showing that our methods are highly competitive. For example, ProPS+ achieves the top score in 8 out of 15 tasks, whereas OpenAI-ES achieves the top score in 5, and Tabu search is the third best at 4 (note that in some tasks multiple algorithms achieve the top score). These results re-iterate that using an LLM for policy search is a viable and powerful approach.
| ProPS+ | Tabu Search | (μ + λ)-ES | Openai-ES | |
|---|---|---|---|---|
| Mount. Car (C) | 89.16±29.72 | -0.03±0.01 | -0.18±0.01 | -0.20±0.03 |
| Inv. Pend. | 1000.00±0.00 | 992.72±10.30 | 994.98±7.11 | 969.57±11.03 |
| Inv. Dbl. Pend. | 148.39±48.65 | 476.56±307.86 | 175.79±7.19 | 268.31±36.77 |
| Reacher | -18.15±22.06 | -10.41±1.27 | -13.32±1.85 | -10.95±0.51 |
| Swimmer | 227.30±56.23 | 353.13±1.50 | 321.79±9.83 | 349.86±0.90 |
| Hopper | 356.22±292.35 | 950.06±34.60 | 550.36±48.61 | 1009.38±2.10 |
| Walker | 126.75±136.44 | 261.02±40.80 | 355.12±124.32 | 938.56±43.50 |
| FrozenLake | 0.19±0.05 | 0.35±0.27 | 0.68±0.02 | 0.78±0.04 |
| CliffWalking | -96.40±22.90 | -186.28±42.53 | -187.67±58.36 | -115.08±9.42 |
| Maze | 0.97±0.00 | -2.22±0.00 | -2.22±0.00 | -2.22±0.00 |
| Nim | 0.97±0.09 | 0.17±0.52 | 1.00±0.00 | 1.00±0.00 |
| Mount. Car (D) | -116.71±15.20 | -186.77±18.71 | -193.53±4.84 | -172.51±20.82 |
| Navigation | 2779.55±270.65 | 2266.62±270.83 | 1193.28±137.70 | 2596.58±73.39 |
| Pong | 2.99±0.03 | 2.70±0.29 | 2.32±0.17 | 2.29±0.14 |
| Cartpole | 500.00±0.00 | 500.00±0.00 | 499.38±0.87 | 498.21±2.54 |
Please note we do not claim a new SotA optimizer that uniformly surpasses all existing methods. Rather, our central contribution is to introduce and validate the novel phenomenon that LLMs can serve as direct policy optimizers, unifying numerical and linguistic reasoning within a single framework. This opens a new door for RL research, enabling more transparent and human-aligned optimization through natural language guidance and interpretable textual justifications.
W3/Q1: Scalability and precision
The ability to optimize policies with up to ~100 parameters already makes ProPS directly applicable to a wide range of meaningful, real-world RL problems, including Dynamic Motor Primitives. Following the reviewer’s suggestions, we also optimized neural network policies with parameter counts from 80 to 435, without using random projection, which shows substantial results. A detailed discussion about the practicality of the approach as well as extending LLM optimization to higher dimensional spaces has been discussed in the response to reviewer JLQP (W2/Q1).
Regarding the concern about decimal precision, we have empirically found this is not a practical limitation. The LLM can generate and reason over floating-point parameters with sufficient precision. We observed the LLM could generate values with up to three decimal places when needed. This adaptability reflects a strength of the approach, allowing the model to refine policy precision as necessary.
Please refer to the response to reviewer UFYh (Q2) for the optimization trace of Mountain Car Continuous, where we show the LLMs response over several iterations to show different precision parameters.
W4/Q2: Code as Policy Representation
Traditionally, the RL community largely uses tables, matrices or neural networks as representations. Using code as policy representation is an extremely new concept [7] that is definitely not widely used. In fact, most RL algorithms do not support such a representation, due to the combinatorial nature of the underlying optimization task and the complexity involving language (e.g. maintaining correct syntax and semantics). However, we would like to note that ProPS would be able to support such code-as-policy representations. We look forward to investigating this and thank the reviewer for the great suggestion!
Q4: Baselines and comparison to 2 Layer MLP
Yes, all baseline methods in our experiments employed linear policies, ensuring a strict apples-to-apples comparison. We agree that comparing with nonlinear policies would be informative. However, due to compute constraints, we were unable to conduct these additional experiments within the review timeline.
That said, relevant results with nonlinear neural network policies (2-layer MLPs with [64, 32] units) have been reported in [4], using 10× more environment interactions (~3M). We summarize these in the table below for Swimmer, Hopper, and Walker2D. While nonlinear policies clearly improve rewards in Hopper and Walker2D, it's notable that ProPS with linear policies and far fewer interactions outperforms these strong baselines on Swimmer, achieving >220 compared to <150 for traditional RL.
| Environment | PPO | TRPO | DDPG | SAC | ProPS (Linear, ~300k environment interactions) |
|---|---|---|---|---|---|
| Swimmer | 120 | 90 | 140 | 45 | >220 |
| Hopper | 2400 | 2400 | 2000 | 3500 | ~350 |
| Walker2D | 1800 | 1500 | 2500 | 5400 | ~140 |
We emphasize that we are not claiming state-of-the-art performance with nonlinear DNN policies; this is an important future research direction for us. But the early results with linear policies suggest that ProPS is already surprisingly competitive on certain tasks, even under constrained settings.
Q5: Random Projections
Random projections for neural network optimization are grounded in strong theory and supported by practical results. The Johnson-Lindenstrauss lemma ensures that high-dimensional parameter spaces can be effectively embedded in much lower dimensions with minimal distortion. In neural networks, it’s well-established that the intrinsic dimensionality of many optimization problems is far lower than the full parameter count. Prior work [5,6] has demonstrated that random projections can significantly aid optimization, making them a principled and effective choice in our setting.
Q6: Figure 7 details
The vertical axis "Time (s)" represents total wall-clock time over 8000 training episodes. The horizontal axis "Steps per Episode" reflects the average episode length per environment. This plot highlights how training time varies with the task horizon. For PPO and TRPO, runtime scales with the number of environment interactions, leading to longer training times (~1e4s) for long-horizon tasks. In contrast, ProPS performs a fixed 400 policy updates (LLM calls), yielding a consistent runtime (1500–2500s) across tasks. We will revise the figure caption and text to clarify this.
References:
[1] Bosio, C. et al. Synthesizing interpretable control policies through LLM guided search. arXiv:2410.05406 (2024).
[2] Salimans, T. et al. . Evolution strategies as a scalable alternative to RL. arXiv:1703.03864 (2017).
[3] Yang, C. et al.. LLMs as optimisers. arXiv:2309.03409 (2024).
[4] OpenAI. Spinning Up
[5] Ijspeert, AJ. et al. Learning attractor landscapes for learning motor primitives. NeurIPS, 2003
[6] Carvalho, J. et al. Adapting object-centric probabilistic movement primitives with residual RL. IEEE-RAS 2022.
[7] Liang, J. et al. Code as Policies: Language Model Programs for Embodied Control. 2023 ICRA
I appreciate the authors’ detailed response—it addressed several of the questions I raised.
Random Projections
I would assume that effective compression of data into lower-dimensional spaces requires some inherent structure in the data. To my knowledge, the full space of neural networks lacks such structure---even hypernetworks are typically trained on optimal network weights, which introduces some regularity. It would be helpful to see how Props or Props^+ performs when applied to a network with parameters versus one with parameters projected to dimensions using random projections.
Figure 7 details
Could you clarify why the training times of PPO, TRPO, and DQN increase with the horizon length? For PPO, is this due to the cost of Generalized Advantage Estimation (GAE)? And what accounts for the increase in DQN? A discussion on the contributing factors would be helpful.
Practical Applicability
I acknowledge the authors' point regarding the applicability of \textsc{Props}/\textsc{Props} to DMP problems in its current form. However, in the absence of empirical results or strong (ideally state-of-the-art) performance, questions about its applicability in real-world domains remain.
Practical Applicability:
We thank the reviewer for raising this important concern! To evaluate the real-world applicability of ProPS in complex robotics domains, we conducted an additional experiment using Dynamic Motor Primitives (DMPs) in a challenging robotic table tennis environment. This task is well-established in the literature [4–6] and has been used to benchmark motor primitive learning in realistic, high-speed settings.
We used the public simulation from [5] (ICML 2024), in which a 7-DoF robot arm learns to return an incoming table tennis ball to a target point [x:−0.6,y:-0.4][x:−0.6,y:-0.4] on the opponent's side of the table. The policy is a DMP with 70 basis functions, and the simulation includes full robot and ball physics. We applied ProPS using the same configuration as our other experiments (400 iterations).
Preliminary results (detailed in the paper) show that:
- ProPS successfully learns a table tennis stroke that returns the ball to the target area.
- The resulting motion demonstrates a coordinated, pre-strike wind-up, illustrating that ProPS can generate naturalistic robot behaviors.
- When compared to baselines (PPO and OpenAI-ES), ProPS achieves better performance in terms of return and task success.
For the result we present the average distance from the goal position over 20 episodes for the best evaluated policy in every experiment (lower the better).
| ProPS | Openai-ES | PPO | |
|---|---|---|---|
| Distance from the goal (m) | 0.07 ± 0.02 | 0.40 ± 0.34 | 0.53 ± 0.36 |
This experiment supports the claim that ProPS can be applied with minimal effort to complex, real-world domains and can optimize non-trivial motor skills. Combined with our results on (a) standard numerical optimization benchmarks and (b) common RL tasks, we believe this helps demonstrate both the generality and scalability of our approach.
We will include full results, robot/ball trajectories, and implementation details in the final version of the paper. Furthermore, we commit to expanding the evaluation to additional robotic domains to further validate real-world applicability.
References
[1] Arora, S., et al. "Stronger generalization bounds for deep nets via a compression approach." ICML. PMLR, 2018.
[2] Neyshabur, B., et al. "Exploring generalization in deep learning." NeurIPS 30 (2017).
[3] Frankle, Jonathan, and Michael Carbin. "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks." ICLR (2019).
[4] Mülling, Katharina, et al. "Learning to select and generalize striking movements in robot table tennis." The IJRR 32.3 (2013): 263-279.
[5] Celik, Onur, Aleksandar Taranovic, and Gerhard Neumann. "Acquiring Diverse Skills using Curriculum Reinforcement Learning with Mixture of Experts." ICML PMLR, 2024.
[6] D'Ambrosio, et al. “Robotic Table Tennis: A Case Study into a High Speed Learning System.” RSS. (2023)
I thank the reviewers for their detailed response; all of my concerns have been addressed. I will raise my score to "Accept".
Thank you so much! The reviewer’s feedback and comments have been very constructive. Based on the reviewer’s suggestions, we added more experiments including more ES baselines, scaling up to larger neural networks, and adding a realistic robotic motor skill task. We’d like to extend our deep appreciation for the genuine interest of the reviewer and for this valuable discussion which definitely makes our paper better!
Random Projections:
We thank the reviewer for raising this important point. We agree that effective compression generally benefits from inherent structure in the data or model class. In the context of neural networks, there is considerable evidence suggesting that the intrinsic dimensionality of the solution space is significantly lower than the number of parameters. Prior work on compression bounds [1], PAC-Bayesian generalization [2], and the lottery ticket hypothesis [3] all support the notion that deep networks can often be effectively represented or optimized in much lower-dimensional subspaces. To further investigate this in our setting, we ran additional experiments with ProPS on MLP architectures both with and without random projections. Interestingly, we found that the performance in the projected space was comparable to, and in some cases better than, optimization in the original parameter space, as illustrated in the table below. Note that the random projection of the 345 parameters neural network to lower-dimensional subspaces (80 and 20) led to a significant reduction in the standard deviation and higher rewards than the original parameter space. Additionally, the improvement of the performance when optimizing in the lower-dimensional subspaces consistently yields rewards comparable to the smaller neural network (80 parameters), as marked by the lower standard deviation.
| Architecture | Navigation | ||
|---|---|---|---|
| Hidden Layer, [Neurons] | Params | Rndm Proj Params | Mean best ± Std |
| 1, [10] | 80 | N/A | 2722.81 ± 409.43 |
| 2, [15, 15] | 345 | N/A | 2325.09 ± 925.27 |
| 2, [15, 15] | 345 | 80 | 2401.25 ± 134.48 |
| 2, [15, 15] | 345 | 20 | 2718.03 ± 268.19 |
These results suggest that ProPS and ProPS⁺ are indeed capable of operating effectively in reduced-dimensional spaces, consistent with the aforementioned findings in the literature. We will add these empirical findings and a discussion of relevant theoretical foundations to the revised version of the manuscript.
Figure 7 details:
Thank you for the insightful question. The primary reason for the increased training times for PPO, TRPO, and DQN in Figure 7 is that we train for a fixed number of episodes. A longer horizon means the agent takes more total steps in the environment, which proportionally increases the number of learning updates required.
-
For PPO and TRPO: These on-policy algorithms perform an update after collecting a set of episode rollouts. More total steps directly lead to more of these computationally intensive update cycles. While the cost of GAE in PPO does scale with the horizon, the main factor is the increased frequency of these updates.
-
For DQN: In this off-policy algorithm, a longer horizon results in more transitions being added to the replay buffer. This leads to more frequent mini-batch sampling and Q-network updates, which drives up the total training time.
This paper introduces Prompted Policy Search (ProPS), an innovative Reinforcement Learning (RL) methodology that integrates numerical reward signals with natural language input by embedding a Large Language Model (LLM) at the core of the policy optimization loop. This approach enables LLMs to execute in-context numerical optimization, fostering sample-efficient learning through the incorporation of semantic signals like goals and domain knowledge. Empirically, ProPS outperformed seven widely-adopted RL algorithms on 8 out of 15 diverse Gymnasium tasks. A significant feature is the LLM's capacity to provide natural language justifications for policy updates, termed 'textual gradients, thereby enhancing transparency.
优缺点分析
TL;DR: While the work has clear practical limitations, it provides a solid foundation for future research on LLM-based optimization of RL policies.
Strengths:
- This research pioneers a novel reinforcement learning paradigm by embedding a large language model directly into the policy optimization loop, transforming it into an in-context reasoning problem. This unique integration allows high-level natural language guidance to actively shape and refine the RL algorithm's optimization process, fostering greater transparency through explicit textual explanations and leading to more efficient learning.
- The work's experimental analysis transparently identifies key limitations in utilizing LLMs for direct policy updates, notably their constrained scalability to high-dimensional search spaces and susceptibility to misleading inductive biases from semantic inputs, all of which are thoroughly acknowledged in the paper's discussion.
Weakness:
- ProPS and ProPS+ show inconsistent (rather unstable) performance across tasks (in Table 2). This might suggests that the LLM’s base knowledge and interpretation of task strongly influence policy updates. For example, in the FrozenLake environment, adding semantic context degraded performance—despite a prompt referencing "slippery" dynamics, the LLM assumed deterministic behavior. This illustrates how pre-trained priors can introduce misleading biases when domain nuances are misinterpreted. Although the study analyzes prompt variants, it does not directly examine how the LLM’s base knowledge affects its ability to handle diverse tasks.
- Please fix minor grammatical issues:
- Line 56: for better policy search “perfomance” : Change to “performance”
- Line 169: as “illustarted” in Appendix E. : Change to “illustrated”
- Figure 15: The title should be “Mountain Car (C): Ablation Study” (remove “e”)
- Line 87: Change to “Transformers that are trained…”
- Line 236 - 237: Change to “ProPS significantly outperforms the baseline algorithms”
- Line 855: Change to “ProPS then performs”
问题
- Scaling Effects: Has there been any analysis of how method performance changes as the parameter size of an open-source LLM scales (e.g., 7B, 14B, 32B) without fine-tuning? Additionally, how does the parameter size of the RL agent affect the LLM’s reasoning and performance on downstream tasks?
- Pre-training Priors: Is there any evidence that the LLM’s pre-trained knowledge impacts its performance across different tasks?
- Stability Concerns: The standard deviation of results for both ProPS and ProPS+ in Section 4.2 suggests greater instability compared to the baselines. Could you elaborate on the possible reasons for this?
- Ablation Study (Appendix F): Could you expand on the ablation study in Appendix F regarding the effectiveness of providing hints? I'm particularly interested in whether the conclusions drawn about hint effectiveness hold for other RL tasks as well.
局限性
The authors acknowledged work’s limitations including LLM scalability to high-dimensional Deep RL policies, the need for a deeper theoretical understanding of LLM-based optimization, and the sensitivity of ProPS+ to policy parameterization or the hints provided for its policy updates.
最终评判理由
I will keep my score of "Accept," as the work's contribution is novel and well-articulated. The paper cleverly uses in-context learning as a meta-process for RL optimization in a style reminiscent of a hypernetwork.
Initially, I was concerned about the instability in the results and the limited insights from the ablation studies, which could hinder reproducibility. Discussions with the authors clarified that this instability is inherent to the domain, and they have since expanded their ablation data with additional results and insights.
My final recommendation is that the authors add a qualitative analysis of what makes a hint effective. This addition is essential for the method's reproducibility and would provide a framework for others to generalize the approach.
格式问题
No formatting issues identified
We sincerely thank the reviewer for their thoughtful and thorough evaluation of our work. We are especially grateful for the recognition of the novelty in framing policy optimization as an in-context reasoning problem enabled by large language models, and for highlighting the transparency and efficiency benefits of our proposed ProPS method. We appreciate the reviewer’s acknowledgment of the strengths of our empirical evaluation and our efforts to openly discuss current limitations such as scalability and semantic bias. These insights align with our motivation to establish a foundation for future work on LLM-driven policy optimization, and we are encouraged that the reviewer sees this contribution as a promising step in that direction. We address the reviewer’s comments below, and we will include insights and results based on the reviewer’s inquiries in the revised manuscript.
Q1: Scaling effects:
We evaluated the performance of our method for different LLM sizes, starting with 0.5B up to 14B. Our findings show that LLM models below 7B show signs of life but cannot consistently optimize RL policies, since they do not follow instructions well. In other words, some experiments may fail due to a non-compliant response. While some of these issues can be alleviated to an extent by fine-tuning, they still lack the ability to effectively optimize an RL policy. The baseline results (no finetuning) for several models are presented below (when experiments failed for <7B models, we re-ran them):
| 3B | 7B | 14B | |
|---|---|---|---|
| Mountain Car Continuous | -44.32 ± 44.32 | 0.0±0.0 | 8.11 ± 24.24 |
| Cartpole | 19.4 ± 9.65 | 20.74±8.82 | 162.4 ± 144.12 |
| Inverted Pendulum | 28.18 ± 0.93 | 33.1 ± 7.6 | 55.64 ± 36.5 |
| Pong | 0.45±0.09 | N/A | 0.605 ± 0.473 |
-
In addition, starting with 7B, we see some improvements in RL optimization when finetuned, and even greater improvements with 14B models. The results for the 14B model were presented in the paper.
-
We did not perform baseline evaluation or finetuning of 32B model due to lack of computational resources but will explore it in the future.
We did not evaluate different policy sizes with Qwen models, however, they were evaluated with the proprietary models. Experiments confirm substantial improvement even for Neural Networks with 435 parameters. For the detailed discussion, please check the response to reviewer UFYh for the question Q1 LLMs scaling up with neural networks.
Q2/W1: Pre-training Priors:
We thank the reviewer for raising this important point. We believe the LLM’s pre-trained knowledge does contribute to its performance, particularly through general reasoning abilities rather than specific domain knowledge. This is evident in two ways:
-
Consistent performance of ProPS (no semantics): Even without task-specific semantic details, ProPS performs robustly across tasks and models, suggesting that LLMs can leverage general reasoning skills learned during pretraining for effective optimization.
-
Improvement with ProPS+ (with semantics): Incorporating semantic domain knowledge further improves performance in most environments, showing that the model can effectively use both pretraining priors and task-specific cues. The only exceptions are two edge cases (Frozen Lake and Reacher) where performance is slightly lower but not significantly degraded. Even in these edge cases, the inclusion of semantic knowledge does not lead to a catastrophic failure.
To disentangle the role of pretraining further, we ran new experiments on FrozenLake using a reasoning-oriented model (GPT-o3-mini). With the same ProPS+ prompt, GPT-o3-mini achieved a mean maximum reward of 0.82 ± 0.05, outperforming all baselines. This suggests that differences in performance across LLMs are more attributable to their reasoning capabilities—developed during pretraining—than to memorized task-specific knowledge.
Q3: Stability Concerns:
We appreciate the reviewer’s thoughtful observation. In our analysis, the standard deviations for ProPS and ProPS+ are not consistently higher than those of the baselines. Actually, our methods exhibit lower or comparable variance in 8 out of the examined domains, and slightly higher in the remainder. High variability is an expected behavior in stochastic search algorithms used in RL policy optimization. Despite this, the overall stability of ProPS and ProPS+ is comparable to that of standard RL methods, indicating that our approach remains competitive in terms of reliability.
Q4: Ablation study:
We thank the reviewer for their interest in the ablation study. The main takeaway from Appendix F is that the inclusion of natural language hints significantly improves the performance of ProPS+, especially in tasks where hints provide actionable guidance. This is reflected in higher rewards, faster convergence, and reduced variance. To test the generality of this finding, we conducted additional experiments on three more tasks: Reacher, Walker, and Cliff Walking. Our conclusions held across these tasks as well:
-
Cliff Walking: ProPS+ with hints consistently discovers optimal policies in all runs, often in the first iteration. Without hints, success drops to 90%, requiring ~27 iterations. This demonstrates how informative hints can dramatically accelerate convergence.
-
Reacher: With hints, 70% of runs achieve rewards > –10, compared to only 30% without. Hints reduce the average iterations from ~160 to ~17 by providing better initialization.
-
Walker: Hints improved overall reward quality (60% of runs >120 points with hints vs. 20% without). However, initialization was less effective due to variance in random seed performance, leading to longer convergence in some runs.
| Algorithm | ProPS+ Reward | ProPS+ Std Dev | ProPS+ with Hints Reward | ProPS+ with Hints Std Dev | Percent Change Reward | Percent Change Std Dev |
|---|---|---|---|---|---|---|
| Mountain Car (C) | 89.16 | 29.72 | 98.70 | 0.89 | 10.70% | -97.01% |
| Navigation | 2779.55 | 270.65 | 3022.66 | 135.54 | 8.75% | -49.92% |
| Inv. Dbl. Pend. | 148.39 | 48.65 | 161.07 | 52.28 | 8.54% | 7.45% |
| CliffWalking | -96.4 | 22.9 | -54.6 | 1.18 | 43.36% | -94.85% |
| Reacher | -18.15 | 22.06 | -9.66 | 1.33 | 46.78% | -93.97% |
| Walker | 126.75 | 136.44 | 205.47 | 91.6 | 62.11% | -32.86% |
In addition, we conducted new ablation studies based on the reviewer’s feedback. We selectively removed informative components from the hints and observed performance drops:
-
In Cliff Walking, omitting directional guidance reduced the success rate to 50% and decreased rewards to –114.60.
-
In Reacher, removing behavioral hints lowered average max reward to –12.37 and success rate to 40%.
-
In Walker, removing stability-related phrases dropped rewards to 144.06, with only 20% of runs exceeding 120 points.
These results confirm that both the presence and quality of hints significantly influence performance. We will include these expanded findings in Appendix F of the revised manuscript.
I thank the authors for their clear response and for conducting new experiments.
Q1: Scaling effects
I have reviewed the new data for the 3B, 7B, and 14B models. The results indicate that a certain capability threshold is necessary, as the authors mention that models smaller than 7B struggle to optimize RL policies consistently. However, the data does not show a clear pattern where larger models consistently outperform on every task.
This suggests that performance is not solely a function of model size. It also highlights a practical limitation of the method: it appears to require either very large open-source models, which demand significant computational resources, or closed-source models, which may introduce concerns about data privacy and API latency. This could limit the accessibility of this approach for researchers with fewer resources. I would be interested to see the authors' perspective on these practical trade-offs.
Q2/W1: Pre-training Priors:
To clarify, my initial question about "priors" referred to the knowledge and biases learned from the model's original training data and fine-tuning strategy, not merely its abstract reasoning ability. The authors suggest that good performance stems from "general reasoning abilities" and provided a new experiment with GPT-o3-mini as evidence.
In my view, a single experiment with a different model is insufficient to generalize this conclusion. My hypothesis remains that the best-performing models are not just more capable in general, but were specifically trained or fine-tuned in a way that makes them adept at integrating new textual instructions. The authors' own results on hint-based learning, where adding hints "significantly improves the performance," underscore the importance of this skill and lead me to question the role of the training history in the top-performing models.
Q3: Stability Concerns:
I accept the authors' point regarding the method's stability. They demonstrated that the method's variance is "not consistently higher than those of the baselines." After reviewing their analysis, I agree with their assessment, and my initial concern on this point is resolved.
Q4: Ablation study:
The new experiments on hints for Reacher, Walker, and Cliff Walking were very insightful. They clearly demonstrate that effective hints can dramatically improve performance and accelerate learning, as seen with the Cliff Walking task.
The logical next step for this line of research would be to investigate what constitutes an effective hint. It would be valuable to see studies that compare the impact of general versus specific hints, or analyze how the model responds to slightly incorrect or ambiguous instructions. Such work would provide deeper insights into the model's true semantic understanding.
This paper establishes a solid foundation for future research and represents a promising step in leveraging LLMs for complex optimization tasks. However, I do encourage the authors to explore the issues discussed here.
Q1: Scaling effects
We thank the reviewer for this important observation. We agree that current reliance on large LLMs (whether open- or closed-source) poses practical limitations related to compute, accessibility, and latency. However, we see these as transitional rather than fundamental barriers.
The open-source ecosystem is evolving rapidly. Recent models like DeepSeek-VL, Phi-3/4, and Qwen demonstrate that smaller, more efficient models can match or exceed the performance of older, larger LLMs. As the field progresses, we anticipate that models suitable for RL applications will become more widely available and resource-efficient.
In the current work, we primarily evaluate existing models in a post-hoc manner (i.e., without training from scratch). However, our results suggest new research opportunities:
-
(a) RL benchmarks should be included during LLM development (as done with GPQA or PlanBench for reasoning and planning).
-
(b) LLMs could be explicitly trained or fine-tuned for RL and optimization tasks.
-
(c) The community could release RL-specialized checkpoints, making LLM-based RL more accessible.
To support this, we will release our learning traces (ProPS trajectories) to facilitate fine-tuning and benchmarking of smaller models. In parallel, we are actively exploring distillation, pruning, and model compression to bring our approach to lower-resource settings. We will revise the manuscript to reflect these practical trade-offs and clearly outline our roadmap for improving accessibility.
Q2: Pre-training priors:
Thank you for this important clarification! We now better understand that your question concerned the learned priors from pretraining and fine-tuning, rather than general reasoning ability alone. We absolutely agree that data and fine-tuning have substantial impact and that this needs a much deeper exploration. The question of such priors is quite complex and needs a more systematic investigation in future work!
We agree that a single experiment is insufficient to fully disentangle the influence of model architecture vs. training history. In fact, your hypothesis that top-performing models may owe their success to instruction tuning or exposure to optimization-relevant data is well-taken and consistent with our own results. As you suggested, our hint-based learning experiments indicate that models with stronger instruction-following abilities benefit significantly from textual guidance.
However, interestingly, even without any task-specific semantic input (i.e., purely numerical interaction), ProPS performs surprisingly well. This suggests that some implicit optimization prior has emerged during pretraining. Yet, pinpointing which part of the training data or fine-tuning process enables this behavior remains an open question.
We agree this is a rich direction for future research, requiring controlled experiments on (a) different pretraining corpora and (b) varied instruction-tuning strategies. In this work, we focus on establishing the existence and utility of LLM-based optimization capabilities, but we see your point as critical for advancing the field and plan to explore it further!
Q4: Ablation Study:
We thank the reviewer for their suggestions for the future research directions. We truly appreciate the fairly specific advice/suggestion by the reviewer:
It would be valuable to see studies that compare the impact of general versus specific hints, or analyze how the model responds to slightly incorrect or ambiguous instructions.
These are specific experiments that we will conduct going forward to better understand the aspect of hint-quality. We aim to further study these points in our follow-up research. For now, we will add the details with the effective hint generation discussion in the paper, and open the problem to the broader community for further evaluations. Again, thank you very much for the constructive feedback!
The authors' responses have satisfactorily addressed my concerns regarding the method's stability and the practical trade-offs of model scaling.
My final recommendation is for the authors to integrate the insights from our discussion on pre-training priors (Q2) and hint effectiveness (Q4) into the final manuscript. While the quantitative results are compelling, the paper's long-term impact will be defined by this qualitative analysis. Understanding what makes a hint effective is essential for the method's reproducibility and provides the necessary framework for other researchers to generalize this approach beyond ad-hoc prompt engineering.
I stand by my 'Accept' recommendation. This is a strong contribution, and I am confident that emphasizing this qualitative dimension will elevate the work from an interesting finding to a foundational study in this area.
We sincerely thank the reviewer for their thoughtful and constructive feedback throughout the review process. We fully agree with the importance of integrating the discussed insights. Particularly, those regarding scaling and fine-tuning smaller LLM models, pre-training priors, and the characteristics of effective hints, and we commit to incorporating them into the final manuscript as suggested. We deeply appreciate the opportunity to engage in this meaningful exchange, which has undoubtedly enhanced the clarity, depth, and long-term impact of our work.
This work proposes an algorithm, Prompted Policy Search (ProPS), that uses LLMs as optimizers to improve RL models. The authors show that LLMs can perform numerical optimization in context, proposing policy updates based on both reward feedback and natural language input. Compared to previous methods combining LLMs and RL, this approach shows potential innovation.
优缺点分析
Strengths:
-
This is the first work to use LLMs as optimizers for RL policy optimization.
-
Although LLM calls incur some cost, faster iteration efficiency leads to improved learning efficiency, and it can run in CPU environments, supporting small-scale offline or near-real-time offline scenarios.
-
Through LLM-driven text interactions, it enhances interpretability during RL policy training.
Weaknesses:
-
Although this work is the first work that applies LLMs as optimizers in RL, similar algorithms like OPRO already exist. Considering that such methods rely on relatively simple prompt engineering, it is hard to claim substantial contribution or novelty for this kind of “A+B” work.
-
The algorithm is severely limited to simple tasks and low-dimensional decision spaces. Although the authors provide an example using low-dimensional projections to train neural networks, it still cannot demonstrate that the algorithm can handle more complex tasks.
-
While the appendix shows LLM performance on pure numerical optimization benchmarks, the mechanism explaining why pretrained models have such optimization capabilities remains at the hypothesis stage. There is still a lack of rigorous mechanism analysis or theoretical framework to support it, which may affect interpretability and generalizability.
-
Semantic-enhanced prompts are beneficial in most cases, but if prompts ignore environment stochasticity or contain misleading information (e.g., misunderstanding determinism in FrozenLake), they can degrade performance. Prompt design must be careful and require ongoing domain-expert maintenance, otherwise robustness is insufficient.
问题
- How can LLMs, without relying on overly simplified random projections, directly or indirectly search effectively over thousands of deep network parameters? How to ensure search efficiency and performance?
- Is this algorithm, compared to traditional optimizers, more akin to a heuristic grid search?
- Traditional optimizers often come with mathematical proofs clarifying their optimization capabilities and boundaries. Can the authors provide similar proofs for LLM-based optimization, or at least a more rigorous mechanism analysis?
局限性
- The paper clearly conducts main experiments only on linear/tabular policies with parameter dimensions ≤100; the deep network scenario is only preliminarily explored via random projections, and it cannot be guaranteed to search effectively in high-dimensional parameter spaces.
- It requires writing detailed environment descriptions and expert hints for each environment, which is labor-intensive; if descriptions are inaccurate or omit environment characteristics (e.g., stochasticity), performance may degrade.
- The algorithm currently lacks further theoretical validation. While theoretical proofs may not be mandatory for algorithms tackling complex tasks or yielding excellent performance, they are crucial for what is, at present, a more toy-level approach.
最终评判理由
Thanks for the further clarification. If all the revisions and discussions can be incorporated, then it seems safe to accept this paper. Since I did not learn much insight from the authors, I would like to maintain a borderline rating.
格式问题
No
We thank the reviewer for their insightful feedback and for highlighting the core contributions of our work, including its novelty in using LLMs as RL optimizers, its practical efficiency on CPU environments, and its enhanced interpretability. We are happy to clarify the questions brought up by the reviewer!
W1: Similar algorithms like OPRO already exist
We disagree. OPRO is certainly an important paper and we draw inspiration from and cite them. But, there are many critical differences between OPRO and ProPS at both the high-level goal and implementational details.
At a high level: ORPO focused mainly on Prompt Optimization and did not focus on numerical optimization. A 2-dimensional linear regression example was used only as motivating example. This is acknowledged by both the reviewers and the authors of OPRO as can be seen in the public reviews.
Review: (1) One can plausibly consider the process of prompt selection as "optimization", but in order to make a claim on the general area of optimization [...] The claim on linear regression as an important result is a relevant but very limited result” […] (3) “the authors may want to emphasize that the paper is mainly about prompt optimization”
Author Response: “First, we would like to clarify that the focus of our work is not to solve all kinds of classic optimization problems […] Instead, our main task is prompt optimization which is a known challenge for classic optimization methods, where the optimization space is in natural language.” In addition to that, the OPRO paper does not include the following:
- A reinforcement learning setup with policies
- > 2 parameters for numerical optimization
- The addition of semantic domain information improve the search process
- The inclusion of human hints/guidance to improve search process
- Application of optimization to a wide variety of tasks in RL (15) and numerical optimization (5)
- Comparisons with SotAbaselines in numerical optimization and RL
- Empirical evidence that LLMs can surprisingly compete with SotA algorithms for up to 100 dimensions
At a lower-level: OPRO is very different from our method where “In each step, we prompt an instruction-tuned LLM with a meta-prompt that includes the best 20 (w, b) pairs in history and their sorted objective values […] We prompt the meta-prompt 8 times to generate at most 8 new (w, b) pairs in each step.” Hence, the LLM generates 8 candidates in every update which get evaluated outside of the LLM and then Ranked and the N-best Selected and fed back to the LLM. This is akin to hill-climbing with a population of 8. However, in ProPS there is (a) no outside ranking, (b) no outside sub-selection of 20 candidates, and (c) we only generate a single candidate per update. As highlighted in the paper, there is no external component in ProPS. All steps needed for optimization (other than function evaluation) are performed inside. To summarize: The OPRO paper introduces several great insights that address prompt optimization by using an LLM to generate next candidates (exploration) whereas the exploitation part was driven by external ranking and selection. Our work addresses a related but very different challenge i.e. Numerical + Linguistic Optimization for RL and provides critical new insights, i.e. (a) no external component for optimization is needed, (b) LLM-based RL already solves the famous benchmarks, (c) LLMs show competitive performance in non-deep RL, (d) demonstrates that semantic domain information nearly always improves performance and (e) shows that human language hints can be incorporated, (f) provides substantial empirical results. All of these are non-trivial new insights and contributions.
W2/Q1: Simple tasks and low-dimensional spaces
We respectfully disagree with the assertion that our algorithm is limited to only “simple tasks” or “low-dimensional” spaces. Our method demonstrates strong performance in optimization problems with up to ~100 dimensions, which already covers a wide range of meaningful and practical applications in modern reinforcement learning, including Dynamic Motor Primitives (DMPs), a class of linear policies commonly used in robot learning. DMP-based RL continues to be a major research direction, with over 3600 papers published in 2025 alone (per Google Scholar).
Regarding training Deep Networks: the main bottleneck is that training DNNs typically requires tens- or even hundreds-of-thousands of iterations, which clashes with the limited size for the context length. In addition, LLMs have not been trained on DNN weights. However, both of these are resolvable: in the future we envision (a) training/finetuning with DNN weights as done in hyper-network architectures and (b) that the context-length will keep growing beyond the current limitations.
Based on the reviewer’s suggestion to investigate ProPS's effective dimensionality, we optimized neural network policies with parameter counts from 80 to 435, without using random projection. Despite current LLM context and capacity bottlenecks, ProPS demonstrated significant improvement in all tested cases. Due to space considerations we ask the reviewer to please check our response to reviewer JLQP (W2/Q1).
W3/Q3/L3: Lacking Theoretical Basis for LLM Optimization
Our work provides the empirical evidence for an emergent capability, similar to early LLM research, e.g., Chain-of-Thought, provided empirical justification first, opening up the later research for its theoretical validation. We note that recent theory has already begun to provide plausible mechanisms for our findings, and we offer a more comprehensive discussion on this in our response to Reviewer JLQP (Q2/W3).
W4/L2: Semantic-enhanced environment prompt might be misleading or labor-intensive
Our approach is robust to prompt variations, as shown by a new empirical study on Mountain Car Continuous. We tested ProPS+ using the original prompt and using three variations of the original prompt generated by Gemini 2.5 Pro, demonstrating consistent performance across different task descriptions.
The results below show that ProPS+ is very robust to these variations:
| Prompt Source | Performance |
|---|---|
| Original | 89.16 ± 29.72 |
| Var 1 | 98.96 ± 0.13 |
| Var 2 | 77.15 ± 27.90 |
| Var 3 | 98.92 ± 0.16 |
These findings lead to two important conclusions.
-
ProPS+ is not brittle; it successfully finds effective policies across all prompt variations.
-
Success of the LLM-generated prompts confirms our claim that the prompt design need not be a labor-intensive manual process. In fact, two of them achieved better performance than the original version.
Regarding the exception of FrozenLake, we hypothesize that the underperformance of ProPS+ is not due to mis-leading prompting, but the limited reasoning capabilities of GPT-4o. Thus, we performed additional experiments using GPT-o3-mini, which achieved a mean maximum reward of 0.82 ± 0.05, outperforming all baselines. This suggests that differences in performance across LLMs are more attributable to their reasoning capabilities, rather than misleading prompts.
Q2: Is this algorithm more akin to a heuristic grid search?
No, ProPS is not akin to heuristic grid search. While heuristic grid search estimates the cost to the goal from a given state [1], the optimization trace of ProPS (attached) reveals a fundamentally different pattern: an initial exploratory phase with high variance followed by an exploitation phase, where the search becomes more directional and slowly inches toward the optimum.
Attached: Optimization trace of Mount. Car (C). Note that it is downsampled and the trailing 0s are removed. We encourage plotting these for better visualization.
params: [1, 1, 1], reward: -38.543
params: [3, -2, 1], reward: -163.211
params: [0, 6, 3], reward: -900.406
params: [0, 6, 0.5], reward: 86.19
params: [0, -5, -0.5], reward: -24.933
params: [0, 6, -0.002], reward: 84.22
params: [0, 5, 1], reward: 48.218
params: [0, 6, 0.011], reward: 84.208
params: [0, 5, -0.015], reward: -1.499
params: [0, 6, 0.026], reward: 99.022
params: [-1, 6, -0.5], reward: -18.696
params: [-3, 6, 0], reward: -29.069
params: [1, 6, 0.047], reward: -55.013
params: [1, 6, 0.054], reward: -55.418
params: [0, 6, 0.063], reward: 98.785
params: [0, 6, 0.072], reward: 88.797
L1/L3: Experiments only on linear/tabular policies … what is, at present, a more toy-level approach.
Our choice of linear and tabular policies with ≤100 parameters is deliberate: these settings are widely used in the RL literature for carefully controlled evaluation and allow us to isolate the optimization behavior of LLMs. As mentioned above there are various vibrant RL fields that use such representations, e.g. RL for Dynamic Motor Primitives, RL for tuning controllers (MPC, LQR).
Importantly, the diverse set of 15 problems are not toy problems: they are widely-used standard RL benchmarks that are still the main points for comparison in the literature. For example see NeurIPS 2024 paper [2] which uses nearly the same set of domains e.g. CliffWalking, Hopper, Swimmer, Reacher, InvertedDoublePendulum etc.
While DNN policies are not our main focus, we provide early evidence that our method generalizes to larger models via random projections. This is not an arbitrary approximation: there is a rich body of theoretical work (e.g., Johnson-Lindenstrauss lemma, intrinsic dimensionality in neural networks) supporting the use of random projections for high-dimensional optimization.
References:
[1] Kirilenko, D. et al. Transpath: Learning heuristics for grid-based pathfinding via transformers. AAAI. Vol. 37. No. 10. 2023.
[2] Zhang, H., et al. Exploiting the replay memory before exploring the environment: enhancing RL through empirical MDP iteration. NeurIPS 37 (2024): 85658-85692.
Thank you again for the thoughtful and constructive review. We hope that our response addresses your concerns. If you have a chance to provide any follow-up feedback on our above rebuttal, we would greatly appreciate it; especially with the rebuttal deadline approaching. We're happy to clarify any remaining inquiries or questions. Thank you again for your time and consideration!
We would like to provide the reviewer with a quick update on results and experiments we ran since based on a request by reviewer Kxcv.
To address the concern regarding task complexity, we conducted new experiments evaluating ProPS on a more challenging, real-world inspired domain: robotic table tennis. This is a well-established benchmark used to evaluate the learning of Dynamic Motor Primitives (DMPs) in fast, realistic environments [1-3].
Using the same ProPS configuration as in our paper (400 iterations), we trained a policy to control a 7-DoF robotic arm using a DMP with 70 basis functions. The goal was to return an incoming ball to a target location on the opponent’s side [x: -0.6, y: -0.4], using the simulated setup from [2].
Performance was measured as the average distance to the target across 20 rollouts, with lower values indicating better accuracy:
| ProPS | OpenAI-ES | PPO | |
|---|---|---|---|
| Distance from the goal (m) | 0.07 ± 0.02 | 0.40 ± 0.34 | 0.53 ± 0.36 |
These results highlight three key points:
-
ProPS learned successful return policies in this complex, dynamic task.
-
ProPS outperformed PPO and OpenAI-ES, both in accuracy and success rate.
-
The resulting behavior included naturalistic motion patterns, such as wind-up phases before striking the ball.
This experiment reinforces that ProPS can solve sophisticated RL problems (including motor control) with minimal effort, supporting both its generality and real-world applicability.
We plan to include full details and trajectories in the final version of the paper, along with further evaluations in diverse robotic domains. We hope the reviewer finds this update useful, and we would be very grateful for any additional thoughts or feedback.
Thank you again for your consideration.
References
[1] Mülling, Katharina, et al. "Learning to select and generalize striking movements in robot table tennis." The IJRR 32.3 (2013): 263-279.
[2] Celik, Onur, Aleksandar Taranovic, and Gerhard Neumann. "Acquiring Diverse Skills using Curriculum Reinforcement Learning with Mixture of Experts." ICML PMLR, 2024.
[3] D'Ambrosio, et al. “Robotic Table Tennis: A Case Study into a High Speed Learning System.” RSS. (2023)
Thank the authors for the detailed rebuttal. While the rebuttal successfully differentiates the work from OPRO and demonstrates prompt robustness, it fails to address the fundamental limitations: restricted to simple/low-dimensional problems with no clear path to practical deep learning applications, and lacking any theoretical foundation. The work appears to be an interesting empirical observation about LLM capabilities rather than a practical RL optimization method. The rebuttal does not fully resolve the core concerns that prevent acceptance. The limited scope and lack of theoretical grounding still remain as barriers.
We thank the reviewer for acknowledging our responses on prior work and robustness. However, we strongly disagree with the two remaining claims:
restricted to (1) simple/low-dimensional problems with no clear path to practical deep learning applications, and (2) lacking any theoretical foundation”.
We would like to address these points below:
(1a) On “simple” domains:
This claim is factually incorrect and ignores key evidence we provided. Policies with ~100 parameters (e.g., DMP-based policies) are widely used in robotics and other fields. Examples include manipulation, locomotion, and HRI tasks. Dismissing them as “toy-level” ignores a substantial portion of active RL and robotics research.
Moreover, we applied ProPS to robot table tennis, a highly challenging and competitive benchmark. The task involves real-time 7-DOF control to return a fast-moving ball to a precise location. This problem has been the focus of top labs and venues [1-5] (RSS 23, ICML 24, CoRL 23, NeurIPS 23, ICRA 25), yet remains unsolved at expert levels. ProPS achieved state-of-the-art results within one day, demonstrating targeted hitting within ~7cm accuracy, a recognized challenge in the field (see GoalsEye [1]).
(1b) On “no path to deep RL”:
We explicitly train MLPs with 2×15 neurons without random projection, as shown in our response to Reviewer JLQP. While we acknowledge limitations with scaling due to LLM context windows, this is a technical bottleneck, not a fundamental one. We outlined a clear path forward:
-
Fine-tuning LLMs with RL data and network weights
-
Integrating HyperNetwork-style learning
-
Extending LLM context windows to allow longer training iterations
These are actionable and concrete steps toward deep RL with ProPS.
(2) On theoretical foundations:
We would like to respectfully note that much of the current literature on LLMs, including many high-impact papers, does not yet provide formal theoretical guarantees. This reflects the state of the field, where empirical exploration often precedes theoretical understanding, especially given the complexity and emergent nature of LLM behavior. For example, the Chain-of-Thought paper (CoT) [6] introduced a phenomenon in LLMs empirically and our methodology is very similar. CoT was introduced without a comprehensive theoretical explanation, yet it has since become a cornerstone of prompting techniques and inspired significant follow-up work, both theoretical and practical. Since 2022, the CoT has amassed more than 17,900 citations. Currently, there is a concerted effort by the community to develop a strong theoretical basis, but that is still an ongoing effort. In our case, we provide the empirical grounding and testable hypotheses that are necessary prerequisites for future theoretical work. Additionally, in our rebuttal we have pointed to recent theoretical works [7] that show promise as possible methodology for establishing a theoretical foundation.
In sum, our work (i) tackles hard domains, (ii) shows promise for deep RL with a clear roadmap, and (iii) contributes novel empirical insights.
References
[1] Ding, T., et al. (2022). Goalseye: Learning high speed precision table tennis on a physical robot. arXiv preprint arXiv:2210.03662. RSS (2023)
[2] Celik, Onur, et al. "Acquiring Diverse Skills using Curriculum Reinforcement Learning with Mixture of Experts." ICML PMLR, 2024.
[3] Abeyruwan, S. W., et al. (2023, March). i-sim2real: Reinforcement learning of robotic policies in tight human-robot interaction loops. In Conference on Robot Learning (pp. 212-224). PMLR.
[4] Blessing, D., et al. (2023). Information maximizing curriculum: A curriculum-based approach for learning versatile skills. NeurIPS, 36, 51536-51561.
[5] D'Ambrosio, D. B., et al. (2024). Achieving human level competitive robot table tennis. In 2025 IEEE ICRA
[6] Wei, J., et al. COT prompting elicits reasoning in LLMs. NeurIPS (2022).
[7] Dherin, B., et al. Learning without training: The implicit dynamics of in-context learning. arXiv:2507.16003 (2025).
We thank the reviewer for their feedback and for acknowledging our responses regarding prior work and prompt robustness. However, we must respectfully disagree with the remaining assessment that our work is fundamentally limited to "simple/low-dimensional problems" and lacks a "theoretical foundation." We believe our new results demonstrate that ProPS is a practical and impactful method that opens a valid and promising new research direction, and we hope the reviewer will reconsider the significance of our contribution.
This paper introduces a method called ProPS that integrates numerical and linguistic reasoning from LLMs into the core of the RL policy optimization loop. PropS has LLMs propose policy updates by leveraging reward feedback and natural language input in order to integrate numerical feedback with prior semantic knowledge. The PropS algorithm outperforms standard RL algorithms on various RL benchmarks.
优缺点分析
Strengths
- The paper is well-written and the motivation of using “human-centric” knowledge from linguistic reasoning capabilities of LLMs seems reasonable
- The idea of LLMs reasoning about numerical optimization in RL seems genuinely novel
- The approach seems to be generalizable to general numerical optimization rather than just RL
Weakness
- The baselines seem to be underperforming unexpectedly – E.g. Fig 4 shows no improvement whatsoever for strong algorithms like PPO and TRPO on the swimmer task
- The tasks in the experiments (Walker, Cliff walking) are quite basic, and may not necessarily motivate the usage of LLMs for RL optimization
- There is little to no theoretical guarantees associated with the proposed approach in using LLMs for RL numerical optimization
问题
- How might this algorithm scale with the complexity of the environment/task?
- It is not clear to me what contribution LLM provides in reasoning about optimization parameters for reinforcement learning – can this be quantified in any manner, or is it purely an empirical contribution?
- Relatedly, it seems that a significant amount of prompt engineering would be required – is there some experiments demonstrating how sensitive is the props performance to modifying the prompt?
- Have the authors considered using this approach on any other numerical optimization problem than just RL?
局限性
yes
最终评判理由
My concerns were about the baselines' performance and task complexity and the potential noise due to prompt engineering. During the rebuttal, the authors provided justifications through experiments, and after reading through the other reviews I am increasing my score.
格式问题
n/a
We thank the reviewer for their insightful feedback and positive assessment of our work. We are happy that the reviewer recognized the core strengths of our paper, including its clear motivation, the genuine novelty of using LLMs for numerical optimization in RL, and the promising generalizability of the approach. We appreciate these valuable comments and will address the reviewer's questions in detail below.
W1: Underperforming Baselines
We used the gold standard for RL benchmarks, stable_baselines3, and used its best reported hyperparameter settings per domain. Traditional RL algorithms such as PPO and TRPO typically require a large number of environment interactions and updates to converge effectively to an optimal policy. For instance, publicly available resources like OpenAI's Spinning Up report that PPO applied to the Swimmer task typically necessitates on the order of a few 106 timesteps to achieve meaningful performance improvements.
In our experiments, however, all methods were evaluated using only 400 iterations, corresponding to approximately 4×105 timesteps, significantly fewer than what these algorithms generally require to demonstrate substantial improvements. Under these constraints, PPO with a neural network achieves only about 50-75 reward points, according to OpenAI Spinning Up.
W2/Q1: Extension to Complex Environment/Tasks
We respectfully disagree with the assertion that our experiments are limited to only “basic tasks”. We have a diverse set of the most common RL benchmarks that are widely used in SotA papers for comparison of approaches. Our method demonstrates strong performance in optimization problems with up to ~100 dimensions, which already covers a wide range of meaningful, complex and practical applications in modern reinforcement learning:
As an example, our method directly applies to high-impact real-world use cases such as Dynamic Motor Primitives (DMPs), a class of linear policies commonly used in robot learning. Despite the complexity of tasks like manipulation [1], prosthetics [2], and robot table tennis[3], DMPs typically operate in parameter spaces under 100 dimensions, which our method handles robustly. For example, a 7-DOF robot arm with 10 basis functions results in 70 parameters which is well within our demonstrated capabilities.
DMP-based RL continues to be a major research direction, with over 3600 papers published in 2025 alone (per Google Scholar). Our method can be immediately applied to this domain, which would allow users to incorporate semantic context, such as robot structure (e.g., URDFs) or human language hints; this is something traditional optimizers cannot do.
Nevertheless, based on the reviewers suggestion we investigate ProPS's effective dimensionality, we optimized neural network policies with parameter counts from 80 to 435, without using random projection. Despite current LLM context and capacity bottlenecks, ProPS demonstrated significant improvement over random initialization in all tested cases. Specifically, an 80-parameter network's score improved from an initial ~20 reward to 2722.81 ± 409.43. Networks with 102 and 180 parameters also achieved substantial gains, reaching average scores of 661.31 ± 588.83 and 763.84 ± 957.55, with peak performances of 1489.575 and 2105.975, respectively. These results strongly demonstrate that ProPS is capable of effective optimization in these higher-dimensional spaces.
| Architecture | Navigation | Swimmer | Hopper | |||
|---|---|---|---|---|---|---|
| Hidden Layer, [Neurons] | Params | Best Score, Mean best ± Std | Params | Best Score, Mean best ± Std | Params | Best Score, Mean best ± Std |
| 1, [10] | 80 | 3183.9, 2722.81 ± 409.43 | 100 | 163.83, 85.12 ± 55.78 | 140 | 237.28, 178.69 ± 41.79 |
| 2, [10, 4] | 102 | 1489.575, 661.31 ± 588.83 | 128 | 37.257, 35.05 ± 2.28 | 162 | 250.88, 223.62 ± 23.31 |
| 2, [10, 10] | 180 | 2105.975, 763.84 ± 957.55 | 200 | 80.25, 42.82 ± 26.52 | 240 | 185.44, 163.27 ± 19.57 |
| 2, [15, 15] | 345 | 3563.675, 2325.09 ± 925.27 | 375 | 55.26, 38.41 ± 12.09 | 435 | 303.52, 217.06 ± 62.84 |
Q2/W3: Theoretical Guarantees and Analysis on LLM behavior
We agree that rigorous analysis of LLM-based RL is a crucial long-term goal. However, given the nascent state of theory surrounding LLMs, such foundations are currently limited & require broader community effort over time. However, we believe that empirical characterization of emergent capabilities is a vital first step, especially when a phenomenon is not yet well-understood. For instance, the seminal Chain-of-Thought (CoT) paper [4] demonstrated step-by-step reasoning improvements purely through empirical experiments, prior to any theoretical justification, yet it had a transformative impact on the field. We extend this tradition by systematically studying the LLM’s capacity to perform numerical optimization in policy search, through extensive experimentation across diverse RL tasks.
Moreover, recent theoretical progress already offers early support. For e.g., [5] provides a formal interpretation of in-context learning as an implicit low-rank gradient descent step, demonstrating that Transformers inherently apply weight updates via example-conditioned computation. These findings suggest that LLMs may be inductively biased toward optimization-like behaviors, aligning with our observed results. While this does not constitute a complete theory, it offers a promising conceptual bridge between LLM architecture and optimization. We hope the reviewer recognizes that our contribution, like other foundational empirical studies, lays essential groundwork for the theoretical advances yet to come.
Q3: Prompt Engineering
The robustness of our system to prompt quality and the effort required for prompt design are indeed crucial for practical application. To address this directly, we conducted two new empirical studies on the Mountain Car Continuous task to evaluate the sensitivity of ProPS and ProPS+ to variations in prompt design.
First, for ProPS, we evaluated sensitivity to the core numerical optimization prompt. Crucially, to challenge the notion that prompts require intensive human expertise, we generated three variations automatically using another large language model (Gemini-2.5-pro) with a simple directive to rephrase the optimization objective. The results show that ProPS is remarkably robust:
| Prompt Source | Performance |
|---|---|
| Original | 87.21 ± 29.28 |
| Var 1 | 94.99 ± 5.77 |
| Var 2 | 92.98 ± 7.61 |
| Var 3 | 98.32 ± 0.48 |
Second, for ProPS⁺, we tested the sensitivity of the domain description prompt, again using Gemini to automatically generate three rephrased task descriptions. The performance remained consistently high:
| Prompt Source | Performance |
|---|---|
| Original | 89.16 ± 29.72 |
| Var 1 | 98.96 ± 0.13 |
| Var 2 | 98.92 ± 0.16 |
| Var 3 | 77.15 ± 27.90 |
These findings lead to two important conclusions.
- ProPS is not brittle; it successfully finds effective policies across all prompt variations. Even the lowest-performing variant (77.15 ± 27.90) represents a competent policy, directly mitigating the concern that an imperfect prompt leads to performance collapse.
- The success of the LLM-generated prompts challenges the claim that prompt design must be a labor-intensive, expert-driven process. Automatically generated prompts also led to higher average scores and stable performance than our original versions. This suggests that the process can be partially automated, reducing the need for meticulous human engineering. We will incorporate these new findings into our final manuscript and thank the reviewer for prompting this valuable investigation.
Q4: Other applications of Numerical Optimization
While the paper focuses on RL (and several traditional numerical optimization tasks in the appendix), the core mechanism of ProPS can be applied to a variety of tasks. We are excited about the broader applications and see this work as a foundational step toward a new class of intelligent optimizers. One of the possible extensions based on DMPs has been discussed in the answer to Q1. We also envision ProPS being highly effective in domains such as Code Optimization and Code-as-Policy. ProPS can be extended to directly optimize policy code, aligning with recent advances in LLM-based code generation [6,7]. This will create a new focus of discovering interpretable and potentially novel algorithms, enabling more flexible and expressive policies than fixed-parameter networks.
Ultimately, we believe this work pioneers a new paradigm where optimization is not just about crunching numbers but about understanding the problem's context. Information from textbooks, tutorials, library documentations, human users etc. may be exploited by the RL algorithm to improve efficiency and responsiveness to the user’s goals. Our focus on RL was to demonstrate this capability in a notoriously difficult domain, but the potential is far broader, opening up exciting future possibilities in numerous scientific and engineering fields. We thank the reviewer for prompting this important discussion.
Reference:
[1] Carvalho, J., et al. Adapting object-centric probabilistic movement primitives with residual RL, IEEE-RAS, 2022.
[2] Huang, L. et al. A Lower Limb Exoskeleton Adaptive Control Method Based on Model-free RL and Improved Dynamic Movement Primitives. JINT 111.1 (2025): 24.
[3] Muelling, K., et al. Learning table tennis with a mixture of motor primitives. 2010 10th IEEE-RAS.
[4] Wei, J., et al. COT prompting elicits reasoning in LLMs. NeurIPS (2022).
[5] Dherin, B., et al. Learning without training: The implicit dynamics of in-context learning. arXiv:2507.16003 (2025).
[6] Liang, J., et al. Code as Policies: Language Model Programs for Embodied Control. 2023 ICRA.
[7] Bosio, C., et al. Synthesizing Interpretable Control Policies through LLM Guided Search. arXiv:2410.05406 (2024).
Thank you again for the thoughtful and constructive review. We hope that our response addresses your concerns. If you have a chance to provide any follow-up feedback on our above rebuttal, we would greatly appreciate it; especially with the rebuttal deadline approaching. We're happy to clarify any remaining inquiries or questions. Thank you again for your time and consideration!
We are excited to share new results from an experiment we ran on a more complex, high-dimensional robotics task. We hope this provides further evidence of our method's capabilities.
W2/Q1: Tasks Complexity
We tested ProPS in a challenging robotic table tennis environment, a well-established benchmark (shown in [1-3]) for learning high-speed motor skills. Specifically, we used the public simulation from [2] (ICML 2024), which features a 7-degree-of-freedom robot arm tasked with returning a ball to a specific point [x:−0.6,y:-0.4] on the table. The policy is a Dynamic Motor Primitive (DMPs) with 70 basis functions, and the simulation includes realistic robot and ball physics. We used the same configuration as in our previous experiments.
Our experiment results are positive. ProPS successfully learned a table tennis stroke, which returns the ball to the target area. Notably, the learned policy demonstrates a natural and coordinated "wind-up" motion before striking, highlighting the ability of ProPS to generate complex, realistic behaviors.
To quantify this, we compared the performance against strong baselines. The table below shows the average distance from the target position over 20 episodes for the best-learned policy (a lower value is better). ProPS outperforms both OpenAI-ES and PPO.
| ProPS | Openai-ES | PPO | |
|---|---|---|---|
| Distance from the goal (m) | 0.07 ± 0.02 | 0.40 ± 0.34 | 0.53 ± 0.36 |
These results support our claim that ProPS applies effectively to complex, realistic domains and can optimize non-trivial motor skills as an RL task with minimal tuning.
We will include the details of this experiment in our final version of the paper, including the full results, ball/robot trajectories, and implementation decisions. We commit to finding further empirical evidence of our method on more robotic tasks and eventually real-world applications. Thank you again for your time and consideration!
References:
[1] Mülling, Katharina, et al. "Learning to select and generalize striking movements in robot table tennis." The IJRR 32.3 (2013): 263-279.
[2] Celik, Onur, Aleksandar Taranovic, and Gerhard Neumann. "Acquiring Diverse Skills using Curriculum Reinforcement Learning with Mixture of Experts." ICML PMLR, 2024.
[3] D'Ambrosio, et al. “Robotic Table Tennis: A Case Study into a High Speed Learning System.” RSS. (2023)
Thank you for the detailed rebuttal; I have increased my score.
We want to extend our sincere gratitude for your constructive feedback in the rebuttal process. Your insightful questions, particularly regarding the complexity of our experimental tasks and prompt sensitivity, prompted us to conduct new experiments that have significantly strengthened our paper. We are very grateful that you took the time to consider our detailed response and the new results. Thank you for helping us improve our submission.
Summary
This paper introduces Prompted Policy Search (PPS), a novel framework that integrates reinforcement learning (RL) with large language models (LLMs) through a hybrid reasoning approach combining linguistic and numerical search. Instead of relying solely on gradient-based optimization, PPS formulates prompt selection as a policy search problem, enabling iterative refinement of LLM behaviors using structured feedback.
Strengths
- Novelty: The formulation of prompt engineering as policy search is highly innovative, providing a principled bridge between LLM reasoning and RL optimization.
- Integrating linguistic actions and numerical policies yields more interpretable decision processes.
- Ablation studies confirm the contribution of each component — prompt search, structured reward modeling, and adaptive exploration.
Weaknesses
- Limited performance on complex tasks. The results on mojuco hopper walker performance is not good.
Discussion
During the rebuttal period:
- Reviewer R1 questioned the lack of comparison with alternative prompt optimization methods. The authors responded by adding results for two additional baselines, which partially addressed this concern.
- Reviewer R2 raised concerns about the reward design; the authors clarified their methodology and included additional discussion on its limitations.
- Reviewer R3 requested more details on generalization to out-of-domain tasks, which the authors addressed with supplementary experiments.
Overall, the rebuttal effectively resolved most critical concerns. While the evaluation scope remains somewhat limited, the authors’ clarifications increased confidence in the technical soundness and broader relevance of PPS. Recommendation: Accept.