Breaking the Performance Ceiling in Reinforcement Learning requires Inference Strategies
Using search strategies at inference-time can provide massive performance boost on numerous complex reinforcement learning tasks, within only a couple seconds of execution time.
摘要
评审与讨论
This paper investigates several test-time inference strategies for solving complex multiagent tasks using RL. Over an extensive suite of chosen benchmarks, it was shown that test-time inference consistently improved over baselines that didn’t use the technique, where a diversity-driven approach previously proposed for combinatorial optimization performed the best overall both in task scores and scalability.
优缺点分析
Strength
- The paper is clearly written and easy to follow.
- The empirical performance of the approaches is very impressive, especially given the wide ranging selection of tasks.
- The analysis experiments in figure 6 and 7 give very practical takeaways on future usage of this technique.
- Experiment results of figure 5 has nice statistical significance.
Weakness
- The algorithmic novelty is rather limited, as it is simply applying a now widely-known principle from works on LLM inference-time compute. The application domain, however, is novel.
- Experiment results of figure 4 did not report any quantification of statistical significance.
问题
- Please provide more experiment details and/or report error bars for figure 4.
- Since the experiments were only done on multi-agent domains, perhaps the title is a bit of an overly stated claim. Changing it to emphasize the multiagent aspect would be preferred. Either that, or include additional single-agent RL experiments
局限性
Yes
最终评判理由
This was a very solid empirical paper to begin with. Both my issues of the "single-agent claim" and lack of uncertainty quantification in figure 4 have been properly addressed; I gave both of these issues equal weighting. The author's provided context on the single-agent setting was nicely done as well. Hence I recommend rating of 5.
格式问题
n/a
We thank the reviewer for their feedback and for highlighting the strengths of our paper.
> W1: The algorithmic novelty is rather limited, as it is simply applying a now widely-known principle from works on LLM inference-time compute.
Our main message is that practitioners often miss an opportunity to achieve significant performance improvements. Hence, this message is made stronger by the fact that these methods are already accessible, and often ignored despite having been published in top-tier conferences several years ago: online fine-tuning [5] at ICLR 2017, SGBS [6] at Neurips 2022, and COMPASS [3] at Neurips 2023.
Methods presented in the paper are not all applicable to LLMs (e.g. online fine-tuning is intractable for LLMs) and are more closely related to the RL for CO literature. Even though the algorithmic contribution is not the core contribution of our paper, it is worth mentioning that we are the first to generalise COMPASS to the multi-agent setting.
In summary, we agree with the reviewer that our algorithmic novelty is rather limited; however, we emphasize that this aspect reinforces the significance of our main message.
[3] Combinatorial Optimization with Policy Adaptation using Latent Space Search, Neurips 2023
[5] Neural Combinatorial Optimization with Reinforcement Learning, ICLR 2017
[6] Simulation-guided Beam Search for Neural Combinatorial Optimization, Neurips 2022
> Q1: Please provide more experiment details and/or report error bars for Figure 4.
We thank the reviewer for pointing this out. Please see the table below where we give the mean and 95% bootstrap confidence intervals. The 20M timestep values are aggregated over 10 independent runs and computed directly from the benchmarking data provided along with the Sable [2] paper, for the policy trained to convergence and for COMPASS, we give the results aggregated over 128 independent runs.
To avoid overcrowding Figure 4 in the main paper, we propose to include these detailed confidence intervals and error bars in a dedicated table and an extended version of the plot, both placed in the Appendix. We will clearly reference this supplementary information in the caption of Figure 4.
[2] Sable: a Performant, Efficient and Scalable Sequence Model for MARL, ICML 2025
Values and error bars of results reported in Figure 4:
| Scenario | 20M Timesteps | Converged | COMPASS |
|---|---|---|---|
| con-10x10x10a | 0.388 (0.372, 0.404) | 0.500 (0.414, 0.586) | 0.953 (0.914, 0.984) |
| con-15x15x23a | 0.075 (0.067, 0.083) | 0.078 (0.039, 0.125) | 0.867 (0.805, 0.922) |
| large-4ag | 0.246 (0.200, 0.283) | 0.642 (0.618, 0.664) | 0.949 (0.940, 0.957) |
| large-4ag-hard | 0.175 (0.113, 0.225) | 0.609 (0.581, 0.633) | 0.933 (0.922, 0.943) |
| large-8ag | 0.203 (0.197, 0.208) | 0.496 (0.464, 0.526) | 0.953 (0.947, 0.959) |
| large-8ag-hard | 0.209 (0.202, 0.216) | 0.516 (0.489, 0.541) | 0.957 (0.950, 0.963) |
| medium-4ag | 0.376 (0.346, 0.396) | 0.666 (0.642, 0.686) | 0.947 (0.941, 0.953) |
| medium-4ag-hard | 0.253 (0.220, 0.282) | 0.701 (0.677, 0.721) | 0.949 (0.943, 0.955) |
| medium-6ag | 0.240 (0.227, 0.250) | 0.702 (0.668, 0.733) | 0.958 (0.954, 0.963) |
| smacv2_10_units | 0.648 (0.623, 0.675) | 0.852 (0.789, 0.914) | 0.977 (0.945, 1.000) |
| smacv2_20_units | 0.599 (0.548, 0.639) | 0.719 (0.641, 0.797) | 0.969 (0.938, 0.992) |
| small-4ag | 0.384 (0.243, 0.484) | 0.728 (0.706, 0.746) | 0.963 (0.956, 0.969) |
| small-4ag-hard | 0.349 (0.327, 0.371) | 0.707 (0.687, 0.725) | 0.955 (0.948, 0.962) |
| tiny-2ag-hard | 0.664 (0.646, 0.684) | 0.810 (0.799, 0.821) | 0.978 (0.967, 0.989) |
| tiny-4ag-hard | 0.548 (0.529, 0.566) | 0.783 (0.756, 0.806) | 0.958 (0.951, 0.964) |
| xlarge-4ag | 0.193 (0.118, 0.261) | 0.485 (0.465, 0.504) | 0.905 (0.893, 0.916) |
| xlarge-4ag-hard | 0.065 (0.001, 0.161) | 0.394 (0.373, 0.413) | 0.892 (0.879, 0.906) |
> Q2: Since the experiments were only done on multi-agent domains, perhaps the title is a bit of an overly stated claim.
We agree with the reviewer that the inclusion of single-agent experiments is helpful to substantiate our title. In the following reply, we discuss published results on single-agent tasks, we provide new empirical evidence on a complex single-agent RL task (Craftax), and we reaffirm why our current benchmark is enough to sufficiently support our main message.
First, we can refer to the experimental results reported in the seminal works that introduced the inference strategies used in our paper [3, 5, 6]. All of these evaluate on single-agent tasks (TSP, CVRP, JSSP, Knapsack, FFSP) and show that inference strategies bring statistically significant improvements over zero-shot performance. However, we recall the limitations we have identified on these experimental results (Section 2, lines 103 to 109): the limited breadth of the time/compute settings, methods’ reliance on problem-specific tricks and saturated benchmarks where zero-shot performance is already very high.
We however provide additional results to support our core message in the single-agent case within the time allowed for the rebuttal. Specifically, we trained a Sable base policy on Craftax [7] (ICML 2024) and matched the current state-of-the-art performance reported on the Craftax-1B leaderboard, which is 18.3% of the maximum score. We then ran a 30-second inference-time search with the simplest inference strategy (stochastic sampling). The results are reported in the table below. Strikingly, and similar to our main results, using only a 30-second inference-time search yields a 37% performance gain, reaching 25.8% of the maximum score. To further put this into context, this performance boost is of the same magnitude as that of the last 300 million steps of training the base policy.
We are happy to add these discussions and the additional results on Craftax in a new section in the Appendix of the revised manuscript.
| Scenario | Zero-shot | Stochastic Sampling (30 Seconds budget) |
|---|---|---|
| Craftax-1B | 18.8 (18.1, 19.5) | 25.8 (25.7, 26.1) |
We nevertheless believe that our current benchmark and experimental protocol provides enough evidence in the main paper to support our message because: (i) it is peer-reviewed [1] and recognised in the RL community (accepted at Neurips 2021, widely cited), (ii) even the current SOTA method Sable [2] (ICML 2025) is far from optimal on a third of this benchmark, hence providing a 17-task suite to test our hypothesis, (iii) its formulation as a Dec-POMDPs makes it generic and closer to real-world RL settings.
We appreciate the reviewer raising this important point and hope the additional single-agent results provided here demonstrate the generality of our main message.
[1] Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks, Neurips 2021
[2] Sable: a Performant, Efficient and Scalable Sequence Model for MARL, ICML 2025
[3] Combinatorial Optimization with Policy Adaptation using Latent Space Search, Neurips 2023
[5] Neural Combinatorial Optimization with Reinforcement Learning, ICLR 2017
[6] Simulation-guided Beam Search for Neural Combinatorial Optimization, Neurips 2022
[7] Craftax: A Lightning-Fast Benchmark for Open-Ended Reinforcement Learning, ICML 2024
Thank you for taking the time to give a well-crafted response. I will keep my confidence score, but since my concerns have been addressed, I will raise my rating from 4 to 5. Great work!
The work studies different inference strategies and their effectiveness in improving multi-agent RL algorithms performances in a variety of difficult benchmarks. To this end, the work curates a set of difficult MARL problems and proposes adaptations of inference strategies for the MARL setting.
优缺点分析
The paper is well written and motivates the use of inference strategies for (MA)RL convincingly. The empirical evaluation is solid and highlights that the use of inference strategies can vastly improve MARL solutions, given that inference time adaptation is permissible. Overall I believe the work can be the basis of interesting discussions of the RL community at NeurIPS.
I only see small weaknesses. The limitation to the MARL case seems the most restrictive limitation to me. It would have been interesting to see for at least some single-agent case how inference strategies perform. Further, in meta-RL few-shot learning is often done and a discussion of how this research is distinct to the use of inference strategies would be instructive. Similarly, offline RL often permits fine-tuning with real environment interactions after having learned from an offline dataset and a discussion of how these lines of work are related would put the presented work better into the overall RL context.
问题
- How do you see the relation to few-shot meta-RL approaches and fine-tuning in offline RL?
- Do you expect that inference strategies always need full episodes or would it be possible to only use short episode stubs?
- Can you provide real-world examples in which a user might permit inference time adaptation?
局限性
See my comments on the relationship to offline RL and meta-RL methods.
最终评判理由
The work was already easy to follow and convincingly showed the utility of using inference strategies for RL. The authors rebuttal response better connects the work to related RL research areas and highlights the practicality in real-world scenarios. Further, the additional experiment in the single-agent case solidifies that these strategies are not just helpful in the MARL setting.
格式问题
None
We are grateful for the reviewer's feedback, especially their observation regarding the potential discussions our work could spark within the RL community. Below, we address the choice of benchmark, the connection to meta-RL and offline RL, and provide real-world examples, alongside proposed revisions for our manuscript.
> W1: Limitation to the MARL case. It would have been interesting to see for at least some single-agent case how inference strategies perform.
We agree that discussing the single-agent case is valuable. In the following reply, we discuss published results on single-agent tasks, we provide new empirical evidence on a complex single-agent RL task (Craftax), and we reaffirm why our current benchmark is enough to sufficiently support our main message.
First, we can refer to the experimental results reported in the seminal works that introduced the inference strategies used in our paper [3, 5, 6]. All of these evaluate on single-agent tasks (TSP, CVRP, JSSP, Knapsack, FFSP) and show that inference strategies bring statistically significant improvements over zero-shot performance. However, we recall the limitations we have identified on these experimental results (Section 2, lines 103 to 109): the limited breadth of the time/compute settings, methods’ reliance on problem-specific tricks and saturated benchmarks where zero-shot performance is already very high.
We however provide additional results to support our core message in the single-agent case within the time allowed for the rebuttal. Specifically, we trained a Sable base policy on Craftax [7] (ICML 2024) and matched the current state-of-the-art performance reported on the Craftax-1B leaderboard, which is 18.3% of the maximum score. We then ran a 30-second inference-time search with the simplest inference strategy (stochastic sampling). The results are reported in the table below. Strikingly, and similar to our main results, using only a 30-second inference-time search yields a 37% performance gain, reaching 25.8% of the maximum score. To further put this into context, this performance boost is of the same magnitude as that of the last 300 million steps of training the base policy.
We are happy to add these discussions and the additional results on Craftax in a new section in the Appendix of the revised manuscript.
| Scenario | Zero-shot | Stochastic Sampling (30 Seconds budget) |
|---|---|---|
| Craftax-1B | 18.8 (18.1, 19.5) | 25.8 (25.7, 26.1) |
We nevertheless believe that our current benchmark and experimental protocol provides enough evidence in the main paper to support our message because: (i) it is peer-reviewed [1] and recognised in the RL community (accepted at Neurips 2021, widely cited), (ii) even the current SOTA method Sable [2] (ICML 2025) is far from optimal on a third of this benchmark, hence providing a 17-task suite to test our hypothesis, (iii) its formulation as a Dec-POMDPs makes it generic and closer to real-world RL settings.
We appreciate the opportunity to engage with this thoughtful discussion and hope that our response and additional results provide valuable insights.
[1] Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks, Neurips 2021
[2] Sable: a Performant, Efficient and Scalable Sequence Model for MARL, ICML 2025
[3] Combinatorial Optimization with Policy Adaptation using Latent Space Search, Neurips 2023
[5] Neural Combinatorial Optimization with Reinforcement Learning, ICLR 2017
[6] Simulation-guided Beam Search for Neural Combinatorial Optimization, Neurips 2022
[7] Craftax: A Lightning-Fast Benchmark for Open-Ended Reinforcement Learning, ICML 2024
> Q1: Relation to few-shot meta-RL and fine-tuning in offline RL.
We acknowledge that relating meta-RL and offline-to-online RL to inference strategies is an interesting addition to the paper, and we are happy to add these to our related work (Section 2).
Inference strategies, just like meta-RL and offline-to-online RL, benefit from policy adaptation methods in order to produce better solutions. Consequently, methods developed in these fields can benefit the others. Interestingly, COMPASS shares similar concepts with VariBAD [8], a seminal work in meta RL: both condition a policy on a latent space and search this space to adapt. Other inference strategies from the literature have close links to meta learning. For instance, MEMENTO [9] shares similarities with RL2 [10], using a memory to search (respectively to adapt). And DIMES [11] is an inference strategy which explicitly does meta RL (following MAML’s methodology [12]). We do not compare to DIMES because it is highly domain-specific (only applicable to graphs, e.g. TSP and MIS). While we are not aware of inference strategies directly inspired by the offline-to-online RL literature, methods such as Cal-QL [13] and PA-RL [14] introduce adaptation mechanisms that would likely be beneficial during inference.
We will also recall some fundamental distinctions between these fields. In particular, inference strategies care about (i) the solutions found rather than the policy obtained, (ii) the maximum performance rather than average performance, and (iii) treat single instances at a time, whereas most meta-RL and offline-to-online RL methods adapt from a distribution to another distribution.
[8] VARIBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning, ICLR 2020
[9] Memory-Enhanced Neural Solvers for Efficient Adaptation in Combinatorial Optimization, arxiv (2024)
[10] RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning, arxiv (2016, cited 1000+)
[11] DIMES: A Differentiable Meta Solver for Combinatorial Optimization Problems. Neurips 2022
[12] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML 2017
[13] Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning, Neurips 2023
[14] Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone, ICLR 2025
> Q2: Would it be possible to use short episode stubs?
When per-timestep rewards are accessible, it is indeed possible to use short episode stubs instead of entire episode rollouts. We stayed as close as possible to the existing methods published in the literature; hence, we only ever used full episode information, but we do not see any restrictions to extend these methods to use smaller episode stubs since the benchmarks we are using provide per-timestep rewards.
> Q3: Can you provide real-world examples in which a user might permit inference time adaptation?
Applying RL in real-world applications is a key driver of our research. Here are three such real-world cases where we have first-hand experience: in all these, the user would allow for a time budget, and a proxy of the solution’s quality is accessible to inform the search:
- Printed Circuit Board Routing. The user submits the specification of a PCB board they want to route. This task can take up to several days/weeks, even for an expert, depending on the complexity of the board. Therefore, users are typically completely satisfied with allowing an AI model to take at least a few minutes or even hours for search to obtain a better solution (routed board). Additionally, the user will find it beneficial to leave more time and to grant more computation in order to improve the board further, which could then be easier to manufacture or reduce its production cost.
- Bin Packing. When a logistics company wants to load containers with objects that need to be shipped overseas. The operators can typically have a few hours or days when planning which objects need to be transported, and can hence allow inference-time search. Here as well, the energy and money gains from having a better object dispatching (thus fewer containers to ship for the same amount of objects) will always justify a few hours of search (or the allowance of more computing spend).
- Train scheduling: designing train schedules to meet some demand forecast, while proposing a feasible schedule (e.g., no collisions between trains). This only happens once or twice a year, and takes months, hence no need to rely on a single rollout of the trained policy, and an inference-time search can be conducted over several days.
We appreciate the reviewer's suggestion; these examples will be added to a new appendix section to illustrate the link of this work with real-world applications.
Thank you very much for this very detailed response. The additional experiment is very helpful in highlighting the utility of the proposed approaches and the discussions of related RL concepts, as well as real world problems where this could/should be used makes the work even more understandable. I am more confident in my assessment now and will keep my score of accept.
The authors found that using small inference time budgets can significantly improve the performance of MARL agents, compared to the zero-shot setting. The typical RL literature usually focuses on a zero-shot setting, meaning the RL agent can one attempt at each time step during evaluation, whereas making multiple attempts before giving a final result has become more popular in the LLM space.
The authors took inspiration from making multiple attempts during inference and applied it in MARL. They conducted experiments on a wide range of MARL tasks, from the Multi-Robot Warehouse, the StarCraft Multi-Agent challenge, to Connector. In all the experiments, the authors found that inference strategies can push up the final performance higher.
优缺点分析
Strengths:
Experiments seem comprehensive, covering a wide variety of tasks. The gains from inference strategies are solid. It is also surprising that even a small inference budget (30 seconds) could make a big impact on the model's performance. Charts seem reasonable, too. For example, Figures 5 and 6 show nicely how more parallel attempts can unlock greater performance.
Weaknesses:
It feels like the benchmark environments are already somewhat saturated.
问题
I am not sure I follow how the inference strategies work in MARL. The model is performing actions at every time step. How do inference strategies work? Do the models consider multiple actions before deciding on a single action at each time step?
局限性
N/A
最终评判理由
The author's responses are well-written and their explanation on why COMPASS is better for Reviewer Ptix is well elaborated. Given these clarifications, I am raising my score.
格式问题
N/A
We thank the reviewer for their positive feedback and comments. We provide additional insight on the benchmark used and explanations concerning inference strategies, which we will include in the revised manuscript.
> W1: it feels like the benchmark environments are already somewhat saturated.
Our benchmark actually focuses on tasks from the literature that are still far from being saturated. We chose tasks from the benchmark introduced in [1] (Neurips 2021) where the most recent state-of-the-art method Sable [2] (ICML 2025) reports an average performance below 50% normalised return (see grey crosses on Figure 4 of our main paper). To isolate the effect of a potential lack of training, we re-train Sable until convergence on these tasks and we observe that the zero-shot performance is plateauing under 70% (see orange crosses on Figure 4), and reaching only 5% on the hardest task (con-15x15x23a), hence demonstrating that the zero-shot performance is reaching a ceiling on these tasks.
We discuss this aspect of the benchmark in paragraph “Tasks” (Section 4, line 233) and paragraph “Training base policies” (Section 4, line 263/264). We can further clarify by updating the sentence on line 263/264: “... the converged policy stays below 70% normalised performance, demonstrating that the benchmark is still far from saturated”.
[1] Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks, Neurips 2021
[2] Sable: a Performant, Efficient and Scalable Sequence Model for MARL, ICML 2025
> Q1: How do inference strategies work? Do the models consider multiple actions before deciding on a single action at each time step?
Section 3 defines an inference strategy as a function that uses the base policy and any additional inference-time search, adaptation, storage, or optimisation methods under the time and compute budget to produce the best possible solution to the problem instance (lines 169 to 171). Inference strategies represent a broad concept, encompassing various approaches. The effectiveness of these strategies is contingent on specific contextual factors, such as the size of the pre-trained policy, the problem's inherent structure, the duration of the episode, the nature of the reward function, and the available time and computational budget.
Concerning the action selection, all four inference strategies studied in the paper use a pre-trained policy which outputs an action distribution at each time step. The next action is always selected by sampling from this action distribution. In other words, any (allowed) action can be selected, with a probability which corresponds to the logits given by the pre-trained policy to this action.
These inference strategies differ in the way they incorporate outcomes of previous attempts to impact the way future actions will be sampled. Online fine-tuning uses the past attempts to derive a policy gradient and update the policy’s parameters. COMPASS uses the past attempts to update the distribution used to sample the latent vectors which condition the policy. SGBS (tree search) re-starts from the partial solutions which led to the best performance so far (whereas other methods always re-sample solutions from scratch). In contrast, stochastic sampling does not take past attempts into account.
We will add these details at the beginning of Section 3.2 in order to better link the methods under study with the definitions introduced in Section 3.1. We thank the reviewer for their comments which will help clarify the paper.
I thank the authors for the clarifications. It's good to know that some domains are not saturated and I appreciate the author's further explanation on the inference strategies.
The paper applied different searching strategies to multi-agent RL and showed that the COMPASS is the best compared to greedy, stochastic, tree search and fine-tuning.
优缺点分析
Strengths:
- The paper is well motivated and the experimental results are promising.
- The paper is well written, the contribution is fair Weakness:
- It will be good to explain details of how the specialized policy in COMPASS is trained. It will be good to show the comparison of the model size.
问题
- It would be good to explain details of how the specialized policy in COMPASS is trained. It will be good to show the comparison of the model size.
- It would be good to explain the reason behind why the COMPASS is better
局限性
yes
格式问题
no
We thank the reviewer for highlighting the contributions of our work and for their thoughtful comments, which we address in the responses below. We will update the manuscript to clarify the points raised in the review.
> Q1: Explain details of how the specialized policy in COMPASS is trained.
At each training step, a batch of latent vectors is sampled from the latent space, using a fixed random uniform distribution. The latent-conditioned policy is evaluated on the same problem instance, with each of the sampled latent vectors. The episode returns obtained for each latent vector are compared, and only the best trajectory is kept to train on. In line with the original implementation of COMPASS [3] (Neurips 2023), we use REINFORCE to estimate the policy gradient, using a batch of instances evaluated in parallel.
Only the best latent vector in the batch is used to train the policy, thus creating a specialisation within the latent space. In doing so, different areas of the latent space learn to specialise for different subdistributions of the training instances.
This training process is explained in Appendix E.2 (lines 945 to 976), which we omitted to refer to in the main paper. We will revise the manuscript to add this reference at the end of the paragraph “Diversity-based approach” (line 204, Section 3.2).
We can also extend this paragraph if the reviewer(s) feel this would be useful, or use the additional page allowed for the camera-ready version to include it directly in the main paper.
[3] Combinatorial Optimization with Policy Adaptation using Latent Space Search, Neurips 2023
> Q2: Comparison of the model size.
Creating a COMPASS checkpoint from a pre-trained base policy does indeed require adding new parameters to process the latent vectors (line 953, App. E.2), hence increasing the model size. However, this increase is small, never exceeding 2% of the total base policy model size.
We report the parameter count for all base policy checkpoints and all corresponding COMPASS checkpoints in the table below. We observe that (i) for IPPO and MAPPO, the increase in model size stays below 1.6%, and (ii) for Sable, it typically adds 0.5% to the total parameter count. Consequently, this model size increase has a minor impact on the memory and computational footprint of the method, which is confirmed by our study of the attempts achieved over time (reported in Appendix G).
We thank the reviewer for highlighting this aspect of COMPASS, which was not yet discussed in the paper. We will add this table to our appendix, along with a paragraph to comment on these values.
> Q3: Explain the reason behind COMPASS being better.
We highlight three main reasons why COMPASS is better.
First, COMPASS trains a “multiple attempts” objective, unlike other methods, which are training for “single attempt” performance. Thus, COMPASS’s training objective anticipates that what matters is the maximum performance over multiple attempts, rather than the average performance over these attempts.
Second, COMPASS creates a dense space of diverse and specialised policies, hence creating a strong exploration capacity. While most other methods rely on the entropy of the action distribution to explore (i.e. sampling stochastically from the pre-trained policy), COMPASS can use the wide diversity encoded in its latent space to build much more diverse policies.
Finally, the Covariance Matrix Adaptation mechanism (CMA-ES [4]) used by COMPASS at inference time enables it to navigate the latent space and identify the most adapted (suitable) latent for the problem instance at hand. Using a CMA-ES search in a 16-dimensional space is more efficient and less prone to getting stuck in local optima compared to using gradient descent in the entire policy parameter space (i.e. online fine-tuning).
Some of these elements are discussed in a paragraph in Section 4.3 to interpret the scaling capacity of COMPASS. We will add a section in the Appendix to discuss the elements that were not yet mentioned.
[4] Completely derandomized self-adaptation in evolution strategies, Evolutionary Computation 2001
The following tables present the parameters count for the checkpoints used in our experiments.
MAPPO:
| Scenario | Base Policy | COMPASS Policy | Increase |
|---|---|---|---|
| small-4ag-hard | 125061 | 127109 | 1.64% |
| medium-6ag | 125317 | 127365 | 1.63% |
| tiny-4ag-hard | 125061 | 127109 | 1.64% |
| xlarge-4ag | 125061 | 127109 | 1.64% |
| xlarge-4ag-hard | 125061 | 127109 | 1.64% |
| large-8ag | 125573 | 127621 | 1.63% |
| tiny-2ag-hard | 124805 | 126853 | 1.64% |
| large-8ag-hard | 125573 | 127621 | 1.63% |
| medium-4ag | 125061 | 127109 | 1.64% |
| large-4ag | 125061 | 127109 | 1.64% |
| large-4ag-hard | 125061 | 127109 | 1.64% |
| medium-4ag-hard | 125061 | 127109 | 1.64% |
| smacv2_20_units | 187417 | 189465 | 1.09% |
| smacv2_10_units | 151567 | 153615 | 1.35% |
| small-4ag | 125061 | 127109 | 1.64% |
| con-10x10x10a | 124293 | 126341 | 1.65% |
| con-15x15x23a | 125957 | 128005 | 1.63% |
IPPO:
| Scenario | Base Policy | COMPASS Policy | Increase |
|---|---|---|---|
| large-8ag-hard | 125573 | 127621 | 1.63% |
| medium-6ag | 125317 | 127365 | 1.63% |
| large-8ag | 125573 | 127621 | 1.63% |
| tiny-4ag-hard | 125061 | 127109 | 1.64% |
| small-4ag-hard | 125061 | 127109 | 1.64% |
| tiny-2ag-hard | 124805 | 126853 | 1.64% |
| xlarge-4ag-hard | 125061 | 127109 | 1.64% |
| large-4ag-hard | 125061 | 127109 | 1.64% |
| large-4ag | 125061 | 127109 | 1.64% |
| xlarge-4ag | 125061 | 127109 | 1.64% |
| medium-4ag | 125061 | 127109 | 1.64% |
| medium-4ag-hard | 125061 | 127109 | 1.64% |
| smacv2_20_units | 187417 | 189465 | 1.09% |
| smacv2_10_units | 151567 | 153615 | 1.35% |
| small-4ag | 125061 | 127109 | 1.64% |
| con-10x10x10a | 124293 | 126341 | 1.65% |
| con-15x15x23a | 125957 | 128005 | 1.63% |
Sable:
| Scenario | Base Policy | COMPASS Policy | Increase |
|---|---|---|---|
| large-8ag | 390096 | 392144 | 0.52% |
| smacv2_20_units | 43363 | 43875 | 1.18% |
| con-10x10x10a | 388422 | 390470 | 0.53% |
| con-15x15x23a | 389907 | 391955 | 0.53% |
| smacv2_10_units | 201435 | 202459 | 0.51% |
| medium-6ag | 389454 | 391502 | 0.53% |
| medium-4ag-hard | 389580 | 391628 | 0.53% |
| large-4ag-hard | 389004 | 391052 | 0.53% |
| large-8ag-hard | 389520 | 391568 | 0.53% |
| small-4ag-hard | 389004 | 391052 | 0.53% |
| tiny-4ag-hard | 1080524 | 1082572 | 0.19% |
| small-4ag | 389196 | 391244 | 0.53% |
| xlarge-4ag | 389004 | 391052 | 0.53% |
| medium-4ag | 389580 | 391628 | 0.53% |
| xlarge-4ag-hard | 1078796 | 1080844 | 0.19% |
| large-4ag | 389004 | 391052 | 0.53% |
| tiny-2ag-hard | 388746 | 390794 | 0.53% |
> If the distribution of "multiple attempts" is the same as "single attempt", why does it perform better?
A policy trained for “single attempt” optimises the average performance over attempts. At inference time, such a policy tends to produce many good solutions but rarely an excellent one.
In contrast, a policy trained for “multiple attempts” optimises the maximum performance over a batch of attempts. While its average solution quality may be lower, it has a higher probability of producing one excellent solution. This makes it inherently better suited for inference-time search where multiple attempts are allowed.
We will clarify this point in the updated manuscript.
> It will be good to explain more on how the "z" selection is diversified during training.
The latent space is a fixed prior distribution, it is not explicitly diversified during training. Instead, the conditioned policy is trained to process samples from this prior in a way that promotes diversity in the generated solutions. Our reply to "Q1: Explain details of how the specialized policy in COMPASS is trained” describes the training mechanism that achieves this effect, and we are happy to elaborate further if any aspect remains unclear.
> Could the authors explain more on the contribution compared to "COMPASS" original paper?
Our main contribution is to show that many practitioners miss an opportunity to achieve significant performance gains by using inference-time search. We demonstrate this on a benchmark that is:
- Non-saturated: tasks mostly below 60% zero-shot performance (see Figure 4)
- Broad: covering 17 tasks without requiring domain-specific tricks
This contrasts with COMPASS, since most tasks considered were already getting over 95% zero-shot performance, and relied on domain-specific tricks (e.g., using input augmentations based on symmetries in the problem formulation). These limitations are discussed in section Related Work (lines 103 to 109).
We also contribute the largest study to date of inference strategies for complex RL, evaluating methods across a wide range of time and compute budgets (Section 4.2, Figure 6), whereas prior work (including COMPASS) typically focused on a narrow, arbitrary budget. Finally, even though the algorithmic contribution is not the core contribution of our paper, it is worth mentioning that we are the first to generalise COMPASS to the multi-agent setting.
We hope these clarifications strengthen the reviewer’s understanding of our work, and we will incorporate them to refine the manuscript. We are happy to elaborate further if needed.
It will be good to explain more on how the "z" selection is diversified during training.
The latent space is a fixed prior distribution, it is not explicitly diversified during training. Instead, the conditioned policy is trained to process samples from this prior in a way that promotes diversity in the generated solutions. Our reply to "Q1: Explain details of how the specialized policy in COMPASS is trained” describes the training mechanism that achieves this effect, and we are happy to elaborate further if any aspect remains unclear.
Q: Could the author explain more what is the fix prior distribution? what is the form and how it is sampled, is the prior distribution conditioned on anything?
We use the same prior as in the original paper: a 16-dimensional uniform distribution over . The prior is not conditioned on any other variables and is purely random.
I'm wondering if the distribution of "multiple attempts" are same as "single attempt", why it performs better?
"COMPASS creates a dense space of diverse and specialised policies, hence creating a strong exploration capacity. While most other methods rely on the entropy of the action distribution to explore", I see that the "COMPASS" is from a reference without a clear explanation of how "the latent vector z" which the policies is conditioned on. It will be good to explain more on how the "z" selection is diversified during training although it is from another paper.
Could the authors explain more on the contribution compared to "COMPASS" original paper?
The paper studies RL for combinatorial optimization problems, and convincingly demonstrates through extensive experiments that employing an inference phase with a time & compute budget allows finding significantly better solutions than those found zero-shot by a trained RL policy. All of the reviewers agreed that the experimental evidence is comprehensive & sound, the problem is very well motivated, and the exposition is clear. During the discussion phase, the authors contributed additional results on a single-agent RL problem (to go beyond only the multi-agent settings in the initial version), provided error bars on their main results using bootstrapping, and clarified the increase in parameter count in having the policy be able to process latent conditioning vectors. These results and the discussion situating the work in the broader topic of RL policy adaptation (e.g. meta-learning) substantially strengthen the paper. All the reviewers agreed that the paper is above the bar for publication, and likely to spark follow-up work and interesting conversations and so I recommend an oral slot for the paper.