7.3

/10

Spotlight4 位审稿人

最低4最高5标准差0.5

3.3

置信度

创新性3.3

质量2.8

清晰度3.0

重要性3.0

NeurIPS 2025

Memory-Enhanced Neural Solvers for Routing Problems

Felix Chalumeau,Refiloe Shabe,Noah De Nicola,Arnu Pretorius,Thomas D Barrett,Nathan Grinsztajn

OpenReview PDF

提交: 2025-05-11更新: 2025-10-29

TL;DR

MEMENTO improves neural routing solvers by using memory to adapt decisions at inference time, outperforming fine-tuning and search methods while pushing SOTA on 11 of 12 tasks.

摘要

关键词

reinforcement learningcombinatorial optimisationroutingmemory

评审与讨论

审稿意见

评分: 5置信度: 42025-06-27

The paper proposes a new approach to solve routing problems using learning based methods. Existing methods include constructive heuristics, that learn a policy to construct a solution but do not leverage the insight from earlier tried solution on the same problem instance. Other existing methods do this, but with additional limitations such as being applicable only to heatmap-based policies or only learning embeddings. The proposed solution is architecture-agnostic: it learns to modify the logits generated by a 'base policy' based on memory of choices made at the same node on the same instance. The instances in this memory are combined using a learned 'rule'. Experiments show this method capable of improving the performance of the base method to the state of the art, even on large instances (500 nodes) and even if combined with a different base method at inference time.

优缺点分析

Quality - Strenghts

The main claims in the paper are supported empirically. Experiments are well set up, thorough, on relevant and subfield-appropriate environments, with a suitable choice of and number baselines. (although only two different problems, TSP and CVRP, at different sizes, are used.) Relevant analysis and ablations are shown, such as the comparison to REINFORCE, the zero-shot combination, ablations on policy inputs.

Quality - Weaknesses

Formulation: The formulation in 3.1 does not seem to accomodate the policy being conditioned on earlier samples, which does not seem to fit the approach proposed later.
The comparison to REINFORCE could perhaps be stated a bit more carefully. I understand the comparison of two processes that based on data would change the logit of future actions. However, it should also be stated that these were defined for very different purposes: long-term gradual improvement of a policy versus immediate (and somewhat greedy) selection of an action. While the comparsion this nuance could be better transmitted. -The paper states that existing work on leveraging prior samples have additional limitations such as being applicable only to heatmap-based policies or only learning embeddings. One exception that could be mentioned is "training graph neural networks with policy gradients to perform tree search" by Macfarlane at all (Neurips Deep RL workshop 2022).

Clarity - Strengths

The structure of the paper is generally clear, the use of language is generally clear and appropriate, figures are used effectively to illustrate the methods and results.

Clarity - Weaknesses

Abstract could make more clear that approach leverages earlier attempts on the same instance; also to readers who wouldn't immediately understand what 'budget' refers to in this setting.
Some discussion and presentation of the results is misplaced. The structure could be improved by moving Fig. 1 to Section 4 (now on the first page!) and to remove the forward reference to Fig.4 from Section 2. The metrics in Fig 1 (left) should be better explained.

Significance The proposed method achieves a consistent performance across almost all conditions and environments investigated. The effect size is sometimes modest, which makes sense on these environments where performance is slowly satisfying. It was impressive to see good performance from the POMO family of auto-regressive methods on size-500 instances for the first time. Experiments are limited to TSP and CVRP so it is not known how significant the proposed method is on other routing problems or even non-routing combinatorial problems. Overall, I would consider the method to be a quite significant contribution to this subfield.

Originality The main idea of the paper is somewhat similar to those of existing work, but getting this to work at state of the art level required several original insights, design choices, and modifications. Overall I'd rate the originality as good.

Minor comments

line 157 mentiond only (ii, iii, iv) are sufficient for the reinforce update, but this should include the state as well.
sentence in line 293 lacks a subject
it is helpful to include the metric shown in the figure caption or legend. This is missing from figures 4, 6, for example (it is explained in the text, but that makes it a bit harder to find)

问题

As representation of the state, your method only uses the current node visited. Have you experimented with other representations of the state (such as an (embedding of) the sequence so far), and why do you think using only current node gives the best result?
Why do you think it helps the method to have as input the remaining budget when action in memory was taken? It seems to me that to maximize the reward function (improvement over best reward so far) this should not have impact on the optimal policy. Perhaps it is a proxy for "best reward so far" (which should help the policy)? Did you consider "best reward so far" as feature for the memory module?
How well do you expect this method to generalize to other routing problems and to non-routing optimization problems?What is the motivation for only considering TSP and CVRP?

局限性

The checklist mentions error bars are reported, however, none of the tables give error bounds (Fig. 5 does, though).

最终评判理由

I will maintain my score (5: Accept) as I think even with the changes the paper does not reach the level of 6: Strong Accept (requiring, among others, "groundbreaking impact" and "exceptionally strong evaluation").

格式问题

None

作者回复

2025-07-31

We thank the reviewer for their constructive comments and insightful suggestions regarding formulation clarity, methodological comparisons, and applicability to broader settings.

> W1: Section 3.1 – Policy conditioning and formulation

We agree that this could be made clearer. While the base policy is auto-regressive (i.e., conditioned on prior actions within the same trajectory), the adaptation mechanism introduced by MEMENTO is conditioned on prior trajectories, that is, on data from previous solution attempts. These prior attempts are stored and retrieved from the memory, and used to correct the action logits at each step.

To reflect this more precisely, we will clarify that the policy is conditioned not only on the current state but also on the retrieved memory, which encodes relevant prior experience for that state. The final logit used is the sum of base and memory-derived logits. This interaction will be made explicit in a revised paragraph and aligned with the formalism introduced in Section 3.

> W2: Comparison with REINFORCE – clarification of intent and purpose

MEMENTO’s learned rule does differ in intent: it aims to refine the action distribution online during inference, whereas REINFORCE is typically used for parameter updates via stochastic gradient ascent over full episodes.

Our Figure 4 illustrates how MEMENTO’s learned rule differs structurally from REINFORCE, especially in how it leverages return and log-prob information differently at various stages of the budget. We will add this nuance when discussing Figure 4. Nonetheless, the structural similarity remains instructive: MEMENTO reinforces high-return, low-probability actions and discourages low-return, high-probability ones, akin to the REINFORCE direction. This shows that MEMENTO rediscovered a REINFORCE-like mechanism, while optimizing for max-over-returns rather than average return.

We also thank the reviewer for pointing us to the work by Macfarlane et al. (NeurIPS Deep RL 2022). We will include this in our related work section as a notable exception that explores policy gradient updates for tree search strategies, and clarify how our approach differs by operating directly at inference and modifying logits without gradient flow.

> W3: Clarifying “budget” and abstract wording

We agree this is important. We will update the abstract to state: “MEMENTO adapts the action distribution over repeated attempts on the same problem instance, leveraging data collected online.” We will also define ‘budget’ explicitly as the number of solution attempts allowed per instance, consistent with the usage in RL-based combinatorial optimisation literature.

> W4: Suggested structure and placement of figures

We acknowledge the reviewer’s point and will revise the paper layout accordingly. Specifically:

Figure 1 will be moved to Section 4, where it is better aligned with the experimental results to avoid flow of the introduction.
Remove forward references from Section 2 (e.g., Fig. 4).
Clarify in the caption of Figure 1 (left) that “Number of times ranked best” is computed by comparing methods across 12 benchmark tasks under different evaluation budgets.

These changes should improve the structure and clarity of the presentation.

> Q1: Using richer representations of state (e.g., partial solution embeddings) instead of just nodes.

Yes, as discussed in Appendix I, we experimented with retrieving based on partial solution embeddings, for example, decoder attention outputs. This approach is theoretically attractive but proved prohibitively costly in practice, especially in batched inference settings. In contrast, retrieving data by node (i.e., which city or customer is currently being visited) yielded similarly relevant information at far lower cost. We find this strategy to be a strong proxy for local similarity. We will expand this discussion slightly in the main text and clarify the rationale.

> Q2: Why is the remaining budget used as an input feature to the MLP? Could other features like “best return so far” be helpful?

We find that the remaining budget is a strong signal for shaping the exploration–exploitation trade-off. As shown in Fig. 8, this allows MEMENTO to explore early and gradually refine later, an emergent behavior aligned with human intuition and meta-RL literature.

We did not consider “best return so far” because we believe that the normalisation we apply to the input values achieve a similar effect as giving access to the best return so far: it compares the return to the typical return obtained, through the mean and standard deviation of the returns, and can hence get a sense of the advantage over the other returns. We agree that comparing with the best return instead could be beneficial, although potentially more prone to instabilities. All in all, it is a direction worth exploring further.

> Q3: Generalization to other routing and non-routing problems

We agree that broader applicability is important. MEMENTO is fully compatible with other CO problems, including non-routing tasks. Its architecture-agnostic design, relying only on access to logits makes it broadly applicable. The only change required for a new domain (e.g., JSSP) is to adapt the retrieval index (e.g., by job or machine id instead of node id).

We chose TSP and CVRP to align with standard benchmarks and focus on pushing scale (n = 500) rather than breadth. We are the first to train RL solvers at this scale for both problems and to release the resulting checkpoints. As mentioned in our conclusion, we view applications to other domains (e.g., scheduling, packing) as a promising avenue for future work, and we are happy to include additional discussion in the final version.

评论- Thank you for the clarifications

2025-08-08

Thank you for the clarifications, they sound reasonable to me. The suggested changes to the improvements will make the paper stronger. I will maintain my score (5: Accept) as I think even with the changes the paper does not reach the level of 6: Strong Accept (requiring, among others, "groundbreaking impact" and "exceptionally strong evaluation").

审稿意见

评分: 4置信度: 42025-06-28

The paper introduces MEMENTO, a memory-based method for online fine-tuning of neural combinatorial optimization solvers. MEMENTO enhances a pretrained constructive policy by incorporating a learned residual policy that adjusts action probabilities during inference. A memory module is used to store features from past solution trajectories—such as actions taken, associated rewards, and contextual signals—which are processed by an MLP to guide future decisions. An improvement-based reward signal enables MEMENTO to adapt the policy on-the-fly to each problem instance. Experiments on TSP and CVRP instances with up to 500 nodes demonstrate that MEMENTO consistently outperforms strong baselines, achieving effective instance-specific adaptation within constrained computational budgets.

优缺点分析

Strengths

The paper is well-written, with clear exposition and strong organization.
A notable contribution of this work is its focus on incorporating computational budget constraints into the design of a construction-based neural CO solver.

Weaknesses

Despite the considerable additional effort—such as introducing a memory module, fine-tuning the base model while training the auxiliary MLP concurrently—the performance gain of MEMENTO over the much simpler baseline, EAS, is not particularly compelling.
To better highlight what I believe is one of the core strengths of the method—the incorporation of budget constraints—the authors should evaluate performance across a wide range of computational budgets: low, moderate, and high. If the meta-learning approach of the authors are working as claimed, the results will show that the model trained for the large budget setting performs worse than the low-budget setting model in the short run, but eventually surpasses it as more computation is used.
The improved update rule illustrated in Figure 4 is intriguing, but the methodology behind generating the figure is not well explained. To plot such a graph, one would need to fix all other input variables to the MLP. Was the figure created by averaging over all nodes? All trajectories? Clarifying this process would help readers better understand the nature of the learned update behavior.
Furthermore, if the authors can extract and interpret the underlying logic of the MLP that leads to improved update rules over REINFORCE—as Figure 4 suggests—this direction deserves deeper exploration. For example, does the MLP promote stronger exploration early in the process and gradually shift toward exploitation? Does it induce more aggressive updates at the beginning or closer to the end of the budget? Addressing such questions would not only help validate the effectiveness of the method but also provide valuable insights for the broader research community.

问题

none

局限性

none

最终评判理由

I hope the authors will incorporate the revisions they indicated during the discussion period into their final manuscript.

Given that the neural networks in this work were not trained under different budget settings, it is likely that successful training depends on substantial manual tuning of hyperparameters and heuristics (e.g., learning rate scheduling). This could make it challenging for readers to easily apply the algorithm to other applications. Nevertheless, the originality of the ideas presented in the paper outweighs this shortcoming, and I do not oppose the paper’s acceptance.

格式问题

none

作者回复

2025-07-31

We thank the reviewer for their thoughtful feedback highlighting both the strengths and areas for improvement in our work, and the opportunity to clarify design choices of our approach. Below, we address each concern.

> W1: Performance gain of MEMENTO over EAS

We understand the concern and agree that performance improvements must justify the added modeling effort. MEMENTO introduces a memory-based module, but this added complexity brings capabilities that static fine-tuning approaches like EAS cannot offer.

In particular, MEMENTO:

Conditions on remaining budget, allowing it to shift behavior over time (e.g., explore early, refine later), unlike fixed-update methods.
Operates without backpropagation at inference, making it significantly faster and more scalable in large-instance or large-model settings (Fig. 7).
Can be used zero-shot with new base policies like COMPASS, highlighting its modularity, a generality not afforded by EAS.

These advantages are reflected empirically. On standard benchmarks (Table 1), MEMENTO consistently outperforms EAS across all 8 tasks. On large-scale tasks (n=500), it achieves state-of-the-art single-agent results. Moreover, Fig. 6 shows MEMENTO maintains its advantage even in low-budget or few-shot settings, where EAS fails to improve over the base.

While performance is one dimension, we believe MEMENTO’s broader modeling capacity and scalability make it a more robust and versatile tool for inference-time adaptation in neural combinatorial optimization.

> W2: Evaluation across a wide range of computational budgets

Indeed, MEMENTO was trained with an explicit goal to shift behavior over time, using the remaining budget as input. Figure 8 (Appendix G) is a training curve that illustrates this shift: early in the budget, MEMENTO promotes broader exploration; later, it focuses on exploiting promising trajectories. This is also reflected in the learned update rule (Figure 4), which shows asymmetry with respect to advantage: only actions with strictly positive normalized return are reinforced.

Additionally, we report performance across a wide range of sequential attempt budgets (e.g., 200 to 1200 shots) and batched settings (20–80) in Figure 6. In those plots, we observe that MEMENTO learns a strategy that performs moderately at first and progressively improves as the budget grows, contrasting with methods like EAS that optimize for single-shot performance and plateau earlier.

We will make this budget-sensitivity aspect more prominent in the main paper.

> W3: Clarification of Figure 4

We thank the reviewer for the interest in the learned update rule. Figure 4 was created by averaging the MLP’s logit correction output across 10,000 transitions, conditioning on two inputs: the normalized return and the log-probability of the action. All other features were fixed at their mean value. This allows us to isolate the learned update pattern across return-logprob space, similarly to what was done in Fig. 3 of the EAS paper. We will clarify this process in the caption and text.

> W4: Interpretability of the learned update rule

While the MLP operates as a black box, we have conducted a structured analysis of its outputs that reveals interpretable behavior.

Specifically, we compute the MLP’s logit correction values over a grid of inputs by varying normalized return and log-probability, while fixing all other inputs at their mean (see Figure 4). This reveals that MEMENTO strongly upweights low-probability, high-return actions, a pattern that aligns with a form of advantage-weighted reinforcement, but with a sharper focus on exploitation.

Beyond this static view, we have also examined how the learned updates change over the course of the budget. When conditioning on remaining budget (low, mid vs. high), we observe a shift in the MLP’s response: early on, the MLP tends to assign nonzero weights even to uncertain or suboptimal transitions, supporting broad exploration; later, it sharply concentrates weights around high-return, high-confidence actions. We thank the reviewer for the additional analysis they suggested, we will add these dynamic response plots to the appendix and clarify the mechanisms behind this progression.

This progression is not hardcoded, it emerges from training and supports the idea that MEMENTO learns to adaptively balance exploration and exploitation through its memory module.

2025-08-03

I thank the authors for their thoughtful and detailed responses. However, I believe there is still a disconnect in our understanding of the notion of budget, and I would like to clarify my concern more precisely.

In real-world deployments of combinatorial optimization, the available computational budget can vary widely depending on the application context. For example, Company A might operate under tight latency constraints and allow only 100 solution attempts, whereas Company B, facing fewer time restrictions, may permit up to 10,000 attempts for solving the exact same problem instance.

In such scenarios, it would be highly valuable to have an algorithm that can be explicitly pre-trained or tuned for a target budget regime. An algorithm optimized for low-budget inference may make different trade-offs (e.g., prioritizing fast exploitation) compared to one optimized for high-budget inference (which may favor early-stage exploration). Crucially, these trade-offs could be learned and embedded during training if the expected budget is known in advance.

However, as far as I can tell, all versions of MEMENTO reported in the paper are trained using a single, fixed budget setting (e.g., 1600 attempts). There is no evidence that MEMENTO has been evaluated or retrained under different budget expectations, nor is there any ablation that explores how the learned update policy generalizes to budget scales not seen during training.

While I appreciate that MEMENTO conditions on the remaining fraction of the budget during inference—and that this allows dynamic behavioral shifts within a run—this is not the same as having models explicitly trained for different total budget regimes. The use of relative budget as an input feature is orthogonal to my concern: namely, whether the total budget expectation used during training matches the actual budget available at inference time, and how sensitive the performance is to this alignment or mismatch.

I believe this axis of variation—how MEMENTO behaves when deployed under a different budget than what it was trained for—is important for both practical applicability and for validating the robustness of the proposed method. I encourage the authors to consider addressing this aspect more explicitly, either in future work or in the final version of the paper.

2025-08-05

We sincerely thank the reviewer for this follow-up, and for clarifying their question about training under different budget regimes (e.g., 100 or 10 000 attempts).

In the current submission, although we evaluate MEMENTO at a standard total budget of 1 600 attempts to match prior work in NCO (e.g., EAS, COMPASS) and ensure a fair comparison, we actually train the model using only 200 attempts. This design choice was driven by data efficiency (since each search yields only one effective gradient) and by our observation that the model still generalises effectively to much higher inference budgets: Table 1 reports inference-time search with 1600 sequential attempts: Table 3 & 4 report search with 500 (Low Budget) and 1000 (High Budget) sequential attempts. We agree that this distinction merits clearer emphasis and discussion in the paper.

We will revise the manuscript to highlight this aspect and, as suggested, include an ablation study in an appendix demonstrating how varying the training budget affects performance across different inference budgets.

2025-08-07

We hope our revision clarifies the points the reviewer raised. Let us know if there is anything else they would like us to elaborate on.

2025-08-08

Thank you for the detailed response. I’ve reviewed your explanations and the planned revisions carefully. I have no further questions at this time and appreciate your efforts. I will take your response into account when finalizing my rating.

审稿意见

评分: 4置信度: 32025-07-01

This paper introduces MEMENTO (Memory-Enhanced Neural Solvers), a novel approach designed to enhance the search capabilities of neural solvers for Routing Problems, particularly at inference time. The core motivation is to improve the adaptability of learned heuristics to specific problem instances and better utilize available computational budgets, an area where existing Reinforcement Learning (RL) based methods often fall short compared to handcrafted heuristics.

MEMENTO achieves this by integrating a memory mechanism into the neural solver's inference process. This memory dynamically updates the action distribution based on the outcomes of previous decisions, allowing the solver to learn from its own search history within a single instance.

The key contributions of this work are:

A novel memory-enhanced framework: MEMENTO leverages a dynamic memory to guide the search process of neural solvers at inference time.

Dynamic action distribution update: The memory mechanism allows the model to dynamically update its action distribution based on previously visited nodes or partial solutions, enabling better adaptation to the current instance.

Superior performance and adaptability: MEMENTO is validated on Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP), demonstrating superior performance compared to traditional tree-search methods and policy-gradient fine-tuning.

Zero-shot combination with diversity-based solvers: The paper shows that MEMENTO can be effectively combined with diversity-based solvers, further improving solution quality.

Scalability to large instances: The approach successfully trains RL autoregressive solvers on large instances, indicating its potential for real-world applications.

优缺点分析

Strengths:

Quality: MEMENTO presents a well-conceived and technically sound approach to improve neural solvers. The integration of a memory mechanism at inference time is innovative and addresses a critical limitation of existing RL-based heuristics (lack of instance-specific adaptation). The experimental setup is robust, comparing MEMENTO against strong baselines including tree search (POMO with sampling), policy gradient fine-tuning, and combining it with diversity-based methods. The results consistently demonstrate MEMENTO's superiority across TSP and CVRP benchmarks, including larger instances, which strongly supports the claims.

Clarity: The paper is well-written and easy to follow. The problem statement regarding the limitations of current RL heuristics is clearly articulated. The MEMENTO framework, including the motivation for memory, the memory update mechanism, and its integration with the neural solver, is explained with excellent clarity. Figures (e.g., Figure 1 illustrating the framework) are highly effective and aid comprehension. The mathematical formulations are precise and enhance the understanding of the memory dynamics.

Significance: MEMENTO offers a significant contribution to the field of neural combinatorial optimization. By enabling instance-specific adaptation at inference time without requiring expensive retraining or complex tree-search algorithms, it bridges a crucial gap between learned heuristics and traditional handcrafted heuristics in terms of flexibility and budget utilization. This improvement in adaptability and quality, especially for larger instances, has immense practical implications for real-world routing problems. The ability to zero-shot combine it with diversity-based solvers further enhances its practical utility.

Originality: The core idea of "memory-enhanced" inference for neural combinatorial optimization is highly original. While memory networks exist in other domains, their specific application here to dynamically update action distributions in an autoregressive routing solver at inference time, based on an evolving partial solution, is novel. The method for deciding what to store and how to retrieve from memory is also an original design choice that leads to strong empirical results.

Weaknesses:

Memory Content and Retrieval Strategy: While the paper justifies the chosen retrieval strategy (retrieving from the same node), a more detailed exploration of alternative memory content (e.g., more complex partial solution representations beyond just the current node's embedding) and retrieval strategies (e.g., k-nearest neighbors based on solution embeddings, as briefly discussed but dismissed for cost reasons) could provide deeper insights. The current strategy might be optimal for the tested problems, but its generalizability to problems with richer state information is less clear.

Computational Overhead of Memory: While the paper states that the simplified retrieval approach improves scaling compared to k-nearest neighbors, a more explicit quantitative analysis of the computational overhead introduced by the memory operations (storage, retrieval, update) as a function of problem size (N) and number of attempts (K) would be beneficial.

Limitations of "Zero-Shot" Combination with Diversity: While MEMENTO can be combined zero-shot with diversity-based solvers, the paper could briefly discuss potential limitations or scenarios where this combination might not be as effective, or where a more integrated design might be preferable.

问题

Question: The paper demonstrates that retrieving from the same node is an effective proxy for similarity. Have the authors explored more complex memory contents (beyond just the current node's representation) or retrieval strategies in detail, perhaps even briefly experimenting with them for smaller instances to understand their theoretical potential before dismissing due to computational cost? For example, storing more holistic partial solution embeddings or using different similarity metrics. Suggestion: Provide a brief qualitative or quantitative discussion (perhaps in the appendix) on the trade-offs involved in different memory content and retrieval strategies. While the chosen method is practical, demonstrating why others are less suitable (beyond just computational cost) would strengthen the design choices.

Question: The paper mentions that the simplified retrieval approach was chosen for better scaling. Could the authors provide a more explicit quantitative analysis of the computational overhead introduced by the memory update, storage, and retrieval operations within MEMENTO? How does this overhead scale with the number of nodes (N) and the number of search attempts (K)? Suggestion: Include a table or graph in the appendix showing the breakdown of inference time, distinguishing between the core neural solver's time and the overhead incurred by MEMENTO's memory operations, across different problem sizes and K values. This would offer a clearer picture of MEMENTO's practical efficiency. Criterion for Score Change: A detailed computational analysis of the memory operations would improve the Quality of the experimental evaluation and enhance the paper's practical Significance.

局限性

Yes.

最终评判理由

Authors address my concerns and for providing additional context and clarification. Please ensure that all the promised improvements and additions, including the new table on inference time and the explicit discussion of design choices, are fully incorporated into the final version of the paper.

格式问题

No.

作者回复

2025-07-31

We sincerely thank the reviewer for raising these thoughtful points, especially concerning memory design and practical efficiency.

> W1 & Q1: Memory content and retrieval strategy

We fully agree that alternative memory retrieval strategies are worth exploring. As discussed in Appendix I, we did consider using partial solution embeddings (e.g., from the decoder’s attention module) and k-nearest neighbor (k-NN) retrieval in that latent space. However, even when using approximate nearest neighbor search in JAX, this strategy incurred significantly higher compute and memory cost during inference, especially under batching. Surprisingly, we observed that retrieving from the same node as the current state consistently returned highly relevant transitions, serving as a strong proxy for similarity. Hence, the node-based retrieval strategy offered the best trade-off between relevance and computational efficiency.

That said, MEMENTO’s framework is modular. We encourage future work or domain-specific applications to instantiate richer retrieval mechanisms when appropriate. We will add a short discussion and a pointer to Appendix I in the main paper to make this design choice more explicit.

> W2 & Q2: Computational overhead of memory and scaling with problem size (N) and attempts (K).

We appreciate this suggestion of how we can present a more quantitative analysis of the computational overhead. In summary, MEMENTO’s overhead scales linearly with the number of memory entries retrieved per action (typically 40) and is independent of the base model architecture.

For example, on CVRP100 with 100 nodes and 1600 attempts, the full memory size is ≈8MB (40 entries × 100 nodes × 5 float32 values × 100 starting points). The time to compute logit corrections via the MLP remains a small fraction of inference time for most architectures. For larger architectures (e.g., 10-layer decoders), MEMENTO becomes faster than EAS due to the latter’s backpropagation cost (see Fig. 7, Appendix A.4).

We agree that adding an explicit table showing inference time broken into solver time and memory overhead across {100, 200, 500}-node instances and several K values will improve the experimental transparency. Therefore, we will include this table in the updated appendix.

> W3: Limitations of “zero-shot” combination with diversity-based solvers

We agree that while MEMENTO can be combined with COMPASS or other diversity-based solvers in a zero-shot manner, it may not be optimal in all regimes. For example, if the best pre-trained policy found by COMPASS differs structurally from the one MEMENTO was trained with, the effectiveness of the logit updates may decrease.

To mitigate this, we activated MEMENTO only after COMPASS’s CMA-ES search had mostly converged, as discussed in Appendix I and Section 4.2. This yielded consistent improvements across all CVRP tasks (Table 2). We agree that an integrated design (e.g., co-training or fine-tuning MEMENTO on the COMPASS policy family) could further improve results, and we will mention this promising direction for future work in the conclusion.

2025-08-05

Thank you for your detailed and thoughtful responses. I appreciate you taking the time to address my concerns and for providing additional context and clarification. Given your thorough response and the planned revisions, I have raised my score accordingly.

Please ensure that all the promised improvements and additions, including the new table on inference time and the explicit discussion of design choices, are fully incorporated into the final version of the paper.

审稿意见

评分: 5置信度: 22025-07-02

RL agent to solve specific routing problems which also utilizes history of past decisions and their outcomes for crafting better updates.

优缺点分析

Strengths:

I like the overall motivation of the approach to give history of previous time steps to derive a better 'update steps' for the CO solver.
Extensive experimental evaluation shows promise.

Weaknesses:

I would have liked more details about the method itself in the main paper. The content between Lines 129 - 195 is too abstract to get the exact details of the approach. I know about more details in the Appendix but let us not underestimate the importance of sharing 'research thoughts/ideas/explanation' and overemphasize the importance of experiments. Even the teaser Figure in the first page is not so important.
Similar to above Figure 3 does not add a lot of value. I would rather prefer exact details in text. Similarly Lines 133-135: it can be combined with ...., should rather be elsewhere. I would like to know what is being done in this paper! I can keep going to give further examples but I think the above ones are enough to get my point across. PS: I like Algorithm 1 in the Appendix, perhaps it can be put in the main paper and using that as means of describing the method?

问题

Do you see the proposed approach also being useful for other cases of CO (non-routing) problems?
The word node occurs multiple times in the paper but it is not exactly defined anywhere. Does by node you mean a decision variable of the CO problem?
How do base policy module and memory module interact to produce the resulting policy? (Seeing Figure 2).

局限性

Exposition of the paper needs to be improved.

最终评判理由

Raising my score since my concerns are resolved. I will look forward to a revised version of manuscript. It is also good that the source code of the approach would be made available.

格式问题

Line 127: harcoded -> hardcoded.

Line 391: No dot after Limitations and future work (for consistency).

作者回复

2025-07-31

We thank the reviewer for their detailed and thoughtful feedback.

> W1–W2: Clarity of the Method Section and Figures

We appreciate the reviewer’s emphasis on clarity and fully agree that the method section (Lines 129–195) should be more accessible and self-contained. We will revise this section to improve the logical flow of the presentation, reduce forward references (e.g., to Fig. 4, discussing zero-shot combination before introducing the method fully, etc), and provide a clearer step-by-step explanation of the inference-time adaptation process.

While we initially placed Algorithm 1 in the appendix to manage space constraints, we understand the value of a concrete procedural description in the main text. Instead of moving the full pseudo-code into the main body, we plan to integrate its structure more directly into the revised Section 3 by aligning the narrative with its logic, i.e. using it as an implicit scaffold while keeping the main text compact. This allows us to retain the balance between conceptual clarity and narrative flow.

Regarding Figure 3, we respectfully believe that it provides critical insight into the learned behavior of our approach, distinguishing it from baseline methods. It is the only figure in the paper that explicitly visualizes the logit update rule learned by MEMENTO’s memory module, and shows how it varies with return, log-probability, and budget. We have received positive feedback on this aspect from other reviewers as this level of interpretability is rarely available in adaptive NCO methods literature. We therefore prefer to retain it in the main body, though we are open to simplifying its caption or reordering the section to reduce any perceived disruption. We thank the reviewer again for highlighting areas of improvement, and will ensure the revised draft foregrounds the method more clearly, with minimized indirection.

> Q1: Generalisation to non-routing CO Problems

MEMENTO’s design is not tied to routing problems. The mechanism of storing transition-level data and applying logit corrections via memory is agnostic to the CO domain. As long as the base policy produces logits over discrete actions and the solution can be constructed incrementally, our memory-based logit correction can be applied without modification. For example, in the case of JSSP or knapsack-type problems, the retrieval strategy would need to change (e.g., grouping memory by machine or item id), but the memory processing and update logic remain the same. We provide this example of how the retrieval strategy can be adapted in Appendix I. We are currently investigating such extensions and will outline them in our conclusion.

> Q2: Clarification on the term “Node”

Thank you for pointing out this ambiguity. In the context of routing problems like TSP and CVRP, a “node” refers to a city or customer location, i.e., a decision variable in the CO instance. More generally, it is the element selected by the action at each step. We will explicitly state this early in Section 3 to avoid any confusion and ensure consistency throughout the manuscript.

> Q3: Interaction between base policy and memory module

As described in Section 3.2 and illustrated in Figure 2, the base policy first outputs action logits from the neural decoder, as usual. In parallel, the memory module retrieves transition-level data from previous trajectories (grouped by current node) and uses an MLP to compute correction logits. These corrections are added to the base logits to yield the final logits used to sample the next action. No gradient flows into the base policy at inference, and the memory module operates independently making this process lightweight. We will make this interaction clearer in the updated method section, in particular aligning Figure 2 more closely with Algorithm 1.

最终决定Accept (spotlight)

2025-09-17

This work proposes MEMENTO, which is a memory-enhanced approach, to improve the performance of neural combinatorial optimization methods for routing problems. The key idea of MEMENTO is to memorize the past attempts of the neural solvers and leverage them to adapt the solver's decision at inference time. To be specific, it stores the past attempts of solution construction and retrieves the information contained in the node (of the past attempts) that the solver is currently in. It then builds and uses an auxiliary model to generate the action logits conditioned on the past attempts as well as the remaining computational budget. Finally, the MEMENTO logits are augmented to the policy logits (generated by the original neural solver) to make the action decision. In this way, MEMENTO satisfies 4 desiderata: learnable, light-weight, agnostic to the model, and can leverage pre-trained memory-less policies.

The reviewers find this work well-motivated and well-written, the proposed MEMENTO method is novel and original, and the experimental results are thorough and strong. This paper initially received mixed ratings with major concerns on the clarity of proposed methods and figures, computational overhead, performance gain over simple baselines, and incorporation of budget constraints. The rebuttal successfully addressed most concerns, and two reviewers voted to accept this work, while the remaining two reviewers leaned toward weak acceptance.

I read the paper myself and fully agree with the reviewers that this work is a promising contribution and offers a potentially impactful approach to the neural combinatorial optimization community. Therefore, I recommend a clear acceptance of this work as a spotlight. I am also curious to see how MEMENTO will perform on problems other than TSP and CVRP, and whether MEMENTO can be incorporated with the recent (non-RL) construction-based neural CO solvers that achieved strong performance on much larger-scale instances.

Please make sure all the discussion and promised revisions are carefully incorporated into the final paper.