Solving Diverse Combinatorial Optimization Problems with a Unified Model
We explore whether a unified model with one single architecture and parameter set can be developed to solve diverse combinatorial optimization problems.
摘要
评审与讨论
This paper proposes a unified deep model to solve diverse CO problems. The model is motivated by the success of next-token-prediction in LLMs and is trained using a transformer backbone with tokenized data collected from problem solution trajectories. The main challenges in training such a model are the long token length due to the complex observation space of CO problems and the need to predict both observations and actions simultaneously. The paper introduces two key designs: a CO-prefix to reduce token length by aggregating static features of the problems, and a two-stage self-supervised learning scheme to account for the heterogeneity of state and action tokens within the MDP. The proposed model demonstrates robust problem-solving capabilities across nine diverse CO problems, as well as strong few-shot and zero-shot generalization abilities. This framework is expected to complement existing neural CO methods that focus on achieving optimal performance for individual CO problems.
优点
- This paper addresses an important and intriguing research topic.
- It appears that the performance of this paper surpasses that of Gato.
缺点
- Although I have read them several times, I still cannot understand zero-shot and few-shot related parts.
- The method claims to benefit from two-stage training, but I do not understand the significant advantage of "Ours" over "Ours-DR."
- The report metrics include both score and time. Although the proposed method demonstrates a significant advantage in runtime, it does not achieve the best score on some tasks. For ATSP/MIS/FFSP, the baseline is random, making it difficult to assess the performance of the proposed method.
问题
For Section 4.3,
-
Should the second subplot in Figure 7 be ATSP?
-
Why does the model have zero-shot capability? Given that the state space and action space are different? The paper mentions, "Since TSP serves as a foundational version of many routing problem variants, our model, pre-trained on the other three problems, can directly generate semi-optimized solutions without any additional data for fine-tuning." I still cannot understand how such a solution could construct by the model.
-
Could the definition of "epoch" in the paper be clarified further? Does each epoch use the same data?
-
My understanding of "few-shot" typically refers to an extremely small amount of data. If I am not mistaken, each epoch in this paper actually contains many samples. Should this terminology be revised?
-
I thought that for CO problems, solvers usually can find a solution based on a time budget. Do the expert method and existing methods involved in this paper have this capability?
-
What is the bolding rule for Table 2?
Q6: What is the bolding rule for Table 2?
In our revised version of Table 2, the best results among all learning-based models are underlined, and the best results among all unified models are in bold.
Reference
[1] Schubert I., et al. "A Generalist Dynamics Model for Control." Arxiv 2023.
[2] Kwon, Y., et al. "Pomo: Policy optimization with multiple optima for reinforcement learning." NeurIPS 2020.
[3] Kool, W., et al. "Attention, learn to solve routing problems!" ICLR 2019.
[4] Kwon, Y., et al. "Matrix encoding networks for neural combinatorial optimization." NeurIPS 2021.
[5] Ahn, S., et al. "Learning what to defer for maximum independent sets." ICML 2020.
Q1: Should the second subplot in Figure 7 be ATSP?
We apologize for such a mistake. The subplot title should be CVRP. We have updated Figure 7 according to your notation.
Q2: Why does the model have zero-shot capability? Given that the state space and action space are different? The paper mentions, "Since TSP serves as a foundational version of many routing problem variants, our model, pre-trained on the other three problems, can directly generate semi-optimized solutions without any additional data for fine-tuning." I still cannot understand how such a solution could construct by the model.
Thank you for your insightful question. The observed zero-shot capability of our model on TSP arises from the hierarchical relationship between TSP and other routing problem variants. Specifically, TSP has a minimal state space (city coordinates) and action space (city indices) compared to other routing problems, such as CVRP or OP, which include additional features like demands, capacities, or profits. These more complex problems inherently encapsulate TSP as a simplified case. Therefore, our model, trained on these more complex problems, learns to generalize to the reduced structure of TSP without requiring further fine-tuning. This is possible because the action space (selecting city indices) and the core problem structure (minimizing travel distance) of TSP are subsets of the richer, more diverse features encountered during pre-training on other tasks.
Q3: Could the definition of "epoch" in the paper be clarified further? Does each epoch use the same data?
Thank you for pointing this out. During training, each epoch consists of 400 batches, with each batch containing 128 trajectories. The trajectory data for each epoch is newly sampled from a mixed set of all 9 problems.Therefore, different epochs use different trajectories, We have revised the description of "epoch" in Section 4.2.
It is also important to note that, for each raw problem instance, multiple trajectories are collected as a form of data augmentation. Each trajectory that is finally fed into the model for training is a contuguous segments of the complete solution MDP. On average, all trajectories we process in each epoch mentioned above refer to approximately 500 raw problem instances. We apologize for the ambiguity in our previous manuscript, and implement the tokenization and data augmentation details in Appendix B.2.
Q4: My understanding of "few-shot" typically refers to an extremely small amount of data. If I am not mistaken, each epoch in this paper actually contains many samples. Should this terminology be revised?
Thank you for highlighting this concern. We apologize for the confusion in our previous manuscript. As discussed above, our model can fast adapt to unseen problems after finetuning with minimal data. In each fine-tuning epoch, we only use around 500 problem instances, which constitutes approximately 0.67% of the data volume used for the main results reported in Table 2. However, the performance is already high after finetuning on such small amount of data.
Such a design to evaluate the few-shot ability was also adopted in other unified and foundation models across multiple tasks, such as TDMs[1]. We intended to demonstrate how effectively our pre-trained model can generalize to unseen problems with minimal data. We hope this clarifies the intent behind the experiment and the terminology used. The revised analysis could be found in Section 4.4.
Q5: I thought that for CO problems, solvers usually can find a solution based on a time budget. Do the expert method and existing methods involved in this paper have this capability?
Most expert methods we used, such as Gurobi for OP and heuristic solvers for other problems, are capable of adjusting their solution quality based on a time budget. Specifically, exact methods (e.g., Gurobi) can theoretically guarantee the optimal solution if given unlimited time. In our experiments, we allowed Gurobi to run until it reached the true optimal solution. Meanwhile, non-exact heuristic methods do not guarantee the optimal solution even with extended runtime. Instead, we relied on implementations from previous literature [2,3] and allowed them to run until their standard convergence criteria were met. The only exception is that for FFSP, we used MatNet[4], a learning-based method, as the expert solver and it has no such capability.
As for the auto-regressive methods, including our method, GATO/DB1 and other specialist baselines, does not have this capability when using the greedy decoding strategy. The expert method choices for different CO problems are listed in Appendix A.
Dear reviewer 8fcW,
Thank you so much for your thoughtful review and overall positive comments. Generaly, we have made several major revisions to the manuscript as follows,
- Additional specialist NCO baselines in our main results in Table 2.
- Additional evaluation results with N=50 problem scales in Appendix C.1.
- Evaluation on the generalization ability to larger scales in Appendix C.2.
- Performance evaluation of our model with different problem combinations in Appendix C.3.
- Ablation studies on parameter scales and embedding dimensions in Appendix C.4.
- Details of tokenization and trajectory collection in Appendix B.
- A thorough revision to the entire paper to improve its presentation.
The updated sections are highlighted in blue for easy identification. We invite the reviewer to revisit these sections for further feedback.
Below we address the concerns specially raised in your review:
W1: Although I have read them several times, I still cannot understand zero-shot and few-shot related parts.
We apologize for the ambiguity in our previous manuscript. Here, we clarify the concept of the experimental design. In Section 4.4 (in the new version), we selected four routing problems and trained four distinct unified models. Each model was trained using a leave-one-out approach, where one problem was excluded during the initial training phase and then gradually fine-tuned using data from the unseen problem. In each fine-tuning epoch, we used only 0.67% of the total data from the unseen problem, which is in line with a few-shot learning setup. We evaluated the performance of our model after such fine-tuning, which reflects the few-shot ability of our model. We have also updated Figure 7 by replacing the x-axis with the percentage of data used.
As for the zero-shot ability, we notice that a model that has never been trained or fine-tuned on TSP is able to directly generate solutions with approximately 48% optimality with no further data at all. The reason is that the corresponding prefix and step token designs of TSP, which only include city coordinates, represent a subset of the more complex routing problems. Our pre-trained model, originally trained on these high-level problems, is able to directly generate solutions with approximately 48% optimality without any additional fine-tuning data.
Such a design to evaluate the few-shot and zero-shot ability was also adopted in other unified and foundation models across multiple tasks, such as TDMs[1]. We intended to demonstrate how effectively our pre-trained model can generalize to unseen problems with minimal data. We hope this clarifies the intent behind the experiment and the terminology used. The revised analysis could be found in Section 4.4.
W2: The method claims to benefit from two-stage training, but I do not understand the significant advantage of "Ours" over "Ours-DR."
Thank you for your question about the advantage of our two-stage training process over the direct ablation baseline, "Ours-DR." The key distinction lies in how we address the heterogeneity of elements within the MDP trajectories for CO problems.
A full MDP trajectory consists of two distinct types of elements: states and actions. Training a model to predict both types simultaneously without differentiating their roles can increase complexity. Specifically:
- Predicting Observations is Easier: In deterministic CO environments, the next state is fully determined by the current state and action. This makes predicting observations a relatively straightforward task.
- Predicting Actions is Harder: Learning the policy, i.e., selecting the optimal next action given the current state, is significantly more challenging. This is the core problem in CO to learn the policy.
Our two-stage training design leverages this difference:
- Stage 1: Forward Dynamics Prediction: The model first learns to predict the easier objective (state transitions). This stage ensures the model gains a solid understanding of problem structure and dynamics, which simplifies subsequent learning tasks.
- Stage 2: Policy Generation: Once the forward dynamics are well-understood, the model focuses on the harder task of learning the optimal policy.
Results Supporting the Two-Stage Approach:
- Final Performance: As shown in Table 2 and the newly added Table 4, the two-stage training consistently results in better performance across all CO problems.
- Convergence Speed and Quality: Figure 6 highlights that "Ours" converges faster and achieves higher solution quality compared to "Ours-DR," underscoring the effectiveness of decomposing the training objectives.
By first tackling the easier component, our approach ensures a more efficient and stable learning process, ultimately benefiting the harder task of policy generation. This staged design helps to simplify the training landscape, improve convergence, and achieve superior performance.
W3: The report metrics include both score and time. Although the proposed method demonstrates a significant advantage in runtime, it does not achieve the best score on some tasks. For ATSP/MIS/FFSP, the baseline is random, making it difficult to assess the performance of the proposed method.
In addition to the baseline comparison with GATO/DB1, we have now included additional specialist learning baselines to provide a broader context for evaluating our model’s performance:
- TSP and Knapsack: POMO [2]
- CVRP, OP, PCTSP, SPCTSP: AM [3]
- ATSP, FFSP: MatNet [4] (used both as the expert solver and the learning baseline for FFSP)
- MIS: LwD [5]
The selected baselines are all auto-regressive NCO methods, which aligns with the foundational nature of our method.
Building on this expanded set of baselines, we also evaluated our model using 16-sample decoding (x16 sampling), in addition to the previous greedy decoding approach. We have included these new results in Table 2 and the table below, and have highlighted the best results—those marked with a * for all learning-based methods and in bold for the unified models.
The results with the sampling strategy show substantial improvements. Specifically, except for FFSP, all other problems achieve scores above 97.8%. Remarkably, our model even outperforms specialist learning baselines on 6 out of 9 problems, which was an unexpected and significant finding. We believe that these new results provide strong evidence of the performance gains achieved by our unified model.
Regarding the efficiency of our model, it is indeed slower compared to the specialist models. This is primarily due to the larger number of parameters in our model—75M parameters compared to 0.6M and 1.2M parameters for AM [2] and POMO [1], respectively. While we acknowledge that our model is less efficient in terms of inference time, we intentionally did not emphasize this as a key contribution of the paper. Instead, we focus on the generalizability and performance improvements that our model offers across a range of problems. For efficiency comparisons, we only provide them as a reference. The revised analysis can be found in Section 4.3.
| TSP | Knapsack | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Method | Obj. | Gap | Score | Time | Obj. | Gap | Score | Time | |
| POMO single-traj | 3.84 | 0.07% | 99.98% | (22s) | 63.14 | 1.17% | 97.09% | (30s) | |
| POMO | 3.84* | 0.07%* | 99.99%* | (23s) | 63.79* | 0.16%* | 99.61%* | (31s) | |
| GATO/DB1-sampling | 3.86 | 0.49% | 99.70% | (15h) | 63.56 | 0.26% | 98.72% | (24m) | |
| Ours-sampling | 3.84* | 0.01%* | 99.99%* | (1h) | 63.53 | 0.56% | 98.60% | (8h) | |
| CVRP | OP | ||||||||
| Method | Obj. | Gap | Score | Time | Obj. | Gap | Score | Time | |
| AM-greedy | 6.38 | 4.40% | 96.12% | (7s) | 5.19 | 3.72% | 93.86% | (9s) | |
| AM-sampling | 6.29 | 2.96% | 97.40% | (14m) | 5.26 | 2.55 | 95.78% | (7m) | |
| GATO/DB1-sampling | 6.27 | 2.41% | 97.82% | (18h) | 5.30 | 1.56% | 97.42% | (10h) | |
| Ours-sampling | 6.27* | 2.40%* | 97.85%* | (2h) | 5.32* | 1.21%* | 98.01%* | (51m) | |
| PCTSP | SPCTSP | ||||||||
| Method | Obj. | Gap | Score | Time | Obj. | Gap | Score | Time | |
| AM-greedy | 3.18 | 0.85% | 99.57% | (13s) | 3.23 | -0.71% | 101.25% | (9s) | |
| AM-sampling | 3.16 | 0.13% | 99.97% | (12m) | 3.20 | -1.85% | 101.94% | (10m) | |
| GATO/DB1-sampling | 3.20 | 1.26% | 99.36% | (15h) | 3.28 | -0.90% | 100.47% | (16h) | |
| Ours-sampling | 3.15* | -0.27%* | 100.21%* | (2h) | 3.16* | -4.03%* | 102.89%* | (2h) | |
| ATSP | FFSP | ||||||||
| Method | Obj. | Gap | Score | Time | Obj. | Gap | Score | Time | |
| MatNet | 3.87 | 0.52% | 99.70% | (33s) | 27.31* | 0.00%* | 100.00%* | (30s) | |
| MatNet-augment | 3.85* | 0.03%* | 99.98%* | (7m) | - | - | - | - | |
| GATO/DB1-greedy | 10.47 | 171.95% | 0.30% | (32m) | 41.42 | 51.67% | 20.24% | (4h) | |
| GATO/DB1-sampling | 8.86 | 131.09% | 22.78% | (8h) | 41.01 | 50.16% | 22.56% | (65h) | |
| Ours-sampling | 3.96 | 3.04% | 98.15% | (4h) | 28.34 | 3.77% | 94.18% | (7h) | |
| MIS | |||||||||
| Method | Obj. | Gap | Score | Time | |||||
| LwD | 10.42 | 0.19% | 98.50% | (8m) | |||||
| GATO/DB1-greedy | 9.70 | 7.09% | 44.36% | (33m) | |||||
| GATO/DB1-sampling | 9.82 | 5.94% | 53.38% | (8h) | |||||
| Ours-sampling | 10.42* | 0.19%* | 98.50%* | (1h) |
I am generally satisfied with the revisions made to the paper and appreciate the author's rebuttal. I have increased the score.
Thank you again for your feedback and dedicated review!
This manuscript aims to develop a unified model for diverse combinatorial optimization problems, following the next-token-prediction concept. It first tokenizes different COPs with a CO-prefix that includes static features and trajectories. After tokenizing, it proposes a two-stage training paradigm in a self-supervised manner, learning to predict states and actions. Experiments are conducted on nine COPs, including TSP, KP, MIS, FFSP, etc., to evaluate the proposed method's zero-shot and few-shot generalization capabilities.
优点
- The idea of developing a unified model across diverse CO domains is interesting, which will arouse the interests of the community.
- Code is provided.
缺点
- This manuscript seems to be a very drafted version. For instance, the appendix part isn't finished; the format, figure indices and reference are in a great mess; the abstract is overly lengthy.
- The improvement of the proposed method over baselines is marginal, and the inference time remains high.
- There is a lack of baseline learning methods, and the experimental section would benefit from further expansion.
- The proposed method only applied to the COPs with a very small problem scale (N=20).
- The tokenization for different COPs still requires hand-crafted designs and the design seems tricky.
- It would be helpful to understand whether the selection of training problems affects the results. Have any experiments been conducted on this? The current problem set heavily features routing problems. It would be better to incorporate other problems.
- Would be better to add some experiments for different hyper-parameters.
- Adding examples to illustrate tokenization for different COPs would enhance clarity.
问题
See weaknesses.
伦理问题详情
NA
Dear reviewer eH5T,
Thank you so much for your thoughtful review and overall positive comments. Generaly, we have made several major revisions to the manuscript:
- Additional specialist NCO baselines in our main results in Table 2.
- Additional evaluation results with N=50 problem scales in Appendix C.1.
- Evaluation on the generalization ability to larger scales with N=100 and 200 in Appendix C.2.
- Performance evaluation of our model with different problem combinations in Appendix C.3.
- Ablation studies on parameter scales and embedding dimensions in Appendix C.4.
- Details of tokenization and trajectory collection in Appendix B.
- A thorough revision to the entire paper to improve its presentation.
The updated sections are highlighted in blue for easy identification. We invite the reviewer to revisit these sections for further feedback.
Below we address the concerns specially raised in your review:
W1. This manuscript seems to be a very drafted version. For instance, the appendix part isn't finished; the format, figure indices and reference are in a great mess; the abstract is overly lengthy.
Thank you for your valuable feedback. We apologize for the presentation issues in the initial version of the manuscript. We have thoroughly revised the entire manuscript, addressing the concerns about clarity and formatting. In particular, we have reorganized the appendix, corrected figure and table indices, and ensured proper referencing. Additionally, we have streamlined the abstract to make it more concise.
We have also added extensive experiments and additional explanations in the appendix to provide a more comprehensive understanding of our method. We hope the revisions meet the standards for clarity and completeness. Thank you again for your helpful suggestions.
W2.The improvement of the proposed method over baselines is marginal, and the inference time remains high.
In general, to better demonstrate the effectiveness of our method, we have further included additional specialist baselines, and added more inference results for comparison.
In addition to the baseline comparison with GATO/DB1, we have now included additional specialist learning baselines to provide a broader context for evaluation:
- TSP and Knapsack: POMO [1]
- CVRP, OP, PCTSP, SPCTSP: AM [2]
- ATSP, FFSP: MatNet [3] (used both as the expert solver and the learning baseline for FFSP)
- MIS: LwD [4]
The selected baselines are all auto-regressive NCO methods, which aligns with the foundational nature of our method.
Motivated by these baselines, we also evaluated our model using 16-sample decoding (x16 sampling), in addition to the previous greedy decoding approach. We have included these new results in Table 2 and the table below, and have highlighted the best results—those marked with a * for all learning-based methods and in bold for the unified models.
Results with the sampling strategy show substantial improvements. Except for FFSP, all other problems achieve scores above 97.8%. Remarkably, our model even outperforms specialist learning baselines on 6/9 problems, which was an unexpected and significant finding. Another significant improvement is the capability on ATSP, MIS and FFSP compared to GATO/DB1. In our previous version, we did not report the performances of GATO/DB1 on these 3 problems, as they could hardly converge. We now include them nonetheless as suggested by other reviewer. Considering the failure in these 3 problems may be too noisy for GATO/DB1 to learn effectively on the other 6 problems, we trained two versions of GATO/DB1: one that was trained on all 9 problems and another that was trained only on the first 6. We report the better results from each model on the first 6 problems for GATO/DB1 only. Even though the performance report of GATO/DB1 is carefully considered, it is still outperformed by our model on 8/9 problems. These new results provide strong evidence of the performance gains of our unified model.
Regarding the efficiency of our model, it is indeed slower compared to the specialist models. This is primarily due to the larger number of parameters in our model—75M parameters compared to 0.6M and 1.2M parameters for AM [2] and POMO [1], respectively. While we acknowledge that our model is less efficient in terms of inference time, we intentionally did not emphasize this as a key contribution of the paper. Instead, we focus on the generalizability and performance improvements that our model offers across a range of problems. The revised analysis can be found in Section 4.3.
| TSP | Knapsack | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Method | Obj. | Gap | Score | Time | Obj. | Gap | Score | Time | |
| POMO single-traj | 3.84 | 0.07% | 99.98% | (22s) | 63.14 | 1.17% | 97.09% | (30s) | |
| POMO | 3.84* | 0.07%* | 99.99%* | (23s) | 63.79* | 0.16%* | 99.61%* | (31s) | |
| GATO/DB1-sampling | 3.86 | 0.49% | 99.70% | (15h) | 63.56 | 0.26% | 98.72% | (24m) | |
| Ours-sampling | 3.84* | 0.01%* | 99.99%* | (1h) | 63.53 | 0.56% | 98.60% | (8h) | |
| CVRP | OP | ||||||||
| Method | Obj. | Gap | Score | Time | Obj. | Gap | Score | Time | |
| AM-greedy | 6.38 | 4.40% | 96.12% | (7s) | 5.19 | 3.72% | 93.86% | (9s) | |
| AM-sampling | 6.29 | 2.96% | 97.40% | (14m) | 5.26 | 2.55% | 95.78% | (7m) | |
| GATO/DB1-sampling | 6.27 | 2.41% | 97.82% | (18h) | 5.30 | 1.56% | 97.42% | (10h) | |
| Ours-sampling | 6.27* | 2.40%* | 97.85%* | (2h) | 5.32* | 1.21%* | 98.01%* | (51m) | |
| PCTSP | SPCTSP | ||||||||
| Method | Obj. | Gap | Score | Time | Obj. | Gap | Score | Time | |
| AM-greedy | 3.18 | 0.85% | 99.57% | (13s) | 3.23 | -0.71% | 101.25% | (9s) | |
| AM-sampling | 3.16 | 0.13% | 99.97% | (12m) | 3.20 | -1.85% | 101.94% | (10m) | |
| GATO/DB1-sampling | 3.20 | 1.26% | 99.36% | (15h) | 3.28 | -0.90% | 100.47% | (16h) | |
| Ours-sampling | 3.15* | -0.27%* | 100.21%* | (2h) | 3.16* | -4.03%* | 102.89%* | (2h) | |
| ATSP | FFSP | ||||||||
| Method | Obj. | Gap | Score | Time | Obj. | Gap | Score | Time | |
| MatNet | 3.87 | 0.52% | 99.70% | (33s) | 27.31* | 0.00%* | 100.00%* | (30s) | |
| MatNet-augment | 3.85* | 0.03%* | 99.98%* | (7m) | - | - | - | - | |
| GATO/DB1-greedy | 10.47 | 171.95% | 0.30% | (32m) | 41.42 | 51.67% | 20.24% | (4h) | |
| GATO/DB1-sampling | 8.86 | 131.09% | 22.78% | (8h) | 41.01 | 50.16% | 22.56% | (65h) | |
| Ours-sampling | 3.96 | 3.04% | 98.15% | (4h) | 28.34 | 3.77% | 94.18% | (7h) | |
| MIS | |||||||||
| Method | Obj. | Gap | Score | Time | |||||
| LwD | 10.42 | 0.19% | 98.50% | (8m) | |||||
| GATO/DB1-greedy | 9.70 | 7.09% | 44.36% | (33m) | |||||
| GATO/DB1-sampling | 9.82 | 5.94% | 53.38% | (8h) | |||||
| Ours-sampling | 10.42* | 0.19%* | 98.50%* | (1h) |
W3.There is a lack of baseline learning methods, and the experimental section would benefit from further expansion.
Thank you for your comment. In our previous version, we mainly compared with GATO/DB1 that could solve multiple CO problems simultaneously. We have now reorganized and expanded Section 4 to provide a more comprehensive performance evaluation. As part of this expansion, we have now included additional baseline comparisons with specialist learning methods, as discussed in our response to W2. We believe this additional analysis strengthens the experimental section and provides a clearer context for evaluating the performance of our proposed model.
W4.The proposed method only applied to the COPs with a very small problem scale (N=20). Thank you for your comment. We understand the concern regarding the problem scale in our original experiments. In response, we have conducted additional experiments to evaluate the performance of our model on larger problem scales.
- Performance on Larger Scales (N=50):
In addition to the main results where N=20, we now present results for problem scales of N=50 on five selected problems, as shown in Table 4. The selection of five rather than nine is due to time and resource limit during rebuttal. We also list the results below. These results demonstrate that our unified model performs consistently well even as the problem scale increases. Notably, our model outperforms the GATO/DB1 baseline and achieves performance comparable to single-model baselines. For the Knapsack problem, our model even outperforms the POMO baseline. These results highlight the scalability and robustness of our unified model, confirming that it is capable of handling problem instances with larger scales while maintaining high-quality performance across diverse CO problems. The implemented analysis can be found in Appendix C.1.
| TSP | Knapsack | |||||||
|---|---|---|---|---|---|---|---|---|
| Method | Obj. | Gap | Score | Time | Obj. | Gap | Score | Time |
| Random | 26.08 | - | 0.00% | (20s) | 85.31 | - | 0.00% | (1m) |
| Expert | 5.69 | 0.00% | 100.00% | (2h) | 161.99 | 0.00% | 100.00% | (26m) |
| POMO-singletraj | 5.73 | 0.70% | 99.80% | (37s) | 161.04 | 0.59% | 98.76% | (2m) |
| POMO | 5.70* | 0.10%* | 99.97%* | (1m) | 161.87 | 0.08% | 99.84% | (2m) |
| GATO/DB1-greedy | 6.25 | 9.86% | 97.22% | (4h) | 160.13 | 0.84% | 97.57% | (2h) |
| GATO/DB1-sampling | 5.96 | 4.64% | 98.68% | (62h) | 160.63 | 0.81% | 98.20% | (34h) |
| Ours-DR-greedy | 5.99 | 5.27% | 98.53% | (20m) | 160.36 | 1.01% | 97.80% | (6m) |
| Ours-greedy | 5.93 | 4.38% | 98.77% | (22m) | 160.68 | 0.81% | 98.20% | (7m) |
| Ours-sampling | 5.78 | 1.45% | 99.59% | (3h) | 161.93* | 0.04%* | 99.92%* | (1h) |
| CVRP | OP | |||||||
| Method | Obj. | Gap | Score | Time | Obj. | Gap | Score | Time |
| Random | 30.67 | - | 0.00% | (1m) | 3.14 | - | 0.00% | (1m) |
| Expert | 10.35 | 0.00% | 100.00% | (12h) | 16.59 | 0.00% | 100.00% | (5h) |
| AM-greedy | 10.97 | 5.88% | 97.00% | (20s) | 16.01 | 3.34% | 95.84% | (11s) |
| AM-sampling | 10.76* | 3.79%* | 98.06%* | (35m) | 16.55* | 1.61%* | 98.01%* | (12m) |
| GATO/DB1-greedy | 11.72 | 12.89% | 93.37% | (6h) | 14.66 | 11.57% | 85.68% | (3h) |
| GATO/DB1-sampling | 11.19 | 7.87% | 95.96% | (94h) | 15.91 | 4.08% | 94.94% | (49h) |
| Ours-DR-greedy | 11.68 | 12.82% | 93.42% | (34m) | 15.38 | 7.29% | 91.00% | (17m) |
| Ours-greedy | 11.61 | 12.14% | 93.77% | (34m) | 15.49 | 6.64% | 91.77% | (16m) |
| Ours-sampling | 11.06 | 6.80% | 96.50% | (5h) | 16.23 | 2.07% | 97.44% | (2h) |
| PCTSP | ||||||||
| Method | Obj. | Gap | Score | Time | ||||
| Random | 21.37 | - | 0.00% | (1m) | ||||
| Expert | 4.48 | 0.00% | 100.00% | (5h) | ||||
| AM-greedy | 4.58 | 2.30% | 99.37% | (13s) | ||||
| AM-sampling | 4.53* | 1.15%* | 99.69%* | (22m) | ||||
| GATO/DB1-greedy | 4.92 | 9.89% | 97.27% | (4h) | ||||
| GATO/DB1-sampling | 4.63 | 3.23% | 99.11% | (65h) | ||||
| Ours-DR-greedy | 4.79 | 6.92% | 98.16% | (29m) | ||||
| Ours-greedy | 4.76 | 6.30% | 98.27% | (29m) | ||||
| Ours-sampling | 4.54 | 1.36% | 99.63% | (4h) |
- Generalization to Larger Problem Sizes (N=100, N=200):
To further assess the model's generalization to larger scales, we used the pre-trained model from Table 2 and fine-tuned it on newly collected trajectory data for the TSP at scales N=100 and N=200. The fine-tuning was conducted for 10 epochs at each scale, and we compared the performance with the POMO baseline [1]. As shown in Figure 10, the results demonstrate that although POMO performs well without fine-tuning on larger scales, our model can adapt quickly. After just 3 and 5 fine-tuning epochs for N=100 and N=200, respectively, our unified model outperforms POMO. These results underscore the model’s ability to scale quickly owing to its efficient fine-tuning process. The detailed analysis is included in Appendix C.2.
W5.The tokenization for different COPs still requires hand-crafted designs and the design seems tricky.
Thank you for your comment. We understand the concern regarding the tokenization process for different CO problems. While the tokenization process involves minimal design choices, these choices indeed follow the principle of simplicity. The designs follow previous learning based NCO literature, in which one major contribution is to reduce hand-crafted rules and human efforts.
-
Data Processing and Token Design:
Although our model can solve a wide variety of CO problems, it does require tokenization due to its reliance on a transformer-based architecture, which does not automatically handle tokenization as LLMs do. The token design in our approach follows a straightforward strategy: we aggregate the static information of each problem into CO-prefix, while keeping the dynamic information in the per-step tokens. This division of static and dynamic features is simple and is based on established methods in the learning-based CO literature, in which a major contribution is to reduce hand-crafted efforts.
Specifically, the token-feature design for each problem follows the design principles laid out in previous work:- TSP, VRP: Adopted from POMO [1] and AM [2]
- OP, PCTSP, SPCTSP: From AM [2]
- Knapsack: From POMO [1]
- FFSP: Based on MatNet [3]
- MIS and ATSP: Raw features, where the prefix token is the flattened adjacency matrix.
These designs are simple and were directly inspired by previous literature, which aimed to reduce the need for complex hand-crafted feature engineering. As a result, token and feature design for each problem can be easily adapted without significant manual adjustments.
-
Tokenization Process:
The tokenization process itself is a general procedure that is applicable across all CO problems in our model. This is a common approach in many LLMs and foundation models, including GATO [6], where the continuous values are discretized and mapped to token IDs. The quantization process is adjustable and can be tailored to different problem distributions, ensuring flexibility and robustness.
In summary, while the tokenization process requires minimal design, it is based on straightforward, established methods and does not involve overly complex hand-crafted steps for each CO problem. The goal of this approach is to ensure generalizability and flexibility while minimizing the need for extensive manual design. To make this clearer, we have included specific token design examples for each CO problem in Appendix A. We have also added a dedicated section in Appendix B that details the tokenization process.
W6.It would be helpful to understand whether the selection of training problems affects the results. Have any experiments been conducted on this? The current problem set heavily features routing problems. It would be better to incorporate other problems.
Thank you for your valuable suggestion. To investigate the effect of training problem selection on our model’s performance, we have now conducted experiments with three different problem sets as suggested:
- All nine problems (including both routing and non-routing)
- Six routing problems (TSP, CVRP, OP, PCTSP, SPCTSP, Knapsack)
- Three non-routing problems (MIS, ATSP, FFSP)
The performance results from these experiments, evaluated using greedy decoding, are summarized in Table 5. We also list the results below.
Interestingly, we find that aggregating problem instances from structurally diverse problems can further boost the overall performance of the model. Except for ATSP, training on all nine problems together results in the best scores across all other problems compared to the other problem group combinations. These results align with previous research, such as GATO [6], which demonstrated that training on a more diverse set of tasks can improve the model's ability to generalize. Our experiments show that training on a mixture of problem types—both routing and non-routing—helps the unified model learn more robust features and generalize effectively across different types of combinatorial optimization problems.
In conclusion, the selection of training problems does influence the model's performance, and incorporating a diverse set of problem types enhances the generalization capacity of the unified model. This finding further strengthens our claim that a unified model can be trained to solve a broad range of CO problems. The newly added analysis can be found in Appendix C.3.
| All Problems | Routing Problems | Non-Routing Problems | ||||
|---|---|---|---|---|---|---|
| Obj. | Score | Obj. | Score | Obj. | Score | |
| TSP | 3.87 | 99.55% | 4.03 | 96.75% | - | - |
| CVRP | 6.66 | 92.30% | 6.89 | 88.72% | - | - |
| PCTSP | 3.20 | 99.34% | 3.38 | 96.25% | - | - |
| OP | 5.06 | 90.72% | 4.74 | 80.02% | - | - |
| SPCTSP | 3.26 | 100.84% | 3.39 | 98.55% | - | - |
| ATSP | 4.22 | 94.43% | 4.11 | 95.80% | - | - |
| Knapsack | 61.99 | 92.62% | - | - | 61.95 | 92.47% |
| MIS | 10.35 | 93.23% | - | - | 10.28 | 87.97% |
| FFSP | 29.10 | 89.88% | - | - | 29.11 | 89.85% |
W7.Would be better to add some experiments for different hyper-parameters.
Thank you for the suggestion to explore the impact of hyper-parameters on our model's performance. To investigate this, we conducted experiments with different parameter scales, specifically adjusting the embedding dimensions of the transformer backbone, which directly affects the total number of parameters. These experiments were performed on five problems with , and the results are summarized in Table 6. We also list the results below.
We observe that the performance of our model continues to improve as the total parameter scale increases. However, the rate of improvement gradually slows down when the total parameter scale reaches 75M and 131M, corresponding to embedding dimensions of 768 and 1024, respectively. Among these configurations, the model with 131M parameters outperforms the model with 75M parameters on three out of five problems.
While increasing the parameter scale generally improves performance, we find that further scaling the parameters beyond a certain point yields diminishing returns. This suggests that the current limitations are not solely related to parameter scale but may also be influenced by the number of problem types and the amount of data used for training. The results generally underscore the importance of balancing model capacity, problem diversity, and data size to maximize the potential of a unified model for combinatorial optimization problems. The newly added analysis can be found in Appendix C.4.
| h=128 #params=2.7M | h=256 #params=9M | h=512 #params=34M | h=768 #params=75M | h=1024 #params=131M | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Obj. | Score | Obj. | Score | Obj. | Score | Obj. | Score | Obj. | Score | |
| TSP | 6.82 | 94.44% | 6.02 | 98.37% | 5.94 | 98.78% | 5.96 | 98.66% | 5.92 | 98.83% |
| CVRP | 12.55 | 89.13% | 11.75 | 93.12% | 11.59 | 93.87% | 11.59 | 93.89% | 11.68 | 93.41% |
| OP | 11.22 | 60.07% | 15.18 | 89.41% | 15.55 | 92.16% | 15.37 | 90.76% | 15.61 | 92.62% |
| PCTSP | 5.56 | 93.30% | 4.91 | 97.33% | 4.77 | 98.20% | 4.81 | 98.00% | 4.71 | 98.59% |
| Knapsack | 140.14 | 69.94% | 160.28 | 97.69% | 160.28 | 97.65% | 160.49 | 97.96% | 160.36 | 97.80% |
W8. Adding examples to illustrate tokenization for different COPs would enhance clarity.
Thank you for your valuable advice. As mentioned in W6, we have included specific token design examples for each CO problem in Appendix A. We have also added a dedicated section in Appendix B that details the tokenization process.
Reference
[1] Kwon, Y., et al. "Pomo: Policy optimization with multiple optima for reinforcement learning." NeurIPS 2020.
[2] Kool, W., et al. "Attention, learn to solve routing problems!" ICLR 2019.
[3] Kwon, Y., et al. "Matrix encoding networks for neural combinatorial optimization." NeurIPS 2021.
[4] Ahn, S., et al. "Learning what to defer for maximum independent sets." ICML 2020.
[5] Schubert I., et al. "A Generalist Dynamics Model for Control." Arxiv 2023.
[6] Reed, et al. "A Generalist Agent." Arxiv 2022.
I am still strongly concerned about the applicability of the proposed method, as its performance on even small problem sizes ( = 20, 50) exhibits no superiority compared to the most recent learning baselines. For example, its optimality gap on CVRP with 20 nodes is 2.40% and inference time is 2h. Hence, I recommend rejection.
We respectfully disagree with the reviewer’s assessment on the applicability of our approach. As we emphasize in Section 1 of our paper, the development of all CO algorithms—whether learning-based or non-learning-based—must be understood within the context of the No Free Lunch Theorem [1]. This theorem asserts that no single optimization algorithm can be universally superior to all others across all possible problem instances. In our work, we pursue the generalization ability to solve a diverse range of CO problems, and as a result, our unified model might inevitably experiences potential optimization gaps compared to each specialized solver for each certain problem.
The major concern raised by the reviewer pertains to whether the significane and applicability of our work with such a focus is enough given these occasional performance discrepancies. We believe that the answer to the reviewer’s concern is: yes, the focus on developing a unified model is indeed of great significance. From a research perspective, as the first comprehensive study to investigate and develop a unified approach for solving diverse CO problems, we believe our work is a critical step in advancing the field. Our research opens new avenues for the CO community to focus on developing foundational models capable of solving a wide range of CO problems, potentially paving the way for further breakthroughs in this domain. On the application side, a unified model significantly reduces the need for hand-crafted designs in both feature extraction and network architecture development for different individual CO problems. Additionally, a unified model can quickly adapt to unseen problem types, eliminating the need for training individual models from scratch for each new problem. This results in a substantial reduction in the expense spent on model training and maintenance, which is a clear advantage in real-world usage.
Meanwhile, even for individual result comparison, we believe that the results presented in the revised version of the paper are compelling. As shown in Tables 2, our model achieves competitive performance to all other baselines. In fact, it even outperforms recent learning-based specialist methods on 6 out of 9 problems. The results in Figure 10 further demonstrate that our model exhibits superior generalization ability across scales when compared to recent baselines [2]. Regarding computation efficiency, our model requires 16m/2h on the CVRP with a greedy/sampling strategy 10,000 instances, as mentioned by the reviewer. While it is slightly slower than the corresponding learning-based methods, the increased time is a necessary trade-off to achieve the enhanced generalization capability. When compared to non-learning expert solvers, our method shows clear advantages in inference speed.
In conclusion, we believe our research is of great significance in pioneering the development of unified models for solving a variety of CO problems. The performance of our model on individual problems is already strong, and its ability to generalize across problem types is a key advantage. We hope the reviewer can reconsider the evaluation and overall score of our work in light of these points.
Reference
[1] Wolpert, D., et al. "No free lunch theorems for optimization." IEEEtransactions on evolutionary computation, 1997.
[2] Kwon, Y., et al. "Pomo: Policy optimization with multiple optima for reinforcement learning." NeurIPS 2020.
Thank you for the response.
I acknowledge the value of this paper in developing a unified solver for multiple COPs, but the weaknesses of the proposed method are obvious:
-
The inference time is excessively long when using sampling, with only marginal improvement compared to the results obtained with the greedy approach.
-
Poor scalability: the optimality gap on CVRP with 20 nodes is 9.00% (greedy, 16 minutes) / 2.40% (sampling, 2 hours); on CVRP with 50 nodes, it is 12.14% (greedy, 34 minutes) / 6.80% (sampling, 5 hours); on TSP with 50 nodes, it is 4.38% (greedy, 22 minutes) / 1.45% (sampling, 3 hours). I don't think these poor results can be described as "occasional performance discrepancies." They seem to be consistently subpar. Furthermore, the performance using greedy is quite poor; although the author(s) argue that sampling improves performance, the inference time remains unacceptable, and the performance does not show significant promise (it can be surpassed by most recent learning or non-learning solvers). Additionally, the inference time appears to increase linearly with the problem scale, and the base time for smaller scales (e.g., 20 nodes) is already very high.
Therefore, I still lean towards rejection.
Even though the reviewer claimed acknowledgement of our paper's value in developing a unified model for CO problems, we believe the evaluation does not fully consider the unique goals and challenges associated with this topic. Specifically, we feel that the importance of the No Free Lunch Theorem (NFLT) [1], which we emphasized in our manuscript and prior responses and greatly influence the appropriate evaluation standard, has not been adequately recognized and even avoided to discuss by the reviewer.
The NFLT clearly states that no single optimization algorithm is superior to all others across all problem instances. Under such limit, we think the reason for rejection given by the reviewer, "no superiority compared to the most recent learning baselines" and "can be surpassed by most recent learning or non-learning solvers", is unreasonable. In fact, as the first literature to develop a unified solver without relying on any specific forms (such as graph structure hypothesis), our model is expected to perform strong task generalization, while comparable performance to other specialist solvers is adequate. Meanwhile, it already shows superiority in several problem types, which is largely beyond our expectations. Further critism of the performance significance in the remaining problems compared to the specialist solvers leads to unfairness of evaluation, since it already violates the NFLT. Specifically, we address the "not good enough" concern by the reviewer:
-
Optimality gap and scalability. While the reviewer criticizes our model for not universally outperforming specialist solvers, this critique overlooks the nature of unified problem solving under the NFLT. Developing a model that generalizes across CO problems inherently involves trade-offs in performance on individual problems. However, as shown in Table 2, our performance gap is superior than recent learning solver baselines on 6 out of 9 problems (0.011% vs. 0.013% on TSP, 2.40% vs. 2.96% on CVRP, 1.21% vs. 2.55% on OP, -0.27% vs. 0.13% on PCTSP, -4.03% vs. -1.85% on SPCTSP, 0.192% vs. 0.194% on LwD), a marjority of the total problem set. Such a performance is already beyond expectation, while the overall performance aligns with NFLT. As for the scalability, we have shown great scale generalization from N=20 to N=100/200 on TSP instances in Figure 10. Our model starts to outperform recent learning baseline [2] after 3/5 epochs of finetuning (1500/2500 instances), showing its strong performance on larger-scale instances after minimal fine-tuning.
-
Inference time. The reviewer’s concern about inference time does not account for the inherent trade-offs between model versatility and computational efficiency. We analyze why the current time cost is alreayd the minimal time requirements. As for direct greedy decoding, the time complexity for construction learning baselines and our unified model on different problem is approximately and respectively, since our model in addition generates state tokens to guarantee the versatality for different problems. is the model parameter scale, and is the problem/action token scale. is the observation token scale, and is already reduced to from by our CO-prefix design. The difference on infererence speed mainly results from parameter scales, which is around times. Such an amount selection is necessary, as discussed in the ablation study in Table 6, and aligns with best practices in foundation model research [3,4,5,6]. Aligning with these necessary trade-offs, our model achieves inference speeds with a minimal requirement: 9m vs. 22s on 10000 TSP instances and 16m vs. 7s on 10000 CVRP instances. As for the x16 sampling decoding, it achieves 2h vs. 14m on 10000 CVRP instances. Given the theoretical bounds and the nature of our model, these results are already close to optimal. Additionally, as noted by reviewers 2GjK and 8fcW, our model’s inference efficiency is a significant advantage compared to non-learning-based solvers.
In conclusion, the performance gaps noted by the reviewer are well within the bounds predicted by the NFLT and reflect the inherent trade-offs of pursuing generalization. We strongly urge the reviewer to reconsider their evaluation with the bounds by NFLT. Evaluating a unified model solely against specialist solvers on individual problems is inherently unfair and overlooks its broader contributions in reduction of hand-crafted designs, generalization and the valuable insights it provides for future research in this domain.
Reference
[1] Wolpert, D., et al. "No free lunch theorems for optimization." IEEEtransactions on evolutionary computation, 1997.
[2] Kwon, Y., et al. "Pomo: Policy optimization with multiple optima for reinforcement learning." NeurIPS, 2020.
[3] Hao, M., et al. "Large-scale foundation model on single-cell transcriptomics." Nature Methods, 2024.
[4] Reed, et al. "A Generalist Agent." Arxiv 2022.
[5] Yuan, Y., et al. "UniST: A Prompt-Empowered Universal Model for Urban Spatio-Temporal Prediction." KDD, 2024.
[6] Hong, D., et al. "SpectralGPT: Spectral Remote Sensing Foundation Model." TPAMI, 2024.
This paper proposes a unified framework for addressing CO problems by modeling them as MDPs and using a two-stage self-supervised learning approach. The authors introduce a CO-prefix design to efficiently handle static information, improving model training. Tested across 9 CO problems, the model shows strong performance, approaching expert solutions and adapting well to unseen tasks with few-shot learning. This work demonstrates the potential of using a unified model to solve diverse CO problems.
优点
- This paper introduces a novel CO-prefix design that not only significantly reduces token lengths but also efficiently separates static and dynamic features, improving both the efficiency and scalability of the training process.
- The unified model proposed in this paper demonstrates the capability to solve different CO problems using a single framework, exhibiting robust generalization across diverse problem types. This versatility, including few-shot and zero-shot learning capabilities, highlights the model’s potential for broad application without the need for retraining.
- The authors conduct comprehensive experiments on 9 different CO problems, providing thorough evaluations that include ablation studies and meaningful comparisons with baseline models.
缺点
- Limited Applicability to Non-Tail-Recursive Problems: Although this paper claims that the proposed method can address all CO problems, its approach is actually limited to problems with the tail recursion property, as noted in lines 239-243. While the paper lacks an explanation on how the approach could be adapted for CO problems without the tail recursion property. This limitation suggests an overstatement of the method’s applicability.
- CO-Prefix Design Limitations in Dynamic Problems: While the CO-prefix design effectively reduces token lengths, its effectiveness may diminish for dynamic problems, such as online bin packing, which contain minimal static information. This notable limitation, however, is not discussed in the paper.
- Limited Baseline Comparisons: The baselines in this paper are relatively narrow. Including comparisons with neural solver methods specifically designed for tasks such as routing and MIS would provide a clearer view of the performance differences between the approach and specialized methods, helping to underscore the innovations presented in this work.
问题
Pls refer to the weaknesses.
伦理问题详情
No
Dear reviewer tjFK,
Thank you for your thoughtful review and overall positive comments. Generally, to address the ambiguity for clarity and incorporate additional evaluations as suggested by all reviewers, we have made several major revisions to the manuscript:
- Additional specialist NCO baselines in our main results in Table 2.
- Additional evaluation results with N=50 problem scales in Appendix C.1.
- Evaluation on the generalization ability to larger scales in Appendix C.2.
- Performance evaluation of our model with different problem combinations in Appendix C.3.
- Ablation studies on parameter scales and embedding dimensions in Appendix C.4.
- Details of tokenization and trajectory collection in Appendix B.
- A thorough revision to the entire paper to improve its presentation.
The updated sections are highlighted in blue for easy identification. We invite the reviewer to revisit these sections for further feedback.
Below we address the concerns specially raised in your review:
W1:Limited Applicability to Non-Tail-Recursive Problems: Although this paper claims that the proposed method can address all CO problems, its approach is actually limited to problems with the tail recursion property, as noted in lines 239-243. While the paper lacks an explanation on how the approach could be adapted for CO problems without the tail recursion property. This limitation suggests an overstatement of the method’s applicability.
Thank you for pointing out this concern. Upon revisiting our manuscript, we acknowledge that the phrasing in Section 1 may have led to some ambiguity. Specifically, the statement "the framework we propose can be applied to any CO problems where a feasible solution can be formulated as an MDP" was intended to convey that our approach applies to CO problems that can be framed as an MDP with the tail recursion property, but it inadvertently suggested broader applicability.
To clarify, we have revised this sentence to: "the framework we propose can be applied to any CO problems as long as a feasible solution can be formulated as an MDP," which better reflects the scope of the current work (line 120 in the revised draft). We agree with the reviewer that a more precise expression is important to avoid overstating the method's applicability.
As highlighted in the manuscript, our approach is primarily focused on CO problems with the tail-recursion property—those that can currently be formulated as MDPs. We do not claim that the method is universally applicable to all CO problems at this stage. The generalization to problems without the tail-recursion property remains a challenging open problem, which we acknowledge as part of our future work.
Once again, we thank the reviewer for their feedback, which has helped us reduce any ambiguity and improve the clarity of our manuscript.
W2:CO-Prefix Design Limitations in Dynamic Problems: While the CO-prefix design effectively reduces token lengths, its effectiveness may diminish for dynamic problems, such as online bin packing, which contain minimal static information. This notable limitation, however, is not discussed in the paper.
We thank the reviewer for raising this important point regarding dynamic problems. We agree that dynamic problems, such as online bin packing, present a unique challenge due to the lack of static features. In such cases, the CO-prefix design, which relies on static features to reduce token lengths, may indeed have limited effectiveness.
As the reviewer rightly notes, dynamic problems often involve state representations that change at each step without static features to aggregate in advance. For example, in online 3D binpacking, Zhao et al. [1] proposed using a Packing Configuration Tree (PCT) to represent the current packing state, where the internal nodes of the PCT describe the packing configurations of items, and the leaf nodes represent the potential placements for the current item. In this scenario, the action corresponds to selecting a leaf node, and since there are no static features, the CO-prefix remains empty, and only step tokens are used to process transitions.
In these cases, our unified model processes MDP transitions without the benefit of the CO-prefix, resulting in a simplified version of our approach, similar to the GATO framework, with only the two-stage training process. While this setup still works effectively for small-scale problems (as shown in Table 2 and Table 4), it becomes less efficient as the token length increases, especially for problems like 3D online bin packing, where the PCT descriptor grows linearly.
We have added this discussion to Appendix A.10 of the revised manuscript. Due to time and resource constraints, we were unable to evaluate the performance of our model without a CO-prefix on the online 3D bin packing problem at this stage, but we plan to conduct this experiment and include the results in future work.
W3:Limited Baseline Comparisons: neural solver baselines should be included.
In our previous version, we mainly compared with GATO/DB1 that could solve multiple CO problems simultaneously. Suggested by the reviewer, we further compare our model with auto-regressive specialist NCO methods, which also use the MDP formulation for CO problems. We report performance on the TSP and Knapsack problems for POMO [2], CVRP, OP, PCTSP, and SPCTSP for AM [3], ATSP and FFSP for MatNet [4], and MIS for LwD [5]. Note that MatNet is used as both the expert solver and the learning baseline for FFSP.
We appreciate the reviewer’s suggestion to expand the baseline comparisons. In response, we have added comparisons with auto-regressive specialist NCO methods that also use the MDP formulation for CO problems. Specifically, we now include the following baselines:
- TSP and Knapsack: POMO [2]
- CVRP, OP, PCTSP, SPCTSP: AM [3]
- ATSP, FFSP: MatNet [4] (used both as the expert solver and the learning baseline for FFSP)
- MIS: LwD [5]
The selected baselines are all auto-regressive NCO methods, which aligns with the foundational nature of our method.
Additionally, we have extended the evaluation of our model and GATO/DB1 with an x16 sampling decoding strategy, as previously, our results were based on greedy decoding. We have included the new results in Table 2 and summarized these new lines in the response below. Best results among all learning based models are starred with *, and the best among all unified models are in bold.
- Our unified model shows generic problem solving ability, achieving performance comparable to specialist models.
- With greedy decoding, our model achieves scores above 90.7% on all problems, except for FFSP.
- With 16-sample decoding, the performance increases, reaching 97.8% except for FFSP.
- Notably, when using the sampling method, our model outperforms specialist learning baselines on 6 out of 9 problems.
We believe these additional comparisons provide a clearer view of the performance of our model and underscore the strength of our unified approach in solving a variety of CO problems.
| TSP | Knapsack | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Method | Obj. | Gap | Score | Time | Obj. | Gap | Score | Time | |
| POMO single-traj | 3.84 | 0.07% | 99.98% | (22s) | 63.14 | 1.17% | 97.09% | (30s) | |
| POMO | 3.84* | 0.07%* | 99.99%* | (23s) | 63.79* | 0.16%* | 99.61%* | (31s) | |
| GATO/DB1-sampling | 3.86 | 0.49% | 99.70% | (15h) | 63.56 | 0.26% | 98.72% | (24m) | |
| Ours-sampling | 3.84* | 0.01%* | 99.99%* | (1h) | 63.53 | 0.56% | 98.60% | (8h) | |
| CVRP | OP | ||||||||
| Method | Obj. | Gap | Score | Time | Obj. | Gap | Score | Time | |
| AM-greedy | 6.38 | 4.40% | 96.12% | (7s) | 5.19 | 3.72% | 93.86% | (9s) | |
| AM-sampling | 6.29 | 2.96% | 97.40% | (14m) | 5.26 | 2.55% | 95.78% | (7m) | |
| GATO/DB1-sampling | 6.27 | 2.41% | 97.82% | (18h) | 5.30 | 1.56% | 97.42% | (10h) | |
| Ours-sampling | 6.27* | 2.40%* | 97.85%* | (2h) | 5.32* | 1.21%* | 98.01%* | (51m) | |
| PCTSP | SPCTSP | ||||||||
| Method | Obj. | Gap | Score | Time | Obj. | Gap | Score | Time | |
| AM-greedy | 3.18 | 0.85% | 99.57% | (13s) | 3.23 | -0.71% | 101.25% | (9s) | |
| AM-sampling | 3.16 | 0.13% | 99.97% | (12m) | 3.20 | -1.85% | 101.94% | (10m) | |
| GATO/DB1-sampling | 3.20 | 1.26% | 99.36% | (15h) | 3.28 | -0.90% | 100.47% | (16h) | |
| Ours-sampling | 3.15* | -0.27%* | 100.21%* | (2h) | 3.16* | -4.03%* | 102.89%* | (2h) | |
| ATSP | FFSP | ||||||||
| Method | Obj. | Gap | Score | Time | Obj. | Gap | Score | Time | |
| MatNet | 3.87 | 0.52% | 99.70% | (33s) | 27.31* | 0.00%* | 100.00%* | (30s) | |
| MatNet-augment | 3.85* | 0.03%* | 99.98%* | (7m) | - | - | - | - | |
| GATO/DB1-greedy | 10.47 | 171.95% | 0.30% | (32m) | 41.42 | 51.67% | 20.24% | (4h) | |
| GATO/DB1-sampling | 8.86 | 131.09% | 22.78% | (8h) | 41.01 | 50.16% | 22.56% | (65h) | |
| Ours-sampling | 3.96 | 3.04% | 98.15% | (4h) | 28.34 | 3.77% | 94.18% | (7h) | |
| MIS | |||||||||
| Method | Obj. | Gap | Score | Time | |||||
| LwD | 10.42 | 0.19% | 98.50% | (8m) | |||||
| GATO/DB1-greedy | 9.70 | 7.09% | 44.36% | (33m) | |||||
| GATO/DB1-sampling | 9.82 | 5.94% | 53.38% | (8h) | |||||
| Ours-sampling | 10.42* | 0.19%* | 98.50%* | (1h) |
Reference
[1] Zhao, H., et al. "Learning Efficient Online 3D Bin Packing on Packing Configuration Trees." ICLR 2022.
[2] Kwon, Y., et al. "Pomo: Policy optimization with multiple optima for reinforcement learning." NeurIPS 2020.
[3] Kool, W., et al. "Attention, learn to solve routing problems!" ICLR 2019.
[4] Kwon, Y., et al. "Matrix encoding networks for neural combinatorial optimization." NeurIPS 2021.
[5] Ahn, S., et al. "Learning what to defer for maximum independent sets." ICML 2020.
Dear reviewer tjFK,
Thanks to the rebuttal expansion, we are now able to further evaluate our unified model on fully dynamic problem as mentioned in your review. We extended our model to the online 3D Bin Packing Problem (3D-BPP), following the settings of the learning-based Packing Configuration Tree (PCT) approach with RL [1].
In general, PCT employs a Packing Configuration Tree to represent the current packing state. The internal nodes describe the packing configurations of items, while the leaf nodes represent potential placements for the current item. In this context, the action corresponds to selecting a leaf node, and since this problem lacks static features, our CO-prefix remains empty. Instead, we use step tokens to process transitions effectively. For expert supervision, we used PCT’s publicly available implementation to generate MDP trajectories. The state representation from PCT serves as the design for our step tokens. Regarding evaluation settings, we adopt the Setting 2 and use Corner Point as the leaf node expansion scheme as outlined in PCT [1]. As for data generation, we randomly select item lengths from and set the bin length to 10. The objective is to maximize the bin utilization percentage until the next item cannot be further packed. The results are shown below.
| Method | Obj. | Gap | Score | Time |
|---|---|---|---|---|
| Random | 0.1739 | - | 0.00% | - |
| Expert | 0.8182 | 0.00% | 100.00% | (11m) |
| Ours-greedy | 0.8123 | 0.72% | 99.08% | (41m) |
| Ours-sampling | 0.8152 | 0.37% | 99.53% | (10h) |
Our model, trained on this dynamic CO problem, achieved 99.08% performance with a greedy decoding strategy and even higher scores when employing a sampling decoding strategy. This success highlights the versatility and robustness of our model, even when applied to entirely dynamic problems. Specifically:
- When static features exist, they are well-represented and effectively aggregated via the CO-prefix.
- When only dynamic features are present, our model continues to exhibit strong unified solving ability, maintaining its generalization across problem types.
Since only forum discussions are allowed at this stage, we plan to include these results and analyses in a future version of the manuscript. We sincerely hope that the concerns addressed in our previous response, along with these additional results, will encourage you to reconsider and possibly raise your overall score. We welcome any additional questions or suggestions you may have!
Reference
[1] Zhao, H., et al. "Learning Efficient Online 3D Bin Packing on Packing Configuration Trees." ICLR 2022.
This paper proposes a Transformer-based unified model for solving diverse combinatorial optimization (CO) problems. Inspired by the previous works known as auto-regressive methods for neural combinatorial optimization (NCO), this paper formulates diverse CO problems as a Markov Decision Process (MDP), and generates optimal solutions in a sequential manner. However, unlike the previous works using reinforcement learning (RL) to generate an optimal sequence, this paper basically leverages imitation learning and more focuses on multi-task learning to solve diverse CO problems in a unified model. For multi-task imitation learning for CO, this paper first collects expert trajectories from an expert solver for each domain. Then, it trains a Transformer-based model on the collected expert trajectories by using the next token prediction. For efficient multi-task imitation learning, this paper proposes two methods: (1) CO-prefix and (2) two-stage self-supervised learning. The key idea of CO-prefix is to consolidate static information as a prefix and maintain only dynamic information at each time step. This can reduce the redundency in trajectories. In two-stage self-supervised learning, this paper first learns the dynamics of state transition by predicting the next state, and then learns a policy by predicting an action given an observation. This paper evaluates the proposed model on 9 CO problems including TSP, CVRP, PCTSP, OP, SPCTSP, Knapsack, ATSP, MIS, and FFSP. The experiment results show that the proposed model can provide better scores than GATO/DB1 on 6 CO problems out of the total 9 CO problems.
优点
S1. This paper empirically demonstrates that a Transformer-based model can learn to solve diverse CO problems by experimenting on representative 9 CO problems.
S2. The proposed methods like CO-prefix and two-stage learning (dynamics learning and policy learning) seems simple but effective in multi-task imitation learning.
缺点
W1. [Method] One of main limitations of this paper is that the proposed method may not address the generalizability that aims to solve unseen CO problems. As mentioned in the Summary section of this review, this paper mainly focuses on imitating expert trajectories generated by an expert solver for each CO problem.
W2. [Method] This paper uses a unified tokenizer that converts discrete and continuous values in trajectories into tokens. However, I am not sure that this quantization is robust to the problem diversity. Different CO problems may have different scales in their values. Therefore, this kind of quantization may result in weak generalizability across different CO problems.
W3. [Experiments] This paper mainly compares the proposed model with Gato/DB1. However, the performance gain (i.e., the difference to an optimal solution) does not seem significant. According to Table 2, the more important thing seems that the proposed model is very efficient, significantly reducing inference time.
W4. [Experiments] Figure 5 (i.e., Performance on diverse problem types) may lead to misunderstanding. The authors did not perform experiments on ATSP, MIS, and FFSP, since Gato/DB1 can not properly process these CO problems. However, readers may think that there is a significant performance gap between Gato/DB1 and the proposed model on these CO problems.
W5. [Experiments] In Section 4.3 (i.e., Performances on Few-shot Ability), this paper provides experiment results on the effect of pre-training. However, the word "few-shot" in the title may lead to misunderstanding. This experiment seems more related to transfer learning rather than few-shot learning.
问题
Q1. CO-prefix is interesting. Could you provide some examples of CO-prefix for each CO problems? Those examples will help reader to understand the proposed model more clearly.
Q2. This paper reports that GATO/DB1 can not properly process some CO problems such as ATSP, MIS, and FFSP. Could you provide more explanation for this?
W3. [Experiments] This paper mainly compares the proposed model with Gato/DB1. However, the performance gain (i.e., the difference to an optimal solution) does not seem significant. According to Table 2, the more important thing seems that the proposed model is very efficient, significantly reducing inference time.
In addition to the baseline comparison with GATO/DB1, we have now included additional specialist learning baselines to provide a broader context for evaluating our model’s performance:
- TSP and Knapsack: POMO [2]
- CVRP, OP, PCTSP, SPCTSP: AM [3]
- ATSP, FFSP: MatNet [4] (used both as the expert solver and the learning baseline for FFSP)
- MIS: LwD [5]
The selected baselines are all auto-regressive NCO methods, which aligns with the foundational nature of our method.
Building on this expanded set of baselines, we also evaluated our model using 16-sample decoding (x16 sampling), in addition to the previous greedy decoding approach. We have included these new results in Table 2 and the table below, and have highlighted the best results—those marked with a * for all learning-based methods and in bold for the unified models.
The results with the sampling strategy show substantial improvements. Specifically, except for FFSP, all other problems achieve scores above 97.8%. Remarkably, our model even outperforms specialist learning baselines on 6 out of 9 problems, which was an unexpected and significant finding. We believe that these new results provide strong evidence of the performance gains achieved by our unified model.
Regarding the efficiency of our model, it is indeed slower compared to the specialist models. This is primarily due to the larger number of parameters in our model—75M parameters compared to 0.6M and 1.2M parameters for AM [3] and POMO [2], respectively. While we acknowledge that our model is less efficient in terms of inference time, we intentionally did not emphasize this as a key contribution of the paper. Instead, we focus on the generalizability and performance improvements that our model offers across a range of problems. For efficiency comparisons, we only provide them as a reference. The revised analysis can be found in Section 4.3.
| TSP | Knapsack | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Method | Obj. | Gap | Score | Time | Obj. | Gap | Score | Time | |
| POMO single-traj | 3.84 | 0.07% | 99.98% | (22s) | 63.14 | 1.17% | 97.09% | (30s) | |
| POMO | 3.84* | 0.07%* | 99.99%* | (23s) | 63.79* | 0.16%* | 99.61%* | (31s) | |
| GATO/DB1-sampling | 3.86 | 0.49% | 99.70% | (15h) | 63.56 | 0.26% | 98.72% | (24m) | |
| Ours-sampling | 3.84* | 0.01%* | 99.99%* | (1h) | 63.53 | 0.56% | 98.60% | (8h) | |
| CVRP | OP | ||||||||
| Method | Obj. | Gap | Score | Time | Obj. | Gap | Score | Time | |
| AM-greedy | 6.38 | 4.40% | 96.12% | (7s) | 5.19 | 3.72% | 93.86% | (9s) | |
| AM-sampling | 6.29 | 2.96% | 97.40% | (14m) | 5.26 | 2.55% | 95.78% | (7m) | |
| GATO/DB1-sampling | 6.27 | 2.41% | 97.82% | (18h) | 5.30 | 1.56% | 97.42% | (10h) | |
| Ours-sampling | 6.27* | 2.40%* | 97.85%* | (2h) | 5.32* | 1.21%* | 98.01%* | (51m) | |
| PCTSP | SPCTSP | ||||||||
| Method | Obj. | Gap | Score | Time | Obj. | Gap | Score | Time | |
| AM-greedy | 3.18 | 0.85% | 99.57% | (13s) | 3.23 | -0.71% | 101.25% | (9s) | |
| AM-sampling | 3.16 | 0.13% | 99.97% | (12m) | 3.20 | -1.85% | 101.94% | (10m) | |
| GATO/DB1-sampling | 3.20 | 1.26% | 99.36% | (15h) | 3.28 | -0.90% | 100.47% | (16h) | |
| Ours-sampling | 3.15* | -0.27%* | 100.21%* | (2h) | 3.16* | -4.03%* | 102.89%* | (2h) | |
| ATSP | FFSP | ||||||||
| Method | Obj. | Gap | Score | Time | Obj. | Gap | Score | Time | |
| MatNet | 3.87 | 0.52% | 99.70% | (33s) | 27.31* | 0.00%* | 100.00%* | (30s) | |
| MatNet-augment | 3.85* | 0.03%* | 99.98%* | (7m) | - | - | - | - | |
| GATO/DB1-greedy | 10.47 | 171.95% | 0.30% | (32m) | 41.42 | 51.67% | 20.24% | (4h) | |
| GATO/DB1-sampling | 8.86 | 131.09% | 22.78% | (8h) | 41.01 | 50.16% | 22.56% | (65h) | |
| Ours-sampling | 3.96 | 3.04% | 98.15% | (4h) | 28.34 | 3.77% | 94.18% | (7h) | |
| MIS | |||||||||
| Method | Obj. | Gap | Score | Time | |||||
| LwD | 10.42 | 0.19% | 98.50% | (8m) | |||||
| GATO/DB1-greedy | 9.70 | 7.09% | 44.36% | (33m) | |||||
| GATO/DB1-sampling | 9.82 | 5.94% | 53.38% | (8h) | |||||
| Ours-sampling | 10.42* | 0.19%* | 98.50%* | (1h) |
W4. [Experiments] Figure 5 (i.e., Performance on diverse problem types) may lead to misunderstanding. The authors did not perform experiments on ATSP, MIS, and FFSP, since Gato/DB1 can not properly process these CO problems. However, readers may think that there is a significant performance gap between Gato/DB1 and the proposed model on these CO problems.
The reason we did not report GATO/DB1 in the previous manuscript was indeed that it struggled to converge effectively on ATSP, MIS and FFSP under the given evaluation settings. To clarify and eliminate any ambiguity, we have now included these results in Table 2 and Figure 5 in our current version, and the table below. Note that performance scores on diverse problem types are clipped into the range of in Figure 5.
As for the results for the other 6 problems where GATO/DB1 could converge successfully, the failure in ATSP, MIS, and FFSP may have been too noisy for GATO/DB1 to learn effectively. To address this, we trained two versions of GATO/DB1: one that was trained on all 9 problems and another that was trained only on the first 6 problems. For transparency, we report the better results from each model version on the first 6 problems.
Results show that in eiter version, GATO/DB1 is outperformed by our unified model, which further shows the superiority of our approach. The revised version of GATO/DB1's evaluation setting can be found in Seciton 4.3.
W5. [Experiments] In Section 4.3 (i.e., Performances on Few-shot Ability), this paper provides experiment results on the effect of pre-training. However, the word "few-shot" in the title may lead to misunderstanding. This experiment seems more related to transfer learning rather than few-shot learning.
We apologize for the ambiguity in the experimental design presented in our previous manuscript. Here, we clarify and redefine the concept of the experimental design. In Section 4.4 (in the new version), we selected four routing problems and trained four distinct unified models. Each model was trained using a leave-one-out approach, where one problem was excluded during the initial training phase and then gradually fine-tuned using data from the unseen problem. In each fine-tuning epoch, we used only 0.67% of the total data from the unseen problem, which is in line with a few-shot learning setup. We evaluated the performance of our model after such fine-tuning, which reflects the few-shot ability of our model. We have also updated Figure 7 by replacing the x-axis with the percentage of data used.
Such a design to evaluate the few-shot ability was also adopted in other unified and foundation models across multiple tasks, such as TDMs[6]. We intended to demonstrate how effectively our pre-trained model can generalize to unseen problems with minimal data. We hope this clarifies the intent behind the experiment and the terminology used. The revised analysis could be found in Section 4.4.
Q1. CO-prefix is interesting. Could you provide some examples of CO-prefix for each CO problems? Those examples will help reader to understand the proposed model more clearly.
Thank you for your suggestion. To clarify the CO-prefix design for each problem, we have now included detailed examples in Appendix A of the revised manuscript. In this section, we provide details for both the CO-prefix and step token design for each CO problem. Furthermore, we also provide the tokenization details of CO-prefix in Appendix B. These examples should help readers better understand the structure and role of the CO-prefix in our model.
Q2. This paper reports that GATO/DB1 can not properly process some CO problems such as ATSP, MIS, and FFSP. Could you provide more explanation for this?
Thank you for your question. GATO/DB1 struggles to process problems like ATSP, FFSP, and MIS due to its inability to effectively handle large observation spaces and the absence of a structured CO-prefix. Without a prefix design, GATO/DB1 processes each step by computing a full observation token at each step, which becomes computationally inefficient, especially when the observation space is large.
For instance, in both ATSP and MIS, the static information is represented by an adjacency matrix, which has a complexity of . This makes the model's processing at each step highly inefficient, as GATO/DB1 can only process one or two complete trajectory steps per training episode. The model's ability to converge is significantly slower due to the nature of the problem's large and sparse observation space.
In comparison, our approach, which incorporates a structured CO-prefix, allows us to inject more relevant prior knowledge into the model and avoid processing the full observation at each step. This enables our model to handle these problems more efficiently, leading to better convergence. We also observe that, even for the remaining problems, GATO/DB1 converges slower than our model across all tasks, as shown in Figure 6. The revised analysis can be found in Section 4.3.
Reference
[1] Reed, et al. "A Generalist Agent." Arxiv 2022.
[2] Kwon, Y., et al. "Pomo: Policy optimization with multiple optima for reinforcement learning." NeurIPS 2020.
[3] Kool, W., et al. "Attention, learn to solve routing problems!" ICLR 2019.
[4] Kwon, Y., et al. "Matrix encoding networks for neural combinatorial optimization." NeurIPS 2021.
[5] Ahn, S., et al. "Learning what to defer for maximum independent sets." ICML 2020.
[6] Schubert I., et al. "A Generalist Dynamics Model for Control." Arxiv 2023.
Dear reviewer 2GjK,
Thank you so much for your thoughtful review and overall positive comments. Generally, to address the ambiguity for clarity and incorporate additional evaluations as suggested by all four reviewers, we have made several major revisions to the manuscript:
- Additional specialist NCO baselines in our main results in Table 2.
- Additional evaluation results with N=50 problem scales in Appendix C.1.
- Evaluation on the generalization ability to larger scales in Appendix C.2.
- Performance evaluation of our model with different problem combinations in Appendix C.3.
- Ablation studies on parameter scales and embedding dimensions in Appendix C.4.
- Details of tokenization and trajectory collection in Appendix B.
- A thorough revision to the entire paper to improve its presentation.
The updated sections are highlighted in blue for easy identification. We invite the reviewer to revisit these sections for further feedback.
Below we address the concerns specially raised in your review:
W1. [Method] One of main limitations of this paper is that the proposed method may not address the generalizability that aims to solve unseen CO problems. As mentioned in the Summary section of this review, this paper mainly focuses on imitating expert trajectories generated by an expert solver for each CO problem.
We appreciate the reviewer’s concern regarding the generalizability to unseen CO problems. Actually, pretraining on problems with expert trajectories helps the model to fast generalized to unseen problems. We have evaluated the few-shot generalization ability of our model on unseen problems in Section 4.4. Specifically, we selected four routing problems and trained four distinct unified models. Each model was trained using a leave-one-out approach, where one problem was excluded during the initial training phase and then gradually fine-tuned using data from the unseen problem. In each fine-tuning epoch, we used only 0.67% of the total data from the unseen problem. The results, presented in Figure 7, compare the performance of our model with that of a model trained from scratch on the corresponding problem.
Results demonstrate that our model exhibits strong few-shot generalization capabilities across all four problem settings, even with minimal data. Notably, the model achieves high solution quality after just one epoch of fine-tuning using only 0.67% of the available data. While our model primarily imitates expert trajectories during pre-training, these fine-tuning results show that the pre-trained unified model can be quickly adapted to an unseen problem. Such a design to evaluate the few-shot ability was also adopted in other unified and foundation models across multiple tasks, such as TDMs[6].
We believe this ability to adapt to unseen problems with limited data represents a major advantage of our approach, further extending its practical applicability.
W2. [Method] This paper uses a unified tokenizer that converts discrete and continuous values in trajectories into tokens. However, I am not sure that this quantization is robust to the problem diversity. Different CO problems may have different scales in their values. Therefore, this kind of quantization may result in weak generalizability across different CO problems.
Our tokenization method is inspired by the approach used in GATO [1], and its robustness comes from two key aspects:
-
Mu-law Encoding and Adjustable Hyperparameters: To handle continuous values, we apply mu-law encoding to map them into a fixed range before assigning token IDs. This encoding, along with the assignment of token IDs for both discrete and continuous values, is highly flexible. The hyperparameters of mu-law, as well as the mapping of values to token IDs, can be adjusted during training to better fit the distribution of values specific to each problem. This allows our model to adapt to the unique scale and distribution of values across different CO problems.
-
Linear Transformations for Inference: In many CO problems, the relationship between problem features is linear. During inference, as long as the problem is linear, even when the input instance’s values differ from the training distribution, we can apply a linear transformation to normalize the features before feeding them into the model. While inferring instances from a different distribution may lead to some performance degradation, this issue can be mitigated by further fine-tuning on the new data, allowing the model to adapt more effectively to the new problem instance.
Thus, the flexibility of our tokenization—via adjustable parameters and for linear transformations for those linear problems—ensures that the model remains adaptable across different CO problems and scales. We believe this approach strikes a balance between robustness and generalizability. More details of our tokenization approach can be found in Appendix B.
Thank you for providing thoughtful responses to my comments. I could understand this paper more. I maintain my initial rating 6, and slightly raised the contribution score (2 to 3) and presentation score (2 to 3).
Thank you again for your feedback and thoughtful review!
Dear reviewer 2GjK,
For your information, we have now extended our unified model to evaluate its performance on a fully dynamic problem, as highlighted by another reviewer, thanks to the expanded rebuttal period. We believe these additional results further validate the strong generalization capability of our model, and we would like to share them with you as well.
We extended our model to the online 3D Bin Packing Problem (3D-BPP), following the settings of the learning-based Packing Configuration Tree (PCT) approach with RL [1]. In general, PCT employs a Packing Configuration Tree to represent the current packing state. The internal nodes describe the packing configurations of items, while the leaf nodes represent potential placements for the current item. In this context, the action corresponds to selecting a leaf node, and since this problem lacks static features, our CO-prefix remains empty. Instead, we use step tokens to process transitions effectively. For expert supervision, we used PCT’s publicly available implementation to generate MDP trajectories. The state representation from PCT serves as the design for our step tokens. As for data generation, we randomly select item lengths from and set the bin length to 10. The objective is to maximize the bin utilization percentage until the next item cannot be further packed. The results are shown below.
| Method | Obj. | Gap | Score | Time |
|---|---|---|---|---|
| Random | 0.1739 | - | 0.00% | - |
| Expert | 0.8182 | 0.00% | 100.00% | (11m) |
| Ours-greedy | 0.8123 | 0.72% | 99.08% | (41m) |
| Ours-sampling | 0.8152 | 0.37% | 99.53% | (10h) |
Our model, trained on this dynamic CO problem, achieved 99.08% performance with a greedy decoding strategy and even higher scores when employing a sampling decoding strategy. This success highlights the versatility and robustness of our model, even when applied to entirely dynamic problems. Specifically:
- When static features exist, they are well-represented and effectively aggregated via the CO-prefix.
- When only dynamic features are present, our model continues to exhibit strong unified solving ability, maintaining its generalization across problem types.
Since only forum discussions are allowed at this stage, we plan to include these results and analyses in a future version of the manuscript. We sincerely hope that the concerns addressed in our previous response, along with these additional results, will encourage you to reconsider and possibly raise your overall score. We welcome any additional questions or suggestions you may have!
Reference
[1] Zhao, H., et al. "Learning Efficient Online 3D Bin Packing on Packing Configuration Trees." ICLR 2022.
This paper presents a unified model for combinatorial optimization problems. The authors' claim is that by leveraging a CO-prefix design and two-stage self-supervised learning, the model can solve a diverse range of CO problems with good generalization capabilities. The model was evaluated on nine problems, showing some promise in performance and the ability to handle different problem types.
However, the paper has several major flaws. While the idea of a unified model is interesting, the novelty of the proposed techniques is not well-established. The performance improvements over baselines are marginal in many cases, and the inference time is a concern. There are also issues with the clarity of the manuscript and the experimental setup, such as the limited scope of baseline comparisons and the lack of in-depth analysis of the model's behavior on different problem scales. Overall, the paper does not meet the standards for acceptance as it fails to convincingly demonstrate the superiority and practicality of the proposed approach.
审稿人讨论附加意见
During the rebuttal period, the authors addressed some of the reviewers' concerns by adding more experiments and improving the clarity of the manuscript. They included additional baselines and evaluated the model on larger problem scales. However, these efforts did not fully address the fundamental weaknesses. The added experiments did not show a significant improvement in performance, and the issues with inference time and the lack of clear novelty remained. In the final decision, the weaknesses continued to outweigh the improvements, leading to the rejection.
Reject