Solving the Asymmetric Traveling Salesman Problem via Trace-Guided Cost Augmentation
摘要
评审与讨论
This paper propose to continuously relax the ATSP of discrete variables, and uses differentiable directed acyclic graph techniques and assignment problem relaxation to achieve gradient-based optimization. Since the relaxation is non-convex, this paper develops an adaptive step-size projected gradient descent to find high-quality solutions; for cases where the relaxation optimization does not yield a valid path, a simple greedy post-processing heuristic method is introduced to eliminate sub-cycles and construct feasible solutions.
优缺点分析
Strengths
- The proposed method is reasonable and sound.
- Extensive experiments show that the proposed method generally outperforms LKH-3 on the ATSP benchmark and has good generalization ability for problem sizes.
Weaknesses
- It is unclear how many chances there are to find a valid solution directly and find a valid solution via continuous iteration, respectively.
- No ablation experiments are included to analyze the effectiveness of the proposed method components.
- The authors did not report the results without 2-opt.
- The article mentions that Gurobi and Concorde are not included as baselines because they run very slowly on large scales. What about small-scale problems?
问题
-
What are the results on ATSPLIB [1] and real-world ATSP datasets [2]?
-
Table 7 is about a small-scale symmetric TSP problem. How does it perform on a larger-scale symmetric TSP problem?
-
Why is the proposed algorithm not as good as GOAL on ATSP20 (Table 2)?
-
Why does the proposed algorithm perform worse than LKH3 on a small scale but better on a large scale (Table 1)?
-
The method in this paper does not involve learning. Why mention learning-based methods separately in related work?
-
Does the initial solution have a big impact on the solution quality?
-
The headers of Tables 1 to 6 should be ATSP instead of TSP. Please do not omit it. This will confuse people at first glance, especially since Table 7 is about TSP.
-
Also see weaknesses.
[1] http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/atsp/
[2] Neural Combinatorial Optimization for Real-World Routing, arxiv 2025.
局限性
yes.
最终评判理由
I thank the authors for their detailed response, which has addressed most of my concerns. Additionally, I would appreciate it if the author could add a discussion section on how to combine the proposed method with learning-based methods in the field of neural combinatorial optimization. Overall, it is an interesting work, and I would like to increase the score to 4. Nevertheless, since I am not particularly familiar with the field of this paper, I’ll defer to the AC and other reviewers.
格式问题
The caption of the table should be placed above the table. But this paper places them in the wrong position.
Thank you for your insightful comments. We appreciate the opportunity to clarify several aspects of our work. Below, we address the concerns raised by the reviewers and provide additional explanations and evidence where appropriate.
Results on real-world datasets
We evaluated our algorithm on large-scale ATSP benchmark instances from TSPLIB, particularly those derived from stacker crane problems. Our method consistently outperforms LKH-3 (1–100) in terms of solution quality while also requiring less computational time. These results demonstrate the practical effectiveness of our approach on real-world structured benchmarks, further validating its scalability and competitiveness beyond synthetic instances.
| Instance | Best Known Result | Algorithm | Running Time (s) | Results |
|---|---|---|---|---|
| rbg323 | 1326 | LKH(1–100) | 1.46 | 1388 |
| LKH(1–10000) | 3.38 | 1346 | ||
| LKH(10–10000) | 20.32 | 1346 | ||
| Ours | 0.42 | 1360 | ||
| Ours (no 2-opt) | 0.20 | 1365 | ||
| rbg358 | 1163 | LKH(1–100) | 1.67 | 1294 |
| LKH(1–10000) | 4.03 | 1175 | ||
| LKH(10–10000) | 25.16 | 1175 | ||
| Ours | 0.09 | 1180 | ||
| Ours (no 2-opt) | 0.04 | 1180 | ||
| rbg403 | 2465 | LKH(1–100) | 6.27 | 2536 |
| LKH(1–10000) | 26.99 | 2498 | ||
| LKH(10–10000) | 9.06 | 2498 | ||
| Ours | 1.53 | 2473 | ||
| Ours (no 2-opt) | 1.28 | 2473 | ||
| rbg443 | 2720 | LKH(1–100) | 5.44 | 2813 |
| LKH(1–10000) | 7.87 | 2762 | ||
| LKH(10–10000) | 30.06 | 2756 | ||
| Ours | 0.52 | 2760 | ||
| Ours (no 2-opt) | 0.27 | 2760 |
Q: Why does the proposed algorithm perform worse than LKH-3 on small-scale problems, but better on large-scale problems (Table 1)?
The performance difference arises primarily from the nature of the search strategies employed by our method and LKH-3. LKH-3 is a randomized heuristic algorithm whose performance depends on the number of search trials specified by its hyperparameters. On small-scale problems, the search space is relatively small, so even a modest number of trials allows LKH-3 to explore a substantial portion of the space, leading to high-quality solutions.
In contrast, our method can be viewed as a guided search strategy. Rather than exhaustively exploring the space, it prioritizes search efficiency by following a trace-guided optimization pathway. This allows our approach to scale more gracefully with problem size, but also means that on small instances, it may not explore the solution space as thoroughly as LKH-3 under high-budget settings.
As the problem size increases, the search space grows exponentially. For large-scale problems, the coverage of LKH-3's randomized search becomes sparse without significantly increasing the number of trials—leading to diminishing returns. In such cases, our method's guided search provides a more effective exploration strategy, enabling it to find higher-quality solutions more consistently and efficiently. This explains why our approach begins to outperform LKH-3 on larger ATSP instances.
Q: The method in this paper does not involve learning. Why mention learning-based methods separately in the related work?
While our method is not learning-based, we include learning-based approaches in the related work because they represent a significant and actively growing direction in solving TSP and other combinatorial optimization problems. Many such methods build on heuristic frameworks, with some leveraging learning to optimize components like hyperparameters or search strategies—e.g., tuning parameters for LKH-3 or learning local improvement rules.
Moreover, reinforcement learning-based solvers often do not require access to ground-truth optimal tours, but their performance is heavily influenced by the quality of the heuristic signals they rely on during training. In this context, stronger heuristics—such as our proposed algorithm—can serve as better guidance or high-quality rollouts, potentially improving the training and generalization of learning-based methods. Therefore, our method could be complementary to learning-based approaches and facilitate their further advancement.
Q: Does the initial solution have a big impact on the solution quality?
We experimented with both random and fixed initialization strategies and found that the choice of initialization has minimal impact on the final solution quality. This is likely due to the stability and robustness of our optimization procedure, which consistently converges to high-quality solutions regardless of the starting point. For simplicity and reproducibility, we therefore adopt a fixed initialization rule in all our experiments.
Ablation studies
We have conducted ablation studies to evaluate the contribution of individual components of our method, particularly the 2-opt local refinement step. Additionally, we compared our approach against the exact solver Gurobi on a set of problem instances. Due to time constraints, these experiments were conducted on 10 randomly selected instances. Despite the limited scale, the results provide useful insights into the effectiveness of our design choices and confirm that our algorithm performs competitively even without the local refinement.
From the results and our earlier findings on TSPLIB ATSP problems, we observe that the 2-opt local search strategy significantly improves performance on symmetric TSP problems, but has limited impact on asymmetric TSP problems.
Performance comparison on 100-node symmetric TSP problems
| Algorithm | Running Time (s) | Ratio to Optimal (Mean) | Ratio to Optimal (Median) |
|---|---|---|---|
| LKH(1–100) | 0.07 | 0.61% | 0.17% |
| LKH(1–10000) | 1.24 | 0.12% | 0.00% |
| LKH(10–10000) | 12.14 | 0.00% | 0.00% |
| Gurobi | 2.06 | 0.00% | 0.00% |
| Ours (no local search) | 0.03 | 15.2% | 14.4% |
| Ours (local search) | 0.05 | 1.47% | 1.35% |
Performance comparison on 500-node asymmetric TSP problems
| Algorithm | Running Time (s) | Ratio to Optimal (Mean) | Ratio to Optimal (Median) |
|---|---|---|---|
| LKH(1–100) | 0.87 | 2.22% | 0.17% |
| LKH(1–10000) | 2.82 | 0.45% | 0.00% |
| LKH(10–10000) | 20.23 | 0.18% | 0.00% |
| Gurobi | 6.36 | 0.00% | 0.00% |
| Ours (no local search) | 0.17 | 0.005% | 0.001% |
| Ours (local search) | 0.30 | 0.005% | 0.001% |
Results of GOAL
We carefully examined the implementation of GOAL and found that it is trained to solve the Open Loop TSP, where the tour does not require returning to the starting node. Additionally, the official evaluation code for GOAL omits the cost of returning to the origin, which leads to an underestimation of the total tour length.
As a result, the reported performance of GOAL is not directly comparable to other methods in our evaluation, which all solve the Closed Loop TSP (i.e., Hamiltonian cycles). We note this discrepancy to ensure a fair and transparent interpretation of the benchmark results.
Table captions
We will update the captions in revised version.
I thank the authors for their detailed response, which has addressed most of my concerns. But I notice that weakness 1 has not been responded. Additionally, I would appreciate it if the author could add a discussion section on how to combine the proposed method with learning-based methods in the field of neural combinatorial optimization. Overall, it is an interesting work, and I will increase the score to 4. Nevertheless, since I am not particularly familiar with the field of this paper, I’ll defer to the AC and other reviewers.
The authors present a local search algorithm for finding solutions to the TSP. The algorithm is based upon incorporating a differentiable formulation of the constraints into the objective and subsequently applying gradient descent to find solutions. Additionally, the resulting TSP formulation is relaxed to allow for continuous solutions. Improvements and details regarding runtime and numerical stability are discussed, as well as an approach to recover feasible solutions from invalid assignments. Evaluation is performed on randomly generated TSP instances, and the presented approach is compared to SotA local search heuristics and learning-based approaches.
优缺点分析
Strengths
- The key concepts of the paper are laid out clearly and in an easy-to-follow manner.
- The algorithm shows convincing performance on the evaluated random instances with respect to solution quality and needed runtime in comparison to state-of-the-art methods, especially on large instances.
Weaknesses
- The evaluation is restricted to purely randomly generated TSP instances.
- The paper has several grammar and spelling mistakes, e.g.:
- line 71 "auto-aggressive way"
- line 72 "learning approach are applied"
- line 82 "Besides on those approaches"
- line 83 "choose learn better"
问题
- How is the multiplier chosen? Which impact does the choice have on the performance of the algorithm?
- How dependent is the output on the chosen initialization ? Would the algorithm benefit from re-running using different initializations?
- How does the presented algorithm perform on benchmark instances from i.e. TSPLIB?
- How does the approximate gradient computation affect the output of the algorithm? Is the purpose of the approximation purely to improve runtime or does it also improve the final output of the algorithm?
- On line 160 it is stated that stepwidth is increased to accelerate convergence. Usually in gradient descent, one increases the stepwidth to achieve greater exploration, while decreasing it leads to convergence to a local minimum. Why does increasing the stepwidth accelerate convergence for this algorithm? Furthermore, it is stated that the adaptive strategy helps to balance exploration and stability. It is not entirely clear what is meant by stability here. Does stability here mean the ability of the algorithm to converge to a local solution?
- Proposition 1 states that any fractional matrix cannot satisfy the constraint. Does this not imply that the feasible solutions to the relaxed problem defined in Eq. (3) and of the original problem definition in Eq. (1) are the same, and therefore (3) is not a relaxation of (1), but (3) is an equivalent way of stating (1)?
- Additionally, on the small instances (TSP20 and TSP50) a comparison to the optimal solution found by an ILP solver would be interesting. This could also be useful in validating the experimental results as e.g. in Table 2 multiple algorithms consistently find the same solution (indicated by the avg. and median gap both being 0.0%), which can be a sign that they all have found the optimal solution. However, there exists an algorithm that consistently finds better solutions (GOAL in this case). Comparing to the optimal solution found by an ILP solver could be used to verify that no error has been made when taking these measurements.
局限性
yes
最终评判理由
I think this is an interesting work and I will maintain my score.
格式问题
n/a
Thank you for your insightful comments. We appreciate the opportunity to clarify several aspects of our work. Below, we address the concerns raised by the reviewers and provide additional explanations and evidence where appropriate.
How is the multiplier chosen? What impact does the choice have on the performance of the algorithm?
Theoretically, should be sufficiently large to enforce the acyclicity constraint. However, in practice, since candidate solutions are derived by solving linear assignment problems, the trace term becomes significantly large whenever a loop exists in the solution. As a result, even a moderate value of is effective at penalizing cycles.
In our experiments, we fixed as it provides a good balance between constraint enforcement and optimization stability. We also tested larger values, such as , and observed no significant difference in performance. This suggests that our algorithm is relatively insensitive to the exact value of , as long as it is reasonably large.
How dependent is the output on the chosen initialization? Would the algorithm benefit from re-running using different initializations?
We experimented with both random and fixed initialization strategies and found that the choice of initialization has minimal impact on the final solution quality. This is likely due to the stability and robustness of our optimization procedure, which consistently converges to high-quality solutions regardless of the starting point. For simplicity and reproducibility, we therefore adopt a fixed initialization rule in all our experiments.
How does the approximate gradient computation affect the output of the algorithm? Is the purpose of the approximation purely to improve runtime or does it also improve the final output of the algorithm?
We have experimentally evaluated the impact of using approximate versus exact gradients. In particular, when applying exact gradient computation on a 20-node symmetric TSP, we observed a performance gap of approximately 1.5% relative to LKH-3—significantly larger than the gap achieved when using the approximate gradient.
This suggests that the purpose of the approximation is not purely for runtime efficiency, but that it also has a positive effect on the final solution quality. One possible explanation is that the use of inexact gradients introduces a form of implicit regularization or stochasticity that encourages better exploration of the solution space. In contrast, exact gradients may lead to faster convergence but can get trapped in poorer local minima.
Therefore, while the approximation does speed up computation, it also empirically improves the robustness and effectiveness of the algorithm.
On step size adaptation and stability in the optimization process
Thank you for the thoughtful question. We clarify that in our method, the adaptive step size strategy is designed to balance exploration and convergence speed during optimization.
In our setting, decreasing the step size typically causes the algorithm to converge more quickly to a nearby local minimum. While this can improve numerical stability, it may lead to suboptimal solutions due to premature convergence.
Conversely, increasing the step size enables the optimizer to explore a broader region of the solution space, potentially escaping poor local minima and discovering better solutions. However, this also introduces a risk of oscillation or divergence, which is why careful control is required.
Our adaptive rule increases the step size when the current assignment remains unchanged (suggesting that a larger step may help move away from a plateau), and decreases it when a change is detected (to avoid overshooting). This dynamic adjustment aims to maintain a balance between exploration and stability.
Here, by stability, we refer to the algorithm’s ability to maintain progress toward a feasible and high-quality solution without oscillation or divergence—not necessarily convergence to a global or even local minimum in the classical sense.
We hope this clarifies the intention behind our step size strategy.
How does the presented algorithm perform on benchmark instances from, e.g., TSPLIB?
We evaluated our algorithm on large-scale ATSP benchmark instances from TSPLIB, particularly those derived from stacker crane problems. Our method consistently outperforms LKH-3 (1–100) in terms of solution quality while also requiring less computational time. These results demonstrate the practical effectiveness of our approach on real-world structured benchmarks, further validating its scalability and competitiveness beyond synthetic instances.
| Instance | Best Known Result | Algorithm | Running Time (s) | Results |
|---|---|---|---|---|
| rbg323 | 1326 | LKH(1–100) | 1.46 | 1388 |
| LKH(1–10000) | 3.38 | 1346 | ||
| LKH(10–10000) | 20.32 | 1346 | ||
| Ours | 0.42 | 1360 | ||
| Ours (no 2-opt) | 0.20 | 1365 | ||
| rbg358 | 1163 | LKH(1–100) | 1.67 | 1294 |
| LKH(1–10000) | 4.03 | 1175 | ||
| LKH(10–10000) | 25.16 | 1175 | ||
| Ours | 0.09 | 1180 | ||
| Ours (no 2-opt) | 0.04 | 1180 | ||
| rbg403 | 2465 | LKH(1–100) | 6.27 | 2536 |
| LKH(1–10000) | 26.99 | 2498 | ||
| LKH(10–10000) | 9.06 | 2498 | ||
| Ours | 1.53 | 2473 | ||
| Ours (no 2-opt) | 1.28 | 2473 | ||
| rbg443 | 2720 | LKH(1–100) | 5.44 | 2813 |
| LKH(1–10000) | 7.87 | 2762 | ||
| LKH(10–10000) | 30.06 | 2756 | ||
| Ours | 0.52 | 2760 | ||
| Ours (no 2-opt) | 0.27 | 2760 |
Proposition 1 states that any fractional matrix cannot satisfy the constraint. Does this not imply that the feasible solutions to the relaxed problem defined in Eq. (3) and of the original problem definition in Eq. (1) are the same, and therefore (3) is not a relaxation of (1), but an equivalent formulation?
Yes, Proposition 1 implies that the continuous formulation in Eq. (3) is an exact representation of the original TSP defined in Eq. (1), in the sense that both share the same set of feasible solutions—namely, permutation matrices corresponding to valid tours.
However, we still refer to Eq. (3) as a relaxation in the context of optimization because it replaces the discrete permutation constraint with continuous doubly stochastic and trace-based constraints. While the feasible set remains unchanged, this formulation enables the use of gradient-based optimization techniques over continuous variables. It is worth noting that, despite being continuous, Eq. (3) remains non-convex and NP-hard in general.
Therefore, Eq. (3) should be viewed as a continuous but exact reformulation of the TSP, offering new algorithmic opportunities while preserving the problem's combinatorial structure.
Comparison with ILP Solver
We conducted a preliminary comparison between our method and an exact ILP solver on both symmetric and asymmetric TSP instances. Specifically, we tested the solver on 100-node symmetric problems and 500-node asymmetric problems and found that the ILP solver was able to return optimal solutions within a reasonable time budget.
Due to time constraints, we evaluated both methods on 10 randomly sampled instances for each case. The results are summarized below. While the ILP solver guarantees optimality, our method achieves competitive solution quality with significantly faster runtimes, particularly for larger asymmetric instances—highlighting its scalability and practical utility for large-scale problems.
Performance comparison on 100-node symmetric TSP problems
| Algorithm | Running Time (s) | Ratio to Optimal (Mean) | Ratio to Optimal (Median) |
|---|---|---|---|
| LKH(1–100) | 0.07 | 0.61% | 0.17% |
| LKH(1–10000) | 1.24 | 0.12% | 0.00% |
| LKH(10–10000) | 12.14 | 0.00% | 0.00% |
| Gurobi | 2.06 | 0.00% | 0.00% |
| Ours (no local search) | 0.03 | 15.2% | 14.4% |
| Ours (local search) | 0.05 | 1.47% | 1.35% |
Performance comparison on 500-node asymmetric TSP problems
| Algorithm | Running Time (s) | Ratio to Optimal (Mean) | Ratio to Optimal (Median) |
|---|---|---|---|
| LKH(1–100) | 0.87 | 2.22% | 0.17% |
| LKH(1–10000) | 2.82 | 0.45% | 0.00% |
| LKH(10–10000) | 20.23 | 0.18% | 0.00% |
| Gurobi | 6.36 | 0.00% | 0.00% |
| Ours (no local search) | 0.17 | 0.005% | 0.001% |
| Ours (local search) | 0.30 | 0.005% | 0.001% |
Results of GOAL
We carefully examined the implementation of GOAL and found that it is trained to solve the Open Loop TSP, where the tour does not require returning to the starting node. Additionally, the official evaluation code for GOAL omits the cost of returning to the origin, which leads to an underestimation of the total tour length.
As a result, the reported performance of GOAL is not directly comparable to other methods in our evaluation, which all solve the Closed Loop TSP (i.e., Hamiltonian cycles). We note this discrepancy to ensure a fair and transparent interpretation of the benchmark results.
I thank the authors for their detailed response and for answering my questions. I think this is an interesting work and I will maintain my score.
This paper presents an innovative continuous relaxation framework for the Asymmetric Traveling Salesman Problem (ATSP). The core idea is to combine a differentiable, trace-based Directed Acyclic Graph (DAG) constraint with a doubly stochastic matrix relaxation to formulate the problem. This approach effectively suppresses subtours, enabling the use of gradient-based optimization to find solutions. The proposed method demonstrates superior performance and scalability compared to state-of-the-art methods like LKH-3, especially on large-scale ATSP benchmarks.
优缺点分析
Strengths:
- The paper proposes a novel, differentiable, and exact continuous formulation for ATSP.
- On large-scale ATSP instances (with more than 200 nodes), the method's solution quality and runtime are superior to the highly-optimized LKH-3 heuristic.
- The algorithm shows convincing performance with solid evaluation.
Weaknesses:
- The experiments report average and median results but do not provide error bars or confidence intervals.
- A valuable direction for future work would be to explore the algorithm's performance in worst-case scenarios.
问题
Please refer to weakness.
局限性
yes
最终评判理由
Although I am not familiar with this research topic, I find this to be an interesting work.
格式问题
No
We would like to respectfully clarify that the review appears to be based on a misunderstanding of our submission.
Our paper does not address the problem of learning Directed Acyclic Graphs (DAGs) from observational data, nor does it consider Structural Hamming Distance (SHD) as a performance metric. Instead, our work focuses on solving the Asymmetric Traveling Salesman Problem (ATSP) by proposing a trace-based acyclicity constraint within a continuous optimization framework. The trace penalty is used to suppress cyclic subtours in assignment matrices, not to enforce acyclicity in the context of causal discovery or DAG learning.
The mention of concepts such as LiNGAM, statistical identifiability, or post-processing in DAG structure learning (e.g., thresholding of adjacency matrices) is therefore unrelated to our setting and not applicable to our contributions.
We kindly ask the reviewer to reconsider the evaluation in light of the actual scope of our paper, which is a continuous formulation for combinatorial optimization, not causal inference or graphical model learning.
Thank you for the clarification. Compared to traditional integer programming constraints for eliminating ATSP subtours, what are potential weaknesses of your differentiable trace constraint in suppressing different types of subtours? Have you considered other differentiable penalties derived from matrix properties and what would their trade-offs be? What is the peak memory consumption during your experiments?
Traditional integer programming-based approaches to TSP, such as branch-and-cut methods, guarantee optimal solutions but are often computationally expensive due to the NP-hardness of the problem. In contrast, our method is a heuristic that offers high-quality solutions within a reasonable time frame.
To the best of our knowledge, most state-of-the-art TSP solvers rely on linear integer programming. Although these formulations use only linear constraints, the total number of constraints can grow exponentially with the number of nodes in the worst case, particularly due to the subtour elimination constraints.
Regarding memory usage, our algorithm has a space complexity of , which is comparable to the storage requirements of the input distance matrix. In practice, this means our method’s peak memory consumption remains within a constant factor of the distance matrix itself, making it highly scalable in terms of memory.
Thanks for your further clarification. As a heuristic, does your method have a worst-case performance guarantee? Have you identified specific problem structures where its performance degrades significantly? Or is this a direction for future research?
For the general asymmetric TSP, it is currently unknown whether a constant-factor approximation algorithm with polynomial complexity can be constructed. In contrast, under additional assumptions—such as symmetry or the triangle inequality—such approximation algorithms do exist. Our algorithm is designed for the general case and does not rely on properties like symmetry or the triangle inequality. While specialized algorithms such as LKH-3, which are tailored for symmetric TSP, may achieve slightly better performance in those specific settings, our method offers broader applicability. In future work, we plan to incorporate problem-specific structural assumptions into our framework to further improve performance when applicable.
Thank you for your clarification. Although I am not familiar with your research topic, I find this to be an interesting work. Therefore, I have decided to raise my score to 4 (boardline accept). I’ll defer to the AC and other reviewers.
This paper proposes a new continuous relaxation framework for the ATSP using trace-based acyclicity constraints and doubly stochastic matrix optimization. The method integrates differentiable DAG constraints to eliminate subtours, employs projected exponentiated gradient descent for optimization, and uses a greedy post-processing step to recover valid tours. Experiments demonstrate state-of-the-art performance on large-scale ATSP instances, outperforming LKH-3 and learning-based baselines.
优缺点分析
I have to say that I am usually working on the approximation algorithms for the network design problem (etc., approximating ATSP). By reading the abstract, I thought this work was related to my field. But, it turns out that it goes beyond my experience after reading the whole paper. I am not the right person to judge the contribution and novelty of the paper.
From my perspective, the paper has the following strengths:
-
The idea of using trace-based constraints looks interesting to me. This approach is to enforce acyclicity in ATSP relaxations, and Proposition 1 establishes that the global minimum of the relaxed problem recovers the exact TSP solution; this is also an interesting idea to me.
-
From my view, the gradient approximation (using hard assignments) efficiently addresses the high cost of matrix power computations. However, I don't know whether it's computationally practical (e.g., easy to implement) or not.
-
This paper gives competitive results on symmetric TSPs (as shown in Table 7), suggesting potential generality beyond ATSP.
The main weakness that I can see is that:
The non-convex optimization (Equation 6) lacks a theoretical analysis of convergence rates or approximation bounds. The adaptive step size heuristic is pragmatic but ad hoc. This probably can become a big problem in other datasets.
问题
I don't have specific questions.
局限性
n/a
最终评判理由
The authors addressed my concern during the rebuttal, and thus, I will keep my score. But this paper goes beyond my experience, so please take my review with caution.
格式问题
n/a
Thank you for your insightful comments. We would like to offer the following clarifications regarding the approximation properties of Eq. (6).
Approximation Properties of Eq. (6)
The constraint term in Eq. (6), , is zero if and only if corresponds to a valid Hamiltonian tour. Therefore, when the Lagrange multiplier is sufficiently large, the relaxation in Eq. (6) becomes exact—i.e., the global minimizer of the relaxed problem coincides with the solution of the original TSP.
However, Eq. (6) defines a non-convex optimization problem. As with other non-convex objectives, gradient-based methods can generally only guarantee convergence to a local minimum. If the exact gradient is used and the step size is selected in the interval , the convergence rate of the projected exponentiated gradient or Frank-Wolfe method for solving Eq. (6) is [1].
When using approximate gradients—as we do in our implementation for efficiency—the convergence rate becomes more difficult to characterize formally. This remains an open theoretical question, and we acknowledge it as a direction for future work.
That said, we observe empirically that the use of approximate gradients yields better performance in practice compared to exact gradients. This is likely due to reduced noise in gradient estimation and improved numerical stability during training.
We thank the reviewer again for raising this important point.
References:
[1] Lacoste-Julien, Simon. Convergence rate of Frank-Wolfe for non-convex objectives. arXiv preprint arXiv:1607.00345 (2016).
[2] Conn, Andrew R., et al. Global convergence of a class of trust region algorithms for optimization using inexact projections on convex constraints. SIAM Journal on Optimization 3.1 (1993): 164–221.
Thank you. I appreciate the response. I think this is an interesting paper. Regarding contribution and novelty, I’ll defer to the AC and other reviewers, as I am not sufficiently familiar with this area.
This paper introduces a novel continuous relaxation framework for the ATSP, using differentiable trace-based constraints combined with doubly stochastic formulations and gradient-based optimization. A greedy post-processing step ensures valid tours.
The main strengths are the original formulation and the empirical performance: the method consistently outperforms LKH-3 and other baselines, particularly on large-scale ATSP and TSPLIB benchmarks.
While theoretical convergence guarantees are limited, and some reviewers noted lower confidence due to field mismatch, the work is technically sound, clearly presented, and demonstrates substantial practical impact. The rebuttal provided convincing clarifications and additional results.