PaperHub
7.3
/10
Spotlight4 位审稿人
最低4最高5标准差0.5
5
4
4
5
2.5
置信度
创新性2.8
质量3.0
清晰度2.5
重要性2.8
NeurIPS 2025

Hybrid-Balance GFlowNet for Solving Vehicle Routing Problems

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

we introduce the Hybrid-Balance GFlowNet framework which uniquely integrates TB and DB in a principled and adaptive manner to acheive Local-Global Optimization.

摘要

关键词
GFlowNetVehicle Routing Problem

评审与讨论

审稿意见
5

The paper discusses 2 different methods for gflownet that are limited, Trajectory Balance (TB) objective effectively aligns with global metrics of VRP and TSP like minimizing total travel distance. However, this exclusive focus on global optimization can lead to the neglect of important local optimization signals. On the other hand, Detailed Balance (DB) [4] offers a mechanism better suited for local optimization. Nevertheless, using DB alone is equally inadequate, as VRPs fundamentally require a global perspective to achieve optimal solutions. For this reason they propose Hybrid-Balance principle, combining TB and DB in a unified and extensible manner, can significantly enhance GFlowNet-based methods for VRPs.

优缺点分析

Strengths

  1. Conceptually Sound and Well-Motivated: The central premise of the paper—that VRPs benefit from both global and local optimization—is highly intuitive and well-articulated. The authors clearly explain the complementary strengths and individual weaknesses of Trajectory Balance (global) and Detailed Balance (local), providing a strong motivation for their hybrid approach.

  2. Novel Theoretical Contribution: The paper's most significant strength lies in its formalization of Detailed Balance for VRPs. The derivation of the backward transition probability, is non-trivial. The authors provide a combinatorial argument (Statements 1, 2, and 3) to derive a closed-form expression for the number of trajectory orderings, which then informs the backward probability at the depot.

  3. Comprehensive and Rigorous Evaluation: The experimental setup is thorough. By integrating HBG into two distinct GFlowNet-based solvers—an end-to-end construction method (AGFN) and a hybrid heuristic search method (GFACS)—the authors demonstrate the general applicability of their framework. The evaluation across two different problems (CVRP and TSP) and multiple scales (up to 1000 nodes) further strengthens the claims of robustness and scalability.

Weaknesses

My main concerns with this paper are the claimed "adaptiveness" of the hybrid balance, the practical significance of some of the performance gains, and the potential overstatement of novelty regarding the inference strategy.

  1. "Hybrid-Balance" is a Simple, Static Summation: The paper claims to integrate TB and DB in a "principled and adaptive manner". However, the final loss function in Equation 11 is a simple, unweighted sum: LossHB​=LossTB​+LossDB​. This is the most straightforward, non-adaptive way to combine two losses. There is no adaptive mechanism described that would, for example, change the weighting between LossTB​ and LossDB​ during training or based on the problem instance. This makes the term "adaptive" misleading. A simple summation might not be optimal, and this lack of exploration into the balance mechanism is a significant weakness.

  2. Marginal Performance Gains in Some Key Scenarios: While the improvements are consistent, their magnitude is questionable in some cases. For the end-to-end AGFN model on CVRP1000, the objective value improves from 133.97 to 131.78, a gain of only ~1.6%.

  3. Fairness of Hyperparameter Tuning: The authors state they "adopt the same model configurations and training settings as AGFN and GFACS". However, the introduction of a new loss component (lDB​) fundamentally changes the optimization landscape. It is highly likely that the optimal hyperparameters (e.g., learning rate, optimizer settings) for the HBG models would be different from the original models. Without re-tuning, the comparison might unfairly handicap the proposed HBG models, or conversely, the original models might have been better tuned for their specific loss. The paper provides no information on whether such re-tuning was performed.

问题

  1. The term "adaptive" is used to describe the integration of TB and DB, yet Equation 11 shows a fixed, unweighted sum. Could you clarify what aspect of the integration is adaptive? Have you experimented with a weighted sum, lTB​+λ⋅lDB​, and if so, how sensitive is the performance to the hyperparameter λ?

  2. The energy in Equation 9 is based on the reward relative to the batch average. What is the intuition behind this choice versus a more direct energy function

局限性

yes

最终评判理由

I maintain my evaluation that the paper should be accepted.

格式问题

no

作者回复

We sincerely thank the reviewer for the thoughtful and encouraging feedback. We greatly appreciate the kind comment of the conceptual motivation, the theoretical contribution of Detailed Balance in VRP, and the thoroughness of our experimental design across diverse solvers and problem scales. Below, we respond to the concerns you raised.

W1 Q1: Weight 𝜆

We sincerely thank the reviewer for this insightful comment and for pointing out the potential ambiguity in our use of the term "adaptive". Our original intention was to emphasize the unified and structurally integration of TB and DB within a consistent probabilistic framework. We appreciate the reviewer’s suggestion and will revise the manuscript accordingly to clarify this point and avoid any possible misunderstanding.

We also appreciate the reviewer’s concern regarding a weighted combination (i.e., 𝐿_{HB} = 𝐿_{TB}+ 𝜆⋅𝐿_{DB}). In response, we have conducted additional experiments to assess the sensitivity of our method to different values of 𝜆, which are gathered in Tables B1–B4 below. The results include adaptive weights (i.e., 𝜆: 0.5->0, 1->0, 2->0 and so on) and fixed weights (i.e., 𝜆: 0.5->0.5, 1->1, 2->2). We find that, for HBG-AGFN, a fixed 1:1 weight between TB and DB consistently yields the best performance. Similarly, for HBG-GFACS, this ratio offers the most favorable trade-off between performance with and without local search.

Table B1. Weights Analysis of HBG-AGFN on CVRP

"𝜆: 0.5→0" refers to changing the TB:DB loss weight ratio from 1:0.5 to 1:0 during training, with the TB weight fixed at 1. The interpretation for other entries follows analogously. The term "origin" denotes the original hyperparameter setting used in our main experiments.

Size/Gap%/𝜆0.5->01->02->00->0.50->10->20.5->0.51->1(origin)2->2
20011.0910.8010.9411.4111.9811.209.989.9510.44
50011.2712.0411.1812.1511.6611.8011.6310.4411.98
100010.8910.619.8911.0310.8410.9810.879.3410.08

Table B2. Weights Analysis of HBG-AGFN on TSP

Size/Gap%/𝜆0.5->01->02->00->0.50->10->20.5->0.51->1(origin)2->2
20010.2610.4210.4911.4811.2610.6610.0810.4510.36
50015.1515.4016.1316.8716.0915.7914.9114.0515.64
100018.8018.5519.0019.2621.8722.0018.5218.4718.81

Table B3. Weights Analysis of HBG-GFACS on CVRP

Size/Gap%/𝜆0.5->01->02->00->0.50->10->20.5->0.51->1(origin)2->2
20014.0519.2917.3717.7218.4719.0816.4216.4818.87
50011.7817.6615.3815.2816.3515.7113.5713.5315.49
100011.2915.8012.7112.8813.4712.8010.9110.6112.80
200(local search)2.391.782.032.072.141.961.961.961.78
500(local search)3.022.512.913.413.442.993.172.812.62
1000(local search)2.822.272.823.313.532.752.452.752.40

Table B4. Weights Analysis of HBG-GFACS on TSP

Size/Gap%/𝜆0.5->01->02->00->0.50->10->20.5->0.51->1(origin)2->2
20019.1320.6123.0622.2120.5624.0221.6419.4022.82
50049.5149.6150.8049.4848.4756.1349.6248.4150.37
100072.3080.1990.5079.3780.3996.3677.4575.9387.23
200(local search)1.571.491.591.591.621.821.541.501.62
500(local search)4.814.655.024.884.735.184.734.685.12
1000(local search)7.737.687.867.787.838.027.707.677.99

W2: Improvement on 1000 Node CVRP Instances

We sincerely thank the reviewer for raising the concern. While the improvement on CVRP1000 under AGFN may appear modest (~1.6%) in absolute terms, we would like to note that this result is obtained on the challenging large-scale instance involving 1000 nodes, where even small percentage gains are non-trivial due to the high combinatorial complexity of the problem. More importantly, we observe consistent improvements across a wide range of problem sizes, solver backbones (AGFN and GFACS), and problem types (CVRP and TSP). We believe this consistency highlights the robustness and broad applicability of our proposed approach.

W3: Re-tuning

We sincerely thank the reviewer for this valuable suggestion. The reviewer is absolutely right that the introduction of a new loss component (𝐿_{DB}) could potentially alter the optimization landscape, and that the original hyperparameter settings used in AGFN and GFACS may not be optimal for the proposed HBG variants.

Our initial motivation for adopting the same configurations was to ensure a fair comparison under consistent settings. However, inspired by the reviewer’s insightful comment, we have conducted additional experiments in which we re-tuned key hyperparameters for the HBG models, including the sparsity parameter, learning rate, and optimizer settings. The updated results (Table A1, C, D) will be incorporated into the Appendix of the revised version.

As the sparsity parameter results shown in Table A1, the original sparsity setting yields the best performance for both HBG-GFACS and HBG-AGFN. Regarding the learning rate comparisons in Table C, for HBG-GFACS, the original value of 5×1045×10^{−4} performs best without local search, while a value of 1×1031×10^{−3} achieves the best results when local search is enabled. For HBG-AGFN, a learning rate of 1×1041×10^{−4} shows superior performance on the 200- and 500-node instances, whereas the original value performs best on the 1000-node instances. As for the optimizer comparison in Table D, the original AdamW setting provides the best performance in most cases. An exception is observed on the 500- and 1000-node instance with local search, where Adam slightly outperforms AdamW for HBG-GFACS.

Table A1. Comparison of Various Sparse Parameter k on CVRP

HBG-GFACS_2 refers to the HBG-GFACS model evaluated with k=|V|/2; the interpretation is analogous for the other entries. The term "origin" denotes the original hyperparameter setting used in our main experiments.

CVRP(Gap%)2005001000
HBG-GFACS_219.2616.2712.69
HBG-GFACS_5(origin)16.4813.5310.61
HBG-GFACS_817.9013.8311.02
HBG-GFACS_1020.4317.8814.9
HBG-GFACS(local search)_22.133.413.46
HBG-GFACS(local search)_5(origin)1.962.812.75
HBG-GFACS(local search)_82.133.363.51
HBG-GFACS(local search)_103.074.034.77
HBG-AGFN_212.6213.3614.78
HBG-AGFN_5(origin)9.9510.449.34
HBG-AGFN_811.5513.2712.69
HBG-AGFN_1012.9814.514.06

Table C. Comparison of Various Learning Rate on CVRP

HBG-GFACS_5×1035×10^{-3} refers to the HBG-GFACS model evaluated with learning rate of 5×1035×10^{-3}; the interpretation is analogous for the other entries.

CVRP(Gap%)2005001000
HBG_GFACS_5×1035×10^{-3}17.1514.4811.77
HBG_GFACS_1×1031×10^{-3}17.0414.3711.50
HBG_GFACS_5×1045×10^{-4}(origin)16.4813.5310.61
HBG_GFACS_1×1041×10^{-4}18.2214.1911.30
HBG_GFACS(local search)_5×1035×10^{-3}1.712.772.63
HBG_GFACS(local search)_1×1031×10^{-3}1.702.652.51
HBG_GFACS(local search)_5×1045×10^{-4}(origin)1.962.812.75
HBG_GFACS(local search)_1×1041×10^{-4}4.536.496.50
HBG_AGFN_5×1035×10^{-3}11.4114.4612.00
HBG_AGFN_1×1031×10^{-3}10.4811.9210.62
HBG_AGFN_5×1045×10^{-4} (origin)9.9510.449.34
HBG_AGFN_1×1041×10^{-4}9.3710.279.59

Table D. Comparison of Various Optimizer on CVRP

HBG-GFACS_SGD refers to the HBG-GFACS model evaluated with SGD optimizer; the interpretation is analogous for the other entries.

CVRP(Gap%)2005001000
HBG_GFACS_SGD17.8015.1912.45
HBG_GFACS_Adam17.3314.2811.11
HBG_GFACS_Adamw(origin)16.4813.5310.61
HBG_GFACS(local search)_SGD3.034.264.55
HBG_GFACS(local search)_Adam2.032.702.73
HBG_GFACS(local search)_Adamw(origin)1.962.812.75
HBG_AGFN_SGD14.6613.0010.84
HBG_AGFN_Adam10.1212.8611.17
HBG_AGFN_Adamw(origin)9.9510.449.34

Q2: Reward Function

We sincerely thank the reviewer for this thoughtful question. The use of reward relative to the batch average in Equation 9 is a common practice in the VRP literature, particularly in learning-based methods like POMO, AGFN, and GFACS. The key motivation behind this choice is to mitigate the effect of node distance discrepancies across different instances, thereby promoting training stability.

For example, different instances often exhibit varying step-wise reward magnitudes due to differences in route length, customer density, or total demand. As a result, the average reward per decision step can differ between instances. If we directly use these raw per-step rewards to compute energy values, instances with smaller average step-wise rewards would inherently yield more favorable energy values, even for suboptimal decisions. In contrast, instances with larger average per-step rewards might assign worse energy values even to relatively good decisions within that instance. This imbalance can distort the learning signal, causing the model to undervalue high-quality decisions in instances with larger average per-step rewards, while overvaluing mediocre decisions in smaller ones.

By using reward values relative to the batch average, we normalize this effect and encourage the model to compare solutions fairly within the context of each instance. This stabilizes training and helps the model focus on learning to generate relatively good solutions across diverse instances, instead of simply favoring small step-wise rewards in absolute terms.

评论

Thank you for your detailed answers. I maintain my evaluation that the paper should be accepted.

评论

We truly appreciate your careful evaluation and your support of our work.

审稿意见
4

This paper introduces the Hybrid-Balance GFlowNet (HBG) framework aimed at improving GFlowNet-based solvers for vehicle routing problems (VRPs). HBG combines trajectory balance (TB) and detailed balance (DB) objectives in a principled, adaptive manner to unify global and local optimization perspectives. The authors design a depot-guided inference scheme tailored for VRPs with a depot structure (e.g., CVRP), and rigorously evaluate the framework by integrating HBG into two state-of-the-art GFlowNet-based solvers: AGFN and GFACS. Experimental results demonstrate significant and consistent improvements in solution quality and generalizability.

优缺点分析

Strength

  • The integration of TB and DB provides a unified approach bridging the gap between local and global optimization for GFlowNets in combinatorial domains. The formalization of DB within VRP, particularly the explicit recurrence and probability derivation, is a notable theoretical advancement.
  • The main development, including Algorithmic design, mathematical derivations, and the final composite loss function, is clearly presented.
  • Empirical improvements are substantial and consistent, with ablation studies confirming the contribution of each module. The improvements persist for TSP, underlining generalizability. The evaluation of depot-guided inference variants is carefully done and matches the method’s claimed benefits.

Weakness

  • The necessity of considering local optimization signals is not well-motivated. In particular, why we also need to pursue local optimization given a global optimization objective? Will the local optimization signals bias the pursuit of global optimum? What does "fine-grained local structure" refer to?
  • It is noteworthy that the depot-guided inference strategy only applies to depot-centric problems like CVRP. For problems without clear depot structures (like TSP), the method falls back to original approaches.
  • The performance could be constrained by the quality of the base GFlowNet solvers.

问题

see my comments in the Strengths And Weaknesses section

局限性

yes

最终评判理由

The authors' rebuttal addressed my concerns. I therefore maintained my score.

格式问题

N.A.

作者回复

We sincerely thank the reviewer for the thoughtful and encouraging feedback. We truly appreciate your kind comment of our theoretical contributions, including the formalization of DB in VRP and the unified integration of TB and DB. We're also grateful for your positive comments on the clarity of our design and analysis, as well as the empirical improvements and careful ablation studies.

Below, we provide detailed responses to the questions and suggestions you kindly raised.

W1: DB Motivation and Term Explanation

We sincerely thank the reviewer for raising this important question. The necessity of incorporating local optimization signals arises from a core limitation of relying solely on global feedback. Trajectory Balance (TB), while effective in guiding overall optimization, assigns credit based on the full trajectory. This can misrepresent the contribution of locally good decisions when the overall solution is not good.

To illustrate this concretely, consider a long vehicle routing trajectory constructed incrementally as A → B → C → D → E → ... → U → V → W → X → Y → Z. Assume the overall route has a high cost (bad performance) due to early suboptimal choices like traversing a high-cost edge from C to D. However, the later segment (e.g., U to Z) may be well-structured and efficient. Under TB, the reward signal derived from the final cost is spread across the entire trajectory. As a result, good transitions like W → X → Y may receive low or misleading credit as the overall result is not good, hindering the model from reinforcing these useful local patterns.

In contrast, Detailed Balance (DB) operates at a step-wise level, comparing local alternatives directly. For example, at node W, DB can determine that choosing X instead of alternative choices like Z leads to better outcomes, which is independent of early errors. This enables the model to recognize and reinforce high-quality decisions even within imperfect global solutions.

Importantly, adding DB does not bias the model away from global optimization—rather, it complements TB by supplying more granular learning signals. As validated in our ablation study (Section 3.2, Table 3), the hybrid use of TB and DB consistently improves performance over using either objective alone.

Moreover, the term “fine-grained local structure” refers to the localized reward relationships between a given state and its immediate successor, which captures beneficial sub-patterns within a trajectory, such as short, efficient sub-routes or cost-effective transitions. These local structures often contribute meaningfully to overall solution quality but may be overlooked or undervalued when using a purely trajectory-level objective like TB. By introducing DB, the model gains the capability to identify and reinforce these fine-grained decisions during training. While the concept of “fine-grained local structure” is implicitely mentioned in Lines 40–41 of the paper, we will revise the text to make it more explicit and accessible.

Motivated by your insightful suggestion, we will incorporate this example and discussion into the main paper to better clarify the motivation behind our hybrid design.

W2: Depot-Guided Inference

The reviewer is correct on this point, the Depot-Guided Inference strategy is specifically tailored for depot-centric problems such as CVRP. While we briefly acknowledged this in Lines 54–58, 66–67, and 233–236 of the manuscript, we appreciate the opportunity to clarify further. In practice, a large number of VRP variants including capacitated VRP, multi-depot VRP, and VRP with time windows, are inherently structured around depots. Therefore, despite its limited applicability to non-depot problems like TSP, we believe the proposed strategy remains broadly relevant and useful for a wide range of real-world routing scenarios. And the results in Table 5 shows that even in the depot-free scenatios like TSP, our overall hybrid balance framework without Depot-Guided Inference can still improve the performance.

W3: Constrains of GFlowNet Solvers

As acknowledged by the reviewer, the performance of our framework could indeed be constrained by the quality of the underlying GFlowNet solvers. This limitation is explicitly noted in the Conclusion (Lines 316–317) of our manuscript. However, our method is designed to be general and has demonstrated its ability to enhance different types of GFlowNet-based backbones, such as GFACS (which learns to search) and AGFN (which learns to construct). We believe it remain applicable for improving future stronger GFlowNet backbone as well.

评论

Thanks for your responses that address all my concerns.

评论

We appreciate the reviewer for acknowledging our response and supporting our work.

审稿意见
4

The paper proposes a novel learning-based approach for Solving Vehicle Routing problems and builds on GFlowNet based methods for finding a distribution over potential solutions. While existing approaches usually optimize towards a global trajectory constraint (Trajectory Balance), the paper extends this formulation with a local inter-state constraint called Detailed Balance. The authors claim that solely TB often shows limiting capabilities in local optimization and DB on the other hand fails on global optimization, which is an integral part of Vehicle Routing Problems. The authors propose a method that includes both optimization objectives and show improved results on standard benchmarks.

优缺点分析

Strengths

  • The overall presentation of the methodology is technical sound and the authors provide the many details of the proposed methodology.
  • The results in Table 1 indicate an improvement over the previous approaches (AGFN, GFACS).
  • The authors provide an insightful ablation study for the balancing strategies and the component contributions.

Weaknesses

  • The Detailed Balance constraint needs to be better motivated. It is unclear why there is a problem with TB and the local optimization objective should help for this global optimization problem. Usually an insightful example from the test data can help with that and further strengthen the story of the paper since the DB is the main contribution of the paper.
  • The authors decided to have the Related Works section in the Appendix. As a reader who is not entirely aware of this branch of research, needs to directly jump to the appendix before reading the full paper. I’d suggest shortening the introduction and the method section to have the related work section in the main paper.
  • Some in the methodology section of the paper are hard to understand, e.g. In equation 2, there is an energy term which is discussed two pages later for the first time. According to equation 9 the energy can be negative if R is larger than the average R. Is this an intended behaviour?
  • Most of the approach is adapted from previous works (Trajectory Generation, TB training) and the contribution is mainly to extend to the objective with the local DB constraint for this specific vehicle routing task. I believe that this approach could also serve for other combinatorial optimization problems.

问题

Please comment on the weaknesses.

局限性

Please discuss limitation in more detail in the Appendix.

最终评判理由

All in all, the rebuttal addressed some of my concerns and I’d increase my rating to borderline accept. I hesitate to increase the rating to accept due to W4.

格式问题

No concerns

作者回复

We sincerely thank the reviewer for the thoughtful and constructive feedback. We are grateful for the reviewer’s kind comments on the technical soundness of our methodology, the clarity and thoroughness of our presentation, the performance improvements shown in Table 1 over strong baselines such as AGFN and GFACS, and the usefulness of our ablation studies in analyzing the effects of balancing strategies and individual components.

Below, we provide detailed responses to the questions and concerns raised.

W1: DB Motivation

To provide a concrete illustration, consider a long vehicle routing trajectory constructed incrementally as A → B → C → D → E → ... → U → V → W → X → Y → Z.

Assume this complete route yields a high total cost (bad performance), primarily due to suboptimal decisions made in the early stages, such as traversing a high-cost edge from node C to node D. In contrast, the latter portion of the tour (e.g., from node U to Z) may follow a more cost-effective and well-structured pattern.

Under Trajectory Balance (TB), the final reward is determined by the overall trajectory cost, and is proportionally assigned to all transitions. Consequently, even high-quality local transitions, such as W → X → Y, may receive weak or misleading training signals simply because they are embedded in a globally suboptimal trajectory. This would hinder the model's ability to learn and reinforce desirable local patterns.

On the other hand, Detailed Balance (DB) operates at a step-wise granularity, evaluating the expected outcomes of individual transitions. For instance, at node W, DB can assess whether transitioning to X leads to better outcomes compared to alternative choices like Z, regardless of earlier suboptimal steps. This localized and reward-sensitive feedback enables the model to more accurately learn local quality from global performance, and promotes stronger learning signals for valuable decisions even within imperfect trajectories.

This example illustrates a core limitation of Trajectory Balance (TB) in long-horizon combinatorial tasks like VRP: when the overall trajectory is suboptimal, TB lacks the ability to identify and preserve well-structured local segments within it. As a result, valuable local patterns may be overlooked or penalized. By incorporating Detailed Balance (DB) into the training objective, we address this limitation by providing fine-grained, step-level singal that helps isolate and reinforce high-quality local decisions, even when the global trajectory doesn't show good performance.

We sincerely thank the reviewer for raising this important point. Motivated by your suggestion, we will include this example and discussion in the revised paper to better illustrate the motivation behind integrating TB and DB. Furthermore, as shown in Section 3.2 and Table 3, our ablation study confirms that combining TB with DB (i.e., Hybrid Balance) consistently leads to improved performance, validating the complementary strengths of Hybrid Balance objectives.

W2: Related Work

We fully acknowledge the reviewer’s concern regarding the placement of the Related Work section in the Appendix. In the original submission, this decision was made due to strict space constraints, i.e., we prioritized providing a thorough and transparent explanation of our methodology, which we felt was critical for understanding the technical contributions. However, we are open to the suggestion and sincerely thank the reviewer for raising this point, i.e., having the Related Work in the main body could greatly benefit readers, especially those less familiar with the research area.

W3: Equation

We sincerely thank the reviewer for this helpful comment. We hadn’t realized that the explanation of the energy term appears nearly two pages after its introduction in Equation 2, and we truly appreciate the reviewer for pointing this out. In the revised version, we will move the explanation closer to Equation 2 to improve clarity and make the flow of the methodology easier to follow.

Regarding the second point about whether the energy can be negative, we appreciate the opportunity to clarify this. Yes, this is indeed intended behavior. As is common in VRP literature such as POMO, GFACS, and AGFN, the energy is defined relative to a baseline (e.g., average reward) and can take negative values. Moreover, the energy in Equation 9 appears in the denominator of the loss function (Equation 2), and a negative energy results in a smaller exponentiated value in the denominator, which leads to a higher loss, which penalizes relatively poor trajectories. We will clarify this behavior of the energy term more explicitly in the revised manuscript.

W4: Contribution and Extention

We thank the reviewer for the thoughtful comment. To the best of our knowledge, this work is the first to introduce the DB objective into vehicle routing problem (VRP), and the first to integrate TB and DB objectives within a unified GFlowNet framework. Achieving this integration is non-trivial, as it requires reconciling two fundamentally different types of training signals: global trajectory-level feedback from TB and localized stepwise credit assignment from DB. At the same time, it is necessary to ensure that the forward and backward dynamics remain consistent throughout training. This carefully unified design allows our method to effectively leverage both global and local learning signals, leading to improved training stability and performance across various VRP settings. To achieve this integration:

  • Leveraging the hybrid structure of VRP whose solution is comprised of multi-node and single-node sub-tours, we derive a general recurrence formulation (Eq. 4) for counting trajectory orders. This formulation serves as the mathematical foundation for integrating TB and DB in a consistent manner.
  • From this recurrence, we analytically derive a set of twelve equations (Eq. 13–24 in Appendix), which lead to the expressions for the backward probability of TB (Eq. 6) and DB (Eq. 7) within a consistent framework.
  • We also define forward probability composition (Eq. 3), where DB serves as the building block for TB, maintaining coherence throughout trajectory construction.
  • To unify flow representations, we compute DB flow from state embeddings and TB flow from final trajectories, interpreting DB flow as an intermediate form of TB flow.
  • Finally, based on hybrid balance, we derive an important structural property: only the depot node has multiple feasible predecessors, while customer nodes have unique ones. This motivates our Depot-Guided Inference, a targeted module that strengthens backward modeling and trajectory quality.

Regarding the potential for generalization to other combinatorial optimization (CO) problems, we agree that our approach holds strong promise beyond VRP. In fact, many existing neural methods in the CO literature such as AGFN, begin with a focus on VRP, precisely because VRP presents a particularly challenging testbed due to its complex structure and rich constraints. Successfully addressing VRP often serves as a strong indication of a method’s scalability and robustness. Building on this foundation, we plan to extend our GED framework to other CO problems such as the Maximum Independent Set (MIS), the Knapsack Problem (KP) and Graph Coloring, which generally involve simpler or fewer constraints compared to VRP. Given the modularity of our framework and its basis in general GFlowNet principles, we are confident in its adaptability and effectiveness across a broader class of combinatorial optimization tasks.

评论

Thank you for your detailed responses.

Regarding W1, I really appreciate your insightful example of the usefulness of DB constraints and your desire to add it as a motivation to a revised paper. The weakness should be resolved with these changes.

Regarding W2, this is not a factor influencing the final rating, as it is to some extent a matter of taste and there is indeed a trade off on how to spend the space and I’d suggest prioritizing the explanation for W1 instead of more related work discussion in the main paper.

Regarding W3, thanks for considering an earlier explanation in the paper and of the negative energy term.

Regarding W4, I really appreciate the contribution of the DB objective to the vehicle routing problem in the GFlowNet framework and see potential improvements beyond this specific problem. The contribution is still an extension of an existing approach with a new objective and all necessary components for the effective integration, however many other components stay untouched e.g. Trajectory generation. Moreover, the DB objective is a rather general extension and it appears to be a missed opportunity to show evaluation of the objective in the context of other combinatorial optimization problems.

All in all, the rebuttal addressed some of my concerns and I’d increase my rating to borderline accept. I hesitate to increase the rating to accept due to W4.

评论

We sincerely thank the reviewer for acknowledging our response and raising the score. We also greatly appreciate the opportunity to further clarify. Below, we provide our detailed responses to the concerns that the reviewer may still have.

Trajectory Generation Extention

We thank the reviewer for the suggestion. In our original paper, HBG is applied to two trajectory generation methods: GFACS, which performs trajectory optimization through Ant Colony Optimization, and AGFN, which follows an end-to-end decoding paradigm. For both HBG variants, we specifically designed a Depot-Guided Inference mechanism which leverages the inherent of our HBG framework that only the depot node has multiple feasible predecessors, while customer nodes have unique ones. This mechanism is described in Section 2.2.3, with corresponding ablation results presented in Section 3.2, Table 2 and Table 4. Depot-Guided Inference is integrated on top of the existing decoding schemes of GFACS and AGFN.

Motivated by the reviewer’s suggestion, we took this opportunity to share our most recent progress, i.e., an extension that also represents one of our planned future directions. Specifically, we extended HBG to guide the 2-OPT local search for trajectory improvement (rather than for trajectory generation). In 2-OPT, two non-adjacent edges are removed and reconnected, and the subpath between them is reversed if this shortens the route. In our approach, HBG predicts the first edge for removal, and the second is chosen by maximizing the 2-OPT gain according to the heuristic rule. Experimental results on TSP are shown below, which indicate that our HBG under a new paradigm (i.e., trajectory improvement) can also work effectively.

Table E. Result of HBG-2OPT

200_Obj. refers to the average objective value obtained on 200-node test instances; the interpretation is analogous for the other entries.

200_Obj.200_Time(s)200_Gap(%)500_Obj.500_Time(s)500_Gap(%)1000_Obj.1000_Time(s)1000_Gap(%)
2OPT11.590.469.1318.272.5412.0926.067.7114.90
HBG-2OPT11.530.378.5618.021.8310.5525.615.2512.92

Extension of HBG to Other Combinatorial Optimization Problems

Thank you for the valuable suggestion. Our HBG framework can be readily extended to other combinatorial optimization problems, such as Maximum Independent Set, Knapsack Problem, and Graph Coloring, by redefining the state, action, reward, and input features for the TB and DB components. The core formulations such as forward probability, backward probability, flow structure, and GNN architecture remain unchanged. Below, we detail the problem-specific adaptations.

Maximum Independent Set: At step tt, the state stis_t^i is a partial independent set of non-adjacent nodes, and the action adds a new node that is not connected to any node in stis_t^i, preserving independence. The DB reward is defined as RiDB(stist1i)=1R_i^{\text{DB}}(s_t^i \mid s_{t-1}^i) = -1, while the TB reward is RTB(τi)=τiR^{\text{TB}}(\tau_i) = -|\tau_i|. Node features include in-degree and out-degree to capture structural roles, and edge features indicate direct connections to enforce exclusivity.

Knapsack Problem: At step tt, the state stis_t^i is a set of selected items whose total weight remains within the knapsack capacity, and the action adds an unchosen item that preserves feasibility. The DB reward is defined as the negative marginal value loss, RiDB(stist1i)=jst1ivjjstivjR_i^{\text{DB}}(s_t^i \mid s_{t-1}^i) = \sum_{j \in s_{t-1}^i} v_j - \sum_{j \in s_t^i} v_j, while the TB reward is the negative total value, RTB(τi)=jτivjR^{\text{TB}}(\tau_i) = -\sum_{j \in \tau_i} v_j. Node features include each item’s value, weight, and value-to-weight ratio, and edge features indicate whether two items jointly violate the capacity constraint, capturing pairwise feasibility.

Graph Coloring: At step tt, the state stis_t^i is a partial coloring assignment that maps a subset of nodes VtVV_t \subseteq V to colors, while the action assigns a valid color to an uncolored node such that adjacent nodes receive different colors. The DB reward is defined as RiDB(stist1i)=1R_i^{\text{DB}}(s_t^i \mid s_{t-1}^i) = 1, with the option to penalize only when a new color is introduced. The TB reward is given by the total number of colors used: RTB(τi)=c(τi)R^{\text{TB}}(\tau_i) = c(\tau_i). Node features include in-degree and out-degree, reflecting each node’s connectivity, and edge features consist of a binary indicator for adjacency, which supports enforcement of valid coloring constraints.

Due to time constraints, we were unable to conduct full-scale experiments on these additional combinatorial optimization problems. However, based on the carefully tailored designs of state, action, reward, and input features for each task, we believe our HBG framework can be naturally and effectively extended to handle them. We sincerely thank you again for supporting our work and raising the score!

评论

Thank you for the detailed explanation on the trajectory generation and potential other use cases.

评论

We appreciate the reviewer for continued engagement and support throughout the process.

审稿意见
5

The paper introduces Hybrid-Balance GFlowNet (HBG), a framework that unifies Trajectory Balance (TB) and Detailed Balance (DB) objectives for solving Vehicle Routing Problems. The authors argue that existing GFlowNet-based VRP solvers like AGFN and GFACS rely exclusively on TB, which captures global optimization but neglects local optimization signals. Their key insight is that DB can address local optimization while TB handles global objectives. The framework includes a depot-guided inference strategy that exploits the structural property that depot nodes have greater flexibility in successor selection compared to customer nodes. Experimental results on CVRP and TSP show consistent improvements when HBG is integrated into existing solvers.

优缺点分析

Strengths:

The observation that existing GFlowNet VRP solvers focus exclusively on global optimization through TB while ignoring local patterns is valid and represents a genuine limitation.

The observation that depot nodes have multiple valid predecessor options during backward trajectory destruction, while customer nodes have deterministic predecessors is well-exploited.

The experimental evaluation covered multiple problem instances and baseline comparisons. The consistent improvements across both AGFN and GFACS demonstrate the generality of the approach.

Weaknesses:

The paper essentially combines two existing GFlowNet objectives (TB and DB). While the combination is sensible, the theoretical contribution is limited.

The relationship between the step-wise and trajectory-level probabilities could be better explained, particularly how the DB formulation maintains consistency with the overall GFlowNet framework.

问题

The paper's treatment of the Graph Neural Network component lacks crucial implementation details and theoretical motivation. While Section 2.1 mentions incorporating a GNN module to 'effectively model complex relational structures of VRP instances,' several important aspects remain unclear:

  1. The paper cites a general GNN survey [45] but does not specify which GNN variant is employed (e.g., GCN, GraphSAGE, GAT, or a custom architecture). This omission makes reproduction difficult.

  2. How exactly does information propagate between nodes? What are the specific update functions for node and edge representations? The paper mentions processing features 'through multiple layers' but provides no details about the aggregation and update schemes.

  3. While k-nearest-neighbor sparsification is mentioned for scalability, what value of k is used? How is this parameter selected, and how does it affect performance across different problem sizes?

  4. The experiments section lacks any GNN-specific hyperparameters (number of layers, hidden dimensions, activation functions, normalization schemes). These details are essential for reproducibility and understanding the method's sensitivity to architectural choices.

Also, the combination of TB and DB losses through simple summation (Equation 11), have the authors considered adaptive weights?

局限性

yes

最终评判理由

Authors have well provided answers for my questions, and I tend to increase my score.

格式问题

NA

作者回复

We sincerely thank the reviewer for the insightful feedback, particularly kind comments of our well-motivated, depot-aware design and effective experimental validation.

W1 W2: Theoretical Contribution: the Consistency of DB and Overall GFlowNet Framework

While TB and DB have been individually explored in prior work, our work presents the first attempt to unify them within one consistent theoretical framework, which is non-trivial and challenging.

Specifically, the DB objective has never been applied to VRP before, and no prior work has explored the combination of TB and DB in GFlowNets. Formulating the DB objective in the VRP setting and achieving a coherent integration with the TB objective requires reconciling two fundamentally different types of training signals and carefully designing forward and backward dynamics that remain consistent throughout the training process.

Backward Probability. Leveraging the hybrid structure of VRP whose solution is comprised of multi-node and single-node sub-tours, we derive a general recurrence formulation of trajectory orders’ count (Eq. (4)). And this formulation serves as the foundation for unifying TB and DB under a consistent probabilistic framework. From this unifying recurrence, we analytically derive a set of twelve equations (Eq. (13)–(24)) in the Appendix, leading to the trajectory-level backward probability of TB (Eq. (6)) and step-wise backward probability of DB (Eq. (7)). These two backward formulations are inherently consistent, as they are both derived from the same recurrence relation in Eq. (4).

Forward Probability. We define the relationship between the step-wise (DB) and trajectory-level (TB) forward probabilities in Eq. (3), where the step-wise forward probability serves as the building block of the trajectory-level forward probability. This formulation ensures that the two are mathematically consistent within the unified framework.

Flow. We further clarify the definition of flow under joint training: the DB flow is computed from the embedding of the current state's node, while the TB flow is computed from the embedding of the final trajectory's nodes. In essence, the DB flow can be seen as an intermediate form of the TB flow, capturing the flow dynamics at each decision step. This perspective reinforces consistency of our unified GFlowNet framework.

Moreover, motivated by the structure of the hybrid balance, we derive an important structural property: only the depot node retains multiple feasible predecessors during reverse sampling. This insight leads us to design the Depot-Guided Inference, which plays a key role in guiding trajectories generation.

Placing most of derivations in the Appendix may have made our theoretical contribution less clear. To improve clarity, we will move the key parts to the main text in the revised version.

Overall, we believe our framework provides a novel and principled theoretical foundation for combining TB and DB objectives in GFlowNets in a consistent manner, particularly adapted to complex routing problems, and opens up new possibilities for general-purpose generative models in structured combinatorial domains.

Q1 Q2 Q4: GNN

As our GNN architecture follows the same design as those used in AGFN and GFACS to ensures a fair comparison, we initially omitted its detailed description due to space limitation. However, as the reviewer suggested, We include the GNN specifications below and will incorporate them into the revised version of the paper.

Our model uses a custom GNN architecture designed specifically for the VRP task, which is widely used in VRP domain, and employs a custom message-passing GNN that jointly updates node and edge representations over multiple layers. At each layer 𝑙, node embedding h𝑖𝑙ℎ_𝑖^𝑙 of the 𝑖-th node and and edge embeddings 𝑒l𝑖𝑗𝑒^{𝑖𝑗}_l between the 𝑖-th and j-th node are updated via the following formulations:

hil+1\mathbf{h}^{l+1}_i = hil\mathbf{h}^l_i + ACT\mathrm{ACT} ( BN\mathrm{BN} ( W1l\mathbf{W}^l_1 hil\mathbf{h}^l_i + A\mathcal{A} ( σ\sigma ( elij\mathbf{e}_l^{ij} ) \odot W2lhjl))\mathbf{W}^l_2 \mathbf{h}^l_j) )

el+1ij\mathbf{e}_{l+1}^{ij} = elij\mathbf{e}_l^{ij} + ACT\mathrm{ACT} ( BN\mathrm{BN} ( W3l\mathbf{W}^l_3 elij\mathbf{e}_l^{ij} + W4lhil\mathbf{W}^l_4 \mathbf{h}^l_i + W5lhjl\mathbf{W}^l_5 \mathbf{h}^l_j ))

Here, W1l,W2l,W3l,W4lW_1^l, W_2^l, W_3^l, W_4^l are learnable parameters, 𝐴(⋅) is the aggregation function (mean pooling in our case), 𝜎(⋅) is the sigmoid activation that modulates attention over neighbors, and ACT denotes the SiLU activation. Graph normalization (BN) is applied at each step for stability. And the number of layers is set to 12 for HBG-GFACS and 16 for HBG-AGFN, with hidden dimensions of 32 and 64, respectively.

Q3: Sparsification Value

We follow AGFN and GFACS in setting the value of k for fair comparison: k = |V|/5 for both HBG-GFACS and HBG-AGFN on CVRP, k = |V|/10 for HBG-GFACS on TSP, and k = |V|/5 for HBG-AGFN on TSP, where ∣V∣ denotes the number of nodes in the instance. Inspired by the reviewer’s comment, we further conducted an ablation study to evaluate how different choices of k affect model performance across various problem sizes to provide a more comprehensive assessment of our model. The results are presented in Table A1 and A2 below. On CVRP, the original k setting consistently outperforms other configurations. For TSP, HBG-GFACS achieves the best performance with its original k setting. Although HBG-AGFN with the original k slightly underperforms compared to k = |V|/2 on 200-node TSP instances instances, it demonstrates superior results on the 500- and 1000-node TSP instances and CVRP instances of all scales. Overall, the results suggest that the best performance is typically achieved at the original k setting.

Table A1. Comparison of Various Sparse Parameter k on CVRP

HBG-GFACS_2 refers to the HBG-GFACS model evaluated with k=|V|/2; the interpretation is analogous for the other entries. The term "origin" denotes the original hyperparameter setting used in our main experiments.

CVRP(Gap%)2005001000
HBG-GFACS_219.2616.2712.69
HBG-GFACS_5(origin)16.4813.5310.61
HBG-GFACS_817.9013.8311.02
HBG-GFACS_1020.4317.8814.9
HBG-GFACS(local search)_22.133.413.46
HBG-GFACS(local search)_5(origin)1.962.812.75
HBG-GFACS(local search)_82.133.363.51
HBG-GFACS(local search)_103.074.034.77
HBG-AGFN_212.6213.3614.78
HBG-AGFN_5(origin)9.9510.449.34
HBG-AGFN_811.5513.2712.69
HBG-AGFN_1012.9814.514.06

Table A2. Comparison of Various Sparse Parameter k on TSP

TSP(Gap%)2005001000
HBG-GFACS_524.9164.91108.11
HBG-GFACS_10(origin)19.4048.4175.93
HBG-GFACS_1523.1158.2294.53
HBG-GFACS_2031.6166.91107.35
HBG-GFACS(local search)_52.397.6110.92
HBG-GFACS(local search)_10(origin)1.504.607.67
HBG-GFACS(local search)_151.725.3610.53
HBG-GFACS(local search)_203.016.8711.58
HBG-AGFN_211.1119.3329.54
HBG-AGFN_5(origin)11.4514.0518.47
HBG-AGFN_810.5516.0724.57
HBG-AGFN_1010.8317.8530.29

Q5: Adaptive Weights

We sincerely thank the reviewer for the thoughtful suggestion regarding the use of adaptive weights in combining TB and DB losses. Motivated by this, we conducted additional experiments exploring the effect of different weighting strategies (i.e., 𝐿HB=𝐿TB+𝜆𝐿DB𝐿_{HB} = 𝐿_{TB}+ 𝜆⋅𝐿_{DB}), including adaptive weights (i.e., 𝜆: 0.5->0, 1->0, 2->0 and so on) and fixed weights (i.e., 𝜆: 0.5->0.5, 1->1, 2->2). The results are presented in Tables B1–B4 below. The findings suggest that for HBG-AGFN, a fixed 1:1 weighting between TB and DB yields the best performance, while for HBG-GFACS, this fixed ratio also provides the best balance between performance with and without local search.

Table B1. Weights Analysis of HBG-AGFN on CVRP

"𝜆: 0.5→0" refers to changing the TB:DB loss weight ratio from 1:0.5 to 1:0 during training, with the TB weight fixed at 1. The interpretation for other entries follows analogously.

Size/Gap%/𝜆0.5->01->02->00->0.50->10->20.5->0.51->1(origin)2->2
20011.0910.8010.9411.4111.9811.209.989.9510.44
50011.2712.0411.1812.1511.6611.8011.6310.4411.98
100010.8910.619.8911.0310.8410.9810.879.3410.08

Table B2. Weights Analysis of HBG-AGFN on TSP

Size/Gap%/𝜆0.5->01->02->00->0.50->10->20.5->0.51->1(origin)2->2
20010.2610.4210.4911.4811.2610.6610.0810.4510.36
50015.1515.4016.1316.8716.0915.7914.9114.0515.64
100018.8018.5519.0019.2621.8722.0018.5218.4718.81

Table B3. Weights Analysis of HBG-GFACS on CVRP

Size/Gap%/𝜆0.5->01->02->00->0.50->10->20.5->0.51->1(origin)2->2
20014.0519.2917.3717.7218.4719.0816.4216.4818.87
50011.7817.6615.3815.2816.3515.7113.5713.5315.49
100011.2915.8012.7112.8813.4712.8010.9110.6112.80
200(local search)2.391.782.032.072.141.961.961.961.78
500(local search)3.022.512.913.413.442.993.172.812.62
1000(local search)2.822.272.823.313.532.752.452.752.40

Table B4. Weights Analysis of HBG-GFACS on TSP

Size/Gap%/𝜆0.5->01->02->00->0.50->10->20.5->0.51->1(origin)2->2
20019.1320.6123.0622.2120.5624.0221.6419.4022.82
50049.5149.6150.8049.4848.4756.1349.6248.4150.37
100072.3080.1990.5079.3780.3996.3677.4575.9387.23
200(local search)1.571.491.591.591.621.821.541.501.62
500(local search)4.814.655.024.884.735.184.734.685.12
1000(local search)7.737.687.867.787.838.027.707.677.99
评论

Dear Reviewer 4QUb,

We sincerely appreciate your time and efforts in reviewing our paper, especially for raising questions which helped us improve the clarify. As the discussion period is closing soon, could you please kindly take a look at our response at your convenience, so that we will have sufficient time to handle the remaining concerns if any. Thank you for your understanding and support.

Best regards,

Authors

评论

Dear Reviewer 4QUb,

We sincerely appreciate your time and efforts in reviewing our paper, especially for raising questions which helped us improve the clarify. As the discussion period is closing soon, could you please kindly take a look at our response at your convenience, so that we will have sufficient time to handle the remaining concerns if any. Thank you for your understanding and support.

Best regards,

Authors

最终决定

This paper proposes Hybrid-Balance GFlowNet (HBG), which unifies Trajectory Balance (TB) and Detailed Balance (DB) into a single framework to jointly address global and local optimization in VRPs. While prior methods relied heavily on TB and often failed to capture local structural signals, this work introduces DB into the VRP domain for the first time and integrates it consistently with TB. The framework also incorporates a depot-guided inference strategy, which proves effective in depot-centric problems such as CVRP, while still yielding improvements for TSP.

The empirical evaluation is thorough: HBG is integrated into two distinct solvers, AGFN and GFACS, and demonstrates consistent performance improvements across multiple benchmarks. Detailed ablation studies and comparisons of weighting strategies further validate the effectiveness of the approach. While some reviewers raised concerns about the use of the term “adaptive” and the generality of the method, the authors provided substantial clarifications and additional experiments, which resolved the main issues.

Overall, the paper offers both theoretical novelty and practical performance gains. It makes a solid and well-validated contribution that is suitable for NeurIPS.