7.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

3.8

置信度

创新性2.5

质量3.0

清晰度2.5

重要性2.8

NeurIPS 2025

A Markov Decision Process for Variable Selection in Branch & Bound

Paul STRANG,Zacharie ALES,Côme Bissuel,Olivier Juan,Safia Kedad-Sidhoum,Emmanuel Rachelson

OpenReview PDF

提交: 2025-05-12更新: 2025-11-30

TL;DR

We enhance the potential for reinforcement learning applications in mixed-integer linear programming by modeling variable selection in Branch and Bound as a Markov decision process.

摘要

关键词

Mixed-integer linear programmingBranch and boundReinforcement learningMarkov decision process

评审与讨论

审稿意见

评分: 5置信度: 42025-06-24

This paper addresses the problem of learning branching policies for solving Mixed-Integer Linear Programs (MILPs) using reinforcement learning. The authors first identify a key theoretical limitation in prior work, arguing that it is not a true Markov Decision Process and is incompatible with standard multi-step RL algorithms. The paper’s main contribution is the introduction of a new, principled formulation called the Branch and Bound Markov Decision Process (BBMDP). This formalization as a standard MDP allows the authors to leverage a broader range of well-established RL techniques, such as k-step temporal difference learning. Finally, the authors present computational experiments demonstrating that an agent trained with their BBMDP formulation outperforms previous state-of-the-art RL agents on several standard MILP benchmarks.

优缺点分析

Strengths: A Principled Formulation. The paper’s primary strength is the introduction of the “Branch & Bound Markov Decision Process” (BBMDP). This is a significant contribution, advancing the field from “MDP-inspired” but technically flawed frameworks (e.g., TreeMDP) to a principled, standard MDP formulation that is theoretically sound. Insightful Analysis. The paper provides a clear and insightful diagnosis of a core issue with prior approaches in Section 3.2, “Misconceptions in Learning to Explore the B&B Tree.” This sharp analysis of how the global incumbent breaks subproblem independence is a valuable contribution in itself and provides strong motivation for the proposed framework. Comprehensive Performance Validation. The experimental results in Section 4 provide strong validation for the proposed framework. The DQN-BBMDP agent consistently outperforms the prior state-of-the-art RL agent (notably DQN-Retro) across all four standard benchmarks, on both in-distribution test instances and larger transfer instances, as shown in Table 1. Weaknesses The main weakness of this paper is the insufficient experimental evidence for some of the claims made in its theoretical section:

Ablation study may not be fully comprehensive. By removing only one component at a time, it does not explore potential interaction effects. It is therefore difficult to ascertain whether the benefits of the different components are purely additive or if they have synergistic relationships that are not captured by the current experimental design.
A significant weakness of this work is its strict reliance on a Depth-First Search (DFS) node selection strategy. This becomes problematic as the ablation study appears to show that performance can be slightly better without the DFS constraint, despite it being theoretically necessary for the Markov property. This apparent contradiction is not sufficiently addressed in the main text and warrants a more detailed discussion to reconcile the theoretical foundation with the empirical evidence.
This paper argues that the BBMDP framework enables the application of a wide range of standard RL algorithms. However, this central claim is not fully substantiated by the experiments, which are limited to a single, well-tuned DQN agent. The paper would be more convincing if it demonstrated the framework’s efficacy with other classes of RL algorithms (e.g., policy gradient methods) to validate this claim of generality.

问题

Please see the Weaknesses part for my main questions. Besides, there are several minor questions:

What is the computational and memory overhead of maintaining the full B&B tree state in BBMDP compared to TreeMDP’s simpler node-based representation, particularly for large problem instances?
How does the k-step TD performance of BBMDP compare to TreeMDP across a more comprehensive range of k values (e.g., k=2, 4, 5)?
How sensitive is the performance gain from HL-Gauss loss to its hyperparameters across different problem types, and were these values tuned specifically for the benchmarks used?

局限性

Yes

格式问题

N/A

评论- Rebuttal to reviewer Qj2v (2/2)

2025-08-01

We thank the reviewer for prompting us to broaden the discussion. First, we would like to emphasize, as noted by the reviewer, that our contribution is primarily theoretical in nature: our goal is to develop a principled MDP formulation for variable selection in B&B that supersedes TreeMDP. Accordingly, our experimental focus was to demonstrate the performance gains of DQN-BBMDP relative to prior RL baselines from the literature, all of which were trained within the TreeMDP framework. That said, we agree that exploring the implementation of a broader range of RL algorithms within the BBMDP framework is a promising direction for future research. In particular, we believe that a specific class of policy iteration algorithms —namely, MCTS-based approaches from the AlphaZero family— could prove especially effective in the B&B setting. [2] In fact, in the context of complex combinatorial board games, the integration of value and policy networks with lookahead search via Monte Carlo Tree Search (MCTS) has been shown to guide action selection towards high-value states, effectively mitigating exploration and credit assignment issues arising from the sparse-reward environment setting. As learning variable selection in B&B is faced with the same predicaments that arise in combinatorial board games, namely sparse and delayed rewards, long planning horizons and highly structured combinatorial state-action spaces, we believe that adapting MCTS-based algorithms to the B&B setting represents a promising avenue for advancing RL-based branching strategies. Interestingly, by providing a principled formulation of variable selection in B&B, the BBMDP framework naturally enables the derivation of a policy improvement operator via MCTS. In contrast, as highlighted on line 651 in Appendix H, TreeMDP is fundamentally incompatible with MCTS-based planning. In fact, by discarding the temporal perspective of B&B, TreeMDP introduces inconsistencies that prevent search algorithms such as MCTS from being effectively adapted to variable selection, thereby hindering the integration of model-based planning techniques into the B&B framework.

We now turn to addressing the remaining questions raised by the reviewer.

Importantly, BBMDP does not introduce any additional computational or memory overhead beyond what is already required by the MILP solver itself. In fact, regardless of the MDP formalism used to model the branching decision process, the underlying solver (in our case, SCIP) must keep track of all open nodes in the B&B tree to ensure both correctness and effectiveness of the solving process, independently of whether this information is ultimately utilized by the branching agent. Moreover, as discussed in lines 231-234 of the paper, under DFS the optimal policy can be retrieved simply from the information associated with the current B&B node, by acting greedily with respect to the optimal $\bar{Q}$ -function. Hence, to ensure a fair and rigorous comparison with prior IL and RL baselines, our DQN-BBMDP agent retains the same observation function and neural network architecture as used in previous works.
Unfortunately, we only trained our DQN-TreeMDP and DQN-BBMDP baselines for values of $k$ -step return in {1, 3}. Despite the shortness of the discussion period and the substantial training time of our RL agents, we will do our best to include computational results for more comprehensive values of $$k by the end of the author-reviewer discussion period. In any case, we would be happy to include these additional results in the final version of the paper, should the review process permit it.
The HL-Gauss loss is defined by a total of four parameters. The first two, $z_{min}$ and $z_{max}$ specify the support of the value function. These are not standard tunable hyperparameters, as they are tightly constrained by the B&B tree sizes encountered during both training and testing. As discussed in lines 594-601 of the paper, we chose $z_{min} = -1, z_{max} = 16$ to cover the range of observed tree sizes in the Ecole benchmark. Moreover, we fixed the number of histogram bins to $m_b = 18$ in order to maintain unit-width bins. Finally, the standard deviation was set $\sigma=0.75$ , following the choice of Farebrother et al. This correspond to spreading the Gaussian probability mass over approximately 6 bins, a choice that proved robust across different benchmarks in their study. Overall, none of these parameters required particular tuning for HL-Gauss to achieve strong performance.

[2] Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., ... & Hassabis, D. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419), 1140-1144.

2025-08-04

Thanks for the rebuttal. The added ablation study effectively addresses my main concern by isolating the benefits of the BBMDP formulation itself. The authors' responses regarding the limitations are also satisfactory. My concerns have been resolved, and I will keep my score.

评论- Rebuttal to reviewer Qj2v (1/2)

2025-08-01

We thank the reviewer for their valuable time and feedback. We start by addressing the main concerns raised by the reviewer.

We emphasize that our DQN-BBMDP implementation differs from the DQN-TreeMDP baseline in only two aspects: the use of the BBMDP formulation and the adoption of the HL-Gauss loss. Moreover, as illustrated in Figure 3, TreeMDP and BBMDP induce equivalent learning updates when trained following 1-step temporal difference targets. Hence, to disentangle the contribution of each innovation, we report results for agents trained with both $k=1$ and $k=3$ step returns.

k-step	DQN-BBMDP<br>(BBMDP ✅; HL-Gauss ✅)	DQN-TreeMDP<br>(BBMDP ❌; HL-Gauss ❌)	TreeMDP + HL-Gauss<br>(BBMDP ❌; HL-Gauss ✅)	BBMDP w/o HL-Gauss<br>(BBMDP ✅; HL-Gauss ❌)
k = 1	158.9	175.8(+10%)	158.9 (+0%)	175.8 (+10%)
k = 3	152.3(–4%)	178.9 (+13%)	162.1 (+2%)	172.3 (+8%)

We believe this extended ablation study now exhaustively covers all meaningful combinations of DQN-BBMDP and DQN-TreeMDP. As discussed in Section 4.2, these results support two key conclusions: (1) the majority of the performance gains in DQN-BBMDP stem from the use of the HL-Gauss loss; and (2) when training over multi-step rollout trajectories ( $k>1$ ), the BBMDP formulation leads to further performance improvements over the $k=1$ baseline, whereas TreeMDP results in a performance degradation relative to its own $k=1$ baseline. This observation is consistent with the claims made in lines 49-62 of the paper.

We thank the reviewer for raising their concern. We agree that the reliance on DFS is a limitation of our current approach, as DFS is known to be suboptimal compared to the sophisticated node selection heuristics implemented in MILP solvers. Nevertheless, we believe two observations help put this limitation into perspective.

First, we stress that despite relying on DFS, our DQN-BBMDP agent clearly and consistently outperforms the DQN-Retro baseline, which was specifically designed to enable learning to branch under SCIP’s default node selection policy. As discussed in lines 620-624 in Appendix F, this provides compelling evidence that, although DFS is generally expected to hinder the training performance of RL agents, the theoretical guarantees brought by DFS in BBMDP enable to surpass the prior state-of-the-art non-DFS agents. This observation is consistent with the findings of Etheve et al. [1], who showed that efficiently optimizing the branching policy has a substantially greater impact on B&B performance than optimizing the node selection policy.
Second, while the theoretical properties derived in Section 4.3 are specific to the DFS setting, we emphasize that the BBMDP formulation introduced in Section 3.1 remains valid under any node selection policy. This is in contrast to TreeMDP, which is not consistent outside of the DFS regime. Therefore, BBMDP provides a principled foundation for future work that seeks to integrate more elaborate node selection strategies, beyond DFS. Crucially, enabling such approaches will require the development of scalable observation functions capable of efficiently encoding the rich, structured information present in evolving B&B trees.

[1] Etheve, M. (2021). Solving repeated optimization problems by Machine Learning (Doctoral dissertation, HESAM Université).

审稿意见

评分: 5置信度: 42025-06-25

The paper revisits reinforcement learning for branching in MILP solving. The paper proposes a new MDP formulation. This differs from the previous sota, TreeMDP, in that it is a standard MDP. This has theoretical advantages and potentially practical advantages. An experimental study on standard benchmark instances shows improved performance compared to previous RL approaches, and narrows the gap to the sota supervised learning approach for branching.

优缺点分析

Attempts to use RL for variable selection in MILP branch-and-bound solving has not yet reached their potential. The paper well situates that in principle RL can surpass supervised learning, namely imitation learning from strong branching. A difficulty has been formulating a MDP for the RL task that has the markov property. Etheve et al in 2020 made an not entirely satisfactory effort, which was formalised by Scavuzzo et al in 2022 as the TreeMDP, accompanied by a careful empirical study. This lead to the sota RL approach of Parsonson et al also in 2022. A strength of the present submission is to compare its approach with all these earlier approaches.

The main contribution of the paper is a MDP formulation as a standard MDP, i.e. not with the tree property. This is an important step. The paper has some nice technical examples and counter-examples, and overall provides illumination into the branching-via-RL task.

Further, the empirical study follows Scavuzzo et al and is done well. The authors (re-)implement Etheve et al and Parsonson et al, and compare on four standard instance classes. Both test and transfer instances are presented, together with an ablation study. The results compare against strong branching, SCIP 8's default branching (although there is now SCIP 9), IL and IL with DFS. The proposed approach is better in general than previous RL approaches and comes closer than them to IL and/or SCIP.

The paper has clear exposition, and the appendices are helpful. There is sufficient information to reproduce the work.

This paper will be interesting for the MILP community. It will have interest for a subset of the ML community. There is not new RL theory here. Altogether this gives this reviewer a Significance score and an Originality score of fair. Explicitly, that is not a criticism of the technical quality. In sum, a valuable contribution to the interface of learning and optimisation in the context of MILP.

问题

Can you consider the "real world" production scheduling instances from: Wang, J., Wang, Z., Li, X., Kuang, Y., Shi, Z., Zhu, F., Yuan, M., Zeng, J., Zhang, Y., Wu, F., 2024a. Learning to cut via hierarchical sequence/set model for efficient mixed-integer programming. IEEE Transactions on Pattern Analysis and Machine Intelligence 46, 9697–9713. doi:10.1109/TPAMI.2024.3432716.

局限性

yes

最终评判理由

The authors are thanked for providing a response to reviewers, which was read and does not lead to a change in rating.

格式问题

none

评论- Rebuttal to reviewer P4AA

2025-08-01

We thank the reviewer for their valuable time and interest in our work. We carefully examined the datasets shared in the official implementation repository of Wang et al. (https://github.com/MIRALab-USTC/L2O-HEM-Torch) and found that the production planning instances referenced by the reviewer were, in fact, the only ones missing from the public release. If the reviewer is aware of an alternative source for obtaining this instance dataset, we would be grateful for their guidance.

审稿意见

评分: 4置信度: 42025-06-30

This paper introduces a novel branch and bound Markov decision process (BBMDP) framework for combinatorial optimization in Mixed-Integer Linear Programming (MILP).
The authors formulate the process by modeling the traversal of B&B subtrees temporally, rather than structurally, and derive a corresponding Bellman equation that facilitates the use of reinforcement learning (RL) techniques.
They evaluate BBMDP on benchmark problems and demonstrate that it outperforms existing RL-based approaches, achieving reductions in both computation time and the number of B&B nodes explored.

优缺点分析

Strength:

Enables the use of RL frameworks within B&B solvers.
Models the B&B process as a temporal decision-making problem, allowing for the incorporation of time-dependent strategies (e.g., node selection with early stop).
BBMDP demonstrates strong empirical performance, outperforming other RL-based approaches in benchmark tests.

Weakness:

The presentation of the work is hard to follow when a nomenclature is not present. Since there are symbols with both B&B and RL their defitions are hard to find from the paragraph, making it difficult for readers to keep track of the notations.
The overall structure of the paper could be improved for better readability. For example,

it would be helpful to make a summary on table 5 which compares BBMDP and TreeMDP in Appendix H. Absense of this summary leads to the confusing regarding the key novelty of BBMDP.
the ordering of sections in the appendix feels extremely arbitrary, making it difficult to locate relevant details.
some concepts in the appendix are mentioned only once in main text without clarification. This makes readers puzzled. An example is the inclusion of HL cross-entropy in line 251, where the motivation of introducing this concept is placed into the appendix, not the main context.
the rationale behind several design choices is not well explained. An example is the reward function being set to –2 for all transitions. It seems the authors try to minimize the size of the B&B tree via this settings, but this initiative is not clearly stated in the paragraph.
a figure showing the workflow of BBMDP, similar as Figure 2, would greatly help readers' understanding.
the benchmarks lack a summary of problem difficulty levels, making it hard to assess the significance of the reported performance gains.

A missing element throughout the manuscript is a clear explanation of why and how choices were made on using temporal models.

What is the major initiative for authors to use such a temporal sequence model instead of structural model?
While the authors argue that temporal modeling allows for MDP to be applied, this shift also introduces new challenges. Specifically in BBMDP, branching decisions are no longer structurally constrained to child nodes but all future nodes in the domain, which increases the complexity of selecting the optimal next node. This potentially leads to higher variance in trajectories and longer training times, intuitively. However, the authors have not discussed the impact of this added complexity such as how the training is performed.

问题

How are child nodes generated during the B&B process?
Is the value of each state updated after every iteration? If so, does the framework allow for previously visited nodes to be revisited in future steps?
The network architecture is described only briefly in the experimental setup (line 293), with little discussion of its flexibility. If the architecture can be chosen freely, it raises an important question: to what extent does the choice of architecture affect performance? This aspect remains unexplored in the paper.

局限性

Yes

最终评判理由

I can rate the submission with score of 4.

After my major concerns are partially addressed by original submission and other reviewers, I can see the paper has made clear contribution to RL in B&B domain, which is significant and with high impact to the field (score of 5).
However, the clarity of the paper is insufficient to me (-1):

many concepts are arbitarily placed into the main text without justification
the reason on the choice of policy network is unexplained (ignore authors' rebuttal)

Therefore, I will rate 4 for this submission.

格式问题

N/A

评论- Rebuttal to reviewer euoY (2/3)

2025-08-01

We appreciate the reviewer’s concern but believe it may stem from a misunderstanding. Reinforcement learning fundamentally addresses sequential decision-making problems, which are most commonly modeled as Markov Decision Processes (MDPs). In order to apply an RL algorithm to a given problem, it is essential to first define a corresponding MDP formulation that faithfully captures the problem's dynamics and objectives. Therefore, the central contribution associated with BBMDP is to restore the temporal MDP perspective that was explicitly set aside in the recent TreeMDP framework. In fact, as discussed in lines 255-256, by discarding the notions of temporality and sequentiality which are core to both MDPs and B&B, TreeMDP fails to describe variable selection accurately in the general case. This is exemplified in Figure 3, as applying the TreeMDP dynamics for $k$ consecutive steps produces B&B trees that do not reflect the true internal state of the B&B process. As a result, TreeMDP’s structural limitations fundamentally restrict the range of RL algorithms that can be effectively applied in B&B. In contrast, as discussed in lines 264-267 of the paper, the BBMDP framework allows to leverage any traditional RL algorithm for the task of learning to branch in B&B, including algorithms that were not compatible with the previous TreeMDP framework, such as k-step temporal difference, TD( $\lambda$ ), along with MCTS-based algorithms from the AlphaZero family.
We are unsure about the specific argument the reviewer intends to make, but we believe the confusion may arise from a conflation of the roles played by the branching and node selection heuristics. In both BBMDP and TreeMDP, the action set consists of the set of fractional variables in the LP relaxation of the current node, i.e. the branching candidates. In contrast, the choice of which node to expand next from the pool of open nodes is delegated to the node selection policy $\rho$ , which operates independently of the branching decisions. As made explicit in lines 104–106 of the paper, our objective is to optimize the branching policy while keeping the node selection policy fixed. As in TreeMDP, we adopt $\rho$ = DFS as the node selection policy, in order to enable the decomposition of B&B episodes into independent subtree trajectories. This is thoroughly exposed in section 3.2. This design choice helps mitigate credit assignment issues by reducing the typical length of episode trajectories logarithmically with respect to the size of the whole B&B tree. Hence, the branching decision process in itself is equivalent in both TreeMDP and BBMDP, in particular, there is no computational or memory overhead associated with the choice of BBMDP over TreeMDP as both frameworks are designed to optimize branching policies over subtree trajectories. It is possible that we may have misunderstood the reviewer’s concern. If so, we would be happy to address it in more detail, and we kindly invite the reviewer to reformulate their concern for clarification.

评论- Rebuttal to reviewer euoY (3/3)

2025-08-01

We now turn to addressing the questions explicitly raised by the reviewer.

The B&B child node generation process is described in lines 91--95 of the paper. Given $v_t$ , the current B&B node, and a variable $x_b$ (with $b \in \mathcal{I}$ ) whose value is fractional in the current linear program relaxation, the B&B algorithm partitions the feasible region into two subproblems:

$P_{-}$ , which enforces $x_b \leq \lfloor \hat{x}_b \rfloor$ , and
$P_{+}$ , which enforces $x_b \geq \lceil \hat{x}_b \rceil$ .

The corresponding child nodes $(v_{-}, v_{+})$ , associated with $(P_{-}, P_{+})$ , are then added to the list of open nodes and become future candidates for selection by the node selection policy.

In our implementation, the creation of child nodes is handled internally by the SCIP solver. The role of the branching agent is limited to providing branching decisions $b \in \mathcal{I}$ via a callback, without altering SCIP’s native node generation mechanism.

We are unsure about the precise question raised by the reviewer. Specifically, what is meant by iteration? Is the reviewer referring to B&B iterations, or to iterations in the approximate dynamic programming procedure? Crucially, we would like to reiterate a key point stated in lines 79-81 of the paper: during the solving process, B&B nodes are explored sequentially. Once a node is visited, it is added to the set of visited nodes and not revisited again. This property is essential to ensure that the B&B algorithm terminates in a finite number of steps. Accordingly, both BBMDP and TreeMDP frameworks explicitly enforce this fundamental constraint in their respective formulations of the B&B process.
We thank the reviewer for raising this question. Indeed, identifying the most effective network architecture for learning over MILPs (and CO problems in general) remains an open research question and has been the focus of several recent studies. [1], [2], [3] Specifically, in the context of learning to branch, as discussed in the related work section (lines 123-128), multiple contributions have explored architectural improvements for branching policies. However, we emphasize that this question is orthogonal to our contribution, which centers on developing a principled MDP formulation for variable selection in B&B. To ensure a fair comparison between DQN-BBMDP and prior IL and RL baselines, we deliberately chose to retain the same network architecture used in those earlier works. This choice allows us to attribute performance differences to the underlying learning framework rather than architectural settings

[1] Schuetz, M. J., Brubaker, J. K., & Katzgraber, H. G. (2022). Combinatorial optimization with physics-inspired graph neural networks. Nature Machine Intelligence, 4(4), 367-377.

[2] Chen, Z., Liu, J., Wang, X., & Yin, W. On Representing Mixed-Integer Linear Programs by Graph Neural Networks. In The Eleventh International Conference on Learning Representations.

[3] Chen, Z., Liu, J., Chen, X., Wang, W., & Yin, W. (2024). Rethinking the capacity of graph neural networks for branching strategy. Advances in Neural Information Processing Systems, 37, 123991-124024.

2025-08-06

After reading the rebuttal and examing the paper, my concerns regarding the structural and temporal components of the BBMDP RL framework have been addressed.
I recognize the point that BBMDP retains the structural context of the B&B tree and frontier, while makes decisions across iterations.

I can raise my score accordingly.

评论- Rebuttal to reviewer euoY (1/3)

2025-08-01

We sincerely thank the reviewer for taking the time to engage with our work. We appreciate the feedback, and we provide below detailed responses to the concerns raised.

We agree that clear and consistent notation is essential, particularly at the intersection of B&B and RL, where symbolic overload can easily arise. Throughout the paper, we have aimed to keep the notation as concise and consistent as possible. That said, we acknowledge that the inclusion of a dedicated nomenclature table could further improve readability. We would be happy to include a glossary in the revised version to help readers quickly reference key definitions from both the MILP and MDP frameworks.

We thank the reviewer for their suggestion. We agree that, given additional space, incorporating a summarized version of the side-by-side comparison between BBMDP and TreeMDP from Appendix H into the main body of the paper would help further clarify the key distinctions between the two frameworks. That said, we would like to emphasize that we have made a consistent effort to highlight these differences throughout the main text, for example, in lines 157–164, 183–186, as well as in Section 3.4 (BBMDP vs. TreeMDP), where Figure 3 provides a clear visual illustration of the differences in learning updates induced by the two frameworks.
We thank the reviewer for this helpful observation. We understand that navigating the appendix may be challenging in its current form. To improve clarity and accessibility, we would be happy to reorder the appendix sections to match the order in which they are first referenced in the main text. This reorganization should help ensure that the supplementary material is more naturally aligned with the reader’s progression through the paper.
We thank the reviewer for pointing this out. Indeed, the HL-Gauss cross-entropy is introduced in line 251 with limited context in the main body of the paper, which may leave readers unfamiliar with this technique seeking additional clarification. Our intent was to keep the main text focused on the core conceptual contribution, namely, the formal derivation of a MDP formulation of variable selection in B&B, while relegating implementation and technical details to the appendix. That said, we agree that, given additional space, the motivation behind this design choice would be more appropriately introduced in the main text.
We thank the reviewer for raising this question. As stated in lines 180–182 of the paper, our choice of r=−2 is purely cosmetic, intended to align the cumulative reward of a BBMDP episode with the actual size of the corresponding B&B tree, thereby simplifying theoretical interpretation. In fact, setting $r=−2$ , $r=−1$ , $r=−10$ , or more generally $r=−k$ for any positive constant $k$ yields strictly equivalent optimal branching policies under the BBMDP formulation. If there are other specific modeling choices the reviewer believes should be clarified, we would be happy to elaborate further.
We thank the reviewer for raising their concern. However, we would like to clarify that the BBMDP workflow follows exactly the same structure as the B&B workflow depicted in Figure 2. As such, we believe that adding an additional figure would likely be redundant.
We thank the reviewer for raising this point. Unfortunately, there is no universal metric to quantify the intrinsic difficulty of MILP instances. Table 3 in Appendix A reports statistics such as the number of variables and constraints, which can offer some intuition but are not always reliable indicators of solving complexity. In practice, a more informative proxy for instance difficulty is the solving time required by a general-purpose MILP solver. In this regard, the performance of the SCIP baseline across the Ecole benchmark, as reported in Table 1, provides a reasonable empirical estimate of the relative difficulty of the instances considered.

审稿意见

评分: 5置信度: 32025-07-03

The paper proposes BBMDP, a principled formulation of learning better branching policies in branch & bound (B&B). The main analytical and algorithmic contribution involves a) identifying an issue with prior RL-based approaches to learning to search in B&B and b) reformulating the optimization problem as a proper stochastic shortest path problem. Specifically, the approach seeks to learn a branching policy to find the shortest sequence of B&B trees (representing partial solutions during search) that lead to a solution. Doing so permits leveraging techniques from Q-function learning (i.e., Bellman equations). Experiments are conducted on standard MILP benchmarks. Results indicate that BBMDP performs strongly compared to prior work on RL-based solutions (e.g., TreeMDP) on both same-size test instances and larger-size transfer instances. However, BBMDP is unable to clearly outperform a strong baseline (IL).

优缺点分析

Strengths

The paper studies an interesting and important problem. B&B algorithms are widely used for combinatorial optimization and any improvements here is likely to be of significant interest and have a large impact.
The paper is well written. The problem formulation is clearly described. The description of prior work is very good.
The proposed approach is intuitively clear and mathematically rigorous. The optimization problem is well defined and the reformulation to stochastic shortest paths makes sense to me. The paper motivates its approach well by contrasting with prior work. As a result, it's easy to identify the main contributions, which are mostly analytical and algorithmic.
The experiments are carefully constructed on four common MILP benchmarks. Although BBMDP does not beat the IL baseline, it is able to outperform other RL agents, which is a nice result. The experiments and appendix include a lot of detail and the included code aids reproducibility.
Overall, I think the paper represents a solid contribution to the literature on learning search control knowledge for B&B algorithms.

Weaknesses

I wasn't able to find any major issues with the paper. A minor point is that the definitions of a couple of important concepts needs to be formally defined (See Q1). Overall, I think this is a nice contribution to the literature on principled RL approaches to improve B&B algrorithms for MILPs.
(Not a weakness, more of an observation) Given the strength of the IL baseline and the importance of the HL-Gauss CE loss, it seems like learning a policy directly might be better / easier. However, PG-tMDP results suggests that might not be the case. Results seem mixed, at best. I was hoping the paper would explore this aspect in more detail and if / how popular policy gradient algorithms (PPO, GRPO, etc.) might fit in here. Including a discussion or empirical exploration of this aspect would be nice to have (but likely out of scope).

问题

What's the formal definition of $\bar{V}$ (Line 193)? Same question for "opposite".
Would popular PG algorithms (PPO, GRPO) make a difference here? How challenging would it be to include them in the experiments?

局限性

Yes.

最终评判理由

After reading the other reviews, I continue to recommend acceptance of the paper, maintaining my current score.

格式问题

None.

评论- Rebuttal to reviewer 1t91

2025-08-01

We thank the reviewer for their valuable time and interest in our work. We address the two questions raised by the reviewer in order.

We use the term opposite in its mathematical sense, referring to the additive inverse: for any real number $x \in \mathbb{R}$ , its opposite is $−x$ . We hope this clarifies the definition of $\bar{V}^\pi$ : $\bar{V}^\pi (s_t, o_i)$ is defined as the opposite size of $T(o_i)$ , the B&B subtree rooted at node $o_i$ , obtained when applying the branching policy $\pi$ and the node selection policy $\rho$ from state $s_t$ . In our work, we show that under DFS, the value of $\bar{V}^\pi (s_t, o_i)$ in fact only depends on $o_i$ and $\bar{x}_{o_i}$ , the incumbent value at the time when the node $o_i$ is processed by the B&B algorithm. Specifically, this time step is denoted $\tau_i$ in line 204 of the paper.
We thank the reviewer for prompting us to broaden the discussion. As noted in our conclusion, we believe that one key reason for the relatively limited generalization capacity of DQN agents to higher-dimensional MILPs—compared to the IL baseline—is that their decisions are greedily derived from evaluating Q-values over the entire action set. Since Q-networks are trained on MILP instances with structurally lower associated B&B trees, DQN policies are more prone to suffer from out-of-distribution effects when evaluated on more complex instances. We therefore agree with the reviewer that exploring, beyond REINFORCE, more advanced policy gradient algorithms that directly learn branching policies is a promising research direction. In particular, we believe that a specific class of policy iteration algorithms —namely, MCTS-based approaches from the AlphaZero family— could prove especially effective in the B&B setting. [1] In fact, in the context of complex combinatorial board games, the integration of value and policy networks with lookahead search via Monte Carlo Tree Search (MCTS) has been shown to guide action selection towards high-value states, effectively mitigating exploration and credit assignment issues arising from the sparse-reward environment setting. As learning variable selection in B&B is faced with the same predicaments that arise in combinatorial board games, namely sparse and delayed rewards, long planning horizons and highly structured combinatorial state-action spaces, adapting MCTS-based algorithms to the B&B setting represents a promising avenue for advancing RL-based branching strategies. Interestingly, by providing a principled formulation of variable selection in B&B, the BBMDP framework naturally enables the derivation of a policy improvement operator via MCTS. In contrast, as highlighted on line 651 in Appendix H, TreeMDP is fundamentally incompatible with MCTS-based planning. In fact, by discarding the temporal perspective of B&B, TreeMDP introduces inconsistencies that prevent search algorithms such as MCTS from being effectively adapted to variable selection, thereby hindering the integration of model-based planning techniques into the B&B framework.

[1] Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., ... & Hassabis, D. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419), 1140-1144.

评论- Re. rebuttal

2025-08-04

After reading the other reviews, I remain positive about the paper and would like to keep my score unchanged.

评论- Apologies and clarification for late rebuttal posting

2025-08-01

Dear reviewers,

It appears we misunderstood the timeline distinguishing the rebuttal period from the author-review discussion period. In fact, we were under the impression that rebuttals could still be posted during the author-reviewer discussion period. Unfortunately, we have since realized that our rebuttals should have been submitted a few hours ago, and we sincerely apologize for any inconvenience this may have caused. We have taken the opportunity to post our rebuttals as official comments, and we kindly hope that you will consider them as our intended rebuttals, despite the delay. We emphasize that, in the interest of fairness, we have taken great care to strictly adhere to the 10,000-character limit imposed on all rebuttals. We sincerely apologize for the disruption, and look forward to constructive and insightful discussions on learning to branch in mixed-integer linear programming.

Sincerely, Authors

评论- Official comment by Authors

2025-08-08

Dear Reviewers,

As the author-reviewer period nears its conclusion, we would like to take this opportunity to once again thank you for the time and effort you have dedicated to reviewing our work. Your feedback has been invaluable in highlighting clear avenues for improvement, which we will be glad to incorporate into the final version of the paper. We greatly appreciate your thoughtful engagement throughout this process.

Sincerely, Authors

最终决定Accept (poster)

2025-09-17

The paper introduces B&B-MDPs (or BBMDPs), a Markov Decision Process formulation for variable selection in branch & bound strategies for MILPs. In particular, it addresses key shortcomings of prior RL-based approaches, such as TreeMDP, by leveraging a temporal MDP reformulation of the selection problem and applying a reinforcement learning algorithm to solve it. The authors also provide empirical results on four standard MILP benchmarks, showing that BBMDP outperforms previous RL approaches and narrow the gap to existing imitation learning baselines.

The review team was generally positive towards the paper, with four reviewers ultimately supporting acceptance. The initial reports highlighted that the idea is novel and well motivated. The paper is also nicely written, with a convincing empirical validation showing comparable or better performance than existing RL agents for variable selection (including the highly cited TreeMDP). There were a few concerns about the clarity of exposition, completeness of the ablation study, and the reliance on depth-first search (DFS) as a simplifying assumption, given modern MILP solvers leverage variations of strong branching and other sophisticated techniques.

The rebuttal period, however, revealed some procedural challenges. Authors submitted their rebuttal late, and reviewers were instructed to disregard it. Despite this issue, most reviewers found their concerns adequately addressed by the original submission, the other reports, and the subsequent discussion, and agreed that these limitations do not undermine the significance of the contribution. Specifically, Reviewer Qj2v emphasized that the ablation study clarified the benefits of the BBMDP formulation, while Reviewer euoY, despite initial reservations about clarity and structure, raised their score after recognizing the framework’s significance. Reviewer P4AA praised the careful positioning of the work within the literature and the robustness of the empirical comparisons, and Reviewer 1t91 found the paper to be a solid and rigorous contribution with clear impact on learning to search in MILPs.

Overall, the consensus is that the paper represents a valuable step forward at the intersection of optimization and reinforcement learning, and merits acceptance at NeurIPS. The review team was particularly engaged and excited about this paper.