PaperHub
6.7
/10
Oral3 位审稿人
最低6最高7标准差0.5
7
6
7
3.7
置信度
正确性3.0
贡献度3.0
表达3.0
NeurIPS 2024

SeeA*: Efficient Exploration-Enhanced A* Search by Selective Sampling

OpenReviewPDF
提交: 2024-05-01更新: 2024-12-19
TL;DR

SeeA* is proposed to incorporate exploration into A* search by introducing an dynamic candidate set.

摘要

Monte-Carlo tree search (MCTS) and reinforcement learning contributed crucially to the success of AlphaGo and AlphaZero, and A$^*$ is a tree search algorithm among the most well-known ones in the classical AI literature. MCTS and A$^*$ both perform heuristic search and are mutually beneficial. Efforts have been made to the renaissance of A$^*$ from three possible aspects, two of which have been confirmed by studies in recent years, while the third is about the OPEN list that consists of open nodes of A$^*$ search, but still lacks deep investigation. This paper aims at the third, i.e., developing the Sampling-exploration enhanced A$^*$ (SeeA$^*$) search by constructing a dynamic subset of OPEN through a selective sampling process, such that the node with the best heuristic value in this subset instead of in the OPEN is expanded. Nodes with the best heuristic values in OPEN are most probably picked into this subset, but sometimes may not be included, which enables SeeA$^*$ to explore other promising branches. Three sampling techniques are presented for comparative investigations. Moreover, under the assumption about the distribution of prediction errors, we have theoretically shown the superior efficiency of SeeA$^*$ over A$^*$ search, particularly when the accuracy of the guiding heuristic function is insufficient. Experimental results on retrosynthetic planning in organic chemistry, logic synthesis in integrated circuit design, and the classical Sokoban game empirically demonstrate the efficiency of SeeA$^*$, in comparison with the state-of-the-art heuristic search algorithms.
关键词
search algorithmreinforcement learningexploration

评审与讨论

审稿意见
7

This paper proposes a novel search strategy, SeeA*. SeeA* employs a selective sampling process to screen a dynamic candidate subset based on an additional strategy. Under certain assumptions, SeeA* theoretically has better efficiency than A* search when SeeA* uses uniform select strategy and the heuristic value function deviates substantially from the true state value function. In experiments, SeeA* outperforms state-of-the-art heuristic search algorithms in terms of problem-solving success rate and solution quality while maintaining a low level of node expansions.

优点

  • Under certain assumptions, SeeA* theoretically has better efficiency than A* search.

  • Experiments on two diverse real-world applications in chemistry and circuit design, as well as one puzzle-solving game, demonstrate the efficiency of SeeA*.

缺点

  • Lots of information is in the appendix rather than the main text. Some parts should be moved to the main text, such as the algorithm pseudocode and results of different parameters in section N.

  • Since this paper is about a traditional search algorithm and the learning part is not very clear, it may not be suitable for NeurIPS.

问题

Line 116: The functions τ\tau and cc should be defined.

Figure 1: I don't understand why this figure lists nn with 10410^4. Because n104n_{10^4} won't be explored by either algorithm. Naive A* will expand n200n_{200} at most.

Line 152: I would suggest moving the algorithm part into the main paper from the appendix.

Line 158: When using uniform sampling, KK \to \infty means SeeA* = A*, and K1K \to 1 means SeeA* = Random Search.

Equation 52: The equation contains NN rather than NoN_o, and what is lowercase kk?

Line 190: PUCT should be pUCT, and QQ is not defined. What are the estimated ff values? Since f=g+hf = g + h which could be calculated directly, why do we need to estimate it?

Lines 201-203: I don't understand, is ff^* sampled from Gaussian and ff^* also sampled from Uniform?

Line 215: NoN_o is not defined.

Equation 8: We can get an optimal K=1/lnpK = -1/\ln p. For large prediction error, K1K \to 1, random selection. For small prediction error, KK \to \infty, A*.

Table 1: Why is uniform slower than cluster as Uniform has smaller expansions, and cluster needs additional cluster process time? Tables 1 and 2 should indicate whether lower or higher values are better.

Lines 14-16: "Theoretically establish the superior efficiency of SeeA* over A*" should mention the theoretical analysis based on specific assumptions.

And I'm also expecting an experiment on path search since this method is highly related to A*.

局限性

The strategy parameters are still hand-crafted, which may result in less dynamism when the test case distribution changes.

作者回复

Thank you for your valuable feedback. We will revise the paper accordingly and define symbols clearly. The key algorithms will be moved to the main paper. We hope that our response addresses your concerns.

Q1: Why Figure 1 lists (n) with (10410^4).

A1: Setting the number of nodes to 10410^4 in the non-optimal branch is intended to demonstrate that even if the path is quite poor (f=104<<ff=10^4<<f^*), A* still spends significant efforts expanding nodes along that branch.

Q2: Is (ff^*) sampled from Gaussian and (ff^*) also sampled from Uniform?

A2: In our assumptions, the ff^* value is the ground truth evaluation, and ff aims to accurately predict ff^*. The prediction error of ff with respect to ff^* follows a uniform distribution. The distribution of ff^* values for all non-optimal nodes is Gaussian. We will revise the original description to clarify the expression.

Q3: Why is uniform slower than cluster as Uniform has smaller expansions, and cluster needs additional cluster process time?

A3: There are primarily three parts. The first part is that competitive learning reduces the time required for clustering. One simply needs to assign each new node to the nearest center without an iterative process or re-clustering. The second part is that the time required for expanding each node varies. In retrosynthesis planning, expanding a node needs to utilize the RDKit package to reconstruct potential chemical reactions from the top 5050 chemical reaction templates selected based on the policy network. The number of chemical reactions corresponding to each reaction template varies, resulting in differences in computation time. The third part is the penalty for unresolved testing samples. If a test sample fails to identify a feasible solution, the expansion count for that sample is set to 500500, and the runtime is 600600 seconds, when calculating the mean performance listed in the table. The penalty for runtime is more severe than for expansion times. Due to the higher success rate of the Cluster sampling compared to the Uniform sampling, the discrepancy between running time and the number of expansions is possible.

Q4: Experiment on path search since this method is highly related to A*.

A4: Experiments on pathfinding problems are conducted using an existing heuristic function hh for guidance. The pathfinding problem is to find the shortest collision-free path from a starting point to a destination. The cost for each step is 11. gg is the number of steps taken to reach the current position, and heuristic hh is the Euclidean distance from the current position to the target position, which is reliable to guide the A* search. The 100100 robotic motion planning problems [1] are used to test the performance. Under the guidance of the same reliable hh, both A* and SeeA* find the optimal solution for all testing cases, and the average length is 400400. The number of expansions of SeeA*(K=5K=5) with uniform sampling is 33283.2133283.21, slightly less than the 33340.5233340.52 of A*. To validate the superiority of SeeA*, an unreliable heuristic function h^\hat{h} is employed, which is randomly sampled from [0,2×h][0, 2\times h]. During the search process of A*, the node with the smallest f^=g+h^\hat{f}=g+\hat{h} is expanded. In this situation, the average solution length of A* is 691.1691.1, much longer than SeeA*'s 438.4438.4. Moreover, A* requires 50281.2850281.28 expansions, which is significantly more than the 32847.2632847.26 expansions needed by SeeA*. Therefore, guided by an unreliable heuristic function, SeeA* finds a better solution than A* with fewer expansions, demonstrating the superiority of SeeA*.

[1] Bhardwaj, Mohak, Sanjiban Choudhury, and Sebastian Scherer. "Learning heuristic search via imitation." Conference on Robot Learning 2017.

Q5: This paper is about a traditional search algorithm and the learning part is not very clear.

A5: We believe that our paper is suitable for NeurIPS. MCTS with help of deep learning contributed crucially the success of AlphaZero. Deep learning is also able to contribute renaissance of A*, and three possible aspects are addressed with a family of possible improvements [2]. The first and also straightforward aspect is estimating ff with help of deep learning, which makes current studies on A* including this paper into the era of learning aided A*. The second aspect is seeking better estimation of ff, such as scouting before expanding the current node to collect future information to revise ff of the current node, which takes a crucial rule for the success of AlphaGo. The third aspect is about selecting nodes among the OPEN list that consists of open nodes of A*. It is an old tune even in the classical era of A*, but investigation is seldomly made. As shown in Appendix D, J, and L, fully connected network, convolutional network, and graph network are utilized to estimate the hh function.

Moreover, the part of sampling strategies is related to learning distributions of open nodes. Uniform sampling approximates the distribution based on frequencies. Clustering sampling is akin to use Gaussian mixture model to learn the distribution of open nodes, where each cluster is a Gaussian. Competitive learning is adopted to assigns nodes to save computing resources. The candidate nodes are sampled from the learned distribution.

What's more, several papers focusing on enhancements to search algorithms have been published in the past NeurIPS conferences [3,4,5,6].

[2] Xu, Lei. "Deep bidirectional intelligence: AlphaZero, deep IA-search, deep IA-infer, and TPC causal learning." Applied Informatics 2018.

[3] Orseau, Laurent, et al. "Single-agent policy tree search with guarantees." NeurIPS 2018.

[4] Sokota, Samuel, et al. "Monte carlo tree search with iteratively refining state abstractions." NeurIPS 2021.

[5] Xiao, Chenjun, et al. "Maximum entropy monte-carlo planning." NeurIPS 2019.

[6] Painter, Michael, et al. "Monte carlo tree search with boltzmann exploration." NeurIPS 2023.

评论

Q6: What are the estimated (f) values? Since (f = g + h) which could be calculated directly, why do we need to estimate it?

A6: f(n)f(n) is the evaluation of a node nn, which is calculated by g(n)+h(n)g(n)+h(n). g(n)g(n) is the accumulated cost from the starting node to nn, which is obtained during the interaction process. h(n)h(n) is the expected future cost from nn to the termination, which is unknown and needs to be estimated by a heuristic function. We will revise this sentence to avoid potential misunderstandings.

Q7: Line 116: The functions (T\mathcal{T}) and (cc) should be defined.

A7: T\mathcal{T} is the state transition function defined in a Markov decision process, which is used to obtain the following state st+1s_{t+1} when taking action ata_t at state sts_t. cc is the cost function, which gives the received cost when taking action ata_t at state sts_t. We will revise the paper to provide precise definitions.

Q8 Line 190: PUCT should be pUCT, and (Q) is not defined.

A8: pUCT is the summation of Q and U, where Q is the average value of the child nodes, and U is the exploration bonus. The definition of Q will be provided in more detail in the revised paper.

Q9: Line 215: NoN_o is not defined.

A9: NoN_o is the number of nodes in the open set O\mathcal{O}. The definition of NoN_o will be included in the revised paper.

Q10: "Theoretically establish the superior efficiency of SeeA* over A*" should mention the theoretical analysis based on specific assumptions.

A10: Thank you for pointing that out. We will revise our paper accordingly.

Q11: N and k in Equation 52.

A11: Thank you for pointing that out. NN and kk should correspond to NoN_o and KK in the main text. We will revise Equation 52 to maintain the consistency in the symbols used.

Q12: Tables 1 and 2 should indicate whether lower or higher values are better.

A12: Thank you for your suggestion. We will indicate in Table 1 and 2 whether larger values are preferred or smaller values are preferred.

Q13: When using uniform sampling, KK \to \infty means SeeA* = A*, and K1K \to 1 means SeeA* = Random Search.

A13: We agree with your viewpoint. When KK \to \infty, all open nodes are selected as the candidate nodes, and SeeA* degenerates back to A*. The expanded node is chosen using a best-first approach, relying entirely on the exploitation of the ff function. When K1K \to 1, only one node is selected as the candidate node, and hence, it is guaranteed to be expanded. The expanded node is determined by the exploration of the sampling strategy. Therefore, an appropriate value of KK ensures that SeeA* is a combination of A* search and random search, achieving a balance between exploitation and exploration.

Q14: Equation 8: We can get an optimal (K=1/logpK = -1/\log p). For large prediction error, (K1K \to 1), random selection. For small prediction error, (KK \to \infty), A*.

A14: We agree with your viewpoint. Equation 8 reaches its maximum value when K=1/logpK=-1/\log p, at which point SeeA performs optimally. For large prediction errors, pp approaches 00, and SeeA* degrades to random selection by setting K=1K=1. For small prediction errors, pp approaches 11, and SeeA* is the same as A* by setting K=K=\infty. Intuitively, if the prediction error is sufficiently small, then the best-first A* search is optimal. If the prediction error is significantly large, decisions based on ff values are likely to be misleading, and random sampling may perform better in this case. We will add this analysis to the main text. Thank you for your suggestion.

Q15: The strategy parameters are still hand-crafted, which may result in less dynamism when the test case distribution changes.

A15: At present, the hyperparameters of algorithms indeed require manual design, but the performance is robust against different hyperparameter settings. The automated design to adapt to dynamically changing environments deserves more research efforts in the future.

评论

Thank you for the responses to all my questions. I have no other question.

评论

Dear Reviewer KUFf:

Thank you very much for your response! We are glad to hear that your concerns have been addressed. The feedback is very helpful, and we will revise our next version of the paper accordingly. Thank you.

Best regards!

审稿意见
6

The paper introduces a method for prioritizing nodes for expansion during heuristic search that builds on A* search. However, instead of selecting the node with the lowest cost in OPEN, it samples a subset of OPEN and selects the node with the lowest cost from that subset. The sampling procedure is done using a uniform, clustering, and UCT approach. Results show that this sample based approach performs better than regular A* search. Perhaps surprising, this includes the uniform sampling method.

优点

The paper gives a good motivation for when sampling a subset of the OPEN list may be advantageous. In specific, this is when the heuristic function contains significant inaccuracies. The results also show the consistent superiority of this approach over A* search.

缺点

I wonder what the results would be in an environment

Line 123: Step 3 also occurs if a node is associated with a state in CLOSED, but is find via a shorter path

问题

Is the clustering strategy susceptible to collapsing to a single cluster? Since the initialization is purely random, it seems like this could be the case. Is there any empirical evidence to suggest this does or does not happen?

How is the heuristic function for Sokoban trained?

It seems the uniform sampling method performs better than A* search. Is there any intuition as to why? For environments with large branching factors, where deep search trees are needed, and where shortest paths are sparse, the probability of sampling a subset of OPEN that contains nodes on a shortest path would be small. Would this not hurt performance for uniform?

局限性

In the questions section, I describe a scenario in which random sub sampling of OPEN could hurt performance. I am not sure if this is definitely the case, but, if it is, then I would consider that a limitation.

作者回复

Thank you very much for your valuable feedback. We hope that our response addresses any concerns you may have.

Q1:Is the clustering strategy susceptible to collapsing to a single cluster?

A1: This scenario may indeed occur, and in such situations, running multiple iterations with varied initializations can be considered. Developing more efficient sampling strategies is a crucial future endeavor. In practical applications outlined in the paper, this scenario occurs infrequently in general. It occurs more often when the number of clusters is relatively low, but it may be justifiable to obtain only one cluster in some cases with fewer nodes in the search tree. For example, when the number of clusters is initialized at 55, in the retrosynthesis task, all the generated nodes of 7474 (out of 190190) testing samples are assigned to the same one cluster after the search is terminated. The average number of node expansions to identify a feasible solution is 79.7579.75. When we look into the details of the 7474 samples, 5353 ones have found a feasible solution within less than 2020 expansions, and 7272 samples have found a feasible solution within less than 79.7579.75 expansions. What's more, increasing the number of initial cluster centers and then selecting only clusters containing nodes evenly can be a potential solution to avoid the clustering strategy collapsing to a single cluster. For the above mentioned retrosynthesis task, when the number of initial cluster centers is 2020, the number of samples with only one cluster decreases from 7474 to 4141.

Q2: How is the heuristic function for Sokoban trained?

A2: Training details will be added to the appendix. The DeepCubeA paper provides 50,00050,000 training Sokoban problems and 1,0001,000 testing Sokoban problems. The A* search guided by a manually designed heuristic is employed to find solutions for the training problems. gg is the number of steps arriving at the current state. hh is the sum of distances between the boxes and their destinations, along with the distance between the player and the nearest box. Under limited search time, 46,25246,252 training problems are solved. For each collected trajectory {n0i,n1i,,nti,,nTii}\{n_0^i,n_1^i,\cdots,n_t^i,\cdots,n_{T_i}^i\}, the learning target for state ntin_t^i is the number of steps from ntin_t^i to the goal state nTin_{T_i}: z(nti)=Tit.z(n_t^i)=T_i-t. Mean square error is employed as the loss function: L(θ)=it(v(nti;θ)z(nti))2iTiL(\theta)=\frac{\sum_{i}\sum_{t}(v(n_t^i;\theta)-z(n_t^i))^2}{\sum_i T_i} Adam optimizer with a 0.00010.0001 learning rate is used to update the parameters.

Q3: It seems the uniform sampling method performs better than A* search. Is there any intuition as to why? For environments with large branching factors, where deep search trees are needed, and where shortest paths are sparse, the probability of sampling a subset of OPEN that contains nodes on a shortest path would be small. Would this not hurt performance for uniform?

A3: SeeA* uniformly samples candidate nodes and expands the one with the lowest ff value. In this setting, each nodes, except a few with the worst ff values, has a probability of being expanded, and the node with a smaller ff value still has a larger expansion likelihood, as discussed in Appendix O. SeeA* improves exploration compared to A*, but expansion remains mainly concentrated on nodes with the best ff values. Therefore, SeeA* achieves a balance between exploration and exploitation, making it better than A*.

If the optimal node n1n_1 is expanded by A*, f(n1)f(n_1) must be less than the ff values of all NoN_o open nodes, while SeeA* only needs to select n1n_1 from KK candidate nodes if n1n_1 is included in the candidate set D\mathcal{D}. PA(n1 is expanded)=P(n1=argminnOf(n))P_{A^*}(n_1 \text{ is expanded})=P(n_1=\arg\min_{n\in\mathcal{O}}f(n)) PSeeA(n1 is expanded)=P(n1=argminnDf(n)n1D)P(n1D)P_{SeeA^*}(n_1 \text{ is expanded})=P(n_1=\arg\min_{n\in\mathcal{D}}f(n)|n_1\in\mathcal{D})P(n_1\in\mathcal{D}) SeeA* outperforms A* if P(n1D)>P(n1=argminnOf(n))/P(n1=argminnDf(n))=pσNo1/pσK1=pσNk.P(n_1\in\mathcal{D})>P(n_1=\arg\min_{n\in\mathcal{O}}f(n)) / P(n_1=\arg\min_{n\in\mathcal{D}}f(n))=p_{\sigma}^{N_o-1}/p_{\sigma}^{K-1}=p_{\sigma}^{N-k}.

According to Corollary 4.2, the larger the prediction error σ\sigma, the lower the likelihood pσp_{\sigma} that the f(n1)f(n_1) is smaller than the ff value of a non-optimal node. PO=P(n1=argminnOf(n))P_O=P(n_1=\arg\min_{n\in\mathcal{O}}f(n)) is the product of No1N_o-1 probabilities, while PD=P(n1=argminnDf(n))P_D=P(n_1=\arg\min_{n\in\mathcal{D}}f(n)) is the product of K1K-1 probabilities. When σ\sigma increases, POP_O decreases much faster than PDP_D. The right side of the above inequality decreases. For uniform sampling, P(n1D)=K/NoP(n_1\in\mathcal{D})=K/N_o is irrelevant with σ\sigma. Therefore, even with uniform sampling, SeeA* can outperform A* when σ\sigma are large enough.

In scenarios with large branching factors, the probability of sampling a candidate set containing optimal nodes is lower. Taking uniform sampling as an example, P(n1D)=K/NoP(n_1\in\mathcal{D})=K/N_o decreases as NoN_o increases. However, the probability of A* selecting the optimal node will also decrease significantly with NoN_o because every node has a probability of better than n1n_1. As presented in Equation 10, H(No)H(N_o) approaches 11 as NoN_o approaches infinity, ensuring the inequality in Theorem 4.3 holds. Despite the potential decline in SeeA*'s performance with the growth of NoN_o inevitably, it continues to outperform A*.

Q4: I wonder what the results would be in an environment Line 123: Step 3 also occurs if a node is associated with a state in CLOSED, but is find via a shorter path.

A4: If a node is associated with a state in CLOSED but is found via a shorter path, this node is expanded as Line 123. Experiments are conducted in retrosynthesis planning. The performance is similar to the results in the paper, and SeeA* still outperforms A* with higher success rates and shorter solution lengths.

AlgorithmSolvedLengthExpansions
A*88.42%9.2891.27
SeeA*(Uniform)96.32%7.4469.21
SeeA*(Cluster)96.84%7.0464.84
SeeA*(UCT)98.95%6.3656.87
审稿意见
7

This work introduces a refined version of the A* search algorithm that integrates selective sampling to improve exploration & efficiency. The developed algorithm balances exploration and exploitation when heuristic guides are off the mark with the help of three sampling strategies. Also it outperforms traditional A* in both the quality of solutions and computational efficiency that is backed with practical tests.

优点

  1. Work uses multiple (3) sampling strategies to balance between exploration & exploitation for diverse scenarios
  2. Good theoretical analysis to show that the developed algorithm is efficient than traditional A* (when heuristic functions deviate from true state values)
  3. Validates the effectiveness of the algorithm through extensive experiments across multiple application
  4. Algorithm can handle large & complex search spaces (backed by experiments)

缺点

  1. Limited theoretical analysis for some sampling strategies
  2. Strong assumptions in theoretical results
  3. Experiment represents a narrow spectrum
  4. Scalability on resource-constrained environments/ large-scale system is not discussed
  5. No sufficient details on when SeeA* fails (or) perform suboptimally

问题

  • In Alg 1, the termination condition “until O is empty” is not clear. For some challenging problems, if a solution may not exist, will SeeA* keep expanding nodes indefinitely? Consider adding a maximum iteration limit to guarantee termination?
  • The uniform sampling strategy discussed in Sec. 4.1.1 is clear and straightforward. However, the authors could discuss the potential drawbacks of this strategy, such as the possibility of selecting low-quality nodes, and how SeeA* addresses these drawbacks
  • The assumption in Corollary 4.2 that the prediction error for f* is uniformly distributed is quite strong. In practice, it’s more likely non-uniform like Gaussians? More info. on the sensitivity of the theoretical results to this assumption and empirical validation of the distribution of prediction errors will be helpful
  • NIT: The specific f and f* values seem arbitrary in Figure 1. Are these values generalize broadly (or) are they cherry-picked?
  • The authors could provide more intuition on the implications of Theorem 4.3. How does the result relate to the trade-off between exploration and exploitation in SeeA*, and how can it guide the selection of the sampling strategy or hyperparameters?
  • In Sec. 5.2, the authors compare SeeA* with various baselines, including MCTS [1]. However, it is mentioned that the MCTS in [1] did not utilize any guiding heuristics. It would be informative to compare SeeA* with an MCTS variant that uses the same guiding heuristics for a fair comparison
  • The hyperparameter sensitivity analysis in Sec. 5.4 provides insights on the performance of SeeA*. However, the analysis is limited to the retrosynthetic planning problem. Does this generalize to other two problem domains and beyond?
  • In the Sokoban experiment (Sec. 5.3), the authors compare SeeA* with several baselines, including DeepCubeA. However, the experimental setup for DeepCubeA is not clearly stated, making it difficult to asses the fairness of the comparison.

Reference: [1] Walter Lau Neto, Yingjie Li, Pierre-Emmanuel Gaillardon, and Cunxi Yu. Flowtune: End-to-end automatic logic optimization exploration via domain-specific multi-armed bandit. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2022.

局限性

  • The experiment focusses solely on synthetic search problems. Initial results on real-world domain for practical applicability would be valuable
  • Paper provides theoretical comparison of SeeA* vs A* for uniform sampling strategy, but lacks analysis for other proposed sampling strategies.
  • Paper only tests on problems that don’t have inaccurate heuristics. Testing on additional domains that have unreliable heuristics would be a good addition
评论

Q9: The experimental setup for DeepCubeA is not clearly stated,.

A9: The experimental setup DeepCubeA is the same as the original paper [5]. The number of nodes expanded at each step is 11, and the weight of the heuristic value is 0.80.8. The test samples for SeeA* are the same as those for DeepCubeA. More details will be added to the appendix materials.

[5] Agostinelli, Forest, et al. "Solving the Rubik’s cube with deep reinforcement learning and search." Nature Machine Intelligence 1.8 (2019): 356-363.

Q10: Paper provides theoretical comparison of SeeA* vs A* for uniform sampling strategy, but lacks analysis for other proposed sampling strategies.

A10: For computational simplicity, only uniform sampling was considered in the theoretical portion. Based on Equation 7, the probability of expanding the optimal node n1n_1 by SeeA* is

PS(σ)=P(n1D)P(n1=argminnDf(n)n1D)P_S(\sigma)=P(n_1\in\mathcal{D})P(n_1=\arg\min_{n'\in\mathcal{D}}f(n')|n_1\in\mathcal{D})

P(n1D)=K/NoP(n_1\in\mathcal{D})=K/N_o for uniform sampling. The other two sampling strategies aim to achieve a higher P(n1D)P(n_1\in\mathcal{D}) compared to uniform sampling by constructing a more diverse candidate set, thereby enhancing the likelihood of expanding the optimal node. Uniform sampling approximates the distribution based on frequencies, while clustering sampling is akin to use Gaussian mixture model to learn the distribution of open nodes, where each cluster is a Gaussian. The candidate nodes are sampled from the learned distribution. We will provide the theoretical analysis of other sampling strategies in the future.

作者回复

We thank the reviewer for constructive comments and suggestions. We will revise our paper carefully. Hope our explanation below can address your concerns.

Q1: Adding a maximum iteration limit to guarantee termination.

A1: Adding a limit to guarantee termination is necessary, and is adopted in our experiments. In retrosynthesis planning, search algorithms are limited to a maximum of 500500 policy calls, or 1010 minutes of running time, as mentioned in Line 271. In Sokoban, the search process is terminated if the running time exceeds 1010 minutes. We will revise Alg 1 accordingly.

Q2: The potential drawbacks of uniform sampling strategy.

A2: The uniform sampling strategy is easy to implement, but P(n1D)P(n_1\in\mathcal{D}), the probability of selecting the optimal node n1n_1 to the candidate set D\mathcal{D}, is relatively low, which directly impacts PSP_S, as shown in Equation 7. Therefore, additional strategies, clustering sampling and UCT-like sampling, are designed to improve P(n1D)P(n_1\in\mathcal{D}) by increasing the diversity of the selected nodes to avoid the excessive concentration of a few branches, thereby increasing P(n1D)P(n_1\in\mathcal{D}). If more information is available besides the ff evaluation, superior strategies can be designed, such as a specialized policy model.

Q3: The assumption in Corollary 4.2 is quite strong. Prediction error is more likely non-uniform like Gaussians.

A3: Thank you for your suggestion. We can prove that Corollary 4.2 also holds when noise follows a Gaussian distribution. Please refer to Global Rebuttal A1.

Q4: Are f and f* values generalize broadly (or) are they cherry picked?

A4: A specific example is provided in Figure 1 to illustrate that A* may be trapped in a local optimum due to insufficient exploration, which is not uncommon due to the prediction errors of guiding heuristics. Figure 7 & 9 in the appendix displayed the search tree of A* and SeeA* while solving a logic synthesis problem. A* exhibits an excessive concentration of expanded nodes in a particular non-optimal branch, which is the same as Figure 1. The superior performance of SeeA* over A* also corroborate this assertion.

Q5: More intuition on the implications of Theorem 4.3.

A5: Thanks for your suggestion. More detailed discussions will be added to the paper. Please refer to Global Rebuttal A2.

Q6: It would be informative to compare SeeA* with an MCTS using the same guiding heuristics for a fair comparison.

A6: In Table 2, the results of MCTS guided by the same heuristics as SeeA* are displayed in the PV-MCTS row, which achieves a 19.5% ADP reduction, surpassing MCTS's 18.5% but falling short of SeeA*'s 23.5%. We will further clarify the distinction between the two in the revised paper.

Q7: The hyperparameter analysis is limited to the retrosynthetic planning problem. Does this generalize to other two problem domains and beyond?

A7: Ablation studies on the Sokoban have been provided in the appendix. As presented in Figure 11, the performance of SeeA* is stable across a wide range of KK. As shown in Table 8, the stronger the exploration, the shorter the identified solution path length, and the greater the number of expansions required to find a feasible solution. Due to the relatively low difficulty levels of the testing examples in the Sokoban, constructing an accurate value predictor is easier than with the other applications. Therefore, SeeA* is only slightly better than A*.

Ablation studies on logic synthesis are summarized below. The performance for different candidate set sizes KK for SeeA* with uniform sampling is displayed. The performance is robust against different KK, outperforming A* (K=K=\infty) consistently.

K13510203050\infty
ADP reduction (%)19.822.121.619.821.219.719.819.5

The performance for different cbc_b for UCT-like sampling is as follows, which is robust against different cbc_b. enhanced exploration with a larger cbc_b leads to superior performance and longer running time.

cbc_b0.51.01.381.5
ADP reduction (%)20.821.822.522.6

Q8: Paper only tests on problems that don’t have inaccurate heuristics. Testing on additional domains that have unreliable heuristics would be a good addition.

A8: Thank you for your suggestion. In the paper, two applications where obtaining accurate heuristics is challenging are considered. To illustrate the effectiveness of SeeA* on problems where accurate heuristics could exist but the guiding heuristic used is unreliable, experiments on pathfinding are conducted, which is to find the shortest path from a starting point to a destination. The cost for each step is 1. gg is the number of steps taken to reach the current position, and hh is the Euclidean distance from the current position to the target position, which is reliable to guide the A* search. 100100 robotic motion planning problems [4] are used to test the performance of A* and SeeA*. Under the guidance of the same reliable hh, both A* and SeeA* find the optimal solutions for all testing cases, for which the average length is 400400. The number of expansions of SeeA*(K=5K=5) with uniform sampling is 33283.2133283.21, slightly less than the 33340.5233340.52 of A*. To validate the superiority of SeeA*, an unreliable heuristic function h^\hat{h} is employed, which is randomly sampled from [0,2×h][0, 2\times h]. During the search process, nodes are evaluated by f^=g+h^\hat{f}=g+\hat{h}. In this situation, the average solution length of A* is 691.1691.1, much longer than SeeA*'s 438.4438.4. Moreover, A* requires 50281.2850281.28 expansions, which is significantly more than the 32847.2632847.26 expansions needed by SeeA*. Therefore, guided by an unreliable heuristic, SeeA* finds a better solution than A* with fewer expansions, demonstrating the superiority of SeeA*.

[4] Bhardwaj, Mohak, Sanjiban Choudhury, and Sebastian Scherer. "Learning heuristic search via imitation." Conference on Robot Learning. PMLR, 2017.

评论

Dear Authors,

Thank you for the detailed responses to all my questions. The responses to Q1, Q3, Q7 and Q9 align directionally with my comments and questions, thus addressing some of my questions. Therefore, I have increased my rating. I hope the authors can add relevant context in this rebuttal in the next iteration of this work.

评论

Dear Reviewer 8E9W:

Thanks for your feedback! It really helped improve the quality of the paper. We will make sure to include the added results and discussion in the next version of this work. Thank you.

Best regards!

作者回复

A1: The assumption in Corollary 4.2 that the prediction error for ff^* is uniformly distributed is quite strong. To further illustrate the applicability of the algorithm, we also prove that Corollary 4.2 is established if the noise follows a Gaussian distribution. Denoting Gaussian distribution as G(,)\mathcal{G}(\cdot,\cdot), Assumption 4.1 will be:

For each node nn on the optimal path, f(n)G(μ0f,σ2)f(n)\sim\mathcal{G}(\mu_0^f,\sigma^2). For nodes not on the optimal path, f(n)G(f(n),σ2)f(n)\sim\mathcal{G}(f^*(n),\sigma^2), and f(n){f^*(n)} are independently and identically sampled from G(μ1f,σs2)\mathcal{G}(\mu_1^f,\sigma^2_s). μ0f<μ1f\mu_0^f< \mu_1^f holds because the optimal path has a lower cost.

For two Gaussian distributions, we have the following lemma [1]:

Lemma 1: Assume xG(μ1,σ12)x\sim\mathcal{G}(\mu_1,\sigma_1^2), yG(μ2,σ22)y\sim\mathcal{G}(\mu_2,\sigma_2^2). If xx, yy are independent of each other and μ2>μ1\mu_2>\mu_1, then_ P(x>y)=\frac{1}{\pi}\int_0^{\frac{\pi}{2}}\exp\left\\{-\frac{1}{2}\frac{[(\mu_2-\mu_1)/\sqrt{\sigma_1^2+\sigma_2^2}]^2}{\cos^2\theta}\right\\}d\theta. For a node nn on the optimal path, f(n)G(μ0f,σ2)f(n)\sim\mathcal{G}(\mu_0^f,\sigma^2). For a node nn' off the optimal path, f(n)G(f(n),σ2)f(n')\sim\mathcal{G}(f^*(n'),\sigma^2). If μ0f>f(n)\mu_0^f>f^*(n'): P(f(n)<f(n')|\mu_0^f>f^*(n'))=\frac{1}{\pi}\int_0^{\frac{\pi}{2}}\exp\left\\{-\frac{1}{2}\frac{(f^*(n')-\mu_0^f)^2}{2\sigma^2\cos^2\theta}\right\\}d\theta=m(f^*(n')|\sigma) Otherwise: P(f(n)<f(n')|\mu_0^f<f^*(n'))=1-\frac{1}{\pi}\int_0^{\frac{\pi}{2}}\exp\left\\{-\frac{1}{2}\frac{(f^*(n')-\mu_0^f)^2}{2\sigma^2\cos^2\theta}\right\\}d\theta=1-m(f^*(n')|\sigma)

F(σ)=P(f(n)<f(n)σ)=f(n)<μ0fP(f(n))m(f(n)σ)df(n)+f(n)μ0fP(f(n))(1m(f(n)σ))df(n)F(\sigma)=P(f(n)<f(n')|\sigma)=\int_{f^*(n')<\mu_0^f}P(f^*(n'))m(f^*(n')|\sigma)df^*(n')+\int_{f^*(n')\geq\mu_0^f}P(f^*(n'))(1-m(f^*(n')|\sigma))df^*(n') If σ2>σ1\sigma_2>\sigma_1: F(σ2)F(σ1)=f(n)<μ0fP(f(n))(m(f(n)σ2)m(f(n)σ1))df(n)+f(n)μ0fP(f(n))(m(f(n)σ1)m(f(n)σ2))df(n)F(\sigma_2)-F(\sigma_1)=\int_{f^*(n')<\mu_0^f}P(f^*(n'))(m(f^*(n')|\sigma_2)-m(f^*(n')|\sigma_1))df^*(n')+\int_{f^*(n')\geq\mu_0^f}P(f^*(n'))(m(f^*(n')|\sigma_1) - m(f^*(n')|\sigma_2))df^*(n') m(f(n)σ)m(f^*(n')|\sigma) is symmetric about the axis f(n)=μ0ff^*(n')=\mu_0^f, m(f(n)σ)=m(2μ0ff(n)σ)m(f^*(n')|\sigma)=m(2\mu_0^f-f^*(n')|\sigma). F(σ2)F(σ1)=f(n)μ0f(P(2μ0ff(n))P(f(n)))(m(f(n)σ2)m(f(n)σ1))df(n)F(\sigma_2)-F(\sigma_1)=\int_{f^*(n')\geq\mu_0^f}(P(2\mu_0^f-f^*(n')) - P(f^*(n')))(m(f^*(n')|\sigma_2) - m(f^*(n')|\sigma_1))df^*(n') According to the definition, mm is monotonically increasing with respect to σ\sigma. Therefore, m(f(n)σ2)m(f(n)σ1)>0m(f^*(n')|\sigma_2) - m(f^*(n')|\sigma_1)>0. Because f(n)N(μ1f,σ22)f^*(n')\sim\mathcal{N}(\mu_1^f,\sigma_2^2) and μ0f<μ1f\mu_0^f<\mu_1^f, we have P(2μ0ff(n))P(f(n))<0P(2\mu_0^f-f^*(n'))-P(f^*(n'))<0 when f(n)μ0ff^*(n')\geq\mu_0^f. Therefore, F(σ2)F(σ1)<0F(\sigma_2)-F(\sigma_1)<0 is established, and P(f(n)<f(n)σ)P(f(n)<f(n')|\sigma) decreases as the prediction error σ\sigma increases when the noise is Gaussian distribution. The above analyses will be added to the revised paper to further elucidate the impact of prediction errors. Under both the uniform error distribution and the Gaussian error distribution, the larger the prediction error, the lower the likelihood of selecting the optimal node.

[1] Xu, Lei, Pingfan Yan, and Tong Chang. "Algorithm cnneim-a and its mean complexity." Proc. of 2nd international conference on computers and applications. IEEE Press, Beijing. 1987.

A2: More intuition on the implications of Theorem 4.3 is provided. In Theorem 4.3, pσp_{\sigma} is the probability that ff of an optimal node nn exceeds ff of a non-optimal node nn', pσ=P(f(n)f(n)σ)p_{\sigma}=P(f(n)\leq f(n')|\sigma). PS(σ)>PA(σ)P_S(\sigma)>P_A(\sigma) holds if and only if

pσ<H(No,K),H(No,K)=(KNo)1NoK.p_{\sigma}<H(N_o,K),\quad\quad\quad H(N_o,K)=\left(\frac{K}{N_o}\right)^{\frac{1}{N_o-K}}.

pσp_{\sigma} decreases as the prediction error σ\sigma increases. If σ\sigma is quite small, then pσp_{\sigma} approaches 11, and the inequality in Theorem 4.3 is unlikely to hold. In this case, A* can identify the optimal solution efficiently without the need for candidate sampling in SeeA*. If σ\sigma is large, estimated ff values are misleading, and the probability that the optimal node's ff value is the best among open nodes is low, possiblely even lower than random sampling. In this case, pσp_{\sigma} is small, and the inequality holds. SeeA* is more effective than A* when ff is inaccurate.

As the branching factors increase and the solution paths grow longer, the size of open set NoN_o grows. H(No,K)H(N_o,K) monotonically increases with respect to NoN_o. As NoN_o approaches infinity, H(No,K)H(N_o,K) tends to 11, and the inequality holds. Intuitively, nn is expanded if its ff value is the smallest among open nodes. Inaccurate predictions raise the likelihood of other nodes having smaller ff values. As NoN_o increases, f(n)f(n) is less likely to be the smallest, leading to poorer performance of A*. SeeA* reduces the number of available nodes for selection, resulting in better performance compared to A*.

The number of candidate nodes KK is a key hyperparameter balancing exploration and exploitation. In SeeA*, exploitation selects the node with the best ff value, like A* search, while exploration uses a sampling strategy to create diverse candidate sets. If K=1K=1, the selected node is determined by the sampling strategy, and SeeA* becomes random sampling. If KK\rightarrow\infty, the candidate set is the same as the open set, and SeeA* degenerates into best-first A*. A smaller KK enhances exploration of SeeA*. H(No,K)H(N_o,K) increases with KK. If KK is very small, the value of HH will be relatively small. To ensure the inequality holds, an appropriate value of KK should be chosen. What's more, the probability of SeeA* expanding the optimal node as defined in Equation 8 reaches its maximum value when K=1/logpσK=-1/\log p_{\sigma}, which increases with pσp_{\sigma}. For small pσp_{\sigma}, the optimal KK is the smallest value 1. When pσp_{\sigma} approaches 1, the optimal KK will be the largest \infty. The choice of KK value is related to the prediction error σ\sigma of the heuristic function.

最终决定

The work builds on two search algorithms, MCTS an A*, to introduce a new algorithm, SeeA*, which ehnances A* with a flexible exploration-exploitation balance. Advantages of SeeA* over A* are justified theoretically and confirmed on real-world and synthetic search domains.

The reviewers seem to agree that the paper brings a significant contribution, is written well, and provides convincing empirical evaluation supported by theoretical analysis. Some reviewers expressed concerns about strong assumptions in the theoretical analysis, but appear to have been convinced by the authors that the theoretical results hold also for more liberal assumption. One reviewer wondered whether the paper fits well within NeurIPS, and the authors justified the appropriateness of the paper for the NeurIPS audience.

The paper introduces an important but simple to grasp enhancement to a broadly used algorithm with many applications in diverse areas of machine learning and artificial intelligence. I recommend the paper for an oral presentation at the conference.