What Matters in Hierarchical Search for Combinatorial Reasoning Problems?

审稿意见

评分: 6置信度: 42024-10-23

This paper reviews hierarchical approaches to reasoning tasks, in particular comparing hierarchical search to low-level search.

While the paper has no serious flaws, I am not entirely sure what to make of it. The insights the paper provides may be useful, but I am somewhat worried if they are contrived or overstated. In general, I think many would agree subgoal methods are worth pursuing, so arguing for this alone may not be entirely useful. At the same time, there are theoretical reasons why subgoal methods are imperfect, namely the upwards refinability property, e.g. that an optimal solution can be expressed in terms of abstractions. In general, subgoal methods may not be as flexible as a method which can use low-level actions (for an arbitrary problem at least), and to retain completeness a system needs to be able to use both abstract and primitive actions. Accordingly, I am somewhat skeptical of the takeaways, which seem to be based on limited empirical evaluation compared to what they claim. Regardless, I think the breakdown of benefits into the four separate takeaways is useful and overall think the paper is headed in the right direction.

优点

The paper summarizes hierarchical reasoning methods and performs experiments to show their desirable properties.

The insights regarding robustness to noisy value functions are quite useful and seem to be well supported by the experiments.

Similarly, the breakdown of claims regarding complex action spaces, ergodicity, and data diversity is useful, but I am less certain about the validity of these claims, given that there are theoretical reasons why they may not be true.

缺点

I believe the related work section could be better organized, it didn't seem to flow with the rest of the paper so maybe it could be at the very end instead? e.g. Section 1 seems to better flow into Section 3.

I think the paper could better distinguish between planning and learning -- many of the problems discussed are amenable to classical algorithms, so where does learning come in best and why?

I think the paper would benefit from more theory -- while subgoal methods may have benefits in the experiments performed, their theoretical properties are not entirely discussed. In particular, issues such as the upwards refinability property make me skeptical of the second takeaway (line 417). In other sections, I think the paper would benefit from contrasting theory and experiment.

In general, the writing was somewhat difficult to follow and clarity could be improved.

问题

I don't have any specific questions about the paper -- but there are likely parts of it I did not fully understand.

2024-11-18

I think the paper would benefit from more theory -- while subgoal methods may have benefits in the experiments performed, their theoretical properties are not entirely discussed. In particular, issues such as the upwards refinability property make me skeptical of the second takeaway (line 417). In other sections, I think the paper would benefit from contrasting theory and experiment.

We agree with your point. While our experimental analysis is a first step towards understanding hierarchical search, it should be further strengthened with theoretical grounding. Our discovery that subgoal methods are highly resilient to value noise, which is arguably our most surprising finding, led us to develop a theoretical explanation (Theorem 1) that confirms this resilience as a universal property. To emphasize its significance, we now included its full formulation in the main body.

Motivated by that, we are actively working on further theoretical analysis. For example, in our experiment involving complex action spaces, we are developing a theoretical model to show how the probability of selecting the optimal actions scales with the size of the action space, based on certain assumptions about the policy learning process. Similarly, our empirical findings can provide a foundation for future theoretical work, guiding research in this area. Therefore, we believe that it is valuable to share them with the community.

Thank you for your insightful comments. We are more than happy to continue the discussion if anything remains unclear.

References:

[1] What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study
[2] Should I Run Offline Reinforcement Learning or Behavioral Cloning?
[3] What Matters for Adversarial Imitation Learning?

评论- Response to Rebuttal

2024-11-21

I appreciate the Author's response and updates, however I intend to keep my review the same for the time being.

2024-11-23

Thank you once again for reviewing our paper. We appreciate your feedback and respect your decision.

In response to your suggestions, we extended the theoretical grounding of our conclusions, as we promised. Specifically, in Section 5.3, we now provide a formal explanation for the advantage of subgoal methods in handling complex action spaces. We show that as action space complexity increases, the diversity of top actions selected by the low-level policy decreases – a finding which previously was only empirical. We included the theorem's formulation and discussion in Section 5.3, with a full proof in Appendix L.

And similarly, we believe our extensive empirical results provide a valuable foundation for future theoretical work. Therefore, we believe it is valuable to share it with the community. We plan to extend our theoretical analysis to cover other takeaways if time permits.

Additionally, we expanded our analysis of approximation errors to cover subgoal generators in addition to value function. Our experiments reveal that subgoal methods are resilient not only to value noise, but to generator noise as well. Since both components guide the search, errors in one component can be mitigated by the other. These results are detailed in Appendix B.2.1.

We apologize for the delay in providing these updates, and we hope that our new theoretical and empirical results further strengthen our contribution.

2024-11-18

Thank you for your valuable feedback. We address your comments below.

I think many would agree subgoal methods are worth pursuing, so arguing for this alone may not be entirely useful.

We completely agree with your perspective. While there is growing interest in applying subgoal methods to combinatorial problems and other complex domains, knowledge about their true advantages however remains fragmented. As a result, standard low-level algorithms continue to be the default choice for most applications, regardless of the domain. Our analysis provides empirical and some theoretical evidence indicating which types of domains may benefit from hierarchical methods, enabling more informed decisions in future research. As such, our contribution closely follows the format of [1, 2, 3].

Furthermore, the field of hierarchical search lacks established evaluation guidelines, making comparisons with state-of-the-art methods difficult or sometimes meaningless. Our analysis highlights some of these pitfalls and offers guidelines that will facilitate smoother progress in the field.

Our aim is not to argue for using subgoal methods alone, but to identify their appropriate use cases. We acknowledge that this motivation is not clearly stated in the paper. To address this, we updated the introduction to include this explanation. Thank you for highlighting this important point.

At the same time, there are theoretical reasons why subgoal methods are imperfect, namely the upwards refinability property, e.g. that an optimal solution can be expressed in terms of abstractions. In general, subgoal methods may not be as flexible as a method which can use low-level actions.

This is an important remark. We agree that subgoal methods come with tradeoffs: they promise faster and more focused searches but may lack the precision and flexibility of low-level methods. Therefore, it is essential to understand the conditions under which subgoal search is a preferable alternative to low-level search. The main takeaway from our analysis are the identified domain properties that contribute to the advantages of subgoal methods. While the presence of these properties is never a sufficient condition, they serve as a useful indicator. We updated our conclusions to emphasize this point.

I believe the related work section could be better organized, it didn't seem to flow with the rest of the paper so maybe it could be at the very end instead? e.g. Section 1 seems to better flow into Section 3.

We agree, the related work section fits better at the end. We fixed that, thank you for the suggestion.

I think the paper could better distinguish between planning and learning -- many of the problems discussed are amenable to classical algorithms, so where does learning come in best and why?

Thank you for this comment. As you noted, our study involves two stages: learning a value function and other components from demonstration data, and applying search to solve new instances. We focus on domains where classical planning algorithms are limited by their reliance on hand-designed heuristics or domain-specific knowledge. To ensure a fair comparative analysis and maintain the generality of our findings, we avoid incorporating any domain-specific knowledge. Our primary objective is to assess the relative strengths and weaknesses of the planning algorithms themselves rather than specific heuristics.

Imitation learning enables us to train models in the absence of domain knowledge and ensures that each component of different methods receives equal supervision, avoiding any unfair advantage. While we agree that hand-designed heuristics often yield superior results in specific cases, our goal is to provide a broader understanding of the strengths and limitations of different planning methods. Training components directly from data allows us to draw conclusions that are more likely to generalize across diverse environments compared to using hardcoded components.

We agree that our paper was missing a proper discussion on this aspect. Therefore, we added this comment to Section 4.1 to ensure clarity. Thank you for bringing this point to our attention.

评论- Response to Author Comment

2024-11-24

I greatly appreciate the Author's further effort and discussion, however I am still concerned that existing theory in subgoal methods is not being addressed or built on sufficiently. This is fairly clear just from the references section.

For example, in hierarchical planning (an adjacent/overlapping idea), the various work by Kutluhan Erol on HTNs [1] would be a good place to start. Even though this is specific to planning, many of the theorems apply directly to other "combinatorial reasoning problems" that can be cast as planning problems.

Sutton/Barto similarly doesn't cover hierarchical RL very deeply, instead see [2], for instance.

For somewhat more recent work, see [3] and the large body of related work.

[1] Erol, K., Hendler, J., & Nau, D. S. (1994, July). HTN planning: Complexity and expressivity. In AAAI (Vol. 94, pp. 1123-1128).

[2] Barto, A. G., & Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning. Discrete event dynamic systems, 13, 341-379.

[3] Konidaris, G., Kaelbling, L. P., & Lozano-Perez, T. (2018). From skills to symbols: Learning symbolic representations for abstract high-level planning. Journal of Artificial Intelligence Research, 61, 215-289.

2024-11-26

Thank you for this important feedback and for highlighting relevant literature. To place our work within a broader theoretical context, we will incorporate these references and discuss how their theoretical insights relate to our analysis in the Related Works section.

We are pleased to highlight that our work perfectly complements this line of research. In the context of significant advancements of deep learning since 2003, it is essential to extend the understanding of hierarchical planning to modern, data-driven approaches. Previous theoretical works often assume the explicit specification of hierarchical components, their roles, and high-level actions [1, 2, 3]. In contrast, our work adopts a strictly more general setup, requiring no domain knowledge, which aligns better with contemporary end-to-end learning paradigms. Despite this minimal-assumption framework, we are able to confirm our empirical findings through theoretical results, such as the Search Advancement Formula and the Action Space Densification Theorem.

Additionally, our primary goal is to address a gap in the literature: identifying when hierarchical methods should be preferred over low-level planning methods. While it is widely acknowledged that both planning and hierarchical methods are useful, to the best of our knowledge, no prior work has systematically analyzed the conditions under which hierarchical methods offer a clear advantage. As a result, low-level planners are often the default choice due to their simplicity.

Therefore, we believe our analysis offers valuable insights to the community, complementing existing knowledge on hierarchical methods in the context of recent advancements. Thank you for raising this point, as including this discussion will definitely strengthen our paper.

[1] Erol, K., Hendler, J., & Nau, D. S. (1994, July). HTN planning: Complexity and expressivity. In AAAI (Vol. 94, pp. 1123-1128).

[2] Barto, A. G., & Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning. Discrete event dynamic systems, 13, 341-379.

[3] Konidaris, G., Kaelbling, L. P., & Lozano-Perez, T. (2018). From skills to symbols: Learning symbolic representations for abstract high-level planning. Journal of Artificial Intelligence Research, 61, 215-289.

评论- Response to Authors

2024-11-26

I appreciate the continued discussion with the authors and commend them on how much effort has gone into their paper.

By providing some older references, my intention was that, even in the era of machine learning, there are theoretical aspects of the problem that still hold and must be addressed for this paper to be successful.

I do appreciate the two theorems mentioned, namely search advancement and action space densification. These are useful in the context of machine learning, but don't really address the fundamental issues present in hierarchical search, especially when one wants guarantees.

I do agree with and commend the overall goal of establishing/characterizing where/how subgoal methods are best used, but I think that doing so is a significant task that might be out of the scope of a single paper. Accordingly, I stand by my original assessment that, while this paper contains some valuable insights, it also draws overly strong conclusions from limited experimentation. Still, I feel the authors have made a good-faith effort to make progress towards their intended goal. I do see some value in the paper being shared with the broader ICLR community, but I think a lot more work is still to be done.

I have updated my overall score 5->6, and adjusted my confidence now that I have been able to clarify the paper with the authors.

2024-11-27

Thank you for your kind feedback. We agree that still much can be done, so we are keen to explore this topic further and hopefully we will share new results before the discussion period ends. At the same time, we will revise the paper according to your suggestions, as we acknowledge your perspective that the existing theory has to be properly discussed and that the presented conclusions should not be overstated.

审稿意见

评分: 6置信度: 52024-11-04

This paper examines the efficacy of hierarchical search methods in combinatorial reasoning problems compared to low-level search methods. It identifies specific conditions where hierarchical search excels due to the handling value function approximations and complex action spaces. The paper proposes standardized evaluation guidelines for search methods and reassesses state-of-the-art algorithms. The research highlights how hierarchical search methods can mitigate the challenges of large action spaces and variable training data.

优点

The paper offers a novel analysis of hierarchical versus low-level search methods in NP-hard problem domains, a less frequented area of research with significant implications for AI optimization tasks (Section 1).
The paper's structure is logical and well-articulated. It presents complex methodologies and findings in a manner that is accessible to readers with a background in AI and machine learning. The diagrams and tables effectively illustrate key points and comparisons.
Robust empirical analysis supports the claims, with extensive testing across various scenarios to demonstrate the superior performance of hierarchical methods under specific conditions.
The findings contribute valuable insights into applying hierarchical search methods, particularly their resilience to value function noise and their efficiency in environments with complex action spaces, which are critical for advancing AI capabilities in real-world applications.

缺点

The study’s focus on specific NP-hard problems may limit the generalizability of the findings. Additional studies could help determine if the advantages of hierarchical methods hold across a broader range of combinatorial problems (General Discussion).
The performance advantages of hierarchical methods heavily depend on the diversity and quality of training data. The paper could delve deeper into the effects of lower-quality or less diverse data sets (Section 5.1).
While the paper addresses the effectiveness of hierarchical search methods, it underplays their computational complexity and resource demands, which could be a significant drawback in practical applications (Section 4).
The robustness of hierarchical methods to training data variability could lead to overfitting if not adequately managed. The paper could discuss strategies to mitigate this risk (Section 5.2).
A missing comparison with other foundation models in combinatorial problems, such as DeepCubeA, which are trained in an unsupervised setting.

问题

How adaptable are hierarchical search methods to other complex domains outside the specific NP-hard problems tested, such as real-time decision-making environments?
How do computational constraints affect the performance of hierarchical search methods compared to low-level search methods in real-world applications?
What improvements do the authors foresee or recommend for hierarchical search methods to handle larger datasets or more variable environments effectively?
Could the authors elaborate on the standardized evaluation guidelines proposed? How do these guidelines compare to existing benchmarks in AI research?

2024-11-16

The study’s focus on specific NP-hard problems may limit the generalizability of the findings. Additional studies could help determine if the advantages of hierarchical methods hold across a broader range of combinatorial problems (General Discussion).

We appreciate your valuable feedback. Expanding the analysis to more domains would indeed support the generality of our findings. We faced a tradeoff between including a wider range of experiments and providing a detailed discussion that convincingly explains our claims. To maintain clarity and depth, we focused on four challenging domains with distinct properties, ensuring robust evidence for our conclusions. Nevertheless, we are working on extending our analysis to additional domains to strengthen our findings. Additionally, we are developing new theoretical explanations, such as probabilistic modeling of the policy network to explain the results in complex action space domains (Section 5.3), in the spirit of Theorem 1. We are working to share them as soon as possible.

The performance advantages of hierarchical methods heavily depend on the diversity and quality of training data. The paper could delve deeper into the effects of lower-quality or less diverse data sets (Section 5.1).

We agree that this is a valuable suggestion. The quality of training data is an important factor, as evidenced by experiments shown in Sections 5.1 and 5.2. While we extensively analyzed the dataset with diverse data sources in Appendix B.1 and the effect of simulated low-quality data in Appendix B.2, we agree that also analyzing the potential negative impact of lack of diversity in the data will be a valuable addition. To cover that, we are working on experiments with ablated subgoal generators. Specifically, we manually control the diversity of outputs of the generator to precisely simulate the effect of low-diversity data on the performance of the search. Thank you for bringing this point to our attention. Would you recommend moving our extended analysis from the appendices to the main body?

The robustness of hierarchical methods to training data variability could lead to overfitting if not adequately managed. The paper could discuss strategies to mitigate this risk (Section 5.2).

Could you kindly clarify this point? Specifically, are you suggesting that hierarchical approaches may overfit to diverse data due to their hierarchical nature or that robustness to data variability may result in an incentive to memorize the data rather than learning the general properties of the environment? Any additional context would be greatly appreciated.

A missing comparison with other foundation models in combinatorial problems, such as DeepCubeA, which are trained in an unsupervised setting.

Thank you for asking about that. We decided to focus our analysis on the most widely used algorithms to ensure our conclusions have broad implications. In particular, we would like to note that DeepCubeA is based on learning a heuristic from a dataset of trajectories, which is subsequently used in weighted A* search. Combining those two steps is very closely related to the $\rho$ -A* algorithm used in our study, which also derives the heuristic from the given data and uses it to guide the weighted A* search, as detailed in Appendix F.3.

We are grateful for your very constructive feedback. We are more than happy to continue the discussion if anything needs to be clarified.

References:

[1] Open X-Embodiment: Robotic Learning Datasets and RT-X Models
[2] Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models
[3] HIQL: Offline Goal-Conditioned RL with Latent States as Actions
[4] Hierarchical Imitation Learning with Vector Quantized Models
[5] Deep reinforcement learning at the edge of the statistical precipice

2024-11-18

Thank you for the answers.

The authors should briefly mention the effects of the poorer-quality or less diverse data sets and refer to the appendix as needed.
I was referring to catastrophic forgetting if the diversity of the dataset is not well sampled while training the models.

2024-11-23

Additionally, we are developing new theoretical explanations, such as probabilistic modeling of the policy network to explain the results in complex action space domains (Section 5.3), in the spirit of Theorem 1. We are working to share them as soon as possible.

As we promised earlier, we extended our theoretical analysis to further strengthen our conclusions.

We worked on explaining our conclusions from Section 5.3, regarding the advantage of subgoal methods in handling complex action spaces. Previously, we showed experimentally that both in the mathematical INT environment and Rubik’s Cube with multiplied action space the advantage of subgoal methods is significant. We attributed those benefits to the ability of subgoal methods to use states as actions and the reduced diversity in low-level search. And indeed, we can prove in general that as the action space gets more complex, the diversity of top actions drops.

To give an illustrative example, in the Rubik’s Cube experiment, to model the increasingly complex action space, for an arbitrary state we can view the training data as a ground-truth density function $f$ over an interval $[0, 1]$ , that is split evenly between the actions (i.e. into 12 intervals of length 1/12). Then, we can define arbitrarily dense action spaces $A_n$ consisting of $n$ points distributed evenly in the domain. For instance, $A_{12}$ corresponds to the standard Rubik’s Cube action space, while $A_{1200}$ corresponds to the variant multiplied 100 times. Our theorem confirms that the actions selected by the policy gets less diverse as the complexity of the action space increases, up to the extreme of converging to a single point as $n$ approaches infinity. In practice, it is even more general, since the data-driven action distribution $f$ may also model smooth interpolation between actions.

While this is rather intuitive when the learned distributions are perfect, it may seem that approximation errors, induced both by the limited training data and the policy network can actually improve diversity. We show that the result holds even in presence of arbitrarily large approximation errors, which is a bit counter-intuitive.

Formally, we prove the following theorem:

Theorem 2 (Action space densification) Fix any state $s$ from the state space $S$ . Let $f : A \to [0, 1]$ be the action distribution induced by the data-collecting policy for the state $s$ . Assume that $f$ is continuous and has a unique maximum. For clarity, assume $A=[0,1]$ .

Consider a sequence of increasingly dense discrete action spaces $A_n := \\\{i / n\\\}\_{i=0}^n\subset A$ . Let $\rho_n : S \times A_n \to [0, 1]$ be a family of policies that learn the distribution $f|\_{A_n}$ over actions, with uniform approximation error $U(-E, E)$ , where $E\in\mathbb R\_+$ . Let $r\_n$ be the range of the top $K$ actions according to the probabilities estimated by $\rho\_n$ . Then

$\lim_{n \to \infty} \mathbb{E}[r_n] = 0.$

We included the theorem's formulation and discussion in Section 5.3, with a full proof in Appendix L.

This way we theoretically confirm our experimental findings discussed in Section 5.3. And so, we believe that our extensive empirical results can serve as a strong basis to guide further theoretical analysis. Therefore, we believe it is valuable to share it with the community. We hope to extend our theory to cover other findings as well if the time of the discussion period permits.

2024-11-25

As the rebuttal period comes to an end, thank you for your time and engaging discussion. We're happy to address any final questions you might have.

2024-11-23

Thank you for the comment.

The authors should briefly mention the effects of the poorer-quality or less diverse data sets and refer to the appendix as needed.

Motivated by your feedback, we conducted additional experiments to check the impact of poor-quality data on the subgoal generator. Specifically, we tested two types of generator ablations.

In the first experiment, to simulate suboptimal generator, instead of selecting the top $n$ subgoals based on computed probabilities, we firstly expand the candidate pool to $n' > n$ subgoals and then randomly sample $n$ subgoals from this expanded set. This approach forces the method to use suboptimal subgoals during the search process. Notably, even in situations where the optimal subgoal could directly lead to the goal state, it may be excluded from the final selection.

Link to suboptimal subgoals ablation chart

As shown in the figure, subgoal methods demonstrate significant resilience to suboptimal generators. Even when the candidate pool increases to include as many as 8 samples, the methods maintain strong performance, over 70% in both cases. As discussed in Section 5.2, subgoal methods balance the influence of subgoal generators with that of the value function. This interplay allows the value function to compensate for generator errors and vice versa. In practice, it suffices if at least one subgoal contributes to positive progress, as the value function can recognize and leverage such progress.

In the second experiment, we simulate low-quality training data by deliberately corrupting some of the generated subgoals. Specifically, each sampled subgoal is rendered invalid with a probability $p$ , making it unreachable. Consequently, not only resources are wasted on attempting to expand these corrupted subgoals, which fail to contribute to the search progress, but also the diversity of the whole search tree is strongly limited due to creating fewer nodes.

Link to corrupted subgoals ablation chart

The attached figure shows that subgoal-based methods exhibit tolerance to a considerable degree of corruption. Even with a corruption probability of $50\\%$ , both algorithms successfully solve most instances. However, when the corruption rate increases to $75\\%$ , the search process fails, as the lack of valid nodes to expand leads to stagnation.

Together with our analysis of value approximation errors, these experiments highlight that subgoal methods benefit from the complementary roles of subgoal generators and the value function. Errors in one component can often be mitigated by the other. In contrast, low-level methods inherently rely on the value function, making its quality a critical factor for their success.

We extended the discussion in Section 5.2 to cover that point. We also placed the discussion of new ablations in Appendix B.2.1.

I was referring to catastrophic forgetting if the diversity of the dataset is not well sampled while training the models.

We believe there may be a misunderstanding. Our experimental pipeline consists of two stages: (1) training the components on offline datasets and (2) evaluation. During the training stage, the models are trained from scratch, and during evaluation, the models are frozen. As a result, there is no possibility of observing catastrophic forgetting. Additionally, we would like to note that in experiments involving value errors, the value network is also kept frozen, with external random noise added only to its outputs.

We hope that resolves your concern. If anything remains unclear, we will do our best to clarify it further.

2024-11-16

Thank you for acknowledging the strengths of our paper and your helpful suggestions. Below, we answer your specific questions.

How adaptable are hierarchical search methods to other complex domains outside the specific NP-hard problems tested, such as real-time decision-making environments?

Thank you for raising this point. Our study has broader implications for other complex domains. For example, advancements in robotics often face significant challenges due to limited data, leading many methods to rely on collective datasets like Open X-Embodiment [1]. As shown in our experiments, hierarchical search methods benefit substantially from training on diverse expert data (Section 5.1). Furthermore, the data bottleneck increases the need for the models to generalize to out-of-distribution scenes and tasks, which is also an advantage of hierarchical methods (Section 5.5). Finally, an essential aspect of robotics involves preventing the robot from becoming stuck or losing a manipulated object, events that can be seen as dead-end scenarios (Section 5.4). Successful applications of hierarchical methods in robotics include models such as SuSIE [2] and HIQL [3].

Additionally, our experiments indicate that hierarchical methods scale well in long-horizon tasks, as evidenced by their performance in the N-Puzzle and the Rubik’s Cube (using Beginner demonstrations), where the average sequence of steps often exceeds 200. Interestingly, while low-level methods can still perform well in these scenarios, we observed that they tend to be much more sensitive to hyperparameter tuning.

It is important to note that we do not claim hierarchical methods are universally superior to low-level approaches in all complex domains. Instead, the properties highlighted in our analysis suggest cases where they should be considered. To clarify that point, we added this discussion to the paper.

While the paper addresses the effectiveness of hierarchical search methods, it underplays their computational complexity and resource demands, which could be a significant drawback in practical applications (Section 4).

How do computational constraints affect the performance of hierarchical search methods compared to low-level search methods in real-world applications?

We agree that this is an important point, as computational complexity is essential in practical applications. In our experiments, we focused on measuring the performance of the tested methods based on the search tree size – an objective, algorithmic metric that is independent of the hardware or optimizations used, can be measured precisely, and is fully reproducible, unlike the wall time. In practice, the only difference in complexity between hierarchical and low-level methods is the cost of calling the subgoal generator at high-level nodes. We used the architectures proposed by the authors, as our aim for each method was to optimize performance instead of time. To optimize execution time, we can tune the number of parameters or use other architectures that are known to work well for generating subgoals, such as VQ-VAEs [4], diffusions [2], or MLPs [3].

To avoid confusion, we added this discussion to the limitations section of our work. Further study on the complexity of each method will be a valuable addition to our analysis. Thank you for pointing this out.

Could the authors elaborate on the standardized evaluation guidelines proposed? How do these guidelines compare to existing benchmarks in AI research?

During our experiments, we were surprised to find that even papers published at top conferences sometimes exhibited poor practices in reporting the results. While proper evaluation standards are common in broader AI research [5], they are not yet established in the subfield of hierarchical search. As a contribution of our work, we propose the following guidelines for proper evaluation:

Report results using a complete search budget.
Include $\rho$ -BestFS with a confidence threshold as a baseline.
Carefully tune the confidence threshold of BestFS.
Use up-to-date code for experiments.

These recommendations are distinct from creating a benchmark, as we do not propose specific evaluation tasks or environments. Instead, we advocate for a standardized reporting approach to prevent common pitfalls. We agree that these guidelines were scattered throughout the original paper, so we added a comprehensive discussion in Appendix J. Thank you for bringing this to our attention.

审稿意见

评分: 5置信度: 32024-11-04

This paper answers two primary questions: 1. Will hierarchical search outperform classical search under the same heuristic function and other settings? 2. In what situations do hierarchical search methods significantly outperform classical search? The authors conduct experiments to compare the performance of hierarchical search with classical search, finding that hierarchical search performs better in instances with complex action spaces and less guided heuristic functions. Additionally, data with high diversity also favors hierarchical search.

I am uncertain whether the authors have addressed all possible influencing factors to arrive to the final conclusion. Since hierarchical search relies on a subgoal generator, which functions as another form of heuristic, this additional generator could significantly influence the improved performance of hierarchical search compared to classical search, especially in tests involving noisy value functions. This reliance on a subgoal generator may also be a critical factor that warrants further investigation.

And the motivation behind this paper is also unclear. After answering these two questions, what potential impact could this research have on future algorithm design or other related fields?

优点

Strengths:

The paper includes numerous experiments examining various factors that could impact the performance of algorithms.

缺点

Weaknesses:

The motivation for the study is unclear.
The presence of a subgoal generator could heavily influence the analysis of the experiments.

问题

Other Comments:

Line 52: "low-level" likely refers to hierarchical algorithms.
Line 59: The phrase "hard-to-learn value functions" requires clarification.
Line 83: There is a reference error.
Line 175: The description “Planner that determines the order in which subgoals are generated” seems to refer to the "Subgoal generator."
Figure 7: Since both AdaSubS and kSubS rely on value functions, why do they not experience a significant drop in performance with high noise levels? For instance, kSubS in Sokoban (2 to 100 noise) shows better results.
Figure 7: What is the value scale (minimum and maximum) for the value function, given that A* and BestFS tolerate a 0.5 error in the Rubik’s Cube test?
Figures: Some figures should be adjusted to enhance readability by increasing font sizes and displaying full labels, such as “Step” and “Subgoal” in Figures 8 and 9.

2024-11-18

Figure 7: Since both AdaSubS and kSubS rely on value functions, why do they not experience a significant drop in performance with high noise levels? For instance, kSubS in Sokoban (2 to 100 noise) shows better results.

We were also surprised by these results. Both AdaSubS and kSubS rely on the value function, but their high-level search trees are much sparser due to the use of subgoals, reducing the influence of the value function. To give an example, when using 4-step subgoals, the high-level search tree contains about 4 times fewer nodes than the low-level search tree, meaning the value function is queried 4 times less often. An extreme case is AdaSubS, which, as detailed in Appendix B.2, queries the value function only when optimistic search gets stuck. This reduced reliance enables subgoal methods to tolerate high approximation errors without significant performance degradation.

As you noted, interestingly there is one case where adding noise to the value function improved performance. This rare effect arises from the exploration-exploitation tradeoff, as noising value estimates can promote exploration. It can be particularly useful in the Sokoban domain, where overly exploiting the value can lead to getting stuck in dead ends.

We have updated the paper to clarify these points, extending the explanation of value resilience and discussing the performance increase from value noising in Section 5.2. Thank you for this valuable feedback.

What is the value scale (minimum and maximum) for the value function, given that A* and BestFS tolerate a 0.5 error in the Rubik’s Cube test?

In each experiment, we normalize the values to the interval [0,1]. For the Rubik’s Cube experiment, this means that adding noise with a variance of 0.5 corresponds to perturbing the distance estimate by an average of 8 steps. BestFS and A* can tolerate such perturbations because they also use a policy to rank actions. However, even this is insufficient under higher approximation errors. We apologize for not including these details in the paper. We have updated the value specification and clarified the correspondence between noise and values in Section 5.2.

Line 52: "low-level" likely refers to hierarchical algorithms. Line 59: The phrase "hard-to-learn value functions" requires clarification. Line 83: There is a reference error. Line 175: The description “Planner that determines the order in which subgoals are generated” seems to refer to the "Subgoal generator."

We apologize for these mistakes and have now corrected them. Thank you for bringing them to our attention.

Thank you for your valuable comments. We are more than happy to continue the discussion if anything remains unclear.

References:

[1] What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study
[2] Should I Run Offline Reinforcement Learning or Behavioral Cloning?
[3] What Matters for Adversarial Imitation Learning?

2024-11-18

Thank you for your valuable feedback and suggestions. We address your comments below.

I am uncertain whether the authors have addressed all possible influencing factors to arrive to the final conclusion.

Thank you for your comment. While our analysis provides empirical and some theoretical evidence on which types of domains may benefit from hierarchical methods, we agree that the list of influencing factors is not exhaustive. Exploring additional domains and properties is a promising direction for future work.

For each analyzed property, we provide robust empirical evidence by conducting multiple diverse experiments designed to isolate each property, with concise explanations in the main body and detailed discussions in Appendix B. While we fully agree that these properties are not sufficient conditions, they serve as useful indicators.

While our experimental analysis is a first step toward understanding hierarchical search, we agree that it can be strengthened with theoretical grounding. Our arguably most surprising finding that subgoal methods are highly resilient to value noise led us to develop Theorem 1, confirming this resilience as a universal property. We are actively working on developing similar theoretical foundations for other properties.

We agree that our aim and contribution was not explained clearly in that matter. Therefore, we updated Introduction and Conclusions to include this discussion.

This reliance on a subgoal generator may also be a critical factor that warrants further investigation.

Thank you for this insightful comment. As you noted, our experiments confirm that low-level methods rely heavily on the value function, whereas subgoal methods are more resilient to value errors due to the guidance of the subgoal generator. Exploring the subgoal generator's influence on performance is indeed an exciting direction. To investigate this, we are conducting additional experiments where we ablate the subgoal generator by simulating errors – randomly discarding generated subgoals with a controlled probability. We will share the results shortly.

The motivation behind this paper is also unclear. After answering these two questions, what potential impact could this research have on future algorithm design or other related fields?

We understand your concern. While there is growing interest in applying subgoal methods to combinatorial problems and other complex domains, knowledge about their true advantages remains fragmented. As a result, standard low-level algorithms continue to be the default choice for most applications, regardless of the domain.

Subgoal methods offer tradeoffs: they enable faster, more focused searches but may lack the precision and flexibility of low-level methods. Therefore, it is essential to understand the conditions under which subgoal search is a preferable alternative to low-level search. Our analysis provides empirical and some theoretical evidence indicating which types of domains may benefit from hierarchical methods, enabling more informed decisions in future research. As such, our contribution closely follows the format of [1, 2, 3].

Furthermore, the field of hierarchical search lacks established evaluation guidelines, making comparisons with state-of-the-art methods difficult or sometimes meaningless. Our analysis highlights some of these pitfalls and offers guidelines that will facilitate smoother progress in the field.

We acknowledge that this motivation was not clearly stated in the original paper. To address this, we updated the introduction to clarify our aims and consolidated the evaluation guidelines into a comprehensive discussion in Appendix J. Thank you for bringing this important point to our attention.

2024-11-23

Our arguably most surprising finding that subgoal methods are highly resilient to value noise led us to develop Theorem 1, confirming this resilience as a universal property. We are actively working on developing similar theoretical foundations for other properties.

As we promised, we extended our theoretical analysis to further strengthen our conclusions.

We worked on explaining our conclusions from Section 5.3, regarding the advantage of subgoal methods in handling complex action spaces. Previously, we showed experimentally that both in the mathematical INT environment and Rubik’s Cube with multiplied action space the advantage of subgoal methods is significant. We attributed those benefits to the ability of subgoal methods to use states as actions and the reduced diversity in low-level search. And indeed, we can prove in general that as the action space gets more complex, the diversity of top actions drops.

To give an illustrative example, in the Rubik’s Cube experiment, to model the increasingly complex action space, for an arbitrary state we can view the training data as a ground-truth density function $f$ over an interval $[0, 1]$ , that is split evenly between the actions (i.e. into 12 intervals of length 1/12). Then, we can define arbitrarily dense action spaces $A_n$ consisting of $n$ points distributed evenly in the domain. For instance, $A_{12}$ corresponds to the standard Rubik’s Cube action space, while $A_{1200}$ corresponds to the variant multiplied 100 times. Our theorem confirms that the actions selected by the policy gets less diverse as the complexity of the action space increases, up to the extreme of converging to a single point as $n$ approaches infinity. In practice, it is even more general, since the data-driven action distribution $f$ may also model smooth interpolation between actions.

While this is rather intuitive when the learned distributions are perfect, it may seem that approximation errors, induced both by the limited training data and the policy network can actually improve diversity. We show that the result holds even in presence of arbitrarily large approximation errors, which is a bit counter-intuitive.

Formally, we prove the following theorem:

Theorem 2 (Action space densification) Fix any state $s$ from the state space $S$ . Let $f : A \to [0, 1]$ be the action distribution induced by the data-collecting policy for the state $s$ . Assume that $f$ is continuous and has a unique maximum. For clarity, assume $A=[0,1]$ .

Consider a sequence of increasingly dense discrete action spaces $A_n := \\\{i / n\\\}\_{i=0}^n\subset A$ . Let $\rho_n : S \times A_n \to [0, 1]$ be a family of policies that learn the distribution $f|\_{A_n}$ over actions, with uniform approximation error $U(-E, E)$ , where $E\in\mathbb R\_+$ . Let $r\_n$ be the range of the top $K$ actions according to the probabilities estimated by $\rho\_n$ . Then

$\lim_{n \to \infty} \mathbb{E}[r_n] = 0.$

We included the theorem's formulation and discussion in Section 5.3, with a full proof in Appendix L.

This way we theoretically confirm our experimental findings discussed in Section 5.3. And so, we believe that our extensive empirical results can serve as a strong basis to guide further theoretical analysis. Therefore, we believe it is valuable to share it with the community. We hope to extend our theory to cover other findings as well if the time of the discussion period permits.

We hope that our answers resolved your concerns. If anything remains unclear, we are happy to answer any further questions.

2024-11-25

As the rebuttal period comes to an end, thank you for your time and engaging discussion. We're happy to address any final questions you might have.

2024-11-23

We wanted to kindly ask if all your concerns were resolved by our answer. Please let us know if there are any remaining issues or points we can address further.

This reliance on a subgoal generator may also be a critical factor that warrants further investigation.

As we promised in our response, we conducted additional experiments to check the impact of the quality of the subgoal generator. Specifically, we tested two types of generator ablations.

In the first experiment, to simulate suboptimal generator, instead of selecting the top $n$ subgoals based on computed probabilities, we firstly expand the candidate pool to $n' > n$ subgoals and then randomly sample $n$ subgoals from this expanded set. This approach forces the method to use suboptimal subgoals during the search process. Notably, even in situations where the optimal subgoal could directly lead to the goal state, it may be excluded from the final selection.

Link to suboptimal subgoals ablation chart

As shown in the figure, subgoal methods demonstrate significant resilience to suboptimal generators. Even when the candidate pool increases to include as many as 8 samples, the methods maintain strong performance, over 70% in both cases. As discussed in Section 5.2, subgoal methods balance the influence of subgoal generators with that of the value function. This interplay allows the value function to compensate for generator errors and vice versa. In practice, it suffices if at least one subgoal contributes to positive progress, as the value function can recognize and leverage such progress.

In the second experiment, we simulate low-quality training data by deliberately corrupting some of the generated subgoals. Specifically, each sampled subgoal is rendered invalid with a probability $p$ , making it unreachable. Consequently, not only resources are wasted on attempting to expand these corrupted subgoals, which fail to contribute to the search progress, but also the diversity of the whole search tree is strongly limited due to creating fewer nodes.

Link to corrupted subgoals ablation chart

The attached figure shows that subgoal-based methods exhibit tolerance to a considerable degree of corruption. Even with a corruption probability of $50\\%$ , both algorithms successfully solve most instances. However, when the corruption rate increases to $75\\%$ , the search process fails, as the lack of valid nodes to expand leads to stagnation.

Together with our analysis of value approximation errors, these experiments highlight that subgoal methods benefit from the complementary roles of subgoal generators and the value function. Errors in one component can often be mitigated by the other. In contrast, low-level methods inherently rely on the value function, making its quality a critical factor for their success.

We extended the discussion in Section 5.2 to cover that point. We also placed the discussion of new ablations in Appendix B.2.1.

审稿意见

评分: 5置信度: 42024-11-05

This paper reports on an empirical evaluation of two types of algorithms for solving combinatorial search problems: “standard” graph search algorithms and “hierarchical search” algorithm. Worth noting that, unlike the classical literature on combinatorial search, here the main setup is that the algorithm does not have a given domain-specific heuristic. Instead the algorithm is given a dataset of solutions to problems in the same domain, and can learn from this dataset how to solve other problems in the same domain. The “standard” graph search algorithms considered in this work learn a heuristic function (referred to as a value function here) and the “hierarchical search” algorithms considered in this work also learn a method to generate subgoals in this domain and performs a higher-level planning in the space of subgoals. The results show that hierarchical search methods are often much better. The purpose of this work is to dig deeper and analyze why and when they are better. This is done by adding noise to the heuristic, considering test problems significantly different from those used for training, domains with dead-ends, and analyzing domains with particularly many actions. The results, very concisely show that hierarchical search is better in general, and in particular when the heuristic is weak, the training is very different from the test, and when there are many dead-ends or many actions.

优点

The problem is interesting, the evaluation is extensive.
The conclusions drawn from the evaluation are useful for future research and implementer.
Some interesting insights into the nature of combinatorial search problems.

缺点

The paper is far from being self-contained. This can be viewed by the large number of appendices. To me, this work is more fitting for a journal than a conference paper. A major example for this is Theorem 1, which one cannot fully understand without reading the appendices. In addition, the scope is maybe not a perfect fit for a top conference, as they mainly studied the existing algorithms and the conclusions are not ground breaking.

Post-rebuttal: the authors provided some interesting theory in the discussion, I wish it was there in the paper when submitted. At this point, I am not sure if the addition of significant theory is reasonable during a rebuttal.

问题

Is my depiction above of the setup – no domain-specific heuristic, but yes data, is correct?
Are the methods for detecting subgoals domain independent, i.e., only based on the given dataset, or do they also include domain-specific hard coded procedures.
In Section 3, the use of “planning” in the third bullet is not clear to me (and I’ve experience in planning).
Where you use the term value function, I think you mean state value heuristic? My guess is that if any, the term reward would be more accurate e(as per reward shaping)

2024-11-18

Thank you for taking the time to review our paper and for providing valuable feedback. We address your comments below:

Is my depiction above of the setup – no domain-specific heuristic, but yes data, is correct? Are the methods for detecting subgoals domain independent, i.e., only based on the given dataset, or do they also include domain-specific hard coded procedures.

You are correct – we do not specify heuristics or use domain knowledge, allowing the models to learn directly from the data. All components we use, including subgoal generators, are domain-independent and rely solely on the provided datasets. In particular, we do not use any hard-coded procedures. This gives our conclusions broader applicability, as is standard in imitation learning and offline reinforcement learning. We corrected the Analysis section to clarify this point. Thank you for bringing this to our attention.

In Section 3, the use of “planning” in the third bullet is not clear to me (and I’ve experience in planning).

We apologize for the confusion. To clarify, our point is that solving complex problems requires algorithms to employ search techniques. In contrast, methods that don’t use search and follow a single action trajectory are inherently limited by computational complexity, since they can perform only a constant number of operations before choosing an action. And in practice, we observed that methods without search are unable to solve most tasks. We agree that our description in the paper was imprecise on this point and updated Section 3 to ensure greater clarity. Thank you for bringing this to our attention.

Where you use the term value function, I think you mean state value heuristic? My guess is that if any, the term reward would be more accurate e(as per reward shaping)

We apologize for the confusion. The value function we use is indeed equivalent to the state value heuristic, which is also a common term for the function that estimates the distance between the current state and the goal state. Since our components are trained using imitation learning, we adopted the terminology for consistency. However, we agree that this needs further clarification, so we added a remark in Section 4 to address this. Thank you for pointing this out.

The paper is far from being self-contained. (...) A major example for this is Theorem 1, which one cannot fully understand without reading the appendices

We are grateful for your feedback. Balancing detailed discussions with clarity in the main text was indeed a challenge. Our goal was to present key experiments and conclusions in the main body while offering comprehensive details in the appendices for readers interested in deeper exploration of selected topics. Similarly, we aimed to simplify the formulation of Theorem 1 by focusing on its key implication: formalizing the intuition that longer subgoals help filter out value approximation errors. To maintain clarity, we chose not to include the full technical details in the main text but ensured they are accessible in the appendices.

While we believed these choices would enhance clarity, we acknowledge they may not have fully achieved that goal. To address this, we have revised Theorem 1 in the main body and integrated extended experimental discussions into the Analysis section where possible. If there are additional sections you believe need more explanation, we would greatly appreciate your guidance.

评论- Thanks! I am still borderline about this paper though

2024-11-20

Thank you for clarifying many issues. I am still "on the fence" about this paper, as I'm not sure the contributions are significant enough for a top conference. If the authors could take the guidelines they obtained from the experiments and come up with a hybrid of sort, or a stronger theory, that would be more valuable in my view.

2024-11-23

Thank you for that comment. Encouraged by your feedback, we extended our theoretical analysis to further strengthen our conclusions.

We worked on explaining our conclusions from Section 5.3, regarding the advantage of subgoal methods in handling complex action spaces. Previously, we showed experimentally that both in the mathematical INT environment and Rubik’s Cube with multiplied action space the advantage of subgoal methods is significant. We attributed those benefits to the ability of subgoal methods to use states as actions and the reduced diversity in low-level search. And indeed, we can prove in general that as the action space gets more complex, the diversity of top actions drops.

To give an illustrative example, in the Rubik’s Cube experiment, to model the increasingly complex action space, for an arbitrary state we can view the training data as a ground-truth density function $f$ over an interval $[0, 1]$ , that is split evenly between the actions (i.e. into 12 intervals of length 1/12). Then, we can define arbitrarily dense action spaces $A_n$ consisting of $n$ points distributed evenly in the domain. For instance, $A_{12}$ corresponds to the standard Rubik’s Cube action space, while $A_{1200}$ corresponds to the variant multiplied 100 times. Our theorem confirms that the actions selected by the policy gets less diverse as the complexity of the action space increases, up to the extreme of converging to a single point as $n$ approaches infinity. In practice, it is even more general, since the data-driven action distribution $f$ may also model smooth interpolation between actions.

While this is rather intuitive when the learned distributions are perfect, it may seem that approximation errors, induced both by the limited training data and the policy network can actually improve diversity. We show that the result holds even in presence of arbitrarily large approximation errors, which is a bit counter-intuitive.

Formally, we prove the following theorem:

Theorem 2 (Action space densification) Fix any state $s$ from the state space $S$ . Let $f : A \to [0, 1]$ be the action distribution induced by the data-collecting policy for the state $s$ . Assume that $f$ is continuous and has a unique maximum. For clarity, assume $A=[0,1]$ .

Consider a sequence of increasingly dense discrete action spaces $A_n := \\\{i / n\\\}\_{i=0}^n\subset A$ . Let $\rho_n : S \times A_n \to [0, 1]$ be a family of policies that learn the distribution $f|\_{A_n}$ over actions, with uniform approximation error $U(-E, E)$ , where $E\in\mathbb R\_+$ . Let $r\_n$ be the range of the top $K$ actions according to the probabilities estimated by $\rho\_n$ . Then

$\lim_{n \to \infty} \mathbb{E}[r_n] = 0.$

We included the theorem's formulation and discussion in Section 5.3, with a full proof in Appendix L.

This way we theoretically confirm our experimental findings discussed in Section 5.3. And so, we believe that our extensive empirical results can serve as a strong basis to guide further theoretical analysis. Therefore, we believe it is valuable to share it with the community. We hope to extend our theory to cover other findings as well if the time of the discussion period permits.

If the authors could take the guidelines they obtained from the experiments and come up with a hybrid of sort

That is an interesting idea. While our work was focused on highlighting the differences between low-level and subgoal methods, we can indeed leverage that knowledge to propose approaches that take the best of both worlds. For instance, since AdaSubS uses a set of generators, we can use a low-level policy (either trained or exhaustive) as one of them, which would result in a form of “hybrid search”. By directly leveraging low-level actions, we can guarantee completeness, promote exploration, and allow the algorithm to operate even in very uncertain environments, while retaining the advantages of using subgoals.

Additionally, we would like to note that we expanded our analysis of approximation errors to cover subgoal generators, in addition to analyzing the value function. Our experiments reveal that subgoal methods are resilient not only to value noise, but to generator noise as well. Since both components guide the search, errors in one component can be mitigated by the other. These results are detailed in Appendix B.2.1.

2024-11-25

As the rebuttal period comes to an end, thank you for your time and engaging discussion. We're happy to address any final questions you might have.

2024-11-18

They mainly studied the existing algorithms and the conclusions are not ground breaking.

We agree that our study is focused on existing methods. Even though there is growing interest in applying subgoal methods to combinatorial problems and other complex domains, knowledge about their true advantages remains fragmented. As a result, standard low-level algorithms continue to be the default choice for most applications, regardless of the domain. Our analysis provides empirical evidence indicating which types of domains may benefit from hierarchical methods, enabling more informed decisions in future research. As such, our contribution closely follows the format of [1, 2, 3].

Additionally, for the observation that subgoal methods are highly resilient to value noise, which is arguably our most surprising finding, we provided a theoretical grounding (Theorem 1) that confirms this resilience as a universal property. To the best of our knowledge, this is the first theorem to explain the robustness of hierarchical search to value approximation errors across general domains.

Furthermore, the field of hierarchical search lacks established evaluation guidelines, making comparisons with state-of-the-art methods difficult or sometimes meaningless. During our experiments, we were surprised to find that even works published at top conferences sometimes exhibited poor practices in reporting the results. Our analysis highlights some of these pitfalls and offers guidelines that will facilitate smoother progress in the field. While such evaluation standards are common in broader AI research, they are not yet established in the subfield of hierarchical search.

We acknowledge that those implications were not clearly discussed and that the guidelines we propose were scattered around the paper. Therefore, we updated our Conclusions to include this discussion and the guidelines we advocate for. Thank you for this important feedback.

We are grateful for your feedback. We are more than happy to continue the discussion if anything remains unclear.

References:

[1] What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study (ICLR 2021)
[2] Should I Run Offline Reinforcement Learning or Behavioral Cloning? (ICLR 2022)
[3] What Matters for Adversarial Imitation Learning? (NeurIPS 2021)

2024-11-16

We thank the Reviewers for their constructive feedback and for acknowledging the strengths of our submission. We are grateful that our work is recognized as a novel analysis of an interesting problem (y8qH, sWxG), which leads to useful conclusions and valuable insights into applying hierarchical search methods (sWxG, y8qH), particularly those regarding the value noise (nZJV, y8qH). We are also pleased that our extensive experimental evaluation (iLMV, y8qH) supports the conclusions well (y8qH) and has no serious flaws (nZJV). We appreciate that the paper's structure is considered logical and well-articulated (y8qH), and that the breakdown of benefits into the four separate takeaways is useful (nZJV). Finally, we are pleased that our findings contribute meaningful insights into the nature of combinatorial search problems (sWxG) -- an area of research with significant implications for AI optimization tasks (y8qH).

We address all specific questions and comments in individual responses. We are happy to improve the paper according to the Reviewers’ suggestions to further strengthen our submission.

2024-12-02

Thank you for your constructive feedback that helped us strengthen our work. Let us summarize the key improvements we have made to the manuscript, with our recent updates:

We introduced a new Action Space Densification Theorem, providing formal grounding for the advantages of hierarchical methods in handling complex action spaces. This complements our existing Theorem 1 on search advancement, strengthening the theoretical foundation of our findings. We are actively working on further theoretical results, guided by our experimental conclusions.

We extended our experiments to demonstrate the robustness of subgoal generators, even under substantial subgoal corruption (up to 50%). These results further reinforce our claims regarding resilience to approximation errors caused by low-quality training data.

Following your feedback, we made several improvements to enhance clarity and precision of our paper:

Added a discussion on the classical theoretical foundations of hierarchical planning and their evolution in modern machine learning.
Extended the analysis of training data quality effects.
Addressed computational resource requirements.
Clarified the distinction between planning and learning aspects.
Gathered and summarized evaluation guidelines for better clarity.
Refined the conclusions to avoid overstatements, aligning them closely with the empirical evidence we provide.
Reorganized the related work section for better flow.
Fit within the 10-page limit while preserving key insights.

We believe that our expanded theoretical results and additional experiments provide stronger support for our core findings regarding value noise resilience, action space handling, and the impact of data diversity. We believe these changes have made the paper more clear and valuable to the ICRL research community, as it deepens the understanding of hierarchical search in combinatorial problems and helps streamline research in this evolving field.

We are grateful for all your insightful comments, which have guided these improvements. We remain open to further discussion during the remaining review period.

AC 元评审

2024-12-22

The paper attempts to identify properties that make combinatorial reasoning problems more amenable to hierarchical rather than low-level state-space search. Via empirical studies, it identifies 4 such factors: hard-to-learn value functions, complex action spaces, presence of dead ends in the environment, and data collected from diverse sources.

Strengths:

Interesting insights about combinatorial reasoning and properties of combinatorial reasoning problems.
Extensive empirical study

Weaknesses:

Clarity: the paper tries to deliver many distinct points, which makes it hard to digest.
The implications and potential for this work's impact aren't very clear. The paper has a lot of interesting information, but how should it translate to algorithm design?
The validity of the takeaways isn't obvious, since they depend on the subgoal proposer and other factors that experiments didn't evaluate extensively (see below).

Overall, the metareviewer finds that, despite its merits, this work's drawbacks are more severe than the reviews imply. In particular, the paper evaluates fairly vanilla versions of the low-level search algorithms as baselines. In reality, e.g., the PDDL planning community long ago introduced methods that make search much faster than basic A* by efficiently identifying landmarks (Richter, Westphal, "Guiding Cost-Based Anytime Planning with Landmarks", JAIR-2010), dead-ends (Kolobov, Mausam, Weld, "SixthSense: Fast and reliable recognition of dead ends in MDPs", AAAI-2010), and using many other tricks. These methods weren't designed for learning-to-search scenarios like those in this paper but can be quickly adapted to them by being combined with A* -- actually, many of them are heavily modified versions of A* and other fundamental search algorithms. Comparing to these stronger low-level search baselines can change the outcome of the paper's empirical study significantly, undermining the paper's key contribution. Hierarchical approaches may still win in general, but factors like dead ends may turn out to play a negligible role in this. The metareviewer joins reviewer sWxG in suggesting that the authors make the authors solidify the paper's contribution by turning the paper's empirical insights into algorithmic innovations: it will make the paper's merits more tangible and harder to argue against

审稿人讨论附加意见

The discussion revolved around additional empirical studies and clarity issues. Ultimately, the authors have improved the paper significantly as a result but still left the reviewers on the fence as to whether the paper's contributions are solid and significant enough for publication.

最终决定Reject

2025-01-22

Reject

What Matters in Hierarchical Search for Combinatorial Reasoning Problems?

摘要

评审与讨论

优点

缺点

问题

优点

缺点

问题

优点

缺点

问题

优点

缺点

问题

审稿人讨论附加意见