PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
4
4
5
3.5
置信度
创新性3.0
质量3.3
清晰度3.3
重要性3.0
NeurIPS 2025

Cognitive Predictive Processing: A Human-inspired Framework for Adaptive Exploration in Open-World Reinforcement Learning

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29
TL;DR

Human-inspired cognitive mechanisms improve reinforcement learning performance in open-world environments by 4.6% while reducing completion steps by 7.1%.

摘要

关键词
Open-World Reinforcement LearningHuman-inspired Artificial IntelligenceCognitive Architectures

评审与讨论

审稿意见
4

The authors propose a cognitively inspired method to enhance the learning and usage of a world model in reinforcement learning. Their approach integrates three key components: (1) a dynamically updated classifier that adaptively determines the agent's current behavioural phase—exploration, approach, or completion (2) a dual-memory system that balances long-term storage of interesting transitions with short-term recall of recent, task-relevant experiences (3) an uncertainty prediction module to guide exploration more effectively.

The proposed system is evaluated within the MineDojo environment and compare to several classical baselines.

优缺点分析

Strengths:

  • The paper clearly explains a complex, multi-stage architecture, making the proposed method easy to follow despite its intricacy.

  • The ablation study is well conducted, effectively isolating the contribution of each component.

  • The appendix provides a detailed analysis of the complementary mechanisms, offering valuable insights beyond the main text.

Weaknesses:

  • The authors report only the mean performance in their experiments; including variance would provide a more complete understanding of the method's stability.
  • The experiments focus primarily on short-horizon tasks, where the agent needs to interact with a nearby object (e.g., a sheep or a tree not far from starting position). The evaluation would be more convincing if it included more complex goals, such as "Combat Zombie," to test the method’s robustness in long-horizon scenarios.

问题

The reviewer does not understand why Dreamer-V3 performs better than CPP in the "harvest sand" task but not in "harvest water". Based on the explanation provided in lines 268–269, the results seem to suggest Dreamer should have the advantage in both cases. Could the authors clarify what factors may have contributed to this discrepancy?

The proposed method appears to be sensitive to initial visual conditions. How does the system behave when the target object is not immediately visible—for example, if a sheep is occluded by a forest at the beginning of the episode and the task requires exploratory behavior? Has this been tested or analyzed?

Have the authors experimented with more complex, multi-stage goals that require achieving several sub-goals—such as collecting different materials before crafting a final object? If so, how well does the dual-memory system scale in such scenarios?

How do the authors envision adapting their approach to goal-conditioned settings, where the agent is required to generalize across different target objectives? Would the current phase classification and memory mechanism still be applicable or require modifications?

局限性

yes

最终评判理由

The reviewer is fully satisfied with the answers provided to his numerous questions. The paper is well-reasoned, and experiments on more complicated or lengthy tasks would only strengthen it. The reviewer maintains the total score and increases the clarity rating to 4 in light of the answers provided.

格式问题

No

作者回复

Thank you for your thoughtful and detailed review. We greatly appreciate your recognition that our paper "clearly explains a complex, multi-stage architecture" despite its intricacy, and that our ablation study is "well conducted, effectively isolating the contribution of each component." Your insightful questions have helped us identify important areas for clarification and improvement in our manuscript.


  • W1: Regarding performance variance
  • A1: You raised an excellent point about reporting only mean performance. We agree this limit understanding of our method's stability. We will incorporate these statistics into Tables 1 and 2 in the revised paper, following your valuable suggestion.

  • W2: On short-horizon tasks
  • A2: We appreciate your concern about evaluating primarily on short-horizon tasks. Our task selection aimed to balance comprehensive assessment with computational resource constraints. While most tasks involve nearby resources, the "Mine iron ore" task represents a more complex scenario requiring navigation through underground cave systems with limited visibility, serving as our long-horizon test case. We agree that additional complex tasks would strengthen evaluation, and we had considered "Combat Zombie" as you suggested. Due to time constraints, we couldn't complete these experiments for this submission, but we commit to expanding our evaluation to include more complex, long-horizon tasks in future work, as we believe CPP's adaptive exploration would excel in such scenarios.

  • Q1: Regarding the Dreamer-V3 performance discrepancy
  • A3: You identified an insightful question about why Dreamer-V3 outperforms CPP in "harvest sand" but not "harvest water." This apparent inconsistency stems from environmental characteristics that interact differently with our memory system. Water bodies have distinctive visual features (blue coloration, reflective properties) and dynamic behavior (flowing patterns), creating strong perceptual signals that our dual-memory system can effectively encode and retrieve. In contrast, sand blocks have more subtle visual characteristics that can occasionally be confused with similar terrain elements. Dreamer-V3's consistent exploration strategy handles visually ambiguous but static resources better in some cases. We will add this clarification to lines 268-275 in the revised paper.

  • Q2: On initial visual conditions
  • A4: Your question about occlusion is highly relevant. We have indeed tested scenarios where targets are initially occluded. As shown in Figure 3 (particularly in the middle row for the exploration phase), our system handles non-visible targets through its phase-adaptive exploration. When targets are occluded, the stagnation detection mechanism (Equations 17-18) identifies lack of progress and maintains the exploration phase until visual detection occurs. In Appendix C.1, we show how affordance maps evolve from diffuse patterns during exploration to concentrated activation when targets become visible. We will enhance this analysis in Appendix C.1.

  • Q3: Regarding multi-stage goals
  • A5: Your question about complex, multi-stage goals touches on an important direction. We conducted preliminary investigations on crafting tasks requiring multiple resources (e.g., crafting an iron sword requiring wood collection → crafting tools → mining ore → smelting → crafting). Our dual-memory system theoretically supports such multi-stage goals by organizing memories according to cognitive phases and resource types, allowing retrieval of relevant past experiences for each subtask. The phase-adaptive controller naturally transitions between exploration, approach, and completion for each subtask sequentially. However, comprehensive evaluation of complete crafting sequences would require substantially more computational resources than our current budget allowed. We believe this represents an exciting direction for future work and will add this discussion to Section 5.

  • Q4: On goal-conditioned settings
  • A6:You raise an excellent question about adapting to goal-conditioned settings. Based on our framework (Section 3), we envision two promising approaches: (1) Encoding goal representations as additional inputs to the phase classifier (Equation 3), allowing phase definitions to dynamically adapt based on goal embeddings, and (2) Structuring the episodic memory retrieval (Equation 13) to prioritize experiences relevant to the current goal. The key challenge would be learning generalizable representations of goals that inform appropriate exploration parameters. Our current phase classification and memory mechanisms would require extension rather than fundamental redesign, specifically by conditioning the functions fθf_\theta and similarity metrics on goal embeddings. This represents an exciting direction we're actively exploring, and we will add discussion of this potential extension to Section 5 of the revised paper.

We sincerely thank you for your thoughtful questions that have helped us articulate both the strengths and limitations of our approach more clearly. Your feedback on addressing performance variance, occlusion handling, and potential extensions to multi-stage and goal-conditioned settings has been invaluable for improving the paper. We will incorporate all these clarifications and additional analyses in our revised manuscript.

评论

I repost here my response that I wrongly add in final comment

The reviewer is fully satisfied with the answers provided to his numerous questions. The paper is well-reasoned, and experiments on more complicated or lengthy tasks would only strengthen it.

审稿意见
4

This paper introduces cognitive predictive processing (CPP), a neurologically inspired architecture for open-world reinforcement learning. Based on insights on human planning and decision-making, CPP integrates three cognitive components: (1) a phase-adaptive cognitive controller that decomposes tasks into exploration, approach, and completion phases, (2) a dual-memory integration system, combining short-term working memory with selective episodic memory, (3) an uncertainty-modulated prediction regulator that adjusts exploration behavior based on prediction error. The method is evaluated on MineDojo, a benchmark built on Minecraft, and shows performance improvements over baselines including DreamerV3, LS-Imagine, STEVE-1, and VPT.

优缺点分析

Strength: This paper addresses an important problem, the inflexibility of fixed exploration strategies in prior open-world RL methods. The authors propose a new adaptive approach and position it well within existing literature. The proposed approach is motivated by human cognitive mechanisms, and its evaluation shows strong performance in exploration efficiency compared to previous methods. The paper includes ablation studies that dissect the individual contributions of each component.

Weakness: As a cognitive scientist, I found the framing of the paper in terms of human-likeness somewhat unclear: is the goal to model human behavior or to leverage insights from cognitive science for performance improvements? Given that the evaluation is strictly performance-based, measuring performance and efficiency, I believe the latter appears to be the focus. However, the paper uses language such as "human-like" that may overstate the behavioral similarity without supporting evidence. In particular, no experiments are presented to assess whether the system's behavior aligns with human planning, nor is there any behavioral validation (e.g., human judgments of how natural or human-like the agent appears). The authors also mention, to some extent, this limitation in the conclusion: "Despite these advances, several limitations remain. Our approach implements simplified cognitive approximations rather than neurologically precise models" (line 300). However, mentioning this limitation implies that a cognitive model capturing human exploration would be more desired instead of one with high performance and efficiency. I would suggest replacing "human-like" with "human-inspired" throughout the paper, including the title, to avoid confusion, make the scope clearer, and reformulate the limitations.

There is a minor typo in line 58 ("We" should be lowercase).

问题

Could you clarify the goals of the paper with respect to human-likeliness further?

In the limitation section, you mention real-world applications, and suggest investigating the applicability to "medical domains where adaptive exploration strategies could benefit procedures such as minimally invasive blood clot removal in constrained vascular spaces". This example seems to me surprising and potentially problematic, as such applications would require rigorous safety guarantees before any RL-based exploration could be considered. I would have expected domains with lower risk, such as assistive robotics, as plausible deployment scenarios. Could you please elaborate why you chose this particular example?

局限性

yes

最终评判理由

The authors clarified that their work does not aim to provide a cognitive model of human exploration, but rather draws inspiration from human behavior. With that clarification, the proposed method appears to be a promising and novel approach to exploration that leverages insights from cognitive science. While I am not in a position to thoroughly compare this approach to the most recent exploration algorithms, the overall idea is interesting and may contribute a useful perspective to the field.

格式问题

None

作者回复

We are deeply grateful for your thoughtful assessment of our work. We are particularly honored to receive feedback from a cognitive scientist, as this interdisciplinary perspective is invaluable for our research bridging reinforcement learning and cognitive mechanisms. Your recognition of our paper's contribution to addressing "an important problem, the inflexibility of fixed exploration strategies" and our "new adaptive approach" is greatly appreciated.


  • W1: human-likeness
  • A1: You have identified a crucial clarity issue in our framing that we completely agree with. You are absolutely right that our primary goal is to leverage insights from cognitive science to improve reinforcement learning performance, rather than to precisely model human behavior. We sincerely appreciate your suggestion to replace "human-like" with "human-inspired" throughout the paper, including the title. We commit to implementing this change throughout the manuscript to better reflect our research goals and avoid overstating behavioral similarity. This change will help readers better understand our contribution without implying claims about precise cognitive modeling that we do not substantiate with behavioral validation.

  • W2: "We" should be lowercase
  • A2: We are genuinely touched by your attentiveness to such details, it reflects the thoroughness with which you reviewed our work. We will correct this in the revised manuscript.

  • Q1: our goals and the medical application example
  • A3: We are impressed by your perceptiveness in questioning our choice of medical example. Our initial motivation for this research was indeed to design intelligent agents capable of goal-directed exploration in dynamic environments. The medical application of nanorobots for tasks like blood clot removal represents our long-term vision for this research direction, where adaptive exploration could have profound human impact in healthcare. However, we fully acknowledge, as you rightly pointed out, that such applications require rigorous safety guarantees beyond our current capabilities. Our plan following this paper is to develop simulated vascular environments with synthetic targets resembling blood clots to test our algorithms in controlled settings. We recognize this is an ambitious direction requiring collaboration with both cognitive science experts and medical professionals. We appreciate your suggestion of safer alternatives like assistive robotics, which represent more appropriate near-term applications, and will revise our examples accordingly.

We are truly grateful for your cognitive science insights and for identifying both the strengths and limitations of our approach. Your feedback has substantially improved our paper's positioning and clarity. We will incorporate all your suggestions in our revised manuscript to ensure our research contribution is presented accurately within its proper scope and context.

评论

Thank you for your response. I remain positive about this work.

We sincerely appreciate your suggestion to replace "human-like" with "human-inspired" throughout the paper, including the title. We commit to implementing this change throughout the manuscript to better reflect our research goals and avoid overstating behavioral similarity.

Changing "human-like" to "human-inspired" makes the goal of the paper clearer and less ambiguous, and I welcome this change.

Additionally, I still recommend revising the sentence in the limitations section "Our approach implements simplified cognitive approximations rather than neurologically precise models." As currently phrased, it may still suggest that your goal is to propose a cognitive model, which apparently is not the case. I would suggest to simply remove this sentence. Optionally, if the authors wish to acknowledge this direction, you could note that future work may explore whether and how your formulation could serve as a cognitive model of human behavior.

One final point: in the introduction, statements like "humans utilize knowledge through selective dual-memory systems" could be more carefully framed. Most findings about human cognition are empirical and establish theoretical models instead of absolute truths. In this case, the dual-memory system is a model that aligns well with experimental observations, but whether it corresponds to the actual biological implementation remains an open question. More accurate formulations might be: "Human behavior has been effectively modeled using a dual-memory system." or "It has been proposed that humans may utilize selective dual-memory systems." In general, I recommend revisiting such statements to clearly distinguish between cognitive models and definitive claims about human neurobiology.

审稿意见
4

This paper introduces CPP, a framework for RL designed to improve how agents explore and make decisions in open-world environments like Minecraft.

Inspired by real cognition, the CPP framework integrates three components to create more flexible and adaptive agents: A phase-adaptive cognitive controller that breaks down tasks, a dual-memory integration system that balances short & long-term memory and an uncertainty-modulated prediction regulator that adjusts exploration based on environmental predictability.

优缺点分析

Strength:

The agent succeeded more often and did so faster. CPP reduced the # steps to finish tasks by ~ 7.1%. In the harvest log task, it was > 50% more efficient than DreamerV3.

The authors’ ablation studies were useful and well done. Each cognitive component was confirmed to contribute meaningfully to the agent performance.

Weakness: The improvement is relatively modest - 4.6 % averaged over five tasks, and CPP actually gets worse on mine iron-ore, where DreamerV3 still leads. So the demonstrated benefits are a starting point rather than an eventual gamechanger (yet).

The paper's central premise is its "human-like framework" built on neurologically-inspired systems. While the concepts are well motivated, the actual implementations often rely on hand tuned heuristics, which is not very adaptive. Many critical parameters that govern the cognitive behavior are hard-coded, e.g. fixed progress thresholds for explore & approach. Similarly, the uncertainty calculation uses a fixed weighting of different error types.

问题

The current model assumes the learning rule f_θ is static throughout the entire course of an animal's learning. Is it possible that animals "learn how to learn"—that is, the learning rule itself is non-stationary and adapts as the animal gains more experience with the task structure?

The quantitative definitions of surprise and uncertainty are based on simple, weighted linear sums of different error signals with fixed, hand-tuned weights (e.g. ω_o, ω_r, ω_trend ). This heuristic approach seems to be less principled than other methods in the RL for quantifying uncertainty.

Could you provide a stronger justification for these specific heuristic formulations?

局限性

Yes.

格式问题

NA

作者回复

Thank you for your thorough and insightful review. We greatly appreciate your recognition of our work's strengths, particularly your acknowledgment that "the agent succeeded more often and did so faster," reducing task completion steps by ~7.1%, and that our "ablation studies were useful and well done" with each component confirmed to "contribute meaningfully to agent performance." Your balanced assessment of both our contributions and limitations has been extremely valuable.


  • Q1: Learning rule adaptation in animals and AI systems
  • A1: Your question about whether animals "learn how to learn" raises a fascinating and important direction. You've identified a key limitation in our current implementation, where the learning parameters θ remain static throughout training. Your insight aligns perfectly with recent neuroscience findings [1] demonstrating that biological systems indeed adapt their learning mechanisms based on accumulated task experience. We find this suggestion particularly compelling and are already exploring its implementation through meta-gradient approaches that would allow our framework to adapt learning rules based on task performance and environmental complexity. Rather than simply using fixed learning rates across all phases, this would enable truly adaptive learning that evolves with experience. We will expand our discussion of future work in Section 5 to include this promising direction, with proper attribution to your valuable suggestion. Such an approach could potentially address the performance inconsistencies you noted in tasks like iron ore mining.

  • Q2: Justification for heuristic uncertainty formulations
  • A2: We appreciate your critical assessment of our uncertainty quantification approach. You're right to question our use of weighted linear combinations with fixed parameters (ω_o, ω_r, ω_trend). While more complex Bayesian approaches might seem more principled, our design choices were motivated by two key considerations: First, computational efficiency in already computationally intensive environments bayesian uncertainty estimation would significantly increase the computational burden during online learning in complex environments. Second, our approach draws inspiration from neuroscience literature [2] suggesting that even in biological systems, uncertainty representations often manifest as relatively simple combinations of prediction errors rather than full Bayesian computations. Our ablation studies (Table 3) demonstrate that even with these simplified heuristics, the uncertainty regulator contributes significantly to performance (2.67-6.33 percentage points across tasks). Nevertheless, we acknowledge this limitation and will strengthen our justification in Section 4.3, including a more detailed analysis of how different uncertainty formulations affect performance across tasks.

We sincerely thank you for your thoughtful questions that have pushed us to better articulate and improve our approach. Your suggestions regarding adaptive learning rules and uncertainty estimation have provided clear directions for enhancing both our current paper and future research. We will incorporate these improvements in our revised manuscript to address the limitations you've identified while maintaining the strengths you recognized in our work.


[1] Sousa M, Bujalski P, Cruz B F, et al. A multidimensional distributional map of future reward in dopamine neurons[J]. Nature, 2025: 1-9.
[2] Trudel N, Scholl J, Klein-Flügge M C, et al. Polarity of uncertainty representation during exploration and exploitation in ventromedial prefrontal cortex[J]. Nature Human Behaviour, 2021, 5(1): 83-98.

审稿意见
5
  • Introduces cognitive predictive processing (CPP) framework inspired by human decision-making, for the purpose of improving exploration in open-world environments
  • Involves three components: (1) module that decomposes tasks into phases, (2) a dual-memory integration system, and (3) an uncertainty-aware prediction module
  • Evaluated on MineDojo, demonstrating improved performance on resource collection tasks

优缺点分析

Strengths

  • Paper is easy to follow, and the presentation is polished
  • The architectural decisions make sense, and appear novel
  • The evaluation environment is appropriate given the stated aims
  • The paper contains sufficient details for reproducibility, and links to source code
  • The results indeed appear promising

Weaknesses

  • Performance is not consistently better than baselines (addressed as limitation)
  • Related, the architecture (especially the manually-defined "phases") seems a bit brittle, and may not be applicable to all tasks of interest
  • The approach is evaluated only on one domain

问题

  • How does the uncertainty-modulated exploration compare to HyperX [1]? This seems like a relevant work either way.
  • Which other evaluation domains were considered, and why was MineDojo selected among them? It is true we are lacking a diverse range of open-world environments in the literature; would simply like to understand other considerations and potential issues with them

[1] https://arxiv.org/abs/2010.01062

局限性

Yes; although more details about limitations introduced by the three distinct phases could be discussed more at length (are there real-world tasks that fall outside of this framework? if not, and this is really very general, a stronger, more explicit explanation of this would be useful too)

最终评判理由

I stand by my initial review and therefore maintain my current score.

格式问题

N/A

作者回复

We sincerely thank you for your thorough and positive assessment of our work. We are encouraged by your recognition that our paper is "easy to follow" with "polished presentation," that our "architectural decisions make sense and appear novel," and that our evaluation is "appropriate" with "sufficient details for reproducibility." We particularly appreciate your acknowledgment that our results "indeed appear promising" and your overall favorable rating.


  • Q1: How does the uncertainty-modulated exploration compare to HyperX [1]? This seems like a relevant work either way.

  • A1: Thank you for highlighting this relevant work. Regarding the comparison between our uncertainty-modulated exploration and HyperX, we observe both conceptual similarities and important distinctions that highlight the contributions of our approach. HyperX addresses uncertainty-guided exploration by introducing two exploration bonuses during meta-training: (1) a novelty bonus on approximate hyper-states using random network distillation, and (2) a bonus based on the discrepancy between predicted and observed rewards/transitions. Their approach focuses primarily on learning good task-adaptation behavior across different tasks in a meta-learning context, with exploration bonuses that tend toward zero during training.

  • A1: Our uncertainty-modulated prediction regulator differs in several significant ways: First, while HyperX operates in a meta-reinforcement learning context focusing on cross-task exploration, our CPP framework addresses the challenge of continuous adaptation within a complex open-world environment. This fundamental difference leads to distinct uncertainty quantification approaches. Second, CPP implements a more comprehensive uncertainty estimation through multiple integrated prediction error channels (Eq. 10, 23-24): (1) observation prediction errors that capture perceptual uncertainty, (2) reward prediction errors that quantify value uncertainty, and (3) reward trend errors that detect changes in progress patterns. This multi-channel integration enables our system to distinguish between different sources of uncertainty rather than treating them uniformly. Third, unlike HyperX's exploration bonuses that are designed to eventually disappear, our uncertainty estimates continuously modulate exploration parameters throughout task execution. As shown in Eq. 25-28, our approach dynamically adjusts jump thresholds based on current uncertainty levels, making exploration more aggressive when uncertainty is high and more conservative when it is low. This continuous modulation is crucial for open-world tasks where uncertainty varies greatly across different regions and phases.

  • A1: The effectiveness of our approach is particularly evident in tasks requiring adaptive uncertainty-guided exploration, such as the 'obtain wool' task, where CPP achieves a 22.3% improvement over LS-Imagine. This task involves tracking dynamic targets (sheep) where uncertainty changes rapidly as the agent moves, requiring precise modulation of exploration behavior based on prediction confidence. Our ablation studies (Table 3) further validate the contribution of the uncertainty-modulated prediction regulator, showing performance drops of 2.67-6.33 percentage points across tasks when this component is removed, demonstrating its significant impact on system performance.We will incorporate this relevant reference and clearly articulate these distinctions in our revised manuscript.


  • Q2 : Which other evaluation domains were considered, and why was MineDojo selected among them? It is true we are lacking a diverse range of open-world environments in the literature; would simply like to understand other considerations and potential issues with them

  • A2: Regarding your question about evaluation domains, we considered multiple environments including Atari, ProcGen, DMLab, and Atari100k before selecting MineDojo. Our choice was guided by MineDojo's unique advantages for evaluating adaptive exploration: (1) It offers a true open-world setting with minimal constraints on agent behavior; (2) It provides diverse interaction possibilities requiring both broad exploration and precise manipulation; (3) It features tasks with varying complexity levels, from simple resource collection to multi-step interactions; (4) It has established benchmarks with prior work enabling fair comparison. While additional environments would strengthen evaluation (a limitation we acknowledge in Sec. 5), MineDojo's complexity makes it a representative testbed for open-world exploration strategies. We plan to extend our approach to other environments in future work.


  • L1: although more details about limitations introduced by the three distinct phases could be discussed more at length (are there real-world tasks that fall outside of this framework? if not, and this is really very general, a stronger, more explicit explanation of this would be useful too)
  • A3: We sincerely appreciate your insightful observation about the need for more detailed discussion of our framework's limitations. You raise an important point about generalizability. Our three-phase approach (exploration, approach, completion) is well-suited for goal-directed tasks with spatial components that involve resource location, navigation, and manipulation, common in robotics, embodied AI, and open-world gaming environments. This architecture naturally maps to how agents typically progress from broad exploration to targeted completion in these domains, as demonstrated by our successful application across diverse MineDojo tasks.
  • A3: However, we acknowledge limitations in certain task categories: (1) highly unstructured environments where phase boundaries blur significantly, such as continuous-control problems without clear goal states; (2) extremely long-horizon tasks requiring numerous intermediate subgoals where a simple three-phase decomposition may be insufficient; and (3) highly adversarial settings where rapid replanning is necessary due to dynamic obstacles or opponents. As our ablation studies show (Table 3), even when the phase-adaptive component is removed, our memory and uncertainty mechanisms still provide performance benefits, suggesting a degree of robustness. Following your valuable feedback, we will include a more comprehensive analysis of our framework's limitations and generality in the revised paper, including a more explicit discussion of task categories where adaptations to the basic three-phase structure might be necessary.

We are grateful for your insightful questions, which have helped us better articulate our contributions in relation to existing work. Your suggestion to compare with HyperX has been particularly valuable in positioning our research within the literature. We will incorporate all your feedback in our revised paper to strengthen the presentation and contextualization of our work.

最终决定

This paper proposes a human-inspired architecture for exploration, demonstrating improvements over current methods. Reviewers agreed on the significance of the results and the potential contribution of the work. The main limitations lie in the presentation, and I also found that the choice of parameters in Table 5 requires further justification beyond simply being based on cognitive science principles, especially since no supporting references are provided. Nevertheless, the work is interesting and believe it would make a valuable contribution to the literature.