6.5

/10

Poster4 位审稿人

最低6最高8标准差0.9

3.0

置信度

正确性3.0

贡献度3.0

表达3.3

ICLR 2025

Epistemic Monte Carlo Tree Search

Yaniv Oren,Viliam Vadocz,Matthijs T. J. Spaan,Wendelin Boehmer

OpenReview PDF

提交: 2024-09-24更新: 2025-03-06

TL;DR

We extend A/MZ specifically and algorithms that use MCTS with learned models of value and / or dynamics in general to estimate and propagate epistemic uncertainty in search, and harness the search for deep exploration.

摘要

关键词

model basedepistemic uncertaintyexplorationplanningalphazeromuzero

评审与讨论

审稿意见

评分: 8置信度: 22024-11-04

The authors propose a novel method, Epistemic - Monte Carlo Tree Search (E-MCTS) which incorporates the uncertainty estimates into MCTS. The authors theoretically motivate their approach as well as provide an experimental study showing the benefit of using their method in hard exploration problems.

优点

Overall, the paper is generally well-written (except for a few places), the proposed method is novel and the contribution is significant.

MCTS is a technique which showed great results in relatively simple domains and it still remains a challenge to apply it in hard-to-explore domains. This work addresses this challenge by adding the uncertainty estimation mechanism into MCTS. The paper demonstrates good results in hard-to-explore domains.

缺点

In abstract, lines 14-15, it is written "MCTS does not account for the propagation of this uncertainty however.". At this point of the text, it is not clear what it means to propagate the uncertainty and why it is necessary. It might be worth adding a sentence about "propagating uncertainty" before that.
The authors mention in many places that the model will be uncertain about the part of the state-action space which was not observed during training, like for example, lines 125-126 "thus their predictions are uncertain outside of the training set". But what about the situation when the model is actually certain about the unobserved part of the state-action space, but wrong ? I believe this situation is also likely to happen. There is no discussion of it at all in the paper.
The authors propose to estimate the square root of the variance $\sqrt{V[Q]}$ as $1/N \sum_{i=1}^{N} \sqrt{V[v]}$ (see, for example eq.11), but this is not how it generally is estimated. Instead, one should estimate $\sqrt{V[Q]} = \sqrt{1/(N-1) \sum_{i=1}^{N} (v_i - \bar{v})^2}$ , where $\bar{v}$ is the empirical average of $v$ . Where does this estimate come from? What is the impact of using this estimate rather than a more correct one which I wrote?
The authors write "models are consistent" (line 269), but it is not clear what is meant by consistent. They also say in line 270, "V[V[s]] ~V[Vpi[s]]". It is unclear why this must be true. Also, the authors say "Imposing again the assumption that the learned models are consistent" (in lines 276-277). It is unclear what they mean by consistency. Suggestion -- write clearly (as an equaition) the definition of the consistency. Then, argue / discuss the implications of this assumption and why this assumption is true in the first place.
In lines 388-390, the authors say "achieve deep exploration", what does it mean? From the experiments, I can only conclude that the method solves these specific tasks, but it is wrong to claim that it "achieves deep exploration", since first, this "achieving" is not defined and second, it is not implied from the experiments that it will hold in general.
The authors evaluate their method in two domains. I think in order to increase the significance of their work, the authors should also evaluate on more easy-to-explore domains (i.e., in Atari) to show that their method works consistently well overall, as well as on more hard-to-explore domains (i.e., in Atari) to show that in situations where it should work well, it does it consistently. For example, would it also work well in domains like chess? Will the findings still hold? Will E-MCTS actually be more effective here compared to AZ?

问题

See weaknesses for the questions.

2024-11-16

Dear reviewer,

Thank you for your valuable time and many detailed comments and suggestions.

Thank you for pointing this out, we will clarify the text.
This situation is indeed possible, but only when the estimator of the epistemic uncertainty is not working correctly. In the presence of errors in uncertainty estimation we expect EMCTS to have an advantage over methods that do not use planning for exploration, such as the A/MZ+UBE ablation in Figures 2 and 4, because EMCTS averages across multiple estimations of the uncertainty and thus can average out independent errors, while model-free exploration methods (such as the ablation) only have access to one estimate for the uncertainty of the value of the current state $\mathbb V[Q(s_t,a)]$ for each $a$ . This is corroborated in Figure 3 in the appendix where the uncertainty estimates of EMCTS (top row) match the ground truth much better than those of the one-prediction UBE estimator (middle row). We touch on this briefly (line 061 in the introduction), and will make sure this is emphasized more strongly.
Popular methods such as AlphaZero [1] and MuZero [2] learn one model $\hat m$ rather than a full posterior $\hat M$ , because approximating $\hat M$ is hard to intractable in practice. In that case, the sample standard deviation $\frac{1}{N-1} \sum_{i}^N (\nu_{\hat m}^i - \bar \nu)$ yields a measure of how much the backups $\nu^i_{\hat m}$ vary inside the model, because every backup is from the same model $\hat m$ . This variance is the non-epistemic part of the uncertainty (i.e. how stochastic are $\hat m$ and the search policy), and is not suited for directing exploration into novel areas of the state-action space. For this reason, we must use other methods such as Equation 11 (see full derivation in Appendix A.2), that estimate the variance of a sum of random variables using the sum of the individual variances.
Thank you for pointing out that our explanation wasn't sufficiently clear. We will make sure this is described more clearly in the paper. Please see our address of concerns with this assumption in the general rebuttal.
Our intent was to point out that to solve Deep Sea the only way for an RL agent that does not have prior knowledge over the goal transition, such as EMCTS, is by being able to explore novel state actions that are far in the future compared to the current state action, i.e., deep exploration. We will make sure this is phrased clearly in the paper.
We include results in a suite of easy-exploration MinAtar problems in the uploaded revision, where the performance of EMCTS is comparable to baseline AlphaZero using the same uncertainty estimator used for SUBLEQ. Note however that in easy-exploration environments the behavior of UCB exploration methods is expected to be from non-existent down to detrimental. For example, in environments such as Cartpole any exploration is detrimental for performance because the goal state is observed immediately. Further, tuning EMCTS' hyperparameters includes the possibility of setting $\beta = 0$ , which means that exploration with EMCTS can always reduce to the baseline when random or no exploration performs better than UCB-based exploration. In regards to more complex, large state-action spaced environments such as Chess (although this also applies to image-based Atari and Minatar), it is a property of UCB-based exploration methods such as EMCTS that if the uncertainty estimator is not suited, UCB exploration can induce searching every individual state-action which is not useful when the state-action spaces are large (we discuss this in more detail in Section 2.3), and again, can even be detrimental. Given a suited uncertainty estimator (such as the IO-Hash used by EMCTS for SUBLEQ in Figure 1), the confidence bounds provide relevant information for exploration in the environment. Finding such estimators is an open problem and an active area of research. However, it is orthogonal to the contributions of EMCTS.

[1] Silver, David, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot et al. "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play." Science 362, no. 6419 (2018): 1140-1144.

[2] Schrittwieser, Julian, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez et al. "Mastering atari, go, chess and shogi by planning with a learned model." Nature 588, no. 7839 (2020): 604-609.

评论- Acknowledgement

2024-12-02

I would like to thank the authors for their response. Thank you for clarifying my confusion with respects to my concerns 1-5.

I appreciate your additional experiment with regards to comparing the performance of the method to AlphaZero in some Atari games. Your results demonstrate that the performance of the proposed method is comparable to Atari in easy-to-explore games. It would also be very informative to see how does the proposed method compare on very hard to explore games (i.e., Montezuma Revenge).

I am increasing my score.

审稿意见

评分: 6置信度: 42024-11-05

This manuscript tries to solve an important issue in Monte Carlo Tree Search (MCTS), that MCTS doesn’t consider the propagation of the uncertainty of learned model. The authors proposed Epistemic Monte Carlo Tree Search (EMCTS), which can account for the epistemic uncertainty in search and harness the search for deep exploration. The proposed method is achieved in three steps: search with learned reward model, planning for exploration with epistemic search, and propagating epistemic uncertainty in search. EMCTS is evaluated in the challenging sparse-reward task of writing code in the Assembly language SUBLEQ and hard-exploration benchmark Deep Sea, and the results show that the proposed framework can solve the quiz faster than AlphaZero/Muzero.

优点

This manuscript propose a novel and solid framework to introduce and deal with uncertainty of learned model caused by limited data or sparse reward environments, which is not taken into account in the standard MCTS.
EMCTS solves the two-fold objective, i.e., extend MCTS to estimate and propagate the epistemic uncertainty from the uncertain learned model and harness the epistemic uncertainty in the search to achieve deep exploration of the environment, via formulating the epistemic uncertainty, and search based on the propagated UCB.
By giving some assumption, the estimation and propagation of the epistemic uncertainty is defined by a random variable and computed via the variance of reward.
Experiments demonstrate that the proposed EMCTS can solve the harder task in a much smaller number of samples than AZ.

缺点

Some assumptions might make EMCTS inapplicable to real complex problems. They assume that the transition model P is true. Theorem 1 need $Q^{\pi}$ to be linear. The learned models are assumed to be consistent. These is not reasonable in many cases.
In many cases, the transition models need to be estimated, and then the EMCTS will not work.

问题

This method assumes that the reward is independent when $i \neq j$ . Is this satisfied in programming or math task? for example, when $i = j+1$ ?
How to get the true transition model in typical tasks?
Is accuracy and uncertainty of the learned model is dynamically improved in the learning procedure? And is this will affect the consistent assumption?

伦理问题详情

None.

2024-11-16

Dear reviewer,

Thank you for your valuable comments and questions.

Some assumptions might make EMCTS inapplicable to real complex problems:

Assume that the transition model P is true and in many cases, the transition models need to be estimated, and then the EMCTS will not work: We demonstrate in Figures 2 and 4 that EMCTS works well even when the transition model is learned. Theorem 1 indeed assumes the Q-values use the true transition model. However, as discussed in the common rebuttal, this is addressed in Appendix A.3 and the method proposed in A.3 is evaluated and successful in Figure 4. In Figure 2 we demonstrate that EMCTS works with learned transition models even without modifying the upper bound.
Theorem 1 need $Q^\pi$ to be linear: There seems to be a misunderstanding. Note that any value is a sum of discounted expected future rewards and is therefore linear in the rewards. The proof of Theorem 1 (Appendix A.1) only uses this property which is true for every value function. It does not assume that $Q^\pi$ is a generally linear function.
The learned models are assumed to be consistent: We do not believe that this is a concern in practice regardless of the complexity of the problem, see our discussion in the above common rebuttal.

To summarize, we believe none of these assumptions prevent EMCTS from being applicable to real complex problems, such as programming, demonstrated on SUBLEQ in Figure 1.

Is it satisfied that the reward is independent in different types of tasks: Does the reviewer refer to the independence assumption made in line 287-288? Please note that we assume the epistemic uncertainty of different rewards to be independent, not the rewards themselves. This assumption is irrespective of the task. It depends only on the class of estimator used (for example, whether $\hat R$ is a DNN or a table), for the reason that it is only assumed with respect to the learned reward function $\hat R$ . If this is not the assumption the reviewer refers to, could the reviewer please clarify which assumption is referred to?

How to get the true transition model in typical tasks? In many typical tasks where RL is used in practice (board games, programming and algorithm design, video games, robotics) the real transition model is either known (for example, in single-player games or zero sum 2-player games with self-play), definable by the practitioner (for example, programming and algorithm design) or formulated using physics models in robotics.

Is the accuracy and uncertainty of the learned model dynamically improved in the learning procedure? The accuracy of the learned models (reward, value and or transition, but also the learned value uncertainty estimator $\hat u$ ) indeed all improve as the agent sees a growing number of training steps and data points, demonstrated also in Figure 3 in the appendix where the estimates improve left to right with respect to the ground truth.

And will this affect the consistency assumption? For the above reason, indeed any possible inconsistency or error in $\hat u$ should further reduce as training progresses. In practice however we see in the right subplots of Figure 4 in the appendix that the agent explores the environment very quickly, demonstrating that the $\hat u$ estimates are sufficiently accurate to result in reliable exploration of novel states even early on in training.

We hope that these answers along with the main rebuttal address the reviewer's concerns and questions. We are happy to clarify anything that is left unclear.

[1] Schrittwieser, Julian, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez et al. "Mastering atari, go, chess and shogi by planning with a learned model." Nature 588, no. 7839 (2020): 604-609.

审稿意见

评分: 6置信度: 22024-11-12

This paper proposes Epistemic Monte-Carlo Tree Search (EMTCS) by introducing a method incorporating epistemic uncertainty in traditional MCTS, due to the limited coverage of the data. This paper upper bounds the uncertainty in the MCTS process and designs a novel planning scheme with this estimated uncertainty upper bound. This method makes the agent to explore the environments with sparse reward signals more efficiently. They did experiments on code generation task like SUBLEQ, which involves deep exploration. They also conducted experiments on the Deep Sea Benchmark. Extensive experiments show that EMCTS achieve better sample efficiency and deep exploration compared to baseline methods, particularly in tasks with sparse rewards.

优点

The paper provides a relatively rigorous theoretical foundation for EMCTS. The authors derive an upper confidence bound for the epistemic uncertain within the learned MCTS model, which theoretically justify the exploration strategy of EMCTS.
The experiments demonstrate the effectiveness of EMCTS algorithm in SUBLEQ and Deep Sea task. EMCTS show a better sample efficiency than other baseline methods. Notably, EMCTS demonstrate deep exploration in the presence of the learned value, reward, and transition models as well as both deterministic an stochastic rewards.
The motivation of EMCTS is good: to address the epistemic uncertainty in deep exploration tasks and incorporate the uncertainty not just in reward but also in transition dynamics.

缺点

The authors assume consistency across the reward and value. This assumption might not hold in dynamic environments or with certain deep neural network structures in practice. To what degree do those assumptions hold in practice? Are there any experiments results showing that those assumptions hold (or nearly hold) in practice? How does EMCTS handle potential inconsistencies in epistemic uncertainty when the learned models (reward, value, transition) are not fully aligned?
The upper bound of the epistemic uncertainty only applies to the discrete state-action space and can introduce considerable computational costs if we are in a high dimensional space. Moreover, how does EMTCS handle larger state-action spaces or continuous state-action spaces? Experiments in the paper only show the effectiveness of EMCTS in case of discrete action space, but how can EMCTS (or some of its adapted variants) handle the continuous action space? Moreover, how sensitive is EMCTS to the exploration hyperparameter \beta?

问题

伦理问题详情

2024-11-16

Dear reviewer,

Thank you for your time, detailed review, comments and questions.

Consistency: Please see our reply in the main rebuttal above.

High dimensional, large or continuous state-action spaces: To effectively explore high dimensional, large or continuous state action spaces, Upper-Confidence-Bound (UCB) based exploration approaches (such as EMCTS) require estimators that estimate the epistemic uncertainty in reward/value of unvisited transitions $(s',a')$ based on a suitable similarity metric between $(s',a')$ and visited transitions $(s,a)$ (see Section 2.3). Developing such estimators for complex domains is indeed a hard problem and remains an active area of research. It is however orthogonal to the contributions of EMCTS. In the presence of suitable estimators EMCTS can be used directly without any computational cost that depends specifically on the dimensionality or the size of the state-action space. This is demonstrated in Figure 1 where EMCTS paired with a suitable estimator, the IO-Hash, can generate SUBLEQ programs that solve specific programming tasks much more efficiently than the variation which uses a less-suited source of uncertainty (the full observation hash), despite the size of the state space ( $\approx 16$ million unique states). An example of an uncertainty estimator that is viable in the presence of continuous state/action spaces is RND, with which EMCTS works well, see Figure 2 and Figure 4. For these reasons, we do not see a conflict between sampling-based MCTS approaches such as [1] and the extensions proposed by EMCTS, given an uncertainty estimator that is suited for the problem, similarly to other UCB-based exploration methods. Finally, note that MuZero [2] is very successful in relatively large action spaces (such as in Go) because of the search that combines value and prior policy. The same search properties apply to EMCTS, with a prior-exploration-policy (Equation 10 and line 255). An EMCTS agent that uses a prior-exploration-policy is evaluated in Figure 1, succsfully solving SUBLEQ instances much faster than the baseline.

The sensitivity of EMCTS to $\beta$ is very low in the sparse-reward domain of Deep Sea, which is demonstrated in the right subplot of Figure 2. Specifically, $\beta$ need only be large enough to induce deep exploration. This aligns with our expectations for a UCB-based exploration method in sparse-reward environments, where the main driver for performance is the ability to estimate the uncertainty well enough and use it to consistently explore novel states until the rewarding transition is observed.

How does EMCTS handle potential inconsistencies in epistemic uncertainty when the learned models (reward, value, transition) are not fully aligned? This is handled by the definition of $\mathbb V [\hat R]$ (line 216) which guarantees an upper confidence bound over the reward outside of the training distribution. The same idea is used in Equation 22 to train the estimator $\mathbb V [\hat V]$ to approximate a term that guarantees an upper bound over the value. Inside the training distribution, further exploration is not necessary and the epistemic uncertainty can safely drop the zero regardless of the extent to which the models are aligned in practice.

We hope this along with the main rebuttal addresses the reviewer's questions and concerns and are happy to clarify anything still left unclear.

[1] Hubert, Thomas, Julian Schrittwieser, Ioannis Antonoglou, Mohammadamin Barekatain, Simon Schmitt, and David Silver. "Learning and planning in complex action spaces." In International Conference on Machine Learning, pp. 4476-4486. PMLR, 2021.

审稿意见

评分: 6置信度: 42024-11-13

This work studies epistemic MCTS, which is to characterize the propagation of model uncertainty in tree search. This is motivated by tasks where the model dynamics are not known in advanced, but need to be learned instead. It could be also useful to host exploration methods.

The algorithm is inspired by theoretical view of linear Q, where the true optimal Q is bounded with high probability by the Q estimate plus the variance term $O\sqrt{V}$ . Though the bound doesn't apply to learned dynamics, it works in the similar way as UCB, where the "inequality" provides intuition of the algorithm design. In the search step, the epistemic UCT then uses a similar bonus term in $O\sqrt{V}$ that is applied to the Q estimate.

This variance term is also propagated through the backup operator, as the authors intended to in their motivation of this work. Through an independency assumption, the variance could be backed up by variance terms of the nodes on the trajectory. With this backup, epistemic UCT could be applied to obtain epistemic MCTS.

The algorithm is tested on Subleq, which is a programming task, and Deep sea, which is a gridworld task. Combining with AlphaZero/MuZero, the algorithm performs quite well on the baseline tasks.

优点

The new algorithm solves MCTS with a learned model, with deep exploration and variance propagation.
The model is supported by theoretical insights.
The model performs quite well on the baseline tasks.

缺点

The practical value of epistemic MCTS remains questionable. In what real-world tasks do we need to run search without knowing the world model?

问题

N/A

2024-11-16

Dear reviewer,

Thank you for your positive comments and valuable time.

Though the bound doesn't apply to learned dynamics The bound is extended to learned dynamics in Appendix A.3. Please see our reply in the main rebuttal for additional details. We evaluate that this approach works in Figure 4. In Figure 2 we demonstrate that even without this modification, EMCTS with learned transition models works well in practice in Deep Sea.

The practical value of epistemic MCTS remains questionable. In what real-world tasks do we need to run search without knowing the world model? First, note that EMCTS is very well suited for tasks where the world model is known, i.e., when only the value model is learned, such as with AlphaZero [AZ, 1] as well as when it is not known. Many methods that use true world models such as AZ, which is very successful in the real-world (known world model tasks of programming and algorithm design [2,3]), rely on learned value functions which contain epistemic uncertainty. The epistemic uncertainty can be useful to direct the agent towards better novel policies, that are very hard to find using non-directed exploration. We demonstrate this on the task of programming in SUBLEQ in Figure 1, where agent has access to the true world model (i.e., both the transition dynamics as well as the reward dynamics) and EMCTS significantly outperforms baseline AZ which uses the same true model. By reasoning over the uncertainty in the learned value function, EMCTS is able explore the environment for solutions (programs) that the agent has not tried before much more efficiently.

A concrete example of a real world task where search is being used without a known world model is [4], where MuZero (fully learned, latent world model) improves over the engineered-by-hand previous state of the art in the real world task of video compression for YouTube videos. In Figures 2 and 4 we demonstrate that our method works in practice with the learned models of MuZero.

Please let us know if any open questions or unaddressed concerns remain.

[2] Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J. R. Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, David Silver, Demis Hassabis, and Pushmeet Kohli. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610(7930):47–53, 2022. doi: 10.1038/s41586-022-05172-4.

[3] Daniel J. Mankowitz, Andrea Michi, Anton Zhernov, Marco Gelmi, Marco Selvi, Cosmin Paduraru, Edouard Leurent, Shariq Iqbal, Jean-Baptiste Lespiau, Alex Ahern, Thomas Koppe, Kevin Millikin, Stephen Gaffney, Sophie Elster, Jackson Broshear, Chris Gamble, Kieran Milan, Robert Tung, Minjae Hwang, Taylan Cemgil, Mohammadamin Barekatain, Yujia Li, Amol Mandhane, Thomas Hubert, Julian Schrittwieser, Demis Hassabis, Pushmeet Kohli, Martin Riedmiller, Oriol Vinyals, and David Silver. Faster sorting algorithms discovered using deep reinforcement learning. Nature, 618(7964):257–263, 2023. doi: 10.1038/s41586-023-06004-9.

[4] Mandhane, Amol, Anton Zhernov, Maribeth Rauh, Chenjie Gu, Miaosen Wang, Flora Xue, Wendy Shang et al. "Muzero with self-competition for rate control in vp9 video compression." arXiv preprint arXiv:2202.06626 (2022).

2024-11-24

Thank you for providing a rebuttal and for correcting me on learned dynamics. I did not look into the appendix but now I see it does extend to it. My evaluation was already positive. I would maintain my rating.

2024-11-16

Dear reviewers,

Thank you for your valuable time and many comments and questions.

Learned transition models: Reviewers yivx and Uotu correctly point out that Theorem 1 requires the transition model $P$ to be known in advance, and raise the question whether EMCTS is applicable with learned transition models, such as in MuZero [1].

EMCTS is indeed applicable with learned transition models, which are discussed in Section 3.4.

In Appendix A.3 we propose a theoretically motivated method to construct an extended upper bound on estimates of $\mathbb V[\hat Q(s,a)]$ that accounts for possible uncertainty in the transition model. In Figure 4 (bottom subplots) we evaluate an agent that uses this method and a learned transition model, which successfully solves a challenging instance of Deep Sea. In Figure 2 (left subplot) we find that in practice the unmodified upper bound already works well even with a learned transition model, demonstrated with the E-MZ agent, which uses a learned transition model and the unmodified upper confidence bound. We will make sure it is emphasized in the paper that EMCTS can be used in practice in the presence of both learned and true transition models by referring to Appendix A.3 and Section 3.4 directly after Theorem 1, as well as describing the E-MZ agent in Figure 4 in more detail in the appendix.

Consistency assumption: The consistency assumption was raised as a concern by reviewers acfQ, yivx and v4iH. Let us first clarify the consistency assumption required by EMCTS (line 269-270). EMCTS requires that the learned value function $\hat V$ and reward function $\hat R$ are trained on the same data. This is the popular choice in practice by standard methods such as MuZero [1]. In addition, EMCTS assumes that the epistemic uncertainty over the learned value $\mathbb V[\hat V]$ (Equation 6) is consistent with the uncertainty with the model used by MCTS $\mathbb V[V^\pi_{\hat M}]$ . That is, that they the are approximately the same $\mathbb V[\hat V] \approx \mathbb V[V^\pi_{\hat M}]$ (line 270).

We believe that this assumption is reasonable in principle for two main reasons: First, since $\hat V$ and $\hat R$ are trained on the same dataset the uncertainty in $\mathbb V[\hat V]$ analytically coincides with the uncertainty in $\mathbb V[\hat R]$ (Equation 5). Second, the estimator $\hat u$ based on the uncertainty Bellman equation [2], estimates $\hat u \approx \mathbb V[\hat V]$ using estimates of $\mathbb V[\hat R]$ (Equation 6), and for that reason $\hat u$ is trained to operate as an estimate that is consistent with both $\mathbb V[V_{\hat M}^\pi], \mathbb V[\hat V]$ .

In practice, inside the training distribution, the TD-learning based training of $\hat u$ is well established and is empirically observed to produce useful estimates of the epistemic uncertainty in practice (for example, [3, 4]). Outside of the training distribution, this is addressed in Appendix D.6, by explicitly upper-bounding predictions where $\hat u$ is evaluated outside of the training distribution. Our experiments corroborate the effectiveness of this approach, and demonstrate robustness of EMCTS to possible errors in uncertainty estimation: Figures 1, 2 and 4 demonstrate that EMCTS with $\hat u$ work well across multiple environments, sources of uncertainty, different architectures and variations of learned models. Finally, in Figure 3 we visualize the uncertainty estimated by EMCTS compared to the uncertainty estimated by one sample of $\hat u$ , where EMCTS matches the ground truth epistemic uncertainty much better.

We will make sure this is described more clearly in the paper.

We have uploaded a new version with the discussed changes (marked in green) and additional results on easy-exploration environments requested by reviewer v4iH (Figure 5 in the new version). We hope that this discussion along with the individual replies addresses the reviewers' concerns, and we are looking forward to further discussion with the reviewers.

[2] O’Donoghue, Brendan, Ian Osband, Remi Munos, and Volodymyr Mnih. "The uncertainty Bellman equation and exploration." In International Conference on Machine Learning, pp. 3836-3845. 2018.

[3] Moritz Akiya Zanger, Wendelin Boehmer, and Matthijs T. J. Spaan. Diverse projection ensembles for distributional reinforcement learning. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024.

[4] Luis, Carlos E., Alessandro G. Bottero, Julia Vinogradska, Felix Berkenkamp, and Jan Peters. "Model-based uncertainty in value functions." In International Conference on Artificial Intelligence and Statistics, pp. 8029-8052. PMLR, 2023.

AC 元评审

2024-12-22

The work studies epistemic MCTS, in particular the propagation of model uncertainty in the tree search. All reviewers appreciated the solid theoretical contribution as well as the algorithmic innovation; both are recognized as being substantial. The experiments in hard to explore environment show the advantage of the proposed methods. The reviewers have some comments, which can be taken as an opportunity to clarify certain aspects of the work, and they can be addressed in the writing of an updated version.

审稿人讨论附加意见

Many of the reviewer-specific questions were addressed during the rebuttal.

最终决定Accept (Poster)

2025-01-22

Accept (Poster)