Online Feedback Efficient Active Target Discovery in Partially Observable Environments
摘要
评审与讨论
In scanning problems such as MRI imaging, the data are locally acquired and one would actively decide the targeted scanning area (it is expensive to have a comprehensive scan). In such problems, it is difficult to train a decision maker to guide the scanning process, because fully scanned data are scarce. To this end, this paper considers the scanning steps as denoising steps in the context of a diffusion model. The authors use the diffusion model to build distribution on predictive space at different area. Then, one may leverage the distribution to guide the information gained by scanning different grids, may the information be exploration or exploitation.
Contributions:
- cast scanning decisions into diffusion dynamics.
- apply exploitation and exploration heuristics leveraged by the Bayesian experimental design community.
优缺点分析
Strengths:
- The studied problem is important and well-introduced.
- I find it quite interesting to cast the query steps as diffusion dynamics.
- The method is evaluated on 3 different test problems.
Weaknesses:
- This paper has a lot of parameters and the notation is very complicated. I personally find section 5 (main method section) quite hard to follow. Perhaps some more description on the intuition would help?
- If I am not wrong, this method has quite significant online computation for each scanning decision (training reward models, optimizing scores etc). For problems like species discovery, perhaps the online complexity is not a major issue. But for problems like MRI scanning (which is mentioned repeatedly in the paper), online computational time is a limitation, and the authors might really need to discuss this. Online complexity is a very common problem in Bayesian experimental design.
- It seems to me that there are a lot of hyperparameters and models to be tuned. Like exploration weight, , and more. This would make it rather complicated to apply the method. For instance, table 4 shows that the SR drops to 0.8465 with while in table 3 we see that the baseline GA is 0.8261 (I wonder whether the authors conducted statistical tests over repeated experiments? Or were all experiments only run once?).
- It seems that the algorithm relies on the reward function to correctly identify the target area, while the reward function is trained as the scanning proceeds. One might need to discuss model misspecification problem (e.g. bad model bad query, bad query bad model).
- In addition, if I am not wrong, the reward function is a black-box function, while in the introduction the authors claim white-box policy, which might be exaggerated.
Clarification:
- Could you elaborate why equation (7) and appendix line 434 are valid? I mean, if both the reward and the negative entropy are non-negative, then the product scales in the same direction of the two factors. But if either of them spans across positive and negative values, then optimizing the product is not the same as optimizing the reward and the negative entropy.
问题
Clarification:
- Could you please elaborate how the generative conditioned on the observations ? Does equation (3) indicate this or from which equation may I see this?
- In Thm 1, why do you have [ \hat{x}t^{i} ]{q_t} ? I thought is already the -th grid, but do we use grid to index ?
- In line 159, it seems like the aim here is to avoid computing all possible grids. How is this actually achieved by Thm 1?
- In Thm 1, how strong is the assumption and is this assumption limiting in practice?
- Could you elaborate why equation (7) and appendix line 434 are valid? I mean, if both the reward and the negative entropy are non-negative, then the product scales in the same direction of the two factors. But if either of them spans across positive and negative values, then optimizing the product is not the same as optimizing the reward and the negative entropy.
Minor:
- You use in the paper. However, were from the past and are actually fixed while decision is optimized. I would suggest writing instead to be clearer on the distinction.
- Code: When you attach the code, you may consider removing .git folder. That creates unnecessary tracks and github links (your repo seems anonymous which is fine).
- Code: you may consider adding versions to requirements.txt, e.g. scipy can sometimes have version issues.
局限性
I tend to say one should discuss the computation needed for each query, and maybe discuss the quality of the trained reward function.
最终评判理由
My original review listed issues like: clarity, discussion on complexity etc.
The rebuttal clarified some terms, and made promise to add some discussion on e.g. online complexity.
One flaw I notice: the authors put a lot of additional experiments in the appendix, but at the moment it seems to me that information is somewhat scattered. I would say the authors need to provide at least an overview in the supplementary, discussing what is in which appendix section, and how the appendix section strengthen the main paper.
My original rate was already positive (4). and I feel the same score applies. I think this paper is between 4 and 5, more leaning towards 4 because the overall clarity can still be improved, just in my opinion.
格式问题
Minor:
- line 102 has a small indent.
- line 165: “We prove this result in Appendix” – did you missed “Appendix B”?
- line 190: “??” should be fixed.
- I think equations should be within linewidth (appendix A, D).
- In your appendix, theorems are numbered differently (main: prop 1, thm 1, 2, … vs appendix: thm 5.1, 5.2, …).
Thank you for your insightful feedback. We are glad to find our problem important and our method interesting. We have addressed all your comments and concerns below.
Q1: More description on the intuition would help?
A1:. We acknowledge that Section 5 contains numerous parameters and complex notation, which may pose challenges for readers. To enhance clarity, we will add a detailed notation table in the appendix that clearly defines each symbol and its purpose.
Q2: For problems like MRI scanning, online computational time is a limitation, and the authors might need to discuss this.
A2: We fully acknowledge that online computational efficiency is critical for practical deployment, especially in time-sensitive applications like MRI scanning. To address this, we have evaluated the online computation time required by DiffATD to select the next sampling location across varying search space sizes, using a standard compute setup. These results, detailed in Appendix Section V (“More Details on Computational Cost across Search Space”), show that the sampling time per observation ranges from 0.41 to 3.26 seconds—well within practical limits for many applications, including MRI. We will further emphasize this discussion in the revised manuscript to provide readers with a clear understanding of DiffATD’s online efficiency.
Q3: Table 4 shows that the SR drops to 0.8465 with while in Table 3 the baseline GA is 0.8261. Whether the authors conducted satistical tests over repeated experiments?
A3: Firstly, since the parameter governs the balance between exploration and exploitation, it is expected that DiffATD’s performance is sensitive to its choice. Your observation that the Success Rate (SR) decreases to 0.8465 when —close to the baseline GA’s SR of 0.8261 reported in Table 3—is particularly insightful and aligns well with our intuition. As progressively decreases, the model emphasizes exploitation more heavily, leading to a decline in performance. This trend underscores the critical role of exploration during the early stages of the discovery process. Notably, when , the approach effectively reduces to GA, relying solely on exploitation. These results collectively emphasize that a purely exploitation-driven strategy is insufficient for active target discovery, and that maintaining an effective balance between exploration and exploitation is essential for success.
Secondly, we would like to clarify that all results presented in the paper are averaged over three independent trials, with the mean values reported. Additionally, the standard deviations across these trials are provided in Section S of the Appendix. These results not only demonstrate the effectiveness of DiffATD but also underscore its stability and consistency across multiple runs.
Q4: One might need to discuss reward model misspecification problem.
A4: We agree that the reward model—trained incrementally as the active discovery process unfolds—plays a pivotal role in guiding exploration and exploitation. We recognize that model misspecification, particularly in the early stages when training data is limited, can lead to suboptimal queries and potentially bias the search process. To address this, DiffATD is designed with a frontloaded exploration strategy grounded in the maximum entropy principle. By deliberately selecting the most diverse and uncertain observations early on, we ensure that the reward model is trained on a broad and informative dataset from the outset, thereby reducing the risk of bias and helping correct initial inaccuracies.
As additional data is acquired with each query, the reward model improves its predictive accuracy, resulting in more reliable guidance for subsequent exploitation-focused queries. This adaptive cycle—where exploration seeds robust model learning, and the refined model in turn enables targeted exploitation—is essential to maintaining both flexibility and effectiveness, even under strict query budgets. To illustrate this, we provide extensive visualizations (see Figures 7 and 24–28) showing the reward model’s confidence becoming more accurate as the discovery process progresses, underscoring its critical role in guiding exploitation.
Q5: The reward function is a black-box function, while the authors claim white-box policy, which might be exaggerated.
A5: Thank you for your valuable observation. To clarify, the term “black-box policy” usually refers to policies learned end-to-end via opaque models without explicit interpretability. In contrast, our policy is explicitly designed based on transparent, principled concepts—such as the maximum entropy principle—making it inherently interpretable, which justifies describing it as a “white-box policy.” While the reward model is implemented with a neural network and can be seen as a black-box component, its role is limited to predicting target presence in regions and does not obscure or complicate the policy’s overall interpretability or decision-making logic. We hope this distinction helps clarify our terminology.
Q6: Are Eqn. 7 and appendix line 434 valid? If both the reward and the negative entropy are non-negative?
A6: In our framework, both the reward and the likelihood (which corresponds to the negative entropy) are defined to be non-negative quantities. Specifically, the reward represents the probability that a given region contains the target, so it is naturally bounded between 0 and 1. Meanwhile, the likelihood, as formulated in Theorem 2, involves an exponential function, ensuring it is also always non-negative. Because both factors are non-negative by definition, their product scales consistently with each individual component. Therefore, optimizing the product is equivalent to jointly optimizing both the reward and the negative entropy as intended. We hope this clarifies the validity of Equation (7) and the related expression in the appendix.
Q7: Discuss computation needed for each query and the quality of the trained reward function.
A7: We have provided a detailed analysis of the computational requirements per query in Appendix Section V, with results summarized in Table 18. Our findings show that DiffATD maintains efficient sampling times—ranging from 0.41 to 3.26 seconds per observation—even as the search space scales, which is suitable for most practical applications. In response to your feedback, we will enhance the manuscript by adding further visualizations, similar to Figures 7 and 24–28, and include a dedicated discussion on the quality and reliability of the trained reward function.
Q8: In Thm 1, why do you have [] ? Do we use grid to index ?
A8: Here, in the superscript represents 'th sample of a batch, and denotes -th sample of the estimated entire search space at the t-th query step (Please see our discussion around Lines 148 -153). Additionally, yes, we use in the subscript to index -th grid/region in the estimated search space .
Q9: Could you please elaborate how the generative conditioned on the observation ? Does Eqn. (3) indicate this?
A9: Yes, Eqn. (3) describes how is constructed: it is the denoised estimate of the search space computed via Tweedie’s formula, starting from a noisy latent . Crucially, each is sampled by executing the reverse diffusion process, which is explicitly conditioned on the set of previously acquired observations { }—this conditioning is reflected in the measurement guidance step of Algorithm 1 (line 6), where the observed values at locations indexed by are enforced during the reverse diffusion process. Therefore, the estimated search space integrates all information obtained up to time , producing a distribution over possible completions of the unobserved space that is fully consistent with prior measurements.
Q10: Avoiding computing all possible grids in line 159, how is this achieved by Thm 1?
A10: We would like to clarify that our aim is to avoid computing a separate set of for all possible combinations of sampling sequence , not avoiding computation across possible grids at a sampling step . To this end, in Theorem 1, we showed that it is possible to decompose the squared L2 norm inside term (defined in Eqn. 5) into two parts, one over the indices in (i.e., possible measurement location at time t), and the other over those corresponds to previous observations . We can achieve this by exploiting the structure that the measurement locations only differ in the newly selected indices . Hence, within the arg max, the elements of each particle remain unchanged across all possible , except at the indices specified by . Hence, the term associated with becomes a constant in the arg max and can be disregarded. Consequently, the formulation reduces the computing of the squared L2 norms exclusively for the elements corresponding to , which can be computed in one shot via a reverse diffusion process because the measurement guidance term in the reverse diffusion process doesn't change as is fixed.
Q11: In Thm 1, how strong is the assumption ?
A11: The assumption in Theorem 1 reflects the idea that, when estimating uncertainty over a batch of samples, each sample contributes equally to the computation. In most practical scenarios, especially when there is no prior reason to favor one sample over another within a batch, this is a reasonable and standard assumption, and is common in practical applications.
Dear Reviewer,
Thank you for acknowledging our response. We hope it has addressed all of your questions.
Thanks.
One more minor remark. I noticed that you put a lot of additional experiments in the appendix, but at the moment it seems to me that information is somewhat scattered. E.g. significant test, multi-arm experiments were there but seem easily overlooked (possibly applied also to other reviewers based on what they wrote).
I would suggest to provide at least an overview in the supplementary, discussing what is in which appendix sections, and how the appendix sections strengthen the main paper.
My original rate was already positive. I only think the overall clarity can still be improved.
Dear Reviewer,
Thank you very much for your valuable suggestion. Following your feedback, we will definitely include an overview in the supplementary material to better guide readers through the Appendix. We truly appreciate your feedback in helping us improve the presentation.
The paper proposes a novel method, DiffATD, to guide diffusion models in task discovery. To this end, following existing ideas from experimental design, DiffATD trades off between exploration of novel areas in the search space, measured by (an approximation of) the model's entropy, to exploitation of a reward that is learned during the learning process.
优缺点分析
Strengths
- The problem is important; DiffATD might be useful in more complex domains (such as robotics) for task discovery.
- The experiments are very comprehensive, showing solid results across several independent tasks.
- I am not an expert in this field, but the proposed method seems novel.
Weaknesses
- Experiments that demonstrate task discovery beyond vision-related tasks (e.g. drug discovery or robotics) would significantly strengthen the paper.
- The main empirical results of the paper essentially only ablate DiffATD's exploration and exploitation components, comparing against max-entropy, greedy and uniform sampling. The paper only compares DiffATD with multi-armed bandits techniques (UCB and -greedy) on MNIST in the appendix. Additional experiments showing demonstrating DiffATD's superiority compared to more competitive methods will improve the paper. For instance, comparing with a baseline that relies on supervised learning (as mentioned in the related works), and showing that DiffATD is much more sample efficient would strengthen the paper.
- My understanding is that most experiments were carried out only on one seed, with the exception of the experiment in appendix S, which studies only two datasets and lacks comparison against other existing baselines in the literature.
问题
- Where do you get the rewards from in your experiments? You motivate DiffATD with problems in which labeled data (for supervised learning) is expensive; however, in your experiments it seems like the reward is given by some ground-truth data that was collected offline. Can you please clarify that?
局限性
The authors did not mention any limitation of their method DiffATD, answering in the submission guideline:
Does the paper discuss the limitations of the work performed by the authors? We design a novel algorithm for a new task.
I believe some more comprehensive discussion on how one can improve DiffATD and what were the challenges in making it work would improve the paper.
最终评判理由
The authors have clarified most of the initial concerns I had about their paper.
格式问题
No.
Thank you for your insightful feedback. We are excited that you found our problem is important, the experiments are comprehensive and the method is novel. We have addressed all your comments and concerns below.
Q1: Experiments that demonstrate task discovery beyond vision-related tasks (e.g. drug discovery or robotics) would significantly strengthen the paper.
A1: Thank you very much for this valuable suggestion. We would like to highlight that we have already demonstrated the effectiveness of DiffATD beyond vision-related tasks through our experiments on species discovery, as shown in Table 2 and Figure 4. This setting involves non-visual, spatial data and confirms the broad applicability of our approach. We also wholeheartedly agree that extending DiffATD to impactful domains such as drug discovery or robotics is a promising and exciting direction. We look forward to exploring these important areas in our future work and believe DiffATD could offer significant benefits there as well.
Q2: Additional experiments showing demonstrating DiffATD's superiority compared to more competitive methods will improve the paper. For instance, comparing with a baseline that relies on supervised learning (as mentioned in the related works), and showing that DiffATD is much more sample efficient would strengthen the paper.
A2: Thank you for this question. We would like to emphasize that our active target discovery (ATD) problem setting fundamentally assumes the absence of any supervised data prior to deployment, making supervised learning approaches inherently inapplicable. This reflects realistic constraints in many real-world domains—such as medical imaging or species discovery—where obtaining labeled ground-truth data is costly, time-consuming, or simply infeasible. Importantly, even when we hypothetically apply a supervised learning approach by assuming access to ground-truth annotations in our ATD setting, our method, DiffATD, performs comparably—and in several cases even surpasses—the supervised baseline. This is achieved without requiring any labeled data (see lines 281–292 and Table 5 in our paper), highlighting the robustness and effectiveness of our approach. This highlights DiffATD’s strong potential to enable effective, interpretable active discovery in practical, data-scarce scenarios.
Q3: My understanding is that most experiments were carried out only on one seed, with the exception of the experiment in appendix S, which studies only two datasets and lacks comparison against other existing baselines in the literature.
A3: We wish to clarify that all experiments reported in our paper were conducted over three independent runs with different random seeds, and the mean results are presented. Furthermore, Appendix S (Tables 15 and 16) includes the corresponding standard deviations, which illustrate the stability and consistency of DiffATD across these trials. We hope this clarifies the robustness of our experimental evaluation.
Q4: Where do you get the rewards from in your experiments? You motivate DiffATD with problems in which labeled data (for supervised learning) is expensive; however, in your experiments it seems like the reward is given by some ground-truth data that was collected offline. Can you please clarify that?
A4: Thank you for raising this important question. In our experiments, the rewards are obtained directly as outcomes from each query or observation made during the active target discovery process. It is important to clarify that, unlike conventional supervised learning, we do not rely on any pre-collected ground-truth labels or offline datasets. Instead, our method continuously learns a parametric reward model incrementally, using the data gathered sequentially and exclusively during the online active discovery phase at inference time.
Q5: The authors did not mention any limitation of their method DiffATD. I believe some more comprehensive discussion on how one can improve DiffATD and what were the challenges in making it work would improve the paper.
A5: Thank you for raising this point. We will certainly include a Limitation and Future Work section in the updated draft as you suggested.
Thank you for your detailed response. You helped clarify all of my questions. A few suggestions for future revisions:
Thank you for this question. We would like to emphasize that our active target discovery (ATD) problem setting fundamentally assumes the absence of any supervised data prior to deployment, making supervised learning approaches inherently inapplicable. This reflects realistic constraints in many real-world domains—such as medical imaging or species discovery—where obtaining labeled ground-truth data is costly, time-consuming, or simply infeasible. Importantly, even when we hypothetically apply a supervised learning approach by assuming access to ground-truth annotations in our ATD setting, our method, DiffATD, performs comparably—and in several cases even surpasses—the supervised baseline. This is achieved without requiring any labeled data (see lines 281–292 and Table 5 in our paper), highlighting the robustness and effectiveness of our approach. This highlights DiffATD’s strong potential to enable effective, interpretable active discovery in practical, data-scarce scenarios.
I understand that after your clarification. This point should be made more explicit and highlighted in the paper.
We wish to clarify that all experiments reported in our paper were conducted over three independent runs
This should be made more explicit, ideally in the main text.
Dear Reviewer,
We thank you for confirming that our response clarifies all your concerns and will absolutely make sure to include your suggestions in the revised version.
This paper investigates the target discovery problem with a limited computational budget. Specifically, leveraging the generative capacity of diffusion models, this paper introduces DiffATD, which utilizes diffusion dynamics for efficiently active target discovery with a limited sampling budget and better interpretability. Experimental results show that DiffATD outperforms baselines in various benchmarks.
优缺点分析
Strengths
- Interesting problem settings and well-defined problem formulation from the perspective of entropy and mutual information.
Weakness
- Comparison with other search algorithms (e.g., MCTS and UCB).
- Comparison with other objective detection algorithms.
- Could the authors explain the relationship between the proposed method and information bottleneck?
Typos
- Lines 190: proof in ??\
- Lines 234: Active Discovery of Species.
- Lines 244: Active Discovery of Skin Disease.
- Lines 271 and 293: Same issues
问题
Questions
- Could the authors provide a comparison with the other existing search algorithms?
- Could the authors provide a comparison with the other existing objective detection methods?
- Could the authors explain the relationship between the proposed method and information bottleneck?
局限性
No. Comparison with existing search algorithms (e.g., MCTS) and objective detection methods.
最终评判理由
I have read the response and other reviews. My concerns have been addressed properly, thus I tend to raise my score to borderline accept.
格式问题
None
Thank you for your nice questions and feedback. We have addressed all your comments and concerns below.
Q1: Could the authors provide a comparison with the other existing search algorithms (e.g., MCTS and UCB)
A1: Thank you for your valuable feedback. To compare with other search methods, we included experiments with multi-armed bandit baselines by implementing both UCB and -greedy strategies. Their performance under the ATD setting is reported in Appendix T, Table 17. We have also provided additional comparative results with UCB using DOTA dataset (See the table below). These results indicates that incapability of UCB based approaches in tackling ATD task. Next, we illustrate why MCTS, UCB based approaches are not applicable in active target discovery setting.
- ATD operates under a partially observable environment where the objective is to strategically cover a large, structured space under a strict budget, and observations' semantics are highly correlated. In contrast, MAB assumes independent arms with stationary reward distributions and focuses on maximizing cumulative reward---assumptions that do not hold for ATD.
- MCTS, on the other hand, relies on repeated rollouts in a known or simulatable environment to estimate long-term rewards. However, in ATD problem, observations can only be obtained through costly real queries, and rewards are defined online rather than pre-specified. These characteristics make MCTS fundamentally inapplicable to ATD.
Comparison with UCB on DOTA
| Method | |||
|---|---|---|---|
| UCB (Bandits) | 0.1132 | 0.2146 | 0.2487 |
| DiffATD | 0.5422 | 0.6379 | 0.7309 |
Q2: Could the authors provide a comparison with the other existing objective detection methods
A2: Thank you for this question. We would like to clarify why traditional object detection methods are not directly comparable or applicable to the Active Target Discovery (ATD) problem due to fundamental differences:
-
Limited and Incremental Observability: In ATD, the agent only has access to sparse, sequential observations from a large, partially observed environment (See Figure 1). The goal is to strategically explore and identify target-rich areas without ever fully observing the entire space. In contrast, object detection methods operate on fully observed images where all visual information is available upfront. This key difference means object detectors are not designed to handle the sequential, budget-constrained querying and discovery aspects central to ATD.
-
Lack of Predefined Targets and Labels: Object detection relies heavily on supervised learning with extensive labeled datasets where object classes and bounding boxes are known beforehand. Conversely, ATD assumes no prior knowledge or labeled data about the targets. Target properties unfold dynamically as queries are made, requiring an adaptive, online learning approach. Our DiffATD framework is specifically designed to learn from this limited, sequential feedback without requiring any labeled training data, unlike standard object detection pipelines.
For these reasons, while object detection methods excel in fully supervised, fully observed settings, they are not suitable baselines for the ATD problem. We hope this clarifies the fundamental distinctions and justifies our methodological choices.
Q3: Could the authors explain the relationship between the proposed method and information bottleneck?
A3: Thank you for the question. Our method does not apply the Information Bottleneck principle. Although both approaches share information-theoretic notions such as entropy, our work focuses on designing an active querying strategy to discover more targets in a partially observable environment under strict budget constraints, while Information Bottleneck aims to learn compressed representations that retain task-relevant information. Our formulation does not optimize an Information Bottleneck objective and differs fundamentally in its goal and application.
Q4: Typos in Lines 190, 234, 244, 271 and 293
A4: Thank you for the observations, we will correct the citation typo.
Dear reviewer, thank you for acknowledging our rebuttal. Have we addressed all of your concern? If not, please let us know any residual issues that you may have, which we would be very happy to discuss further.
Thanks for your effort and detailed response. I have read the response and other reviews. My concerns have been addressed properly, thus I tend to raise my score to borderline accept.
The paper studies active target discovery where items have to be found across a domain. The paper proposes DiffATD, which maintains a belief distribution over unobserved states. DiffATD then selects new observations by carefully balancing exploration and exploitation. The paper evaluates DiffATD across of item "discovery" examples with images. Across those examples, DiffATD outperforms uniform pixel-selection and variants based around pure exploration or pure exploitation.
优缺点分析
Strengths:
- The idea of balancing exploration and exploitation for active target discovery appears novel and interesting.
- The proposed method significantly outperforms baselines across all experiments. The experimental methodology appears solid, although experiments are small-scale, and are not comparing to many baselines. I am not familiar enough with the subfield to say conclusively whether any particular baselines are missing.
Weaknesses:
- The key weakness in my eyes is the omission of a comparison to methods from bandits / Bayesian optimization. It seems to me that the problem could be modeled as a noise-free bandit problem. If modeled as a bandit problem, one could build upon a wide range of prior literature that studies how to effectively balance exploration and exploitation. There are a plethora of methods, for example, also some information-based methods such as "information-directed sampling" [1] which seems related to the proposed method.
- The paper is not including any theoretical guarantees for the sampling scheme, as opposed to prior results from bandits / BayesOpt.
- The term from the sampling objective depends on the remaining budget . This seems undesirable to me, as often a budget is not clearly defined (e.g., search and rescue from the motivation). Often, we simply want to identify the target "as quickly as possible". Many bandit algorithms do not require specification of a budget upfront, yet still provide anytime optimality guarantees.
Typos:
- lines 190: "??"
[1]: Russo & Van Roy. Learning to Optimize via Information-Directed Sampling.
问题
- The paper does not ablate the choice of belief model experimentally. Is the choice of single-step diffusion important?
局限性
yes
最终评判理由
The authors adequately clarified some of my main concerns during the rebuttal, and I therefore increase my score to 4. Nevertheless, I agree with the other reviewers that the submission lacks some clarity and that results are quite scattered (especially in the appendix).
Also, the authors did not address my concern about a lack of a theoretical analysis of the proposed method (in simpler settings), which could significantly strengthen the paper in my view.
格式问题
The vspace around (sub)sections has been reduced substantially compared to the original template.
Thank you for your valuable feedback. We've addressed all your comments and concerns below.
Q1: The key weakness in my eyes is the omission of a comparison to methods from bandits / Bayesian optimization. It seems to me that the problem could be modeled as a noise-free bandit problem.
A1: Thank you for raising this important point. While, at first glance, active target discovery (ATD) and (noise-free) bandit problems seem related—both balancing exploration and exploitation—there are deep, structural differences that make direct application of classic bandit based methods fundamentally inadequate for ATD. The fundamental bottleneck is that the action space in MAB cannot capture the structural semantics of the environment, which themselves define the targets. Below, we clarify these differences with concrete arguments and supporting experimental results.
1.Modeling Paradigm: Environment-Structure-Centric vs. Action-Centric
In the domains targeted by ATD, such as medical imaging and species discovery, the environment embodies a complex, domain-specific structure that is challenging to accurately represent solely through action correlations—whether during the initial modeling of the MAB problem or throughout the online learning process. Such structural semantics fundamentally dictates whether a given query corresponds to a target, making such semantics indispensable for accurate discovery. Consequently, these problems inherently require environment-centric modeling, underscoring the need for a generative model capable of explicitly representing such structured priors. Furthermore, ATD operates in a partially observable environment, where observation correlations stem from the inherent structure of the environment rather than from the actions themselves. In contrast, MAB problem modeling, even when incorporating correlations among actions, fails to capture these environment-driven dependencies.
- Active Target Discovery: Each sampling action potentially informs about its neighbors and influence the estimation of the global environment structure; spatial or semantic correlations are intrinsic. Feedback from one region impacts the belief about other unobserved regions, and efficient strategies build a coherent model of global structure.
2.Objective: Novelty in Discovery vs. Cumulative Reward The modeling objective of ATD also fundamentally differs from that of bandit formulations.
- Active Target Discovery: The sole aim is to strategically discover as many new targets as possible within a fixed sampling budget. Unlike Bandit, in ATD, Sampling the same location twice yields no additional value; novelty and strategic search space coverage are critical. For example: In ATD (e.g., searching for rare tumors in an MRI), revisiting the same region after a tumor has been found offers no new benefit.
Nevertheless, despite the different objectives, one can still frame the ATD problem as an MAB formulation without repeated queries (e.g., as a noise-free MAB). We conducted several experiments by modeling the ATD problem as a multi-armed bandit formulation on the simplest MNIST-based scenario in Appendix T, Table 17, which demonstrate that under the ATD objective, the absence of explicit modeling of the environment’s structural semantics leads to failure in correctly discovering novel targets. We also compare the performance of UCB and DiffATD using the DOTA dataset (Please see the table below). These additional results further reinforces the incapability of standard bandit based approach in ATD setting.
Comparison with Multi-Armed Bandit (UCB) on DOTA
| Method | |||
|---|---|---|---|
| UCB | 0.1132 | 0.2146 | 0.2487 |
| DiffATD | 0.5422 | 0.6379 | 0.7309 |
Nevertheless, it could be an interesting direction for future work to investigate how to explicitly incorporate environmental structural semantics and discover new targets into the MAB framework.
Q2: The term from the sampling objective depends on the remaining budget. This seems undesirable to me, as often a budget is not clearly defined (e.g., search and rescue from the motivation). Often, we simply want to identify the target "as quickly as possible".
A2: Does Not Preclude Anytime Use: The mechanism involving can be generalized or made anytime by considering it as a monotonically decreasing function of elapsed time or queries made, rather than as a function of a strictly known budget. One can, for instance, design to smoothly balance exploration/exploitation based on the number of queries so far, regardless of an explicit stopping point. Moreover, our framework can flexibly accommodate fixed-budget, rolling-budget, or open-ended (anytime) objectives by appropriately designing or learning (·). In our experiments, we use a simple function for simplicity and clarity in the discussion and domain alignment, but the scheme does not assume strict foreknowledge of a terminal budget. If, for example, the “budget” is unbounded but we wish to “find the target as fast as possible”, (·) can be defined to sustain exploration longer, or adaptively decay based on observed rates of discovery. To conclude, is a design choice and depends on the specific requirements of the problem at hand.
Q3: Typos in line 190
A3: We apologize for the typo. We will fix the typos in the final version of the paper.
Q4: The paper does not ablate the choice of belief model experimentally. Is the choice of single-step diffusion important?
A4: Thank you for the question.
- The purpose of the belief model (e.g. Diffusion model) in DiffATD is to construct a dynamically updated, uncertainty-aware belief distribution over unobserved regions—critical for guiding the exploration-exploitation tradeoff in target discovery. Thus, it is important to utilize a belief model that is capable of understanding the hidden structure of the underlying search space.
- In order to validate this, following your feedback, we conduct an experiment with a different choice of the belief model that is trained on a different domain, and compare its performance with DiffATD where the prior belief model is trained on the same domain. We present the result in the following table.
Effect of Prior Belief Model on Performance ( On DOTA)
| Method | ||
|---|---|---|
| DiffATD with a weak prior model | 0.5143 | 0.6391 |
| DiffATD | 0.6379 | 0.7309 |
Our experimental outcomes indicate that the search performance drops significantly when prior belief model is not capable of capturing the hidden structure of the underlying search space, highlighting the importance of a strong belief model (i.e. a pretrained diffusion model) in tackling Active target discovery.
We would also like to highlight that we utilize Tweedie's formula to reconstruct the search space from a noisy image in a single-step, making the inference process significantly faster and saves significant computation time.
Dear Reviewer,
Thank you for acknowledging our response. We truly hope it has resolved all of your concerns.
Dear authors,
Thank you for your detailed rebuttal. After clarifying some of my main concerns, I have increased my score to suggest acceptance. Unfortunately, so far, it seems that the authors did not address my concern about a lack of a theoretical analysis of the proposed method (in simpler settings). In my view, such an analysis could significantly strengthen the paper.
Dear Reviewer,
Thank you for your thoughtful response. In this work, our goal is to lay a solid foundation and develop a principled, interpretable, and practical framework for the novel Active Target Discovery (ATD) problem. We agree that exploring optimal sampling guarantees within the ATD problem setup is a promising and interesting direction in its own right, and we will certainly consider it as part of our future work.
Thank you for your quick response. I understand this decision, and I have updated my score to recommend acceptance.
To address the challenge of identifying targets of interest in partially observable environments while minimizing costly ground-truth feedback, this paper introduces Diffusion-guided Active Target Discovery (DiffATD). The approach leverages Tweedie’s formula to construct a belief distribution over unobserved regions. Using this distribution, the method employs an exploration-exploitation strategy: exploration is guided toward regions with high entropy, while exploitation targets regions with low entropy. Experiments are conducted to demonstrate the effectiveness of the proposed approach.
优缺点分析
Strengths:
- The authors effectively articulate the motivation of their approach in the introduction section, particularly in comparison to standard reinforcement learning methods, emphasizing its advantage in reducing labeling costs.
- Figure 1 clearly illustrates the role of the exploration-exploitation strategy in efficiently identifying the target of interest.
- The proposed methodology is grounded in theoretical foundations and appears to offer a novel contribution.
- The approach may offer impact in domains such as medical imaging, species discovery, and remote sensing, where labeling is costly and limited.
- An ablation study is conducted to demonstrate the contribution of individual components—specifically, exploration and exploitation—to the overall performance.
Weaknesses:
- The baseline comparisons are limited. Aside from random search and internal ablations (exploration-only and exploitation-only), no competitive baselines are included. The authors should compare against stronger alternatives where applicable [1–3]. If not applicable, authors need to provide the reason for that.
- The results for the fully supervised approach such as FullSEG should be included as part of the all 3 datasets (Overhead Imagery, Species, Skin Disease).
- Multi-armed bandit methods like Upper Confidence Bound (UCB) are not considered as baselines. Given their relevance, the authors should include such methods across all datasets to provide a more comprehensive comparison.
- Some notations are unnecessarily long and could benefit from simplification for readability. For instance, symbols like e.g., , in Equation 7 may be renamed for clarity.
- Certain references appear to be missing. For example, the refer to the proof aboveTheorem 3 is marked with “??”.
- Figure captions lack sufficient detail. Specifically, Figure 2 would benefit from a more comprehensive explanation of the DiffATD framework, rather than the generic caption “An Overview of DiffATD.”
- The function of the parameter in Equation 9, which balances exploration and exploitation, is not visualized. Including a plot or illustration of this function would improve interpretability.
问题
Please refer to Strengths and Weaknesses Section
局限性
The authors does not seem to include the Broader impacts and Limitations of their work. Some of the impacts of this work could be in domains like medical imaging, ecological species discovery, and remote sensing. In terms of the limitations, the proposed approach may have higher computational cost which should be discussed in this paper.
最终评判理由
The authors have addressed some of my concerns. While I acknowledge the paper presents some novel contributions, it still lacks in terms of the comprehensive evaluation considering all relevant baselines. Taking this into account, I am updating my score to a Borderline accept.
格式问题
NA
Thank you for your valuable feedback. We've addressed all your comments and concerns.
Q1: The authors should compare against stronger alternatives where applicable [1–3]. If not applicable, authors need to provide the reason for that.
A1: Thank you for your question. However, methods [1-3] are not applicable for the following two key reasons (We refer to our discussion on Lines 79-84):
- (1) All these prior methods typically focus solely on optimizing for reconstruction, while our ultimate goal is identifying target-rich regions within a pre-specified budget. Success for our task hinges on balancing exploration (obtaining useful information about the scene) and exploitation (uncovering targets).
- (2) In ATD problem setting, targets are not pre-defined, meaning there is no ground-truth labeling or even a pre-specified definition of the reward signal before discovery during inference, whereas RL-based methods [1-3] rely on extensively pre-defined rewards and supervised training, making them not applicable for ATD.
Q2: The results for the fully supervised approach such as FullSEG should be included as part of the all 3 datasets (Overhead Imagery, Species, Skin Disease).
A2: It is a nice observation. In Table 5, we have included results with fully supervised approaches such as FullSEG for both the overhead imagery (DOTA) and skin disease datasets, as these tasks are derived from computer vision and support traditional segmentation-based evaluation. To further assess the generalizability of our method beyond vision-based domains, we also evaluated it on the species discovery task. This task is fundamentally different—it is formulated as active discovery over an unobserved spatial distribution and does not lend itself to segmentation-based approaches. As a result, fully supervised methods like FullSEG are not applicable in the species discovery setting. We appreciate your feedback and hope this clarifies our evaluation approach.
Q3: MAB methods like Upper Confidence Bound (UCB) are not considered as baselines. Given their relevance, the authors should include such methods across all datasets to provide a more comprehensive comparison.
A3: We appreciate the suggestion. Multi-armed bandit methods such as UCB are fundamentally incompatible with the ATD setting. ATD operates under a partially observable environment, its objective is to strategically cover a large, structured environment where observations are strongly correlated, under a strict budget. In contrast, MAB assumes independent arms with stationary reward distributions and focuses on maximizing cumulative reward, assumptions that do not hold in ATD.
Nevertheless, we included experiments with MAB-based baselines in our study by implementing both UCB and -greedy strategies. Their performance under the ATD setting is reported in Appendix T, Table 17. Furthermore, we conduct several additional experiments on DOTA dataset, results are shown in the table below. As shown in both of the tables, MAB methods such as UCB perform poorly in the ATD setting, confirming their incompatibility with this problem.
Comparison with UCB on DOTA
| Method | |||
|---|---|---|---|
| UCB (Bandits) | 0.1132 | 0.2146 | 0.2487 |
| DiffATD | 0.5422 | 0.6379 | 0.7309 |
Q4: Some notations are unnecessarily long and could benefit from simplification for readability. For instance, symbols like e.g., , in Equation 7 may be renamed for clarity.
A4: We appreciate your feedback. We will simplify these notations to help readers follow the content more easily.
Q5: Certain references appear to be missing. For example, the refer to the proof aboveTheorem 3 is marked with “??”.
A5: Thank your for the observation, we will correct this typo.
Q6: Figure captions lack sufficient detail. Specifically, Figure 2 would benefit from a more comprehensive explanation of the DiffATD framework, rather than the generic caption “An Overview of DiffATD.”
A6: Thank you for the suggestion! We will certainly add more details in the caption.
Q7: The function of the parameter in Equation 9, which balances exploration and exploitation, is not visualized. Including a plot or illustration of this function would improve interpretability.
A7: Thank you for highlighting this valuable point. The parameter serves as a scheduling function to balance exploration and exploitation, as described in lines 200–204. We also analyze its sensitivity in lines 271–280 and in Table 4. We fully agree that visualizing the behavior of would greatly enhance interpretability. In response to your suggestion, we will include a dedicated figure in the revised manuscript to clearly illustrate the schedule function and help readers better understand its practical influence on the method’s performance.
Q8: The authors does not seem to include the Broader impacts and Limitations of their work. Some of the impacts of this work could be in domains like medical imaging, ecological species discovery, and remote sensing. In terms of the limitations, the proposed approach may have higher computational cost which should be discussed in this paper.
A8: Thank you for your suggestion. We recognize the importance of discussing both the broader societal impacts and the limitations of our work. In terms of broader impacts, We agree that DiffATD has broader implications beyond medical imaging, remote sensing and species discovery, such as drug discovery and hazardous industrial pollution monitoring. We will include a dedicated section in the revised manuscript to articulate these broader implications more clearly. With regard to limitations, we fully agree that computational cost is an important consideration. To address this, we have measured the computational cost-both inference time and memory cost-across different scales of the search space and provided a detailed analysis in Appendix V, Table 18.
Dear Reviewer,
We appreciate your acknowledgment. We sincerely hope our response has fully resolved your concerns. We would be very happy to provide any further clarification if needed.
Dear Reviewer,
As we near the end of the discussion period, we kindly wish to confirm if our response has fully addressed your questions.
This paper introduces the Active Target Discovery (ATD) problem alongside a solution, DiffATD. It's a creative and well-motivated approach that cleverly uses diffusion models to guide an effective exploration-exploitation search strategy. This work shows promise for applications in fields like medical imaging and remote sensing, and the empirical results convincingly demonstrate its superiority over the tested baselines. The reviewers were ultimately persuaded of the paper's merits, particularly after a thorough and constructive rebuttal from the authors that adeptly clarified the work's contributions. Overall, this is a solid paper.