Training-Free Guidance Beyond Differentiability: Scalable Path Steering with Tree Search in Diffusion and Flow Models
摘要
评审与讨论
This paper proposes a family of methods for inference-time control of diffusion generative model with nondifferentiable rewards. The method involves keeping a set of samples and filtering or resampling the set of proposals generated from those samples at each diffusion step. The method is evaluated on a wide array of discrete generation tasks, including symbolic music, molecules, and DNA. Scaling experiments are performed to establish the impact of key hyperparameters.
优缺点分析
Strengths The paper's key conceptual idea, that of maintaining a set of samples that is steered and resampled during inference via a reward function, is well worth articulating and presenting to the community. It mirrors important developments in continuous diffusion steering (Feynman-Kac steering) that demonstrates the benefits of maintaining a set of samples, a la SMC, rather than steering individual trajectories. The coverage and diversity of the experiments is noteworthy and shows promising results.
Concerns While the core message and idea of the submission are sound and potentially impactful, there are concerns about the manner of presentation and execution of the method.
- Despite the appealing intuition of maintaining an active set, the main experiments are nonetheless conducted with . With this departure, the exposition in the abstract, exposition, and figures are misleading.
- The theorems in the continuous case suppose that the data are Gaussian and the reward function linear. These restrictions are so severe as to perhaps disqualify the theorem from being a main result. Certainly, it cannot be said in exposition that "TreeG-SC yields a distribution T that close approximates..." without qualification.
- There is not very much exploration about how to define the intermediate value functions, which seems critical to the performance of the method. Evaluating the terminal reward at samples x_1 | x_t obtained in one step might not work, as those samples could be far out of distribution of x_1.
- There appears to be a somewhat exaggerated difference between TreeG-SC and TreeG-SD, differing only in whether the variance of x_1 | x_t is accounted for or whether we use the conditional mean. In the discrete case, they appear identical.
- With ranking based selection, the procedure reduces to a intuitive heuristic procedure. Unlike SVDD, we do not know what distribution the procedure will sample. This makes it difficult to reason and fairly compare the performance with methods that do sample a known distribution - the procedure could be optimizing the reward but resulting in diversity collapse or missing modes in hard to reach local minima.
- It seems that whenever the results in a column are not favorable to the authors, the results in that column are not bolded (Tables 1, 2, 3).
- The paper uses "scaling law" to refer to any monotonic decrease in loss - this is not accurate.
Altogether, the core idea of the paper is strong, but the experiments do not focus on all of the most important aspects of the idea and overemphasizes points that are not really the paper's strengths (i.e., theoretical bounds). I do not anticipate that the authors will have time to dramatically reorganize the presentation of results during rebuttal; however, I am open to raising the score if the authors can argue why the procedure. as implemented in the main experiments, corresponds to what is claimed in the abstract and introduction.
问题
No specific questions for authors.
局限性
Yes
最终评判理由
As I wrote in the review, I am open to raising the score if the authors can argue why the procedure. as implemented in the main experiments, corresponds to what is claimed in the abstract and introduction. The authors successfully addressed this.
格式问题
The authors appear to have modified the line spacing in lists and around headings.
Thank you for the detailed response.
Q1: Despite the appealing intuition of maintaining an active set, the main experiments are nonetheless conducted with . With this departure, the exposition in the abstract, exposition, and figures are misleading.
A1: We clarify that we have conducted experiments with across all tasks (Sec. 6.3, Appx. G.1.2, G.2.2, G.3.2). In Sec. 6.2, we reported results of to ensure fair comparisons with baselines, which do not have the concept of an active set. We clarify:
- Our method has two key mechanisms, branch-out then selection, and maintaining an active set, corresponding to two scalable parameters: branch-out size and active set size . Increasing either improves optimization effect. For example, the following table reports the Note Density loss (↓) for symbolic music across different and values (TreeG-SD):
| A=1 | A=2 | A=4 | A=8 | A=16 | |
|---|---|---|---|---|---|
| K=1 | 2.486±3.530 | 1.465±1.842 | 1.139±2.271 | 0.784±1.108 | 0.696±1.526 |
| K=2 | 0.629±0.827 | 0.231±0.472 | 0.113±0.317 | 0.057±0.179 | - |
| K=4 | 0.319±0.619 | 0.159±0.401 | 0.048±0.198 | - | - |
| K=8 | 0.217±0.450 | 0.082±0.266 | - | - | - |
| K=16 | 0.142±0.423 | - | - | - | - |
- We reported results for fair comparison, as baselines lack the concept of an active set. This isolates the impact of our branch-out and selection mechanism. However, as shown below, allowing a flexible active set size () further improves performance at the same computational cost. For instance, TreeG-SD with achieves better performance than , which was reported for comparing with baselines.
| Method | Loss (↓) | OA (↑) |
|---|---|---|
| TDS | 0.218 ± 0.241 | 0.875 ± 0.023 |
| SVDD | 0.445 ± 0.437 | 0.835 ± 0.028 |
| SCG | 0.134 ± 0.533 | 0.842 ± 0.022 |
| TreeG-SD (A=1, K=16) | 0.142 ± 0.423 | 0.832 ± 0.023 |
| TreeG-SD (A=4, K=4) | 0.048 ± 0.198 | 0.843 ± 0.012 |
Q2: The theorems in the continuous case suppose that the data are Gaussian and the reward function linear. These restrictions are so severe as to perhaps disqualify the theorem from being a main result. Certainly, it cannot be said in exposition that "TreeG-SC yields a distribution T that close approximates..." without qualification.
A2: We appreciate the comment and have revised the exposition to qualify our statement, now writing: “TreeG-SC(SD) yields a distribution that closely approximates the target conditional transition distribution, under gaussian data and linear objective assumption in continuous case.” We also emphasize that the empirical contribution is central to our work, with the theoretical results serving as complementary insights. We'd like to further clarify:
- On the necessity of assumptions: From a theoretical perspective, the linear-Gaussian case is not trivial but rather fundamental. The search mechanism inherently disrupts the reverse dynamics of the unconditional diffusion model, making it analytically intractable to rigorously characterize the output distribution or derive end-to-end sample complexity bounds without additional structure. Recent work (e.g.,[1] Marion et al., 2024) also relies on Gaussian assumptions to establish rigorous guarantees for guided diffusion. Therefore, this assumption should be considered reasonable.
- On broader applicability: While our theoretical analysis adopts Gaussian-linear assumptions to enable rigorous proofs, our experiments explicitly verify that the proposed approach performs well in substantially more complex settings, with non-Gaussian data distributions and nonlinear objectives, demonstrating that the method effectively generalizes beyond the assumptions used in the theoretical framework.
Q3: TreeG-SC and TreeG-SD differ only in whether the variance of x_1 | x_t is accounted for or whether we use the conditional mean. In the discrete case, they appear identical.
A3: TreeG-SC and TreeG-SD are fundamentally different methods in both discrete and continuous cases. Their differences are summarized below:
| Aspect | TreeG-SC | TreeG-SD |
|---|---|---|
| Exploration space | Next state | Destination |
| Use of lookahead estimate | Uses the diffusion model to obtain for evaluating each candidate next state | Leverages already available in the backward process, to branch toward promising destinations |
| High-level idea | Optimize the immediate next step | Select promising final destinations first, then infer the next state accordingly |
| Complexity (Sec.6.4 Table 4) |
Although these operations may appear superficially similar, their exploration targets, use of conditional distribution, high-level strategies, and computational complexities are completely different.
Q4: Evaluating the terminal reward at samples x_1 | x_t obtained in one step might not work, as those samples could be far out of distribution of x_1.
A4: Evaluating via is both theoretically reasonable empirically effective:
- First, diffusion/flow model is explicitly trained to model the distribution (conditional mean in continuous case by Tweedie’s formula [2]). Assuming a perfect pre-trained diffusion model, using for evaluation is aligned with how the models are intended to function, thus theoretically justified.
- Second, our empirical results demonstrate strong performance on our three proposed methods, indicating that evaluation using the one-step lookahead estimate is effective.
We appreciate the suggestion to explore alternative value function designs. One direction is multi-step lookahead value estimation. However, it would incur substantially higher computational cost due to the need for multiple inference steps through the diffusion model. In contrast, our proposed value functions offer a practical and efficient solution that achieves strong performance with significantly lower computational overhead.
Q5: With ranking based selection, the procedure could be optimizing the reward but resulting in diversity collapse or missing modes in hard to reach local minima.
A5: We have already included diversity as an explicit evaluation metric in our experiments.
- We use Tanimoto similarity for molecule (Table 2) and average Hamming distance for DNA (Table 3). For music, while there is no standard diversity metric, the quality metric (OA) reported in Table 1 can reflect diversity to some extent.
- Our results show that the proposed methods achieve diversity comparable to baselines, including SVDD, across all tasks. While ranking-based selection can be seen as an extreme case of resampling, empirical evidence suggests it does not lead to diversity collapse or mode dropping—likely because the practical reward signal is not overly sharp, avoiding collapse into narrow modes.
Q6: It seems that whenever the results in a column are not favorable to the authors, the results in that column are not bolded (Tables 1, 2, 3).
A6: This is a misunderstanding of our presentation choice. We intentionally bold only the metrics that reflect the optimization effect, not the quality metrics, for the following reasons:
- (1) To avoid confusion and preserve the main point: Bolding quality metrics could obscure the focus of the results. For example, in Table 3, some baselines such as TFG-Flow and SVDD show very high diversity but almost no optimization effect. Highlighting these values would misleadingly give undue credit, despite their poor performance on the main objective.
- (2) Quality metrics only need to be at a comparable level: For quality metrics, the absolute values or rankings are less critical; it's only required that the proposed method remains comparable to the baselines. Since this information is already clear from the tables, additional bolding is unnecessary and could mislead readers about the study’s primary goal.
Q7: The paper uses "scaling law" to refer to any monotonic decrease in loss - this is not accurate.
A7: We use the term "scaling law" to describe the improvement in optimization performance (decrease in loss) with respect to the active set size and branch-out size .
Reference
[1] Pierre Marion et al. Implicit diffusion: Efficient optimization through stochastic sampling. arXiv preprint arXiv:2402.05468, 2024.
[2] Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106(496):1602–1614, 2011
Given your review, we believe there were major misunderstandings on our work. We sincerely hope that our responses can resolve the misunderstandings and address your concerns, and we would greatly appreciate it if you would like to re-evaluate our work given the responses.
Dear Reviewer nDiZ,
The authors have replied to your main concerns regarding their empirical evaluation, the theoretical assumptions and the presentation of their results, among others. The authors raised issues of potential misunderstandings by the reviewer, which I tend to agree with. Could you please have a look at their rebuttal and provide specific feedback on their responses?
Please not that simply acknowledging reading their rebuttal and/or only stating generic statement like "I will keep my score" without further explanation is not acceptable. Your detailed input is essential for a fair and informed decision.
Many thanks,
Your AC
Hello AC,
Yes, I have carefully read the authors' comments. I appreciate the clarification about TreeG-SC vs TreeG-SD and acknowledge an initial misunderstanding. However, I remain concerned that the overall pitch of the method as evolving and filtering an active set (Fig 1) is not borne out by the experiments, which largely focus on active set size A=1. Furthermore, the practice of not bolding columns whenever the results are unfavorable to the authors is at odds with community norms.
I appreciate the authors' detailed response. My major concern remains that despite the framing of the work, the experiments are largely with active set size A=1. The authors' response that experiments with A>1 occupy only 5 lines of main text and are otherwise deferred to the appendix supports my point. I would like to ask the authors to agree to significantly augment the revision with results with A>1.
The clarification of TreeG-SC and TreeG-SD is appreciated.
Q: It seems that whenever the results in a column are not favorable to the authors, the results in that column are not bolded (Tables 1, 2, 3).
A: This is a misunderstanding of our presentation choice. We intentionally bold only the metrics that reflect the optimization effect, not the quality metrics, for the following reasons: (1) To avoid confusion and preserve the main point: Bolding quality metrics could obscure the focus of the results. For example, in Table 3, some baselines such as TFG-Flow and SVDD show very high diversity but almost no optimization effect. Highlighting these values would misleadingly give undue credit, despite their poor performance on the main objective. (2) Quality metrics only need to be at a comparable level: For quality metrics, the absolute values or rankings are less critical; it's only required that the proposed method remains comparable to the baselines. Since this information is already clear from the tables, additional bolding is unnecessary and could mislead readers about the study’s primary goal.
I do not find any of this reasoning convincing or adhering to norms of the field. By this reasoning, any deterioration in generation quality could be justified.
Q: The paper uses "scaling law" to refer to any monotonic decrease in loss - this is not accurate.
A: We use the term "scaling law" to describe the improvement in optimization performance (decrease in loss) with respect to the active set size and branch-out size.
A scaling law is a quantitative relationship between some measure of training compute and loss, and needs to be formulated as such. A monotonic trend is not a scaling law.
Thank you for your thoughtful feedback. We’re glad to have addressed many of your concerns, and we would like to respond to your remaining points regarding presentation.
My major concern remains that despite the framing of the work, the experiments are largely with active set size A=1.
The practice of not bolding columns did not adhere to norms of the field.
We appreciate these comments and have revised the manuscript accordingly, reporting the updated tables here. Specifically, we:
- Added an additional row in the tables reporting results for with comparable inference compute.
- Bolded all metrics.
| Method | Pitch Histogram | Note Density | ||
|---|---|---|---|---|
| Loss(↓) | OA(↑) | Loss(↓) | OA(↑) | |
| No Guidance | 0.0180±0.0100 | 0.842±0.012 | 2.486±3.530 | 0.830±0.016 |
| Classifier | 0.0050±0.0040 | 0.855±0.020 | 0.698±0.587 | 0.861±0.025 |
| DPS | 0.0010±0.0020 | 0.849±0.018 | 1.261±2.340 | 0.667±0.113 |
| TDS | 0.0027±0.0055 | 0.845±0.017 | 0.218±0.241 | 0.875±0.023 |
| SCG | 0.0036±0.0057 | 0.862±0.008 | 0.134±0.533 | 0.842±0.022 |
| SVDD | 0.0085±0.0100 | 0.846±0.013 | 0.445±0.437 | 0.835±0.028 |
| TreeG-SD (A=1,K=16) | 0.0002±0.0003 | 0.860±0.016 | 0.142±0.423 | 0.832±0.023 |
| TreeG-SD (A=4,K=4) | 0.0002±0.0003 | 0.818±0.013 | 0.048±0.198 | 0.843±0.012 |
Table 1: Evaluation on Music Generation. The chord progression task is not included here due to its high computational cost, and we did not have sufficient resources to conduct full experiments, but it will be complemented in revised version.
| Method | QED | SA | DRD2 | =1 | ||||
|---|---|---|---|---|---|---|---|---|
| VAL(↑) | TS(↓) | VAL(↑) | TS(↓) | VAL(↑) | TS(↓) | MAE(↓) | TS(↓) | |
| No guidance | 0.61±0.19 | 0.12±0.02 | 0.79±0.10 | 0.12±0.02 | 0.06±0.17 | 0.12±0.02 | 2.09±1.16 | 0.12±0.02 |
| DG | 0.62±0.20 | 0.12±0.02 | 0.80±0.10 | 0.12±0.02 | 0.17±0.32 | 0.12±0.02 | 0.11±0.33 | 0.14±0.03 |
| TFG-Flow | 0.61±0.20 | 0.12±0.02 | 0.79±0.10 | 0.12±0.02 | 0.09±0.22 | 0.12±0.02 | 0.28±0.65 | 0.13±0.02 |
| SVDD | 0.71±0.17 | 0.13±0.02 | 0.81±0.09 | 0.13±0.02 | 0.64±0.42 | 0.17±0.04 | 0.35±1.14 | 0.14±0.03 |
| TreeG-SD (A=1,K=200) | 0.77±0.14 | 0.12±0.02 | 0.86±0.10 | 0.16±0.04 | 0.45±0.41 | 0.14±0.03 | 0.11±0.38 | 0.12±0.02 |
| TreeG-SD (A=2,K=8) | 0.80±0.11 | 0.12±0.02 | 0.89±0.07 | 0.21±0.05 | 0.66±0.40 | 0.19±0.04 | 0.02±0.12 | 0.14±0.02 |
Table 2: Evaluation on Small Molecule Generation.
| Method (strength) | Class 1 | Class 2 | Class 3 | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Prob(↑) | NLL(↓) | Div(↑) | Prob(↑) | NLL(↓) | Div(↑) | Prob(↑) | NLL(↓) | Div(↑) | |
| No guidance | 0.021±0.079 | 0.667±0.015 | 373 | 0.008±0.052 | 0.667±0.015 | 373 | 0.007±0.053 | 0.667±0.015 | 373 |
| TFG-Flow () | 0.054±0.129 | 0.699±0.005 | 375 | 0.012±0.076 | 0.696±0.005 | 375 | 0.004±0.029 | 0.693±0.005 | 375 |
| SVDD | 0.155±0.173 | 0.663±0.017 | 364 | 0.088±0.193 | 0.662±0.016 | 367 | 0.061±0.168 | 0.660±0.016 | 363 |
| DG () | 0.359±0.188 | 0.625±0.016 | 351 | 0.627±0.340 | 0.615±0.022 | 359 | 0.693±0.264 | 0.636±0.019 | 357 |
| DG () | 0.372±0.237 | 0.600±0.034 | 352 | 0.571±0.356 | 0.587±0.032 | 343 | 0.173±0.256 | 0.606±0.036 | 349 |
| DG () | 0.251±0.171 | 0.577±0.044 | 331 | 0.350±0.351 | 0.587±0.037 | 335 | 0.064±0.143 | 0.603±0.033 | 343 |
| TreeG-G () | 0.236±0.178 | 0.615±0.034 | 355 | 0.313±0.343 | 0.596±0.057 | 347 | 0.247±0.280 | 0.486±0.054 | 324 |
| TreeG-G () | 0.509±0.242 | 0.516±0.040 | 353 | 0.915±0.154 | 0.471±0.040 | 332 | 0.745±0.217 | 0.399±0.042 | 306 |
| TreeG-G (A=1,K=1) () | 0.560±0.258 | 0.518±0.042 | 358 | 0.951±0.110 | 0.454±0.035 | 331 | 0.894±0.136 | 0.415±0.043 | 321 |
| TreeG-G (A=2,K=2) () | 0.740±0.209 | 0.492±0.032 | 362 | 0.981±0.042 | 0.463±0.034 | 336 | 0.934±0.098 | 0.416±0.039 | 310 |
Table 3: Evaluation on Enhancer DNA Design.
Clarification on bolding: In our previous version, we did not bold certain quality metrics because they are not direct optimization targets. This practice is consistent with prior works [1,2], which also do not emphasize these metrics. Nevertheless, we have now bolded all reported metrics to meet standard presentation expectations.
Additionally, to better illustrate the impact of maintaining an active set, we will expand the experimental results in the main text (Sections 6.3 and 6.4) in the revision. This will include:
- Full figures showing the effect of scaling across all tasks.
- Detailed ablation study analyzing the trade-off between and under fixed compute.
A scaling law is a quantitative relationship between some measure of training compute and loss, and needs to be formulated as such.
Thank you for pointing this out. We will revise the manuscript terminology, replacing “scaling law” with “scaling effect (of active set size and branch-out size )”.
Reference
[1] Nisonoff et al. Unlocking Guidance for Discrete State-Space Diffusion and Flow Models.
[2] Lin et al. TFG-Flow: Training-free Guidance in Multimodal Generative Flow.
We sincerely hope that our responses address your remaining concerns, and we would be grateful for any further feedback if needed.
Dear Reviewer,
As we approach the end of the discussion period, this is a gentle reminder that we have reported the revised presentation of our tables in accordance with your suggestions. Specifically, we have:
- Added and highlighted results for .
- Bolded all metrics.
These changes will be clearly reflected in the revised main text.
We sincerely hope that our responses have addressed your remaining concerns, and we would greatly appreciate it if you would consider raising your score in light of these revisions.
Thank you again for your time and valuable feedback. We look forward to your response.
Best regards,
Authors
Thanks for the prompt updates. I will raise the score.
Thank you for your positive feedback and for updating the score. We are sincerely grateful for your time and thoughtful input, which have been invaluable in strengthening our paper.
In this paper, the authors proposed tree search methods for inference time (black-box) guidance for both diffusion models and flow models(discrete cases). Two schemes are proposed for tree branch expansion and value function estimation. The authors further provide theoretical bounds of the difference between the algorithm-induced distribution and target conditional transition distribution for both schemes.
优缺点分析
Strengths
-
This paper provides tree search methods for inference time (black-box) guidance for both diffusion models and flow models(discrete cases). Although [Huang et al. 2024] proposed a method that selects the best one at each time t for inference time guidance, this paper generalizes it well for both diffusion models and flow models.
-
This paper provides theoretical error bounds of the difference between the algorithm-induced distribution and target conditional transition distribution for both cases. It directly reveals the relationship between the approximation error and the tree Branch-Out size .
Weaknesses
-
Theorem 2 for the discrete case is a bit weird. It seems that given the same Branch-Out size , the high-dimensional case can achieve a much smaller error compared with the low-dimensional case. Or equivalently, to achieve a fixed error , the high-dimensional case requires a smaller Branch-Out size compared with the low-dimensional case.
-
The assumptions of the error bound for the continuous case are too strong. The requirement of the data follows a Gaussian distribution, and the objective function is linear is too restrictive. If we already know this nice form, we do not need to take diffusion models. We can easily start from the Gaussian; other techniques may get better results.
-
In the black-box guidance area, usually, the main challenge is the query efficiency. For example, in drug discovery tasks, the number of wet lab experiments or high-computation-cost docking evaluations is the bottleneck. In targeted image generation tasks, the number of collections of users' preferences or simply LLM API calls for scoring is the bottleneck. However, no comparison against baselines regarding query efficiency is evaluated in the current draft, which is not convincing enough.
-
It seems that all the evaluations in the experiments are on simple tasks. In the drug discovery area, the Vina Docking score is usually more challenging to optimize compared with QED and SA. The QED and SA are easily led to overfitting. However, evaluations regarding the Vina Docking score are missing. In addition, no comparison is performed on the image domain. The image target generation tasks are high-dimensional, which makes them more challenging for the value function estimation.
-
Part of the math or symbols are not clear enough.
-
The , and for the flow models are not well explained. Is it a typo for the symbol on Line 117?
-
The distribution depends on the time t, but the error bound in the Theorem does not show this explicitly.
- Some prior related works on the inference-time black-box guidance for diffusion models are not discussed. I list a few of them below.
[1] Tang et al. Tuning-free alignment of diffusion models with direct noise optimization. arXiv:2405.18881, 2024.
[2] Tan et al. Fast Direct: Query-Efficient Online Black-box Guidance for Diffusion-model Target Generation. ICLR 2025.
[3] Lyu et al. Covariance-adaptive sequential black-box optimization for diffusion targeted generation. arXiv:2406.00812, 2024.
问题
Q1. Please provide more discussion and explanation about Theorem 2 for the discrete case and the assumptions of the error bound for the continuous case.
Q2. If possible, please provide comparisons against baselines regarding query efficiency.
Q3. If possible, please provide evaluations on the Vina Docking score and image generation tasks.
局限性
yes
最终评判理由
In this paper, the authors proposed tree search methods for inference time (black-box) guidance for both diffusion models and flow models(discrete cases). Two schemes are proposed for tree branch expansion and value function estimation. The authors further provide theoretical bounds of the difference between the algorithm-induced distribution and the target conditional transition distribution for both schemes. The authors cleaned up the typos and addressed most of my concerns.
格式问题
NA
Thank you for your positive feedback on our work! We first present new experimental results, followed by responses to the remaining concerns.
We added new experiments on Vina Docking score (protein: 5ht1b). Our proposed TreeG-SC outperforms all gradient-free guidance baselines. (Since the docking score is non-differentiable, we excluded the gradient-based baseline DG.)
| Method | Docking Score(↑) | TS(↓) |
|---|---|---|
| No Guidance | 8.50±1.41 | 0.12±0.02 |
| TFG-Flow | 8.45±1.49 | 0.12±0.02 |
| SVDD | 9.36±1.38 | 0.12±0.02 |
| TreeG-SC | 9.58±1.28 | 0.12±0.02 |
Experiment details: We used QuickVina2-GPU-2.1 to evaluate binding affinity of generated molecules to protein 5ht1b following [1]. The original scores were renormalized as max(−DS,0). Experiments used 200 denoising steps on 100 generated sequences.
Due to time constraints, we only have these results for now, but will complement them with larger sample sizes, additional denoising steps, and more protein targets in future experiments.
Respond to your questions:
Q1: All the evaluations in the experiments are on simple tasks. Lack of high-dimentional task and image generation is more challenging for the value function estimation.
A1: We thank the reviewer for the comments but respectfully disagree with this assessment.
- Our music generation task operates on high-dimensional data: the raw piano-roll data has a dimensionality of (3×128×1024), and the latent space used for diffusion model remains substantial at (4×16×128) [2], which is comparable in scale to typical image generation tasks.
- The three music generation tasks we evaluate are highly challenging—particularly the non-differentiable objectives of note density and chord progression, which are especially difficult to optimize. In particular, chord progression requires the time-consuming rule-based tool music21 [3] to get the evaluation. These tasks are sufficiently complex to demonstrate the effectiveness of our method.
Q2: In the black-box guidance area, usually, the main challenge is the query efficiency. ... no comparison against baselines regarding query efficiency is evaluated in the current draft, which is not convincing enough.
A2: We appreciate the reviewer’s comment. Our method is not limited to black-box optimization but addresses the broader problem of inference-time alignment for diffusion models, where query efficiency is only one of several factors affecting computation. As discussed in Sec. 6.4 (Table 4), the overall computation cost comprises:
- Passing through the diffusion model (which can dominate the total cost, e.g., in video generation),
- Evaluation of the objective function (query efficiency, as the reviewer noted),
- Backpropagation, when gradient-based methods and differentiable predictors are used.
Given these varying cost contributions across tasks, we focused on ensuring comparable overall inference time when comparing to baselines, rather than isolating query efficiency. We provide detailed statistics on the number of compute units per step per sample and the overall inference time of a batch, compared to the best training-free baselines across all tasks, as follows.
| Method | Chord Progression | Pitch Histogram/Note Density | ||||
|---|---|---|---|---|---|---|
| Diffusion queries per step | Objective queries per step | Time (s) | Diffusion queries per step | Objective queries per step | Time (s) | |
| SCG | 17 | 16 | 7267 | 17 | 16 | 194 |
| TreeG-SD | 1 | 16 | 6660 | 1 | 32 | 203(PH),204(ND) |
Table: Computation Breakdown in Music Generation ( for TreeG-SD, sample size for SCG is ).
| Diffusion step | CP Objective | PH Objective | ND Objective | |
|---|---|---|---|---|
| Time (s) | 0.0202 | 0.2520 | 0.0095 | 0.0356 |
Table: Basic unit time in Music Generation.
| Method | Diffusion queries per step | Objective queries per step | Backpropagations per step | Time (s) |
|---|---|---|---|---|
| SVDD | 17 | 320 | 0 | 72 |
| TreeG-G | 1 | 20 | 1 | 10 |
Table: Computation Breakdown in DNA Enhancer Design ( for TreeG-G, sample size for SVDD is ).
| Diffusion step | Objective | Backpropagation | |
|---|---|---|---|
| Time (ms) | 0.087 | 0.021 | 0.036 |
Table: Basic unit time in DNA Enhancer Design.
| Method | Vina Docking (Protein: 5ht1b) | QED | ||||||
|---|---|---|---|---|---|---|---|---|
| Diffusion queries per step | Objective queries per step | Denoising Steps | Time (s) | Diffusion queries per step | Objective queries per step | Denoising Steps | Time (s) | |
| SVDD | 9 | 3.51 | 200 | 26924 | 5 | 40 | 1000 | 17 |
| TreeG-SC | 9 | 3.51 | 200 | 26689 | 5 | 40 | 1000 | 17 |
Table: Computation Breakdown in Small Molecule Generation (Branch-out/sample sizes for TreeG-SC and SVDD are for Vina Docking and for QED. Since docking scores could not be calculated on invalid SMILES strings, the average number of objective queries over all steps was fewer than the expected value ).
| Diffusion step | QED Objective | Vina Docking Objective | |
|---|---|---|---|
| Time (ms) | 0.038 | 2.0e-4 | 390 |
Table: Basic unit time in Small Molecule Generation.
Q3: In Theorem 2, it seems that given the same Branch-Out size, the high-dimensional case can achieve a much smaller error compared with the low-dimensional case.
A3: Thank you for pointing this out. This is a typo. The correct expression in Theorem 2 is:
The mistake originated from the last step of the proof (line 763). The error should accumulate additively across dimensions:
where is the single-dimension (marginal) error, rather than multiplicatively as previously (incorrectly) written:
We have corrected this in the revised version. The updated bound now properly reflects that the error grows with the dimensionality of the data.
Q4: The assumptions of the error bound for the continuous case are too strong. The requirement of the data follows a Gaussian distribution, and the objective function is linear is too restrictive.
A4:
- On the necessity of assumptions: From a theoretical perspective, the linear-Gaussian case is not trivial but rather fundamental. The search mechanism inherently disrupts the reverse dynamics of the unconditional diffusion model, making it analytically intractable to rigorously characterize the output distribution or derive end-to-end sample complexity bounds without additional structure. Recent work (e.g.,[4] Marion et al., 2024) also relies on Gaussian assumptions to establish rigorous guarantees for guided diffusion. Therefore, this assumption should be considered reasonable.
- On broader applicability: While our theoretical analysis adopts Gaussian-linear assumptions to enable rigorous proofs, our experiments explicitly verify that the proposed approach performs well in substantially more complex settings, with non-Gaussian data distributions and nonlinear objectives, demonstrating that the method effectively generalizes beyond the assumptions used in the theoretical framework.
Q5: Symbols for the flow models are not well explained. Is it a typo for the symbol on Line 117?
A6: Thank you for pointing this out. Yes, it's a typo. The correct expression is:
We have corrected this in the revised manuscript.
Q7: The distribution depends on the time t, but the error bound in the Theorem does not show this explicitly.
A7: Regarding the timestep , Theorem 1 depends on the interval (or the scheduling ), which is explicitly stated in the theorem. Other relevant terms are time-dependent constants, which are fully detailed in the extended versions of Theorems 1 and 2 (Appx. D.1 and D.3).
Q8: Some prior related works on the inference-time black-box guidance for diffusion models are not discussed.
A8: Thank you for bringing this to our attention. We discuss the related works as follows:
[5] uses an ODE/SDE solver to backpropagate through the backward process and the objective gradient directly to the initial latent state. [6] builds on the Guided Noise Sequence Optimization technique, using Gaussian Process estimation to guide diffusion models for black-box optimization. [7] fine-tunes models using covariance-adaptive sequential black-box optimization techniques. While [5–7] are applicable to non-differentiable objectives, they are not applicable to discrete diffusion/flow models.
We appreciate this feedback and have added a discussion of these works in the revised version.
Reference
[1] Yang, et al. Hit and lead discovery with explorative rl and fragment -based molecule generation.
[2] Huang et. al. Symbolic music generation with non-differentiable rule guided diffusion. arXiv:2402.14285, 2024.
[3] music21: A toolkit for computer-aided musicology and symbolic music data. 2010.
[4] Pierre Marion et al. Implicit diffusion: Efficient optimization through stochastic sampling. arXiv preprint arXiv:2402.05468, 2024.
[5] Tang et al. Tuning-free alignment of diffusion models with direct noise optimization. arXiv:2405.18881, 2024.
[6] Tan et al. Fast Direct: Query-Efficient Online Black-box Guidance for Diffusion-model Target Generation. ICLR 2025.
[7] Lyu et al. Covariance-adaptive sequential black-box optimization for diffusion targeted generation. arXiv:2406.00812, 2024.
We sincerely hope that our responses can address your concerns, and we would greatly appreciate it if you would consider raising your score of our work to a clear accept given the responses.
Dear Reviewer cSpQ,
Thank you very much for your comprehensive feedback. I'm sure the authors appreciated it. They have replied to your main concerns regarding, e.g, the theory, query efficient and evaluation. Could you please have a look at their rebuttal and provide specific feedback on their responses? Please not that simply acknowledging reading their rebuttal and/or only stating generic statement like "I will keep my score" without further explanation is not acceptable. Your detailed input is essential for a fair and informed decision.
Many thanks,
Your AC
Thanks for the authors' detailed responses. Most of my concerns have been well-addressed.
We appreciate your positive feedback and are glad our responses addressed your questions. Your constructive comments have been invaluable in improving the paper.
This paper presents a training-free diffusion guidance algorithm applicable to general objectives with non-differentiable rewards and discrete data distribution. The proposed algorithm (TreeG) is based on a tree search along the diffusion sampling trajectory. The authors first state a generic tree-search framework (Section 4), and instantiate TreeG with different design choices (Section 5). The TreeG algorithm is proved to approximate the true posterior distribution closely with resampling. Experimental results on generating symbolic music, small molecules and enhance DNAs show clear advantage of TreeG against existing guidance methods.
优缺点分析
Strength:
- This paper provides a generic search-based framework for guiding score-based models for general objectives and using both continuous and discrete diffusion models.
- The discussion that connects existing methods to TreeG by showing them as special cases of TreeG is inspiring.
- The performance of the TreeG algorithm is rigorously guaranteed.
- Experimental results show advantange of TreeG compared to existing methods. The experiments encompass different tasks in different domains. The scaling ability of TreeG also looks promising.
Weakness:
- Despite being training-free, the computational cost of the tree search-based algorithm seems to be high. A comparison or discussion on computational efficiency of this inference-time method compared to fine-tuning methods (e.g., RTB[1]) is missing.
- While TreeG generate samples with better targeted objectives in the experiments, the distributional similarity is sometimes worse (e.g., when increasing the guidance strength in DNA design tasks). It is unclear whether the generated samples satisfy the conditional distribution by only looking at the reported metrics.
[1] Venkatraman et al. Amortizing intractable inference in diffusion models for vision, language, and control. NeurIPS 2024.
问题
- Could the authors also comment on the scalability of TreeG in terms of the dimensionality of the data distribution? Also, is there any theoretical insight on how tree search-based algorithms perform when the objective/reward functions are sparse in the search space, i.e., when most of the samples have low rewards?
局限性
Yes, the authors address limitations in Appendix B.
最终评判理由
My original concerns on this paper are resolved, and I have updated my assessment of this paper accordingly.
格式问题
None
Thank you for your positive feedback on our work! We address your questions below.
Q1: Could the authors also comment on the scalability of TreeG in terms of the dimensionality of the data distribution?
A1: Thank you for the question. We provide theoretical insights: for higher-dimensional data, a larger branch-out size is required. For example, for TreeG-SC, Theorem 1 gives
here is the cardinality of the discrete data space. Since typically grows exponentially with dimensionality, it requires correspondingly larger to achieve the same optimization tolerance in higher-dimensional spaces. Similar conclusion holds for TreeG-SD.
Q2: Is there any theoretical insight on how tree search-based algorithms perform when the objective/reward functions are sparse in the search space, i.e., when most of the samples have low rewards?
A2: In our proof of Theorem 1 (line 682 in Appx.D.1), we show the branch out size size depends on the ratio between the maximum and minimum of the objective function:
K = \Theta\left(\frac{B^2_\max}{B^2_\min} \cdot \frac{\log (|\mathcal{X}|/\delta)}{\varepsilon^{2}}\right),where B_\max = \sup \exp (f_y(x)) and B_\min = \inf \exp (f_y(x)). When the objective function is sparse (i.e., most samples have very low rewards), the ratio B_\max / B_\min becomes large. Consequently, the branch-out size must also be larger to adequately explore the space and locate high-reward regions.
Q3: Despite being training-free, the computational cost of the tree search-based algorithm seems to be high. A comparison or discussion on computational efficiency of this inference-time method compared to fine-tuning methods (e.g., RTB[1]) is missing.
A3: Thank you for the comment. As discussed in Sec. 6.4 (Table 4), the computational cost of our method primarily comes from querying the pre-trained diffusion model and evaluating the objective function, both of which are also required by fine-tuning methods. However, unlike our inference-time approach, fine-tuning methods (including RTB [1]) incur substantially higher cost due to additional training steps and many more model queries. Thus, our TreeG remains significantly more efficient than training-based approaches.
We thank the reviewer poining this out and have clarified this comparison with fine-tuning methods in the revised version.
Q4: While TreeG generate samples with better targeted objectives in the experiments, the distributional similarity is sometimes worse (e.g., when increasing the guidance strength in DNA design tasks). It is unclear whether the generated samples satisfy the conditional distribution by only looking at the reported metrics.
A4: We appreciate the valuable feedback. To better evaluate how well the generated samples satisfy the conditional distribution , we have introduced the Negative Log-Likelihood (NLL) as an additional metric.
From Bayes’ theorem:
where the two terms can be measured as follows:
- Measuring : This captures the naturalness of samples. We use the Negative Log-Likelihood (NLL), a widely adopted metric in discrete diffusion and flow-based generation [2,3]. The likelihood is computed using the ELBO of the pretrained (discrete) diffusion model.
- Measuring : This reflects the optimization effectiveness. We use the Target Class Probability, as provided by the oracle classifier, which we already report.
Results based on these complete metrics are shown in the table below.
| Method (strength) | Class 1 | Class 2 | Class 3 | |||
|---|---|---|---|---|---|---|
| Prob(↑) | NLL(bits/dim↓) | Prob(↑) | NLL(↓) | Prob(↑) | NLL(↓) | |
| No guidance | 0.021±0.079 | 0.667±0.015 | 0.008±0.052 | 0.667±0.015 | 0.007±0.053 | 0.667±0.015 |
| DG () | 0.359±0.188 | 0.625±0.016 | 0.627±0.340 | 0.615±0.022 | 0.693±0.264 | 0.636±0.019 |
| DG () | 0.372±0.237 | 0.600±0.034 | 0.571±0.356 | 0.587±0.032 | 0.173±0.256 | 0.606±0.036 |
| DG () | 0.251±0.171 | 0.577±0.044 | 0.350±0.351 | 0.587±0.037 | 0.064±0.143 | 0.603±0.033 |
| TreeG-G () | 0.236±0.178 | 0.615±0.034 | 0.313±0.343 | 0.596±0.057 | 0.247±0.280 | 0.486±0.054 |
| TreeG-G () | 0.509±0.242 | 0.516±0.040 | 0.915±0.154 | 0.471±0.040 | 0.745±0.217 | 0.399±0.042 |
| TreeG-G () | 0.560±0.258 | 0.518±0.042 | 0.951±0.110 | 0.454±0.035 | 0.894±0.136 | 0.415±0.043 |
As for the previously reported Fréchet Biological Distance (FBD), it measures the Gaussian distance between embeddings of generated DNA sequences and those of training data within the target class. Upon reflection, we find this metric problematic: it implicitly treats the training data within target class as the sole “ground truth,” discouraging the generation of novel yet valid DNA sequences that maintain naturalness but deviate from the training samples with the target class. For instance, a newly generated sample belonging to class 1 may legitimately appear closer to samples from class 2, yet FBD would penalize it.
This limitation unnecessarily penalizes valid solutions. Therefore, we omit FBD in our updated evaluation, as Target Class Probability and NLL, together with reported diversity, already provide a more comprehensive and principled assessment.
Reference
[1] Venkatraman et al. Amortizing intractable inference in diffusion models for vision, language, and control. NeurIPS 2024.
[2] Austin et al. Structured Denoising Diffusion Models in Discrete State-Spaces.
[3] Uehara et al. Reward-Guided Iterative Refinement in Diffusion Models at Test-Time with Applications to Protein and DNA Design.
We sincerely hope that our responses can address your concerns, and we would greatly appreciate it if you would consider raising your score of our work to a clear accept given the responses.
Dear Reviewer nJDF,
The authors have replied to your main concerns regarding the computational cost/scalability and the evaluation of the results. Could you please have a look at their rebuttal and provide specific feedback on their responses? Please not that simply acknowledging reading their rebuttal and/or only stating generic statement like "I will keep my score" without further explanation is not acceptable. Your detailed input is essential for a fair and informed decision.
Many thanks,
Your AC
I would like to thank the authors for the informative response. I appreciate the authors providing additional theoretical insights on the scalability of the method. I also agree with the authors on the claim that the distributional metric FBD is not a good metric for measuring naturalness, and the new NLL metric is satisfactory. I will raise my score to 5.
We sincerely appreciate your recognition of our work, both experimentally and theoretically, and thank you for increasing the score. Your constructive feedback throughout the review process has been greatly helpful in strengthening the paper.
TreeG reframes training-free guidance as a tree-search over inference paths, enabling diffusion and flow models to optimize non-differentiable or discrete objectives without extra training. Three instantiations( TreeG-SC, TreeG-SD, and TreeG-G) are theoretically supported and outperform prior guidance methods on symbolic music, molecule, and DNA generation, while exhibiting a clear compute-versus-performance scaling law. In summary TreeG is a well-motivated step toward gradient-free control of diffusion and flow models, offering both provable guarantees and tangible empirical gains. Broader domain validation, ablations, and alignment between theory and implementation would be beneficial before claiming universal guidance capability.
优缺点分析
Strengths
-
Consistent empirical wins across three distinct domains of application.
-
Sensible supporting empirical analysis.
-
Positive practical implications across widely relevant application domains
Weaknesses
- Domain coverage: three synthetic/benchmark tasks. Lack of an argument towards the selection criteria for the target tasks.
问题
Could the authors elaborate on the:
-
Selection criteria for the target tasks? What is the argument for universality over this specific set?
-
Theoretical assumptions (Gaussian data, linear objectives) may limit generality. Can you make more explicit the assumptions in which the model should perform?
-
Possible reproducibility and transferability gap across tasks due to many hyper-parameters (e.g., Monte-Carlo N, temperature α) tuned per task without ablation. Can you comment on this?
局限性
Yes
最终评判理由
Authors responded the main questions. The scores remained the same (overall positive) as these were clarifications questions.
格式问题
No
Thank you for your positive feedback on our work! We address your questions below.
Q1: Selection criteria for the target tasks? What is the argument for universality over this specific set?
A1: We selected tasks that pose non-differentiability challenges, which our methods are designed to address. Specifically, this includes continuous diffusion models with non-differentiable objectives (e.g., symbolic music generation) and discrete diffusion models (e.g., molecular design and enhancer DNA design). These tasks were chosen because they represent diverse yet common instances where nondifferentiability arises, demonstrating the broad applicability of our approach.
Q2: Theoretical assumptions (Gaussian data, linear objectives) may limit generality.
A2:
- On the necessity of assumptions: From a theoretical perspective, the linear-Gaussian case is not trivial but rather fundamental. The search mechanism inherently disrupts the reverse dynamics of the unconditional diffusion model, making it analytically intractable to rigorously characterize the output distribution or derive end-to-end sample complexity bounds without additional structure. Recent work (e.g., [1] Marion et al., 2024) also relies on Gaussian assumptions to establish rigorous guarantees for guided diffusion. Therefore, this assumption should be considered reasonable.
- On broader applicability: While our theoretical analysis adopts Gaussian-linear assumptions to enable rigorous proofs, our experiments explicitly verify that the proposed approach performs well in substantially more complex settings, with non-Gaussian data distributions and nonlinear objectives, demonstrating that the method effectively generalizes beyond the assumptions used in the theoretical framework.
[1] Pierre Marion et al. Implicit diffusion: Efficient optimization through stochastic sampling. arXiv preprint arXiv:2402.05468, 2024.
Q3: Possible reproducibility and transferability gap across tasks due to many hyper-parameters (e.g., Monte-Carlo N, temperature) tuned per task without ablation.
A3: We would like to emphasize that we have already conducted ablation studies on hyperparameters such as the Monte-Carlo sample size and the resampling temperature, as reported in Appendix H (Tables 17 and 18). These provide transferable insights across tasks:
- Monte-Carlo sample size : Increasing improves performance up to a point, after which further increases no longer yield significant gains. Thus, our experience is to select a larger within the available computational budget.
- Resampling temperature: Our ablation shows that ranking-based selection (temperature = 0) consistently outperforms resampling. Therefore, in our main experiments, we adopt ranking, which also eliminates the need to tune this hyperparameter.
We sincerely hope that our responses can address your concerns, and we would greatly appreciate it if you would consider raising your score of our work to a clear accept given the responses.
Dear Reviewer 5QYR,
The authors have replied to your main concern regarding domain coverage and clarity on the selected tasks. Could you please have a look at their rebuttal and provide specific feedback on their responses? Please not that simply acknowledging reading their rebuttal and/or only stating generic statement like "I will keep my score" without further explanation is not acceptable. Your detailed input is essential for a fair and informed decision.
Many thanks,
Your AC
Thank you for the responses. The questions were fully addressed.
We appreciate your positive feedback and are glad our responses addressed your questions. Your constructive comments have been invaluable in improving the paper.
This paper introduces TreeG, a tree search–based framework for training-free guidance in diffusion and flow models that can handle non-differentiable objectives and discrete data. The framework is instantiated into three algorithms, supported by theoretical analysis under simplifying assumptions, and evaluated across symbolic music, molecule, and DNA design tasks. Additional experiments, including results on Vina Docking, further demonstrate its empirical strength.
The main strengths are the novelty of the unified framework, consistent performance gains across diverse domains, and the added clarity provided through new evaluation metrics such as Negative Log-Likelihood. The rebuttal and discussion resolved initial concerns regarding task selection, theoretical assumptions, computational cost, and presentation of results. Reviewers acknowledged these clarifications and raised their assessments accordingly.
Overall, the reviewers converged to a positive consensus, finding the contribution technically solid and broadly relevant. I recommend acceptance.