7.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

3.0

置信度

创新性2.3

质量3.0

清晰度3.0

重要性2.5

NeurIPS 2025

Derivative-Free Guidance in Continuous and Discrete Diffusion Models with Soft Value-based Decoding

Xiner Li,Yulai Zhao,Chenyu Wang,Gabriele Scalia,Gökcen Eraslan,Surag Nair,Tommaso Biancalani,Shuiwang Ji,Aviv Regev,Sergey Levine,Masatoshi Uehara

OpenReview PDF

提交: 2025-05-12更新: 2025-10-29

摘要

关键词

Biomolecular designReward-guided generationDiffusion models

评审与讨论

审稿意见

评分: 4置信度: 42025-07-02

The authors propose a novel algorithm for sampling from pretrained diffusion models while optimising for a riven downstream reward function. The proposed method relies on first training a value function, which assigns values not only to final samples but also to noise samples. It then uses BoN sampling at each diffusion step, picking the sample with the highest value under the learned value function. The authors claim that their method significantly improves on relevant image generation and molecule synthesis benchmarks.

优缺点分析

Strengths:

developing significantly better optimisation methods for broadly applicable models (like pretrained diffusion models) is of broad utility
the proposed method is simple and easy to understand
the paper writing is clear

Weaknesses:

"eliminates the need to construct differentiable models" -- while this is true, the proposed method itself relies on training an additional value function. Why is this much better? e.g. using derivatives of proxy models should be more sample efficient?
how much more compute intensive is the BoN sampling? Could we e.g. annotate this for each row in Table 1?
Do the Table 1 results somehow account for reward hacking? shouldn't we also evaluate the distance to the original model, to see which methods achieve the highest reward while having the smallest distance tot he original model?
no limitations addressed
Why does SVVD-R often outperform SVDD? shouldn't it be similar? (In Table 1)
Comparison to prior work does not seem sufficient, given that the paper rarely is based on improved empirical performance. E.g., no reward hacking is addressed in Table 1 (and not discussed?), the comparisons in figure 3 are only with the pretrained model and do not take other baselines into account (apart from the BoN baseline in the appendix?)

Minor things/ Comments:

"naturalness of these design spaces" -- would be great if this could be expressed a bit more formally (if possible)

问题

See weaknesses

局限性

最终评判理由

authors addressed concerns largely.

格式问题

formatting seems good

作者回复

2025-07-31

We sincerely appreciate the reviewer’s time and thoughtful feedback. The reviewer primarily raised two key concerns: (1) why we don't use differentiable proxy models, and (2) the potential risk of reward hacking.

Regarding point (1), we note that (a) constructing accurate differentiable proxies is often non-trivial in many practical scenarios, and (b) even in such cases, our method demonstrates greater reward optimization performance compared to gradient-based methods such as DPS. Furthermore, as shown in Algorithm 6 (Appendix H), our method (SVDD) can be seamlessly integrated with gradient-based techniques when suitable proxies are available.

Regarding point (2), we provide multiple pieces of empirical evidence in the paper indicating that our method is robust to reward hacking (Likelihood, Clip scores as shown in Section 6 already, and additional evidence with LaVa model).

We would be happy to offer further clarification if needed.

W1The proposed method itself relies on training an additional value function.

In our method, importantly, we propose a posterior mean approximation (Section 4.3), which avoids learning entirely, and use this approach as a default version in our experimental section, Section 6. In contrast, learned value function is merely an alternative ablation we study in Appendix C and Appendix H.2, and not the primary proposal of this paper.

More specifically, our posterior mean estimation approach approximates the expected reward from a noisy input using a single forward pass of the pre-trained diffusion model—an idea inspired by DPS and universal guidance. This method avoids the sample inefficiency with learned value functions, while remaining effective empirically, as demonstrated in Section 6.
We further show that this training-free approach, when combined with SVDD, performs comparably to the Monte Carlo variant that requires a learned value function (Appendix C.3, Table 2). Additionally, we show that the posterior mean approximation itself serves as a strong value function estimator (Appendix C.3, Figure 6, and Section H.2), achieving competitive performance in practice.

W1: Why is SVDD better than Using Differentiable Proxy Models?

Thank you for raising this critical point. Here is our response.

While we acknowledge that in some scenarios it is feasible to construct accurate differentiable proxy models—and doing so may lead to sample-efficient optimization—we would like to emphasize that many real-world rewards are difficult to model with accurate differentiable proxies, for the following reasons:
- Non-differentiable domain-specific features: Some rewards are based on discrete or dictionary-based features (e.g., molecular or protein descriptors, or outputs from tools like AlphaFold3), which are inherently non-differentiable.
- Black-box feedback: Many important objectives in molecular and protein design rely on black-box simulations, such as physics-based tools like Vina or Rosetta, where gradients are unavailable but the reward signals (e.g., docking score, folding stability) are crucial.
Even in settings where differentiable rewards are easily available—such as “Enhancers” and “5′ UTR”—our method still outperforms gradient-based approaches like DPS, as shown empirically in Table 1 in Section 6.
Moreover, SVDD can be seamlessly combined with differentiable proxy models when they are available, even if they are potentially suboptimal. As described in Algorithm 6 (Appendix F), SVDD can incorporate classifier guidance to construct proposal distributions, allowing it to leverage differentiable signals in place of a pretrained diffusion model.

W2: Compute Overhead of Best-of-N Sampling – Can We Annotate It in Table 1?

In general, we set up experiments where the computational complexity for querying diffusion models is the same. More specifically, with this regard, the computational complexity is O(N’M) (Batch-size: N’, M: search width) while in Best-of-Ns computational complexity is O(N). We set N’M=N in Table 1.

We further include a table summarizing wall-clock time and GPU memory for SVDD and Best-of-N under comparable settings, where we run a loop in code implementation for all methods.

Task	Image Aesthetic(M=20)	Enhancer HepG2(M=10)	5’UTR MRL(M=10)
Best of N	200s 9878M	27.56s 2102M	10.68s 1872M
SVDD	203s 12355M	27.78s 2173M	13.27s 1967M

We will expend Table 1 to annotate the computation complexity in the paper final revision.

W3: Reward Hacking and Evaluation of Distance to Original Model

Thank you for highlighting this important point. While we do not explicitly use the term “reward hacking” in the paper, we would like to emphasize that we have included multiple empirical and conceptual evidence demonstrating that our method is roubst to reward hacking in our submission. We have also conducted additional experiments to further support this claim.

Empirical Evidence 1:

In Table 1, we report the log-likelihood (LL) of generated samples under the pretrained model. LL serves as a proxy for how well the samples align with the model’s original distribution (distance) —a natural indicator of whether the generation has drifted too far (i.e., potential reward hacking). We observe that the LL of SVDD is comparable to that of Best-of-N, suggesting that SVDD maintains high fidelity to the pretrained model's distribution.

Empirical Evidence 2:

In the image domain, we present both visualizations and CLIP scores to assess alignment with prompts in Section 6. As seen in our results (Figure 3 in our paper)—and in contrast to prior representative work such as DRaFT [Deng et al., 2024]—there is no visual indication of reward hacking. In DRaFT, when reward hacking occurs, the deviation is typically apparent upon visual inspection (Figure 3 in the DRaFT paper).

Empirical Evidence 3 (Additional Analysis):

To more systematically assess semantic fidelity, we conducted a new experiment using LLaVA-1.5-7B, an image-to-text model. Given an animal prompt (e.g., “cheetah”), we evaluate the object mention accuracy—the proportion of generated images where the model correctly identifies the target prompt. This directly checks whether the reward optimization process sacrifices semantic consistency, which would be indicative of reward hacking. Note that this experiment is inspired by two related works (DDPO in Black et al., 2023, and BRAID), which aim to optimize rewards in pre-trained image diffusion models.

The results show that SVDD maintains high object mention accuracy, further supporting its robustness against reward hacking. We note that achieving 100% accuracy is non-trivial—in fact, in scenarios where models are fine-tuned using DDPO or DRaFT, reward hacking can occur and lead to near 0% accuracy in the end, as illustrated in Figure 3 of the DRaFT paper and Figure 3 of the BRAID paper.

	Object Mention Accuracy(%) in Aesthetic
Best-of-N	100
DPS	100
SMC	100
SVDD	100
SVDD-R	100

Conceptual Evidence:

By design, SVDD samples from a search tree rooted in the pretrained model, ensuring that generation remains close to the original distribution. This constraint acts as a natural safeguard against reward hacking, as it prevents excessive divergence from the pretrained model's behavior.

We will incorporate these clarifications and the additional experiment in the final version of the manuscript.

Reference:

DRaFT: https://openreview.net/pdf?id=1vmSEVL19f (ICLR 2024)

BRAID: https://proceedings.neurips.cc/paper_files/paper/2024/file/e68274fc4f158dbcbd4dddc672f7ee9c-Paper-Conference.pdf (NeurIPS 2024)

W4 no limitations addressed

Thank you for pointing this out. We would like to clarify that the limitations of our approach are discussed in Section 4.5. If the reviewer was referring to additional concerns not addressed there, we would be happy to elaborate further and incorporate any necessary clarifications in the revised manuscript.

W5: Why Does SVDD-R Often Outperform SVDD?

This is an insightful question. While SVDD-R is based on SVDD, its resampling strategy can amplify high-reward trajectories more aggressively, which leads to higher average rewards, particularly when:

The value approximation is imperfect, and resampling acts as a second-stage correction;
Sampling budget is limited, and SVDD-R focuses computation on promising samples.

W6: The comparisons in Figure 3 are only with the pretrained model and do not take other baselines into account

Thank you for your suggestion. In Figure 3, we did not include results from all proposal methods, primarily due to space constraints. Additionally, for tasks such as generating aesthetic images or high-QED molecules, it is often difficult to evaluate quality purely through visual inspection. That said, we will include visualizations for all remaining baselines in the revised version of the paper! Due to the conference's strict rebuttal guidelines, we are unable to provide additional figures at this stage, but we will ensure they are included in the final submission. (If there are other concerns regarding further comparisons to prior work, we would be happy to provide additional information. )

Minor Comments “Naturalness of These Design Spaces” – Could Be More Formal

Thank you for your suggestion. We agree that the term "naturalness" may sound informal, as its meaning can be context-dependent (e.g., natural images, natural chemical space, etc.). Mathematically, we assume that the pre-trained diffusion model captures this notion. We will clarify this point further in the revised version.

2025-08-03

The rebuttal largely addresses my concerns and I ahve update dmy score accordingly.

评论- Response to Reviewer eWG2

2025-08-05

Dear Reviewer eWG2,

Thank you for replying to our rebuttal and updating the score! We are glad to know that your concerns have been largely addressed. Thank you again for your valuable comments and suggestions, which help improve our work a lot. Please let us know if there are any additional questions or feedback.

Sincerely,

Authors

审稿意见

评分: 5置信度: 32025-07-03

This work presents an algorithm called Soft Value-based Decoding in Diffusion Models (SVDD), which is a conditioned generation method for diffusion model which does not requires a differentiable classifier for guidance. The key idea is to define a "soft value function" (Eq. 1), and which re-weights the sampling probability of Monte-Carlo generation of diffusion models. Furthermore, the paper proposes SVDD-R, a more efficient sampling method (similar to beam search). The proposed algorithms have been empirically evaluated on image, molecular and DNA generation tasks. Ablation studies have also been performed. Overall, the current work suggests an effective method for diffusion model's generation from non-differentiable feedback signals.

优缺点分析

Strengthes

The paper aims at solving derivative-free guidance for diffusion models, which is a practical and important challenge in AI applications.
Comprehensive empirical evaluations are presented in the main texts and the appendix, showing the effectiveness of SVDD and its features.
The idea of using soft value function is well-motivated.

Weaknesses

SVDD requires multiple evaluations during inference, which limits its potential to leverage human feedback, since asking humans to provide reward signals for so many times during generation is not practical.
The manuscript structure is a bit top-heavy. Space is too much occupied by methods, and many results are extruded to the Appendix. Some key results in the main texts are not sufficiently explained (see my questions). Conclusion is too short.
The choice of $\alpha$ is crucial, as the paper mentioned -- when $\alpha$ is small, the "soft" value becomes "hard" value, and lead to a higher reward, while may loss diversity. How to select a proper $\alpha$ in principle remains unsovled.

问题

Can you train a guidance model, which computes an additional gradient $\nabla(\exp(r(x)/\alpha)))$ like that in a classifier-free guidance? While the reward function may not be differentiable, this guidance model could be trained using policy gradient. This may decrease inference cost in case large amount of generation is needed.
In Table 1, why SVDD-R sometimes showed a much higher CLIP score (Image:Compress) whereas sometimes a much lower CLIP score (Image:Aesthetic) than SVDD?
Also in Table 1, why SVDD-R can show a much higher log-likelihood even than the pretrained model (Molecule: QED, Docking parp1)?
Before RL was used on diffusion models, the concept of soft value and related studies have been fruitful. What is the relationship between the paper's soft value and those proposed in classic RL literature such as [1,2,3] ?

[1] Haarnoja T, Tang H, Abbeel P, et al. Reinforcement learning with deep energy-based policies[C]//International conference on machine learning. PMLR, 2017: 1352-1361. [2] Haarnoja T, Zhou A, Abbeel P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor[C]//International conference on machine learning. Pmlr, 2018: 1861-1870. [3] Levine S. Reinforcement learning and control as probabilistic inference: Tutorial and review[J]. arXiv preprint arXiv:1805.00909, 2018.

局限性

yes

最终评判理由

My questions are well addressed by the authors. I think that the this work, after incorporating the suggestions from other reviewers and me, would be a good contribution to the venue.

格式问题

Figure 4 caption is not well formated.

作者回复

2025-07-31

Thank you for your thoughtful and positive feedback!

W1: Inference Cost Limits Use of Human Feedback

We appreciate the reviewer raising this important point. While SVDD does require multiple reward evaluations during inference, we would like to emphasize the following:

In human-in-the-loop settings, it is possible to leverage surrogate reward functions trained on offline data. In such cases, the reward may be differentiable; however, as shown in Experiment 6, our method outperforms SOTA approaches that directly optimize differentiable rewards, such as DPS.
When differentiable proxy models are available, SVDD can be seamlessly integrated with gradient-based approaches, as shown in Algorithm 6 in Appendix F.
Importantly, SVDD remains effective even with a small number of candidate samples. As demonstrated in Figure 4, using just $M = 5$ already yields substantial improvements in reward.

We will clarify these points further in the revised version in our conclusion section.

Our SVDD-R variant reduces repeated evaluations by promoting top candidates using global resampling, improving efficiency.

In the revised version, we will expand the discussion on how SVDD can be adapted for human-in-the-loop reward optimization, especially via bootstrapped human preferences, offline reward modeling, and hybrid schemes where SVDD is only used in early iterations and replaced by faster sampling later.

W2: Manuscript is Top-Heavy; Results in Appendix

Thank you for the suggestion. We acknowledge this structural issue and sincerely appreciate the reviewer’s feedback. Due to space constraints, we prioritized theoretical exposition and algorithmic clarity in the main text. However, we agree that some experimental results warrant more prominent placement. To address this, we plan to revise the paper as follows:

We will move key ablations and plots from Appendices D/E/F into the main paper, including results on value approximation vs. reward correlation, the effect of importance sample size $M$ , and value function quality.
We will expand the Conclusion section to highlight better practical implications, methodological trade-offs, and future directions.

We believe these changes will improve the balance between theory and empirical results and enhance the overall accessibility of the paper.

W3: Selection of $\alpha$

Indeed, the parameter $\alpha$ governs the reward–diversity trade-off, analogous to the temperature in softmax policies or the guidance scale in classifier-free guidance. While we empirically investigate its effect in Figure 5, we agree that developing a principled method for selecting $\alpha$ remains an open question—particularly because the optimal value is often highly context-dependent.

As a practical guideline (which we will include in the Discussion section), we propose the following strategy:

Offline settings: Choose $\alpha$ to maximize a weighted combination of reward and diversity (e.g., entropy, coverage metrics.... specific metrics are context-dependent) on a small validation set.
Active learning setting: $\alpha$ can be annealed during inference to balance exploration and exploitation.

We will highlight this insight and include a discussion in Section 7.

Q1: Can you train a guidance model, which computes an additional gradient like that in a classifier-free guidance? While the reward function may not be differentiable, this guidance model could be trained using policy gradient.

Thank you for the suggestion. We assume you are referring to reinforcement learning (RL)-based fine-tuning. Indeed, Black et al. (DDPO) introduced such an approach, which has been shown to outperform classifier-free guidance enve. While we greatly appreciate these contributions, our work focuses on inference-time techniques (without fine-tuning), which are also highly relevant and practical in real-world applications. We will clarify this distinction in the revised manuscript. Please feel free to let us know if we have misunderstood your intention—we would be happy to elaborate further.

Q2: SVDD-R CLIP Score Discrepancy in Image Tasks

That's an insightful observation. We conjecture that SVDD-R tends to favor high-reward outliers, which can either increase or decrease auxiliary metrics such as CLIP scores, depending on the correlation between the optimized reward and the auxiliary metric. We will clarify this phenomenon in the discussion accompanying Table 1.

Q3: SVDD-R Shows Higher Log-Likelihood than Pretrained Model

This effect may be attributed to the resampling dynamics. Since SVDD-R prioritizes trajectories with both high reward and high model confidence (i.e., high log-likelihood), the resulting resampled batch can exhibit a higher average log-likelihood than samples drawn directly from the pretrained model. We emphasize that this does not violate the generative nature of the model; rather, it reflects a biased selection within the model’s existing sample space. We will clarify this explanation in the revised manuscript, specifically under Table 1.

Q4: Comparison with related RL papers

That’s a great point. This indeed corresponds to soft value functions in the classical RL literature, particularly when embedding diffusion models into entropy-regularized MDPs. This connection has been noted in several related works—for example, see Section 3 of the reference below, which discusses this concept in the context of fine-tuning diffusion models via reward signals.

(However, it is worth noting that while the connection is discussed, these soft value functions are not directly used in the algorithms proposed in that work. Hence, our proposed algorithm is novel. )

Reference: Uehara, M., Zhao, Y., Biancalani, T., & Levine, S. (2024). Understanding reinforcement learning-based fine-tuning of diffusion models: A tutorial and review. arXiv preprint arXiv:2407.13734.

评论- Thank for the response

2025-08-02

The authors have addressed most of my concerns. I agree that this work could be a valuable contribution to NeurIPS, and I will increase my score to 5 accordingly.

评论- Response to Reviewer kjpL

2025-08-05

Dear Reviewer kjpL,

Thank you for your appreciation of our work and increasing the score! We are glad to know that most of your concerns have been addressed. Thank you again for your valuable comments and suggestions, which help improve our work a lot. Please let us know if there are any additional questions or feedback. Thank you!

Sincerely,

Authors

审稿意见

评分: 5置信度: 32025-07-03

This paper introduces SVDD, an iterative sampling method using soft value functions for diffusion models. Without the need of fine-tuning and differentiable model, SVDD enables direct use of non-differentiable feedback, applicable to both continuous and discrete diffusion models. Approaches based on classifier guidance require differentiable proxy models, and cannot utilize non-differentiable features which limits its usage. To address this issue, SVDD leverage soft value functions which works as look-ahead functions that gives information of how intermediate samples lead to rewards in the future. SVDD selects the intermediate states with the highest reward, resulting in desired sample. Experiments on image, molecule, and DNA/RNA generation shows that SVDD outperforms previous baselines such as DPS and SMC-based methods.

优缺点分析

Strength

Writing is easy to follow with clear motivation and problem setup.
Optimizing the reward without need of fine-tuning and enabling non-differentiable feedback is advantageous for diverse field and realistic settings. Previous approaches based on classifier guidance cannot use non-differentiable features.
To the best of my knowledge, integrating soft value function for reward optimization of diffusion models is a novel approach.
Validation on several domains shows the wide applicability of the proposed method, supporting both continuous and discrete diffusion models.
Conducted through ablation studies, including study on the duplication size M for performance and time, and study on hyperparameter alpha in the appendix.

Weakness

While the experimental setup is sound using SD 1.5 for image generation and GDSS for molecule generation, using larger and more recent models, such as SDXL for image generation and DiGress, a discrete diffusion model for molecule generation, could strengthen the experimental results. I want to clarify that these are not required experiments, but a suggestion.
How is the generation time of SVDD compared to the baselines, DPS and Best-of-N? It seems that look ahead with soft value function would require additional inference time compared to the baselines. Also, how much additional memory is required compared to the baselines?

问题

Please address the questions in the weakness section.

局限性

Yes.

最终评判理由

Authors have adequately addressed my concerns and thereby I recommend acceptance.

格式问题

作者回复

2025-07-31

Thank you for your thoughtful and positive feedback!

w1. Use of Larger, More Recent Models (e.g., SDXL, DiGress)

We appreciate the suggestion. Due to the limited timeframe of the review period, we were not able to include these additional results. However, we selected Stable Diffusion and GDSS as baselines because they are well-established models widely used in the community. In particular, Stable Diffusion has been frequently used in prior work on fine-tuning diffusion models with reward signals, including recent studies such as Clark et al. (ICLR 2024).

Reference: Clark, K., Vicol, P., Swersky, K., & Fleet, D. J. Directly fine-tuning diffusion models on differentiable rewards. ICLR, 2024.

w2. Generation Time and Memory Overhead vs DPS and Best-of-N

This is a valuable question. Here is an answer.

The total computational cost scales approximately linearly with the importance sample size $M$ . For instance, using $M = 4$ incurs roughly 4× the cost of a single sampling step—comparable to Best-of-4—yet SVDD achieves significantly higher rewards.
As shown in Figure 4 (Section 6.2), we vary the value of $M$ and analyze the trade-off between computational cost and reward performance.
Technically, the additional cost can manifest as either increased runtime or memory usage, depending on the implementation. For example, if samples are generated sequentially in a loop, the runtime increases linearly with $M$ while memory usage remains nearly constant, as shown in our paper. Alternatively, if samples are generated in parallel, memory usage may increase with $M$ , while runtime remains roughly constant.

We further include a table summarizing wall-clock time and GPU memory for SVDD, DPS, and Best-of-N under comparable settings, where we run a loop in code implementation for all methods.

Task	Image Aesthetic(M=20)	Enhancer HepG2(M=10)	5’UTR MRL(M=10)
Best of N	200s 9878M	27.56s 2102M	10.68s 1872M
SVDD	203s 12355M	27.78s 2173M	13.27s 1967M
DPS	38s 35086M	8.19s 8076M	7.77s 8068M

Note that we set $N=M$ in Best-of-N, and DPS is much slower than regular sampling by comparing "Best of M time/M". We will incorporate it in our final version.

评论- Thank you for the response.

2025-08-06

Thank you for the response. My concerns have been addressed. I'll keep my score to accepting the paper.

评论- Thank You to Reviewer ETLn

2025-08-08

Dear Reviewer ETLn,

Thank you for your appreciation of our work! We are glad to know that your concerns have been addressed. Thank you again for your valuable comments and suggestions, which help improve our work a lot. Please let us know if there are any additional questions or feedback.

Sincerely,

Authors

审稿意见

评分: 5置信度: 22025-07-07

The paper introduces a method called SVDD to address the challenges of current diffusion models on optimizing downstream reward functions while preserving the naturalness of the design spaces. SVDD does so by integrating value functions to enable a look-ahead into the intermediate noisy states and the rewards that can be achieved. Experiments on different domains show improved performance over the baselines.

优缺点分析

Strengths:

The work targets an important problem, especially related to sampling in the cases on non-differentiable rewards which can be common in scientific discovery scenarios.
The paper overall is written well along with a good motivation of the method.

Weaknesses:

The method relies on learning a good value function, which can be difficult to do properly in general. A discussion on this in the manuscript would be useful.
The additional cost of sampling good trajectories to learn such a value function would be useful to analyze. Even though the method avoids the cost of expensive fine-tuning, it is not very clear how much is the additional cost of sampling trajectories and learning a value function is.
The method seems an extension of SMC, and in terms of that, the novelty of the method is not too clear.

问题

How many seeds have been used when generating the results and comparisons with the baselines?
The LL metrics for the proposed method and the baselines seems pretty close. Is there a reason for the proposed method not bringing too much benefits as compared to the baselines?
How is the quality quality of the learnt value function evaluated and how is its dependence on the number of sample trajectories needed for the training analyzed?

局限性

Yes

最终评判理由

Update: I would like to thank the authors for their rebuttal which has been useful for me to understand their work better. I have also looked at the reviews and discussions from other reviewers that has also clarified a bunch of my concerns. Since the authors have clarified most of my concerns, I am updating my scores accordingly.

格式问题

None

作者回复

2025-07-31

Thank you for the thoughtful review! The reviewer primarily raised two points: (1) our method appears to rely on learned value functions, and (2) the analysis may be insufficient.

Regarding point (1), we would like to clarify that our proposed method does not rely on a learned value function. Instead, we focus on a learning-free variant, which serves as the default throughout our main experiments. For point (2), we provide a detailed analysis of the learning-free approach (SVDD-PM) in Appendix C and Appendix H.2, including its empirical performance. In this rebuttal, we also offer additional details regarding its sample and computational efficiency. We would be happy to elaborate further!

W1: Reliance on Learning Value Function

While we agree that learning a good value function could be challenging sometimes, in our method, importantly, we propose a posterior mean approximation (Section 4.3), which avoids learning entirely, and use this approach as a default version in our experimental section, Section 6. In contrast, learned value function is merely an alternative ablation we study in Appendix C and Appendix H.2, and not the primary proposal of this paper.

More specifically, our posterior mean estimation approach approximates the expected reward from a noisy input using a single forward pass of the pre-trained diffusion model—an idea inspired by DPS and universal guidance. This method avoids the sample inefficiency with learned value functions, while remaining effective empirically, as demonstrated in Section 6.
We further show that this training-free approach, when combined with SVDD, performs comparably to the Monte Carlo variant that requires a learned value function (Appendix C.3, Table 2). Additionally, we show that the posterior mean approximation itself serves as a strong value function estimator (Appendix C.3, Figure 6, and Section H.2), achieving competitive performance in practice.

W2: Cost of Sampling/ Learning Value Functions + Q3: Evaluation of Value Function Quality and Sample Efficiency

We appreciate the reviewer’s suggestion (but we would like to reiterate—as noted in our response to W1—that our default method is SVDD with a posterior mean approach, which does not require learning value functions).

Below, we address your specific concerns:

(1) Value function quality

We discussed the quality of value function approximations in detail in Appendix H.2. Several plots are provided to illustrate its performance. As shown there, the Pearson correlation between predicted and true rewards is relatively high. Notably, as the diffusion process progresses, our value estimation method increasingly aligns with the final reward, indicating strong predictive accuracy.

(2) Sample Efficiency

We provide additional details regarding computational and sample efficiency in the table below for clarity.

Task	Num of samples	Runtime	Fine-tuning diffusion model runtime (with DDPO)
Enhancer HepG2	704 * 128	~25min	~3.8h
Molecule QED	576 * 1000	~2.5h	~46h
5’UTR MRL	736 * 128	~20min	--

The reported runtime includes the total time for sampling trajectories, value function learning, and evaluation
For comparison, we also experimented with fine-tuning diffusion models using reward signals via DDPO (Black et al., 2023), and we report the runtime to convergence across several tasks. Compared to fine-tuning the diffusion model, this computational cost remains negligible.

We will incorporate these details into the final version of the paper. Due to the rebuttal format restrictions, we are unable to include learning curve plots of the value function experiments here, but we will ensure they are included in the final camera-ready version.

W3: Novelty Compared to SMC

We understand that this point may be somewhat confusing, but we have explicitly emphasized the distinction in our paper—particularly in Section 2 (Lines 102–107).

While there are concurrent efforts exploring reward-guided generation (Kim et al., 2025, etc), our approach differs in key ways. Most notably, (1) our algorithm is an instantiation of nested importance sampling (nested-IS) SMC (Naesseth et al., 2019, Algorithm 5, 105 13), whereas other concurrent works typically employ standard sequential Monte Carlo. (2) Additionally, we introduce a novel technique for reward maximization in Section 5 that integrates ideas from both SMC and nested-IS SMC organically.

The algorithmic distinction between our approach and standard SMC-based methods is also highlighted in Section 4.4. We would be happy to elaborate further if it is still unclear.

Q1: Number of Seeds

Thank you for the comment. While the appropriate methodology may vary depending on the domain, we would like to clarify that the confidence intervals reported in our tables are constructed with this in mind. For example, in the image domain, we generate 300 samples using different random seeds to compute these intervals.

Q2: Close LL Scores Across Methods

Thank you for the observation. This result is in fact expected. We would like to clarify that the objective of SVDD, as described in Line 170, is to generate samples with high reward while preserving naturalness. Importantly, the log-likelihood (LL) serves as a proxy for naturalness, but it is not the primary optimization objective. Given this, the observed result aligns well with the intended behavior of our method.

2025-08-04

Please engage in the discussion with the authors. The discussion period will end in a few days.

最终决定Accept (poster)

2025-09-17

This work presents an algorithm called Soft Value-based Decoding in Diffusion Models (SVDD), which is a conditioned generation method for diffusion model that does not requires a differentiable classifier for guidance. The work is targeting an important problem, especially when considering non-differentiable reward functions and will be of interest to the NeurIPS audience. All of the reviewers concerns were adequately addressed during the rebuttal stage.