Derivative-Free Guidance in Continuous and Discrete Diffusion Models with Soft Value-based Decoding
摘要
评审与讨论
The authors propose a novel algorithm for sampling from pretrained diffusion models while optimising for a riven downstream reward function. The proposed method relies on first training a value function, which assigns values not only to final samples but also to noise samples. It then uses BoN sampling at each diffusion step, picking the sample with the highest value under the learned value function. The authors claim that their method significantly improves on relevant image generation and molecule synthesis benchmarks.
优缺点分析
Strengths:
- developing significantly better optimisation methods for broadly applicable models (like pretrained diffusion models) is of broad utility
- the proposed method is simple and easy to understand
- the paper writing is clear
Weaknesses:
- "eliminates the need to construct differentiable models" -- while this is true, the proposed method itself relies on training an additional value function. Why is this much better? e.g. using derivatives of proxy models should be more sample efficient?
- how much more compute intensive is the BoN sampling? Could we e.g. annotate this for each row in Table 1?
- Do the Table 1 results somehow account for reward hacking? shouldn't we also evaluate the distance to the original model, to see which methods achieve the highest reward while having the smallest distance tot he original model?
- no limitations addressed
- Why does SVVD-R often outperform SVDD? shouldn't it be similar? (In Table 1)
- Comparison to prior work does not seem sufficient, given that the paper rarely is based on improved empirical performance. E.g., no reward hacking is addressed in Table 1 (and not discussed?), the comparisons in figure 3 are only with the pretrained model and do not take other baselines into account (apart from the BoN baseline in the appendix?)
Minor things/ Comments:
- "naturalness of these design spaces" -- would be great if this could be expressed a bit more formally (if possible)
问题
See weaknesses
局限性
no
最终评判理由
authors addressed concerns largely.
格式问题
formatting seems good
We sincerely appreciate the reviewer’s time and thoughtful feedback. The reviewer primarily raised two key concerns: (1) why we don't use differentiable proxy models, and (2) the potential risk of reward hacking.
Regarding point (1), we note that (a) constructing accurate differentiable proxies is often non-trivial in many practical scenarios, and (b) even in such cases, our method demonstrates greater reward optimization performance compared to gradient-based methods such as DPS. Furthermore, as shown in Algorithm 6 (Appendix H), our method (SVDD) can be seamlessly integrated with gradient-based techniques when suitable proxies are available.
Regarding point (2), we provide multiple pieces of empirical evidence in the paper indicating that our method is robust to reward hacking (Likelihood, Clip scores as shown in Section 6 already, and additional evidence with LaVa model).
We would be happy to offer further clarification if needed.
W1The proposed method itself relies on training an additional value function.
In our method, importantly, we propose a posterior mean approximation (Section 4.3), which avoids learning entirely, and use this approach as a default version in our experimental section, Section 6. In contrast, learned value function is merely an alternative ablation we study in Appendix C and Appendix H.2, and not the primary proposal of this paper.
-
More specifically, our posterior mean estimation approach approximates the expected reward from a noisy input using a single forward pass of the pre-trained diffusion model—an idea inspired by DPS and universal guidance. This method avoids the sample inefficiency with learned value functions, while remaining effective empirically, as demonstrated in Section 6.
-
We further show that this training-free approach, when combined with SVDD, performs comparably to the Monte Carlo variant that requires a learned value function (Appendix C.3, Table 2). Additionally, we show that the posterior mean approximation itself serves as a strong value function estimator (Appendix C.3, Figure 6, and Section H.2), achieving competitive performance in practice.
W1: Why is SVDD better than Using Differentiable Proxy Models?
Thank you for raising this critical point. Here is our response.
-
While we acknowledge that in some scenarios it is feasible to construct accurate differentiable proxy models—and doing so may lead to sample-efficient optimization—we would like to emphasize that many real-world rewards are difficult to model with accurate differentiable proxies, for the following reasons:
- Non-differentiable domain-specific features: Some rewards are based on discrete or dictionary-based features (e.g., molecular or protein descriptors, or outputs from tools like AlphaFold3), which are inherently non-differentiable.
- Black-box feedback: Many important objectives in molecular and protein design rely on black-box simulations, such as physics-based tools like Vina or Rosetta, where gradients are unavailable but the reward signals (e.g., docking score, folding stability) are crucial.
-
Even in settings where differentiable rewards are easily available—such as “Enhancers” and “5′ UTR”—our method still outperforms gradient-based approaches like DPS, as shown empirically in Table 1 in Section 6.
-
Moreover, SVDD can be seamlessly combined with differentiable proxy models when they are available, even if they are potentially suboptimal. As described in Algorithm 6 (Appendix F), SVDD can incorporate classifier guidance to construct proposal distributions, allowing it to leverage differentiable signals in place of a pretrained diffusion model.
W2: Compute Overhead of Best-of-N Sampling – Can We Annotate It in Table 1?
In general, we set up experiments where the computational complexity for querying diffusion models is the same. More specifically, with this regard, the computational complexity is O(N’M) (Batch-size: N’, M: search width) while in Best-of-Ns computational complexity is O(N). We set N’M=N in Table 1.
We further include a table summarizing wall-clock time and GPU memory for SVDD and Best-of-N under comparable settings, where we run a loop in code implementation for all methods.
| Task | Image Aesthetic(M=20) | Enhancer HepG2(M=10) | 5’UTR MRL(M=10) |
|---|---|---|---|
| Best of N | 200s 9878M | 27.56s 2102M | 10.68s 1872M |
| SVDD | 203s 12355M | 27.78s 2173M | 13.27s 1967M |
We will expend Table 1 to annotate the computation complexity in the paper final revision.
W3: Reward Hacking and Evaluation of Distance to Original Model
Thank you for highlighting this important point. While we do not explicitly use the term “reward hacking” in the paper, we would like to emphasize that we have included multiple empirical and conceptual evidence demonstrating that our method is roubst to reward hacking in our submission. We have also conducted additional experiments to further support this claim.
Empirical Evidence 1:
In Table 1, we report the log-likelihood (LL) of generated samples under the pretrained model. LL serves as a proxy for how well the samples align with the model’s original distribution (distance) —a natural indicator of whether the generation has drifted too far (i.e., potential reward hacking). We observe that the LL of SVDD is comparable to that of Best-of-N, suggesting that SVDD maintains high fidelity to the pretrained model's distribution.
Empirical Evidence 2:
In the image domain, we present both visualizations and CLIP scores to assess alignment with prompts in Section 6. As seen in our results (Figure 3 in our paper)—and in contrast to prior representative work such as DRaFT [Deng et al., 2024]—there is no visual indication of reward hacking. In DRaFT, when reward hacking occurs, the deviation is typically apparent upon visual inspection (Figure 3 in the DRaFT paper).
Empirical Evidence 3 (Additional Analysis):
To more systematically assess semantic fidelity, we conducted a new experiment using LLaVA-1.5-7B, an image-to-text model. Given an animal prompt (e.g., “cheetah”), we evaluate the object mention accuracy—the proportion of generated images where the model correctly identifies the target prompt. This directly checks whether the reward optimization process sacrifices semantic consistency, which would be indicative of reward hacking. Note that this experiment is inspired by two related works (DDPO in Black et al., 2023, and BRAID), which aim to optimize rewards in pre-trained image diffusion models.
The results show that SVDD maintains high object mention accuracy, further supporting its robustness against reward hacking. We note that achieving 100% accuracy is non-trivial—in fact, in scenarios where models are fine-tuned using DDPO or DRaFT, reward hacking can occur and lead to near 0% accuracy in the end, as illustrated in Figure 3 of the DRaFT paper and Figure 3 of the BRAID paper.
| | Object Mention Accuracy(%) in Aesthetic |
|---|---|
| Best-of-N | 100 |
| DPS | 100 |
| SMC | 100 |
| SVDD | 100 |
| SVDD-R | 100 |
Conceptual Evidence:
By design, SVDD samples from a search tree rooted in the pretrained model, ensuring that generation remains close to the original distribution. This constraint acts as a natural safeguard against reward hacking, as it prevents excessive divergence from the pretrained model's behavior.
We will incorporate these clarifications and the additional experiment in the final version of the manuscript.
Reference:
DRaFT: https://openreview.net/pdf?id=1vmSEVL19f (ICLR 2024)
BRAID: https://proceedings.neurips.cc/paper_files/paper/2024/file/e68274fc4f158dbcbd4dddc672f7ee9c-Paper-Conference.pdf (NeurIPS 2024)
W4 no limitations addressed
Thank you for pointing this out. We would like to clarify that the limitations of our approach are discussed in Section 4.5. If the reviewer was referring to additional concerns not addressed there, we would be happy to elaborate further and incorporate any necessary clarifications in the revised manuscript.
W5: Why Does SVDD-R Often Outperform SVDD?
This is an insightful question. While SVDD-R is based on SVDD, its resampling strategy can amplify high-reward trajectories more aggressively, which leads to higher average rewards, particularly when:
- The value approximation is imperfect, and resampling acts as a second-stage correction;
- Sampling budget is limited, and SVDD-R focuses computation on promising samples.
W6: The comparisons in Figure 3 are only with the pretrained model and do not take other baselines into account
Thank you for your suggestion. In Figure 3, we did not include results from all proposal methods, primarily due to space constraints. Additionally, for tasks such as generating aesthetic images or high-QED molecules, it is often difficult to evaluate quality purely through visual inspection. That said, we will include visualizations for all remaining baselines in the revised version of the paper! Due to the conference's strict rebuttal guidelines, we are unable to provide additional figures at this stage, but we will ensure they are included in the final submission. (If there are other concerns regarding further comparisons to prior work, we would be happy to provide additional information. )
Minor Comments “Naturalness of These Design Spaces” – Could Be More Formal
Thank you for your suggestion. We agree that the term "naturalness" may sound informal, as its meaning can be context-dependent (e.g., natural images, natural chemical space, etc.). Mathematically, we assume that the pre-trained diffusion model captures this notion. We will clarify this point further in the revised version.
The rebuttal largely addresses my concerns and I ahve update dmy score accordingly.
Dear Reviewer eWG2,
Thank you for replying to our rebuttal and updating the score! We are glad to know that your concerns have been largely addressed. Thank you again for your valuable comments and suggestions, which help improve our work a lot. Please let us know if there are any additional questions or feedback.
Sincerely,
Authors
This work presents an algorithm called Soft Value-based Decoding in Diffusion Models (SVDD), which is a conditioned generation method for diffusion model which does not requires a differentiable classifier for guidance. The key idea is to define a "soft value function" (Eq. 1), and which re-weights the sampling probability of Monte-Carlo generation of diffusion models. Furthermore, the paper proposes SVDD-R, a more efficient sampling method (similar to beam search). The proposed algorithms have been empirically evaluated on image, molecular and DNA generation tasks. Ablation studies have also been performed. Overall, the current work suggests an effective method for diffusion model's generation from non-differentiable feedback signals.
优缺点分析
Strengthes
- The paper aims at solving derivative-free guidance for diffusion models, which is a practical and important challenge in AI applications.
- Comprehensive empirical evaluations are presented in the main texts and the appendix, showing the effectiveness of SVDD and its features.
- The idea of using soft value function is well-motivated.
Weaknesses
-
SVDD requires multiple evaluations during inference, which limits its potential to leverage human feedback, since asking humans to provide reward signals for so many times during generation is not practical.
-
The manuscript structure is a bit top-heavy. Space is too much occupied by methods, and many results are extruded to the Appendix. Some key results in the main texts are not sufficiently explained (see my questions). Conclusion is too short.
-
The choice of is crucial, as the paper mentioned -- when is small, the "soft" value becomes "hard" value, and lead to a higher reward, while may loss diversity. How to select a proper in principle remains unsovled.
问题
-
Can you train a guidance model, which computes an additional gradient like that in a classifier-free guidance? While the reward function may not be differentiable, this guidance model could be trained using policy gradient. This may decrease inference cost in case large amount of generation is needed.
-
In Table 1, why SVDD-R sometimes showed a much higher CLIP score (Image:Compress) whereas sometimes a much lower CLIP score (Image:Aesthetic) than SVDD?
-
Also in Table 1, why SVDD-R can show a much higher log-likelihood even than the pretrained model (Molecule: QED, Docking parp1)?
-
Before RL was used on diffusion models, the concept of soft value and related studies have been fruitful. What is the relationship between the paper's soft value and those proposed in classic RL literature such as [1,2,3] ?
[1] Haarnoja T, Tang H, Abbeel P, et al. Reinforcement learning with deep energy-based policies[C]//International conference on machine learning. PMLR, 2017: 1352-1361. [2] Haarnoja T, Zhou A, Abbeel P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor[C]//International conference on machine learning. Pmlr, 2018: 1861-1870. [3] Levine S. Reinforcement learning and control as probabilistic inference: Tutorial and review[J]. arXiv preprint arXiv:1805.00909, 2018.
局限性
yes
最终评判理由
My questions are well addressed by the authors. I think that the this work, after incorporating the suggestions from other reviewers and me, would be a good contribution to the venue.
格式问题
Figure 4 caption is not well formated.
Thank you for your thoughtful and positive feedback!
W1: Inference Cost Limits Use of Human Feedback
We appreciate the reviewer raising this important point. While SVDD does require multiple reward evaluations during inference, we would like to emphasize the following:
-
In human-in-the-loop settings, it is possible to leverage surrogate reward functions trained on offline data. In such cases, the reward may be differentiable; however, as shown in Experiment 6, our method outperforms SOTA approaches that directly optimize differentiable rewards, such as DPS.
-
When differentiable proxy models are available, SVDD can be seamlessly integrated with gradient-based approaches, as shown in Algorithm 6 in Appendix F.
-
Importantly, SVDD remains effective even with a small number of candidate samples. As demonstrated in Figure 4, using just already yields substantial improvements in reward.
We will clarify these points further in the revised version in our conclusion section.
- Our SVDD-R variant reduces repeated evaluations by promoting top candidates using global resampling, improving efficiency.
In the revised version, we will expand the discussion on how SVDD can be adapted for human-in-the-loop reward optimization, especially via bootstrapped human preferences, offline reward modeling, and hybrid schemes where SVDD is only used in early iterations and replaced by faster sampling later.
W2: Manuscript is Top-Heavy; Results in Appendix
Thank you for the suggestion. We acknowledge this structural issue and sincerely appreciate the reviewer’s feedback. Due to space constraints, we prioritized theoretical exposition and algorithmic clarity in the main text. However, we agree that some experimental results warrant more prominent placement. To address this, we plan to revise the paper as follows:
-
We will move key ablations and plots from Appendices D/E/F into the main paper, including results on value approximation vs. reward correlation, the effect of importance sample size , and value function quality.
-
We will expand the Conclusion section to highlight better practical implications, methodological trade-offs, and future directions.
We believe these changes will improve the balance between theory and empirical results and enhance the overall accessibility of the paper.
W3: Selection of
Indeed, the parameter governs the reward–diversity trade-off, analogous to the temperature in softmax policies or the guidance scale in classifier-free guidance. While we empirically investigate its effect in Figure 5, we agree that developing a principled method for selecting remains an open question—particularly because the optimal value is often highly context-dependent.
As a practical guideline (which we will include in the Discussion section), we propose the following strategy:
-
Offline settings: Choose to maximize a weighted combination of reward and diversity (e.g., entropy, coverage metrics.... specific metrics are context-dependent) on a small validation set.
-
Active learning setting: can be annealed during inference to balance exploration and exploitation.
We will highlight this insight and include a discussion in Section 7.
Q1: Can you train a guidance model, which computes an additional gradient like that in a classifier-free guidance? While the reward function may not be differentiable, this guidance model could be trained using policy gradient.
Thank you for the suggestion. We assume you are referring to reinforcement learning (RL)-based fine-tuning. Indeed, Black et al. (DDPO) introduced such an approach, which has been shown to outperform classifier-free guidance enve. While we greatly appreciate these contributions, our work focuses on inference-time techniques (without fine-tuning), which are also highly relevant and practical in real-world applications. We will clarify this distinction in the revised manuscript. Please feel free to let us know if we have misunderstood your intention—we would be happy to elaborate further.
Q2: SVDD-R CLIP Score Discrepancy in Image Tasks
That's an insightful observation. We conjecture that SVDD-R tends to favor high-reward outliers, which can either increase or decrease auxiliary metrics such as CLIP scores, depending on the correlation between the optimized reward and the auxiliary metric. We will clarify this phenomenon in the discussion accompanying Table 1.
Q3: SVDD-R Shows Higher Log-Likelihood than Pretrained Model
This effect may be attributed to the resampling dynamics. Since SVDD-R prioritizes trajectories with both high reward and high model confidence (i.e., high log-likelihood), the resulting resampled batch can exhibit a higher average log-likelihood than samples drawn directly from the pretrained model. We emphasize that this does not violate the generative nature of the model; rather, it reflects a biased selection within the model’s existing sample space. We will clarify this explanation in the revised manuscript, specifically under Table 1.
Q4: Comparison with related RL papers
That’s a great point. This indeed corresponds to soft value functions in the classical RL literature, particularly when embedding diffusion models into entropy-regularized MDPs. This connection has been noted in several related works—for example, see Section 3 of the reference below, which discusses this concept in the context of fine-tuning diffusion models via reward signals.
(However, it is worth noting that while the connection is discussed, these soft value functions are not directly used in the algorithms proposed in that work. Hence, our proposed algorithm is novel. )
Reference: Uehara, M., Zhao, Y., Biancalani, T., & Levine, S. (2024). Understanding reinforcement learning-based fine-tuning of diffusion models: A tutorial and review. arXiv preprint arXiv:2407.13734.
The authors have addressed most of my concerns. I agree that this work could be a valuable contribution to NeurIPS, and I will increase my score to 5 accordingly.
Dear Reviewer kjpL,
Thank you for your appreciation of our work and increasing the score! We are glad to know that most of your concerns have been addressed. Thank you again for your valuable comments and suggestions, which help improve our work a lot. Please let us know if there are any additional questions or feedback. Thank you!
Sincerely,
Authors
This paper introduces SVDD, an iterative sampling method using soft value functions for diffusion models. Without the need of fine-tuning and differentiable model, SVDD enables direct use of non-differentiable feedback, applicable to both continuous and discrete diffusion models. Approaches based on classifier guidance require differentiable proxy models, and cannot utilize non-differentiable features which limits its usage. To address this issue, SVDD leverage soft value functions which works as look-ahead functions that gives information of how intermediate samples lead to rewards in the future. SVDD selects the intermediate states with the highest reward, resulting in desired sample. Experiments on image, molecule, and DNA/RNA generation shows that SVDD outperforms previous baselines such as DPS and SMC-based methods.
优缺点分析
Strength
-
Writing is easy to follow with clear motivation and problem setup.
-
Optimizing the reward without need of fine-tuning and enabling non-differentiable feedback is advantageous for diverse field and realistic settings. Previous approaches based on classifier guidance cannot use non-differentiable features.
-
To the best of my knowledge, integrating soft value function for reward optimization of diffusion models is a novel approach.
-
Validation on several domains shows the wide applicability of the proposed method, supporting both continuous and discrete diffusion models.
-
Conducted through ablation studies, including study on the duplication size M for performance and time, and study on hyperparameter alpha in the appendix.
Weakness
-
While the experimental setup is sound using SD 1.5 for image generation and GDSS for molecule generation, using larger and more recent models, such as SDXL for image generation and DiGress, a discrete diffusion model for molecule generation, could strengthen the experimental results. I want to clarify that these are not required experiments, but a suggestion.
-
How is the generation time of SVDD compared to the baselines, DPS and Best-of-N? It seems that look ahead with soft value function would require additional inference time compared to the baselines. Also, how much additional memory is required compared to the baselines?
问题
Please address the questions in the weakness section.
局限性
Yes.
最终评判理由
Authors have adequately addressed my concerns and thereby I recommend acceptance.
格式问题
No
Thank you for your thoughtful and positive feedback!
w1. Use of Larger, More Recent Models (e.g., SDXL, DiGress)
We appreciate the suggestion. Due to the limited timeframe of the review period, we were not able to include these additional results. However, we selected Stable Diffusion and GDSS as baselines because they are well-established models widely used in the community. In particular, Stable Diffusion has been frequently used in prior work on fine-tuning diffusion models with reward signals, including recent studies such as Clark et al. (ICLR 2024).
Reference: Clark, K., Vicol, P., Swersky, K., & Fleet, D. J. Directly fine-tuning diffusion models on differentiable rewards. ICLR, 2024.
w2. Generation Time and Memory Overhead vs DPS and Best-of-N
This is a valuable question. Here is an answer.
-
The total computational cost scales approximately linearly with the importance sample size . For instance, using incurs roughly 4× the cost of a single sampling step—comparable to Best-of-4—yet SVDD achieves significantly higher rewards.
-
As shown in Figure 4 (Section 6.2), we vary the value of and analyze the trade-off between computational cost and reward performance.
-
Technically, the additional cost can manifest as either increased runtime or memory usage, depending on the implementation. For example, if samples are generated sequentially in a loop, the runtime increases linearly with while memory usage remains nearly constant, as shown in our paper. Alternatively, if samples are generated in parallel, memory usage may increase with , while runtime remains roughly constant.
We further include a table summarizing wall-clock time and GPU memory for SVDD, DPS, and Best-of-N under comparable settings, where we run a loop in code implementation for all methods.
| Task | Image Aesthetic(M=20) | Enhancer HepG2(M=10) | 5’UTR MRL(M=10) |
|---|---|---|---|
| Best of N | 200s 9878M | 27.56s 2102M | 10.68s 1872M |
| SVDD | 203s 12355M | 27.78s 2173M | 13.27s 1967M |
| DPS | 38s 35086M | 8.19s 8076M | 7.77s 8068M |
Note that we set in Best-of-N, and DPS is much slower than regular sampling by comparing "Best of M time/M". We will incorporate it in our final version.
Thank you for the response. My concerns have been addressed. I'll keep my score to accepting the paper.
Dear Reviewer ETLn,
Thank you for your appreciation of our work! We are glad to know that your concerns have been addressed. Thank you again for your valuable comments and suggestions, which help improve our work a lot. Please let us know if there are any additional questions or feedback.
Sincerely,
Authors
The paper introduces a method called SVDD to address the challenges of current diffusion models on optimizing downstream reward functions while preserving the naturalness of the design spaces. SVDD does so by integrating value functions to enable a look-ahead into the intermediate noisy states and the rewards that can be achieved. Experiments on different domains show improved performance over the baselines.
优缺点分析
Strengths:
- The work targets an important problem, especially related to sampling in the cases on non-differentiable rewards which can be common in scientific discovery scenarios.
- The paper overall is written well along with a good motivation of the method.
Weaknesses:
- The method relies on learning a good value function, which can be difficult to do properly in general. A discussion on this in the manuscript would be useful.
- The additional cost of sampling good trajectories to learn such a value function would be useful to analyze. Even though the method avoids the cost of expensive fine-tuning, it is not very clear how much is the additional cost of sampling trajectories and learning a value function is.
- The method seems an extension of SMC, and in terms of that, the novelty of the method is not too clear.
问题
- How many seeds have been used when generating the results and comparisons with the baselines?
- The LL metrics for the proposed method and the baselines seems pretty close. Is there a reason for the proposed method not bringing too much benefits as compared to the baselines?
- How is the quality quality of the learnt value function evaluated and how is its dependence on the number of sample trajectories needed for the training analyzed?
局限性
Yes
最终评判理由
Update: I would like to thank the authors for their rebuttal which has been useful for me to understand their work better. I have also looked at the reviews and discussions from other reviewers that has also clarified a bunch of my concerns. Since the authors have clarified most of my concerns, I am updating my scores accordingly.
格式问题
None
Thank you for the thoughtful review! The reviewer primarily raised two points: (1) our method appears to rely on learned value functions, and (2) the analysis may be insufficient.
Regarding point (1), we would like to clarify that our proposed method does not rely on a learned value function. Instead, we focus on a learning-free variant, which serves as the default throughout our main experiments. For point (2), we provide a detailed analysis of the learning-free approach (SVDD-PM) in Appendix C and Appendix H.2, including its empirical performance. In this rebuttal, we also offer additional details regarding its sample and computational efficiency. We would be happy to elaborate further!
W1: Reliance on Learning Value Function
While we agree that learning a good value function could be challenging sometimes, in our method, importantly, we propose a posterior mean approximation (Section 4.3), which avoids learning entirely, and use this approach as a default version in our experimental section, Section 6. In contrast, learned value function is merely an alternative ablation we study in Appendix C and Appendix H.2, and not the primary proposal of this paper.
-
More specifically, our posterior mean estimation approach approximates the expected reward from a noisy input using a single forward pass of the pre-trained diffusion model—an idea inspired by DPS and universal guidance. This method avoids the sample inefficiency with learned value functions, while remaining effective empirically, as demonstrated in Section 6.
-
We further show that this training-free approach, when combined with SVDD, performs comparably to the Monte Carlo variant that requires a learned value function (Appendix C.3, Table 2). Additionally, we show that the posterior mean approximation itself serves as a strong value function estimator (Appendix C.3, Figure 6, and Section H.2), achieving competitive performance in practice.
W2: Cost of Sampling/ Learning Value Functions + Q3: Evaluation of Value Function Quality and Sample Efficiency
We appreciate the reviewer’s suggestion (but we would like to reiterate—as noted in our response to W1—that our default method is SVDD with a posterior mean approach, which does not require learning value functions).
Below, we address your specific concerns:
(1) Value function quality
We discussed the quality of value function approximations in detail in Appendix H.2. Several plots are provided to illustrate its performance. As shown there, the Pearson correlation between predicted and true rewards is relatively high. Notably, as the diffusion process progresses, our value estimation method increasingly aligns with the final reward, indicating strong predictive accuracy.
(2) Sample Efficiency
We provide additional details regarding computational and sample efficiency in the table below for clarity.
| Task | Num of samples | Runtime | Fine-tuning diffusion model runtime (with DDPO) |
|---|---|---|---|
| Enhancer HepG2 | 704 * 128 | ~25min | ~3.8h |
| Molecule QED | 576 * 1000 | ~2.5h | ~46h |
| 5’UTR MRL | 736 * 128 | ~20min | -- |
- The reported runtime includes the total time for sampling trajectories, value function learning, and evaluation
- For comparison, we also experimented with fine-tuning diffusion models using reward signals via DDPO (Black et al., 2023), and we report the runtime to convergence across several tasks. Compared to fine-tuning the diffusion model, this computational cost remains negligible.
We will incorporate these details into the final version of the paper. Due to the rebuttal format restrictions, we are unable to include learning curve plots of the value function experiments here, but we will ensure they are included in the final camera-ready version.
W3: Novelty Compared to SMC
We understand that this point may be somewhat confusing, but we have explicitly emphasized the distinction in our paper—particularly in Section 2 (Lines 102–107).
While there are concurrent efforts exploring reward-guided generation (Kim et al., 2025, etc), our approach differs in key ways. Most notably, (1) our algorithm is an instantiation of nested importance sampling (nested-IS) SMC (Naesseth et al., 2019, Algorithm 5, 105 13), whereas other concurrent works typically employ standard sequential Monte Carlo. (2) Additionally, we introduce a novel technique for reward maximization in Section 5 that integrates ideas from both SMC and nested-IS SMC organically.
The algorithmic distinction between our approach and standard SMC-based methods is also highlighted in Section 4.4. We would be happy to elaborate further if it is still unclear.
Q1: Number of Seeds
Thank you for the comment. While the appropriate methodology may vary depending on the domain, we would like to clarify that the confidence intervals reported in our tables are constructed with this in mind. For example, in the image domain, we generate 300 samples using different random seeds to compute these intervals.
Q2: Close LL Scores Across Methods
Thank you for the observation. This result is in fact expected. We would like to clarify that the objective of SVDD, as described in Line 170, is to generate samples with high reward while preserving naturalness. Importantly, the log-likelihood (LL) serves as a proxy for naturalness, but it is not the primary optimization objective. Given this, the observed result aligns well with the intended behavior of our method.
Please engage in the discussion with the authors. The discussion period will end in a few days.
This work presents an algorithm called Soft Value-based Decoding in Diffusion Models (SVDD), which is a conditioned generation method for diffusion model that does not requires a differentiable classifier for guidance. The work is targeting an important problem, especially when considering non-differentiable reward functions and will be of interest to the NeurIPS audience. All of the reviewers concerns were adequately addressed during the rebuttal stage.