Inference-Time Alignment of Diffusion Models with Direct Noise Optimization
This work studies the problem of inference-time alignment of diffusion generative models with downstream objectives
摘要
评审与讨论
The paper proposes a novel approach called Direct Noise Optimization (DNO) for aligning diffusion models with continuous reward functions at inference time. DNO optimizes the injected noise during the sampling process to maximize the reward function, without requiring any fine-tuning of the model parameters. The key contributions are:
Out-of-Distribution Reward Hacking: The paper identifies out-of-distribution reward hacking as a critical issue in DNO. It proposes a probability regularization technique to ensure the generated samples remain within the support of the pretrained distribution.
Non-Differentiable Reward Functions: The paper extends DNO to handle non-differentiable reward functions by developing a hybrid gradient approximation strategy.
Experimental Results: Extensive experiments demonstrate that DNO can achieve state-of-the-art reward scores for various image reward functions, within a reasonable time budget for generation. DNO is shown to outperform tuning-based methods in terms of reward scores while requiring significantly fewer computing resources.
给作者的问题
The experiments in the paper focus primarily on image generation tasks. Could the authors discuss whether DNO can be applied to other types of diffusion models (e.g., text, audio) and what modifications might be necessary? Are there fundamental limitations to DNO that would prevent its application in these domains?
The authors argue that DNO offers a favorable trade-off between inference time and reward, especially compared to fine-tuning methods. Could the authors provide more detailed comparisons of computational resources required for DNO versus fine-tuning methods, particularly for larger models like SDXL? How does the optimization time scale with model size?
论据与证据
The claims made in the submission are supported by clear and convincing evidence. The authors provide theoretical analysis, extensive experiments, and visual examples to back up their claims about the effectiveness and efficiency of Direct Noise Optimization (DNO) for aligning diffusion models at inference-time.
The authors present a comprehensive theoretical study of DNO, including a theorem that demonstrates the improvement of the distribution after each gradient step. They also propose variants of DNO to handle non-differentiable reward functions and address the out-of-distribution (OOD) reward hacking problem. The theoretical foundation is solid, with proofs and justifications provided in the appendices.
The authors conduct extensive experiments on several important reward functions, including brightness, darkness, aesthetic score, HPS-v2 score, and PickScore. They compare DNO with existing alignment methods like LGD, SPIN, DDPO, and AlignProp, showing that DNO can achieve state-of-the-art reward scores within a reasonable time budget. The results are presented in tables and figures, demonstrating the superiority of DNO in various settings.
The authors explore three methods for optimizing non-differentiable reward functions, including the proposed Hybrid-2 method. They provide experimental results showing that Hybrid-2 is significantly faster and more effective than other methods like ZO-SGD and Hybrid-1. The experiments on JPEG Compressibility and Aesthetic Score reward functions validate the effectiveness of their approach.
To prevent OOD reward hacking, the authors introduce a novel probability regularization technique. They provide visual examples and quantitative metrics (like CLIP Score and ITM score) to show that this technique effectively keeps the generated samples within the support of the pretrained distribution. The regularization term is shown to stabilize the optimization process and maintain sample quality.
The authors argue that DNO is efficient and practical, requiring significantly fewer computing resources than tuning-based methods. They provide details about the memory usage and time budget for experiments, showing that DNO can run on a single consumer-level GPU with memory usage of less than 15GB. This is supported by the implementation details and experimental settings described in the paper.
方法与评估标准
The proposed methods and evaluation criteria in the paper make good sense for the problem of aligning diffusion models with reward functions at inference-time. Here's why:
This approach is well-suited for inference-time alignment as it doesn't require modifying the pretrained model parameters. By optimizing the injected noise during the sampling process, DNO can effectively align the generated samples with the target reward function while maintaining the pretrained distribution's support.
The introduction of probability regularization to prevent out-of-distribution (OOD) reward hacking addresses a critical issue in alignment methods. This technique ensures that optimized samples remain within the support of the pretrained distribution, which is essential for maintaining sample quality and relevance.
For non-differentiable reward functions, the proposed hybrid gradient methods (especially Hybrid-2) provide an efficient solution. These approaches combine estimated gradients of the reward function with true gradients of the noise-to-sample mapping, offering a practical way to handle real-world scenarios where reward functions may not be differentiable.
The authors use established benchmark datasets and reward functions relevant to diffusion model alignment, such as aesthetic score, HPS-v2 score, and PickScore. These are appropriate and widely recognized metrics in the field of generative models and alignment research.
The paper compares DNO against several existing alignment methods (LGD, SPIN, DDPO, AlignProp), providing a comprehensive evaluation of its performance relative to state-of-the-art approaches.
Visual examples and optimization trajectories are provided to qualitatively assess the effectiveness of DNO and its variants. This complements the quantitative results and helps in understanding the behavior of the proposed methods.
理论论述
I checked the proof for Theorem 2.1, which is the main theoretical claim in the paper. This theorem demonstrates that under the assumption of L-smoothness for the composite mapping r ◦ Mθ, the expected reward improves after each gradient step in the Direct Noise Optimization (DNO) process.
The proof follows these key steps:
- It leverages the Descent Lemma from optimization theory, which is a classical result for smooth functions.
- It applies this lemma to the specific context of the noise optimization problem.
- It substitutes the gradient step definition into the inequality from the Descent Lemma.
- It arrives at the final inequality showing that the expected reward improves with each gradient step.
I didn't find any issues with this proof. The assumptions are clearly stated, and the logical flow from premises to conclusion appears sound. The application of the Descent Lemma is appropriate, and the algebraic manipulations seem correct.
实验设计与分析
I checked the soundness and validity of several key experimental designs and analyses in the paper.
补充材料
I reviewed several parts of the supplementary material that were crucial for understanding the technical details and experimental setups.
与现有文献的关系
The key contributions of this paper are closely related to several areas of prior research in diffusion models, reinforcement learning, and optimization. Previous approaches to aligning diffusion models with reward functions have primarily focused on fine-tuning the model parameters through reinforcement learning (RL) or direct fine-tuning. Notable examples include: DDPO. DNO represents a different approach to inference-time alignment by directly optimizing the noise vectors rather than modifying the model parameters or the sampling dynamics.
遗漏的重要参考文献
No
其他优缺点
While DNO is presented as a novel framework, some components build directly on existing ideas from noise optimization in diffusion models . The paper could benefit from a more explicit discussion of how it advances beyond these prior works.
The practical impact might be limited by the computational requirements of the optimization process, though the authors argue that the time costs are reasonable. For some applications, the additional optimization time might still be prohibitive.
The paper is difficult to understand and should be reorganized before being published.
其他意见或建议
Ensure consistent capitalization in figure captions (e.g., "Figure 1. ODE vs. SDE for optimization" should be "Figure 1. ODE vs. SDE for Optimization")
Ensure all acronyms are defined upon first use (e.g., OOD, , ODE)
Thank you for your time in reviewing our work. Here, please allow us to provide specific responses to your major comments.
1. Comparing to prior works
Please allow us to emphasize and reiterate our main contributions here, which are distinct from all prior works on noise optimization.
- A more comprehensive formulation for noise optimization with theoretical understanding: Firstly, we reveal a fundamental insight: every stochastic element in the diffusion sampling process can be harnessed and optimized. In contrast, previous works have focused solely on optimizing the initial noise. As another major contribution, we elucidate the underlying mechanism of optimizing noise vectors, showing that it can be viewed as sampling from a provably better distribution.
- Identification and mitigation of the OOD reward-hacking problem in noise optimization: For example, in our experiments on brightness and darkness enhancement, we show that without proper regularization, noise optimization cannot be applied effectively. Our work highlights the standard Gaussian distribution as a crucial prior for noise vectors, enabling successful regularization for such applications.
- Extension to non-differentiable reward functions: This innovation significantly broadens the applicability of DNO to a wider range of scenarios, as requiring reward functions to be differentiable can be a highly restrictive condition in real-world applications.
With these three important contributions developed in this work, we believe it lays a stronger foundation for more applications of noise optimization in the future.
2. Optimization time
We would like to point out that the optimization time in DNO is actually controllable, which provides significant flexibility for different applications.
For time-sensitive applications, fewer optimization steps can be used, making the additional time less prohibitive, although the sample improvement may also be limited in such cases. For less time-sensitive applications, more optimization steps can be employed to achieve better results.
More importantly, as also discussed in our second response to Reviewer F5TP, directly combining DNO with fine-tuning-based methods will make the time cost required by DNO more acceptable.
3. More modality
For audio, we believe the answer is yes. For text, however, we think the technique of DNO is not directly applicable. This is because, in discrete diffusion, it is not possible to compute the gradient from the sample back to the latent noise due to the combinatorial nature of the problem. The reasons we chose to experiment with image diffusion models are mainly threefold:
- There are many excellent open-sourced image diffusion models available for us to conduct experiments.
- There are numerous existing baselines for the image alignment problem, which allow for meaningful comparisons. In contrast, there are very few implementations addressing the alignment problem for audio.
- Most importantly, there are many powerful, open-sourced reward models for images, which are trained on high-quality human-ranked datasets. Such resources are currently lacking for other modalities like audio.
In the future, once the community for audio diffusion achieves the same maturity as the image diffusion community, we believe DNO will continue to demonstrate its value in this domain.
4. Scaling behavior of computational resources
Thank you for bringing this important question to our attention. This is indeed a crucial aspect to discuss!
Due to limited tokens in the rebuttal response, here we can only provide a quick response and leave the evidence for the revised manuscript or later discussion period of rebuttal. Here, we will elaborate on how memory usage scales with fine-tuning methods compared to our DNO method, as memory usage is typically the dominant factor in determining the number of GPUs required for these tasks. In summary, our conclusions can be drawn as follows:
- Assuming the memory usage of direct sampling as 1 unit, fine-tuning methods would require at least 10 times more units to get started, while our DNO method requires approximately 1.5 times more units. Generally, memory usage scales linearly as the model size grows.
- One gradient step in DNO roughly takes 2.5–2.7 times longer than direct sampling. Generally, the time cost of direct sampling scales sublinearly or linearly as the model size grows, depending on the level of parallelism in the architecture.
Thank you again for raising this excellent discussion. We believe this analysis will make our manuscript more informative, and we will include these insights and the rigorous measurements in the revised version.
5. Presentation
Could you please specify which parts you feel need the most improvement or reorganization? Your detailed suggestions would be greatly appreciated. Thank you for your valuable feedback!
This paper conducts a comprehensive investigation in optimizing the noise in the sampling process of diffusion models for alignment. The main contributions are: 1) giving a rigorous definition of noise optimization and extending it to SDE sampling, 2) explaining and quantifying the root of OOD issues in noise optimization and providing a regularization, and 3) extending noise optimization to non-differentiable rewards by estimating their gradients. Generally, this paper delves into some crucial technical issues of noise optimization, a new branch for diffusion alignment, and proposes basic solutions.
给作者的问题
[1] Can DNO be combined with other tuning-based methods? I believe this could be extremely important for potential real-world applications, since training cost may not be a real concern for industry deployment of diffusion models.
[2] Could you elaborate the optimization process of SDE sampling as advised above? How is the optimization conducted?
[3] Please explain the relationship between the work Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps (https://arxiv.org/abs/2501.09732).
论据与证据
For all contributions mentioned above, there are all clearly convinced by either theoretical or empirical results. Additionally, for 2), I expect a more direct demonstration of the connection between the low probability of z and the OOD example, e.g. what is the level of M1 and M2 value for diffusion models with hacked reward alignments?
方法与评估标准
Yes. This paper discusses an underexplored branch of diffusion alignment: optimizing the noise, which makes sense for deeper understanding of diffusion models.
理论论述
The correctness of Theorem 2.1, results about SDE optimization advantages, and Lemma 3.1 are all checked. To the best of my knowledge, I do not see any issues.
实验设计与分析
The experiment is extensive and persuasive. I checked the improvement by introducing regularization and the comparison to existing alignment methods. As an underexplored method, DNO yields considerable performance.
Issues: I wonder whether DNO could be combined with fine-tuning based methods for even more superior performance.
补充材料
I review Appendix B for examples and Appendix D for the theoretical results.
与现有文献的关系
this paper delves into some crucial technical issues of noise optimization, an underexplored branch for diffusion alignment, and proposes basic solutions. It provides a fundamental baseline for this field and guides necessary attention to this branch, which I believe is significant.
遗漏的重要参考文献
One existing (or maybe concurrent) work should be noticed: Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps (https://arxiv.org/abs/2501.09732). I am curious about the difference between this paper and the mentioned work.
其他优缺点
Weaknesses:
[1] The difference between steps of ODE optimization and those of SDE optimization should be elaborated. For example, does optimizing one step for SDE sampling optimize all TxD noise? Since this should be a long serial optimization, how do all gradients computed?
其他意见或建议
I think this paper studies a significant problem. If my concerns are addressed, I would be pleased to raise my scores.
Thank you for your time in reviewing our work. Here, please allow us to provide specific responses to your major comments.
1. "What is the level of M1 and M2 value for diffusion models with hacked reward alignments?"
In this work, we did provide a more direct illustration of this point. In lines 262–263, we define the metrics using the values of and . Then, in Figures 2 and 13–15, we show that the value of diminishes to zero when reward-hacking occurs.
2. "Can DNO be combined with other tuning-based methods?"
Yes, and we believe the most natural way to combine DNO with tuning-based methods is to directly apply DNO on fine-tuned models. In this way, our proposed DNO can continue to improve samples generated by aligned models at test time. To validate this point, we conducted a quick experiment. Following the setting in Section 5.2, we directly applied DNO to the model fine-tuned by DDPO. We observed that DNO can indeed continue to improve the sample quality to achieve higher rewards and can reach the level of reward achieved by running DNO for 5 minutes using the base model, but with only a 1-minute budget.
| Method | DDPO | DNO (5 min) | DDPO + DNO (1 min) |
|---|---|---|---|
| Aesthetic | 7.180 | 8.587 | 8.761 |
| HPS | 0.287 | 0.324 | 0.319 |
In this sense, applying DNO with fine-tuning-based methods can be viewed as a way to accelerate the DNO algorithm, making it more practical, especially for time-sensitive applications.
3. "Could you elaborate on the optimization process of SDE sampling as advised above?"
You might be overthinking the complexity here. Using ODEs and SDEs for DNO has almost the same time cost and memory usage. Gradient backpropagation is conducted over the entire sampling process for both ODEs and SDEs using automatic differentiation.
To clarify how our implementation works, here is a core snippet of our code:
# SDE sampling logic. To change it into ODE, we only need to modify the sampler and dimension of noise vectors accordingly
noise_vectors = torch.randn(args.num_steps + 1, 4, 64, 64, device=args.device)
noise_vectors.requires_grad_(True)
sampler.initialize(noise_vectors, "sde")
# Sampling process
while not sampler.is_finished():
model_kwargs = sampler.prepare_model_kwargs(prompt_embeds=prompt_embeds)
model_output = checkpoint.checkpoint(unet, **model_kwargs)
sampler.step(model_output)
# Gradient computation
sample = sampler.get_last_sample()
loss = - reward_function(sample)
loss.backward()
4. Relationship to the work [arXiv:2501.09732]
Thank you for pointing out this work to us! We also noticed this work after the ICML submission deadline. For simplicity, we will refer to this work as ITS below. From our perspective, the ITS work represents a different approach to handling test-time scaling for diffusion models using combinatorial search techniques, while we focus on continuous search techniques using gradient-based optimization.
To illustrate the difference:
- If the reward model is not continuous or differentiable, the method proposed by ITS is more useful, as it does not require continuity or differentiability by design.
- On the other hand, when the reward model is continuous and smooth, our proposed DNO is more favorable because it leverages more information for optimization, resulting in much faster convergence compared to the ITS work. This is also evident if we compare Figure 9 in the ITS work and Figure 3 of our work, where we show that using the gradient of the Aesthetic reward leads to significantly faster optimization.
As illustrated above, we acknowledge that this is indeed a very important concurrent work, and we will include a discussion of this work in the revised manuscript.
Thank you. I believe your method is significant. I have raised my score.
Dear Reviewer F5TP,
Thank you for acknowledging our rebuttal response and raising the score. We assure you that the discussed points will be reflected in the revised version.
Best regards,
Authors
This paper investigates the alignment problem of diffusion models during inference and proposes a tuning-free, prompt-agnostic method named Direct Noise Optimization (DNO). The authors theoretically investigate the properties of DNO and propose variants of DNO, aiming to solve problems of out-of-distribution reward hacking and optimization of non-differentiable reward functions. Experiments demonstrate that DNO achieves state-of-the-art performance on several reward functions.
给作者的问题
Why does DNO achieve significantly better performance with very few diffusion steps (e.g., 10 or 15 steps) compared to the standard setting of 50 steps (Table 2)?
论据与证据
Most claims in the paper are well-supported by theoretical analysis and empirical results.
方法与评估标准
The paper primarily uses reward scores (e.g., Aesthetic Score, HPS Score, PickScore) and Out-of-Distribution (OOD) indicators (e.g., CLIP Score, ITM Score). These metrics are appropriate for assessing alignment performance.
理论论述
The key assumption in Theorem 2.1 that the noise-to-sample mapping is smooth is not entirely convincing, and it is unclear how the cited Figure 4 in (Tang et al., 2024a) directly supports the claim. A more rigorous justification is needed.
实验设计与分析
A potential limitation is the use of a simple animal prompt dataset, which may restrict the evaluation’s generality. Testing on more diverse prompts (e.g., scenes, abstract concepts) could better assess DNO’s robustness.
补充材料
Yes, I reviewed the supplementary material, which includes the provided code.
与现有文献的关系
The proposed Direct Noise Optimization (DNO) method contributes to the growing body of work on aligning diffusion models with specific tasks or reward functions. Previous methods like DDPO, DPOK, AlignProp, and DRaFT have explored various approaches to this problem, including reinforcement learning and direct fine-tuning. DNO belongs to the category of inference-time alignment methods, which also includes LGD.
遗漏的重要参考文献
I noticed that there are some works on inference-time initial noise optimization, such as [1] () and [ 2]. These works seem to be closely related to the topic of this paper. However, they are not cited or discussed in the current version.
[1] Eyring L, Karthik S, Roth K, et al. Reno: Enhancing one-step text-to-image models through reward-based noise optimization. NIPS, 2024.
[2] Qi Z, Bai L, Xiong H, et al. Not all noises are created equally: Diffusion noise selection and optimization. ArXiv 2024.
其他优缺点
Strengths
- DNO is a test-time optimization method that distinguishes itself from conventional approaches like reinforcement learning and direct fine-tuning of diffusion models.
- The proposed method addresses OOD reward-hacking and non-differentiable rewards.
- The paper provides detailed mathematical derivations and theory.
Weaknesses
- It is unclear whether the proposed method might negatively impact the original model's capabilities, such as diversity in generation.
- lacks a comparison with other inference-time initial noise optimization methods.
- The paper tests on simple animal prompts and does not evaluate DNO performance on commonly used t2i or complex prompts. Additionally, the paper claims that DNO is prompt-agnostic, but human preference metrics like HPS and PickScore actually require prompt consideration.
其他意见或建议
No further comments or suggestions.
Thank you for your time in reviewing our work. Here, please allow us to provide specific responses to your major comments.
1. On the smoothness assumption.
Conceptually, it is quite straightforward to argue that the reward function is smooth with respect to pixel changes. Because small changes to the image pixels would not result in large differences in the reward function's score. To provide a more concrete answer, we conducted a quantitative analysis:
We first sampled a noise vector and then generated a second noise vector in the neighborhood of , i.e., . Using the Aesthetic Score as the reward function , we computed the following quantities:
and
Using 100 samples with random prompts from Section 5.1, we estimated these values to be and , respectively. These results rigorously demonstrate that the composite mapping is indeed smooth. We will include this justification in a more formal way in the revised manuscript.
2. Results on more diverse prompts.
We quickly conducted some additional quantitative experiments using DNO with SD v1.5 and set , similar to the setting in Section 5.2. We tested 1000 randomly selected prompts from the Pick-a-Pic test dataset https://huggingface.co/datasets/yuvalkirstain/pickapic_v1. Below is a new table that reports the average performance of our DNO. As shown, DNO can still perform well for complex prompts. This is not a surprising result, because by design, DNO optimizes noise vectors specific to each prompt, ensuring robust performance across diverse scenarios.
| SD v1.5 | DNO (1 min) | DNO (3 min) | DNO (5 min) | |
|---|---|---|---|---|
| Aesthetic | 5.769 | 6.013 | 6.993 | 8.305 |
| HPS | 0.270 | 0.279 | 0.291 | 0.326 |
| PickScore | 21.20 | 21.85 | 23.61 | 24.89 |
3. Discussion on the two related works.
Thanks for mentioning these two works to us! After carefully reading them, we agree that these two works are indeed closely related to our work, and we will add them to our revised manuscript accordingly.
While these two works also focus on noise optimization for diffusion models, there are several distinctions between them and our work:
-
[1] ReNo considers a similar reward-based gradient optimization for noise optimization. However, they only consider one-step distilled models, rather than the full-step diffusion models explored in our work. This makes their approach a simplified version of our proposed DNO method. Using one-step distilled models can result in faster optimization speed, but this inevitably sacrifices sample quality. Moreover, for one-step distilled models, there is only one noise to optimize, so they do not need to distinguish between ODE and SDE samplers. Finally, their work lacks a deeper analysis of several critical aspects covered in our paper, such as the OOD reward-hacking problem, convergence issues, and extensions to non-differentiability scenarios.
-
[2] This work considers a fundamentally different setting compared to our work and ReNo [1]. Their approach aims at constructing a "reward function" using only the sampling trajectory information, and then applies an optimization idea similar to DNO. Since they do not use any external reward function to directly improve the noise, it is natural to see (as demonstrated in Table 1 of their work) that their improvements are relatively marginal.
4. Impact on diversity.
This is indeed an important question! Unfortunately, it is difficult to rigorously discuss diversity since it heavily depends on the reward function's optimization landscape, which is typically unknown in practice. For instance, if the reward function is strongly concave with a single global maximum, running DNO to convergence would collapse the distribution into a Dirac delta, eliminating diversity entirely. Conversely, if the reward function is constant, DNO would leave diversity unaffected.
Thanks for highlighting this point; we will include this discussion in our revised manuscript.
5. Explanation of Table 2.
This occurs because we fix the time budget to 1 minute; thus, running DNO with smaller allows more optimization steps. Although using smaller can improve performance on these benchmarks, in our main experiments (Table 1), we set to maintain consistency with existing baselines.
This paper introduces a novel approach for optimizing diffusion input noise based on a specified reward function. The key advancements over prior work include:
- A new regularization technique that ensures the optimized noise remains within the distribution of the diffusion model.
- A method for handling non-differentiable reward functions.
- Optimization of a sequence of noise variables throughout the diffusion SDE sampling process, rather than just a single initial noise in ODE sampling.
These improvements enhance performance and yield strong empirical results using SDv1.5 as the teacher model, all without requiring any network optimization.
给作者的问题
n/a
论据与证据
The claims are substantiated with supporting evidence.
方法与评估标准
The evaluation is generally well-reasoned. However, Section 5.1 may require additional work to further validate the effectiveness of the proposed regularization. Specifically, the current adversarial setting—optimizing noise to increase darkness with the prompt "white [animal]"—does not fully reflect real-world use cases. Are there more practical scenarios where this regularization would be particularly crucial?
理论论述
I briefly reviewed the equations, and they appear to be correct.
实验设计与分析
Yes, but some baselines and comparisons are missing. For instance, the performance of ODE-based and SDE-based noise optimization should be directly compared. Additionally, it would be valuable to explore simpler regularization techniques, such as KL regularization between the optimized noise and a standard Gaussian distribution, as a baseline.
补充材料
Yes, all.
与现有文献的关系
This paper addresses the problem of noise optimization for improved image generation. Optimizing noise presents a promising avenue for further performance enhancement beyond weight optimization in pretrained networks and holds significant potential in the emerging trend of test-time scaling.
遗漏的重要参考文献
it covers the related works well.
其他优缺点
A few additional strengths of this work include:
- The use of SDE and a more flexible optimization target, which significantly outperforms prior approaches and achieves performance comparable to more expensive weight tuning.
- The ability to handle non-differentiable rewards, broadening the applicability of the method.
其他意见或建议
n/a
Thank you for your time in reviewing our work. Here, please allow us to provide specific responses to your major comments.
Regarding your comment: "Optimizing noise to increase darkness with the prompt 'white [animal]'—does not fully reflect real-world use cases. Are there more practical scenarios where this regularization would be particularly crucial?"
Firstly, we would like to clarify that although optimizing blackness/whiteness was chosen to better reflect an adversarial setting, these optimizations also represent realistic applications. There is a genuine need to generate images with highly dark or highly light backgrounds, which cannot be achieved by base models through either prompting or best-of-k selection, and this application is actually inspired by a well-known trick for diffusion models called offset-noise (see CrossLabs Blog https://www.crosslabs.org/blog/diffusion-with-offset-noise.). This is why we believe these are also very important applications.
From an academic perspective, the main goal of this section is to demonstrate that OOD Reward Hacking is more likely to occur in a strong adversarial setting, and we propose a solution to mitigate it. Interestingly, we have found that our proposed technique has been adopted in a more practical setting in a recent work (https://arxiv.org/pdf/2412.03876). In this work, the reward model evaluates whether the content is safe while the prompts are malicious. This resembles the adversarial setting in our brightness and darkness example, and they demonstrate that our method can mitigate the reward-hacking phenomenon to some extent.
Looking ahead, as more reward models and applications emerge, we believe our proposed techniques will find even more important and crucial applications.
This paper has received mixed reviews in the final recommendations, with three acceptances and one weak rejection. It presents a novel approach for optimizing diffusion input noise by leveraging a specified reward function. While the reviewers raised concerns regarding the insufficient clarification of the optimization process in SDE sampling and the lack of comparison with other noise optimization methods, the authors' rebuttal effectively addressed several of these issues. This ultimately led to a consensus among three reviewers in favor of acceptance. As a result, the Area Chair has decided to accept this paper.