Training-Free Safe Denoisers for Safe Use of Diffusion Models
This paper introduces a training-free method to make diffusion models safer by directly modifying their sampling process to avoid generating undesirable content like NSFW images or copyrighted material, without needing to retrain the models.
摘要
评审与讨论
The paper introduces training-free safe denoiser, a method to ensure diffusion models generate safe content without retraining. Current training-free methods mainly rely on prompts technique, such as incorporating unsafe/safe prompts into the process of CFG. This paper proposes to incorporate images as guidance with the guidance scale adaptive to the current generated . Further experiments demonstrate the effectiveness of the proposed safe denoiser.
优缺点分析
Strength
-
The idea to incorporate images instead of text prompts is interesting.
-
The paper is well written.
Weakness
-
Why not provide some qualitative results? (examples in Figure 1 is not enough)
-
More high-level intuition is suggested. Though the paper provides rigorous mathematical framework about the proposed safe denoiser, I find it does not provide as much insight as the math. To the contrary, I especially value the insight provided in (line 91-96). Besides, personally, I would interpret the method as replacing negative prompt with images as the sampling guidance. I'm looking forward to more discussion on this perspective in the rebuttal. Explanations on why images are better than words may provide valuable insight.
-
The proposed method may not be applied to specific concepts. In other word, it can only be used for very general concepts like NSFW.
-
(Minor) I'm not sure if this training-free method is practical in real-world, as it can be easily bypassed (for instance, using the default sampler not the safe denoiser instead).
-
(Minor) The qualitative results in Figure 1 show that all these training-free method will manipulate the structure (comparing to the first column).
问题
-
What is the corresponding negative images in Figure 1(b)?
-
More insights about the method. (Weakness 2)
-
What is the data distribution in Figure 2&3? Is it a 2D toy experiment or a t-sne result?
-
What are the advantages of this method over methods using text filters or image content safety checker?
Score Justification
Overall, I recognize the proposed method as a new sampling technique. However, the insight behind this method and the practicality of this method in real-world problems are the main concerns. I will definitely raise the score if these concerns are addressed.
局限性
Limitations (requires multiple images) are provided in the Appendix. However, due to the extra required images, there are still some other potential limitations:
-
In some scenarios, collecting corresponding negative images is even much more challenging than designing a safe sampling method. For instance, for IP like Spiderman or celebrity identity.
-
It is computational intensive to process these negative images if these negative sets alter with different erasing task.
最终评判理由
I would give a brief summary of the discussion period to facilitate AC's final decision:
Overall, the paper addresses an important problem in AIGC safety by proposing a novel image-based steering method. While the initial submission had significant gaps in clarity and completeness, the authors have made substantial revisions to address the key concerns raised during the review process. The potential updated manuscript now may include critical components that strengthen its impact and contribution to the field.
The first version of the manuscript suffered from three major shortcomings:
-
Lack high-level intuition: The intuition behind the proposed method was not clearly explained.
-
Absence of qualitative results: The paper relied heavily on empirical quantitative validation.
-
Unstated limitations: The scope of the method was not explicitly discussed.
However, during the rebuttal process, the author provides:
-
Clear motivation of image based method. (the ambiguity of textual description and the orthogonal safety tool for text-based methods)
-
Expand qualitative results. (including the Munch case which is quite interesting and inspiring.)
-
Refine limitation section. Explicitly link stated limitations to the method's design. (please see Q4,Q5,Q8&Q9)
Therefore, I raise my score to Accept due to the authors' comprehensive revisions. The inclusion of the above elements (limitations, qualitative results, and high-level intuition) significantly improves the paper's scientific value. However, the inability to fully validate the qualitative results during the rebuttal phase and uncertainty about the final version's alignment with the proposed revisions prevent me from assigning a higher score.
格式问题
N/A
We thank the reviewer for their thoughtful feedback and insightful questions. We address each of the points below.
Q1: Lack of qualitative results
Why not provide some qualitative results?
A. Due to the page limitations of the main paper, we included a representative set of qualitative examples and placed a more extensive collection in the supplementary material (see Figures E.4 - E.13). We plan to incorporate more of these results into the main body for the camera-ready version, for which an additional page is permitted. We can provide specific examples if desired.
Q2: Need for more high-level intuition
More high-level intuition is suggested. ... Besides, I would interpret the method as replacing negative prompt with images as the sampling guidance. ... Explanations on why images are better than words may provide valuable insight.
A. We are grateful for this profound question, as it allows us to clarify two key aspects of our work: first, why image-based guidance is often a more robust tool than text, and second, how our framework is a more general and complementary approach than simply replacing negative prompts.
First, on why images are often a necessity for true safety, we offer two primary arguments:
- Textual Negation is Unreliable: For critical applications like copyright enforcement (e.g., in the music industry), textual negation is often insufficient.As recent work [1] shows, textual negation can paradoxically generate concepts present only in the negative prompt. The fact that textual negation can fail in such unpredictable and counter-intuitive ways makes it an unreliable tool for critical safety applications. In contrast, image-based negation offers a more direct and robust mechanism. It operates by explicitly pushing the generated sample's latent representation away from that of the negative examples, providing a more reliable safeguard against generating specific, forbidden content.
- Some Problems are Ill-Posed for Text: Consider the "right to be forgotten," which would require a model to avoid generating images of millions of specific individuals. Devising a unique and effective negative text prompt for every individual is a practically impossible task. Image-based negation is an easy and viable way for such instance-level removal. We view this paper as establishing the fundamentals, with scalable implementation being a critical next step for future research.
Second, while the reviewer’s interpretation of our method as "replacing negative prompts with images" is an insightful starting point, we would offer a more accurate perspective. Our image-based guidance is not merely a replacement but an orthogonal safety tool. The key distinction is that it is not mutually exclusive with text-based negation and our method can be applied in addition to traditional negative prompts. Our experiments confirm that when both are used together, they provide a significant, cumulative safety benefit for a negligible increase in computational cost. This makes our approach a highly flexible and powerful addition to the existing safety toolkit, rather than a mere substitute for one of its components.
Q3: Applicability to specific concepts
The proposed method may not be applied to specific concepts. ... can only be used for very general concepts like NSFW.
A. Our method is, in fact, concept-agnostic and can be applied to negate any concept for which negative examples exist. To demonstrate this, we would like to kindly point the reviewer to Table 4 in our manuscript, where we conduct an experiment specifically on negating a single, highly-specific ImageNet class (the Chihuahua).
Q4: Practicality and bypass vulnerability
(Minor) I'm not sure if this training-free method is practical in real-world.
A. This is an important point regarding the deployment of safety methods. Our work is a research paper aimed at establishing a foundational methodology for addressing sensitive issues like NSFW content, copyright, and privacy. As argued in our response to Q2, critical challenges in video/audio copyright and personal privacy necessitate a shift towards sample-based negation. In practice, real-world safety systems are built in layers, typically using pre-filtering (of prompts) and post-filtering (of images). Our work provides an additional, crucial layer of safety that operates during the generation process itself.
Q5: Structural manipulation in qualitative results
(Minor) The qualitative results in Figure 1 show that all these training-free method will manipulate the structure (comparing to the first column).
A. The reviewer's observation is correct, but we would argue this change is not a flaw but an inherent and expected outcome of any safety intervention.
Our method's goal is not to preserve the structure of the original model's potentially unsafe output. Instead, our goal is to steer the sampling process to draw from a safe distribution . By definition, any guidance that alters the trajectory away from the original path will change the final sample's structure. The critical measure of success is not faithfulness to the original (and potentially unsafe) generation, but a successful generation from the target safe distribution. This structural change is a fundamental characteristic shared by training-based safety alignment methods as well.
Q6: Negative images used in Figure 1(b)
What is the corresponding negative images in Figure 1(b)?
A. For the experiment in Figure 1-(b), we used a random set of 10 solo real portrait of the individual from the internet as the negative images, see Figure C-3 in Appendix.
Q7: Data distribution in Figures 2&3
What is the data distribution in Figure 2&3?
A. The visualizations in Figures 2 and 3 are from a 2D toy experiment conducted on the two moons dataset. The Python code snippet below generates the 2D toy data, with hyperparameters for centering and scaling.
from sklearn.datasets import make_moons
def moons_dataset(n):
X, _ = make_moons(n_samples=n, random_state=42, noise=0.03)
X[:, 0] = (X[:, 0] + 0.3) * 2 - 1
X[:, 1] = (X[:, 1] + 0.3) * 3 - 1
return X.astype(np.float32) / 3.
dataset = moons_dataset(2000)
Q8: Advantages over text/image filters
What are the advantages of this method over methods using text filters or image content safety checker?
A. This is a fundamental question applicable to all research on in-process safe generation. The reviewer rightly points out that pre-filtering (of prompts) and post-filtering (of images) are existing safety mechanisms. Our answer is that in-process safe generation provides a critical additional layer of protection—a "defense-in-depth" strategy.
In the wild, users are incredibly creative in devising adversarial methods to bypass static filters. Our method introduces a safeguard that operates during the generation process itself. More importantly, unlike pre- and post-filtering methods which are typically discriminative, our approach is generative. This qualitative difference in mechanism means it can protect against failure modes that traditional filters might miss, thus contributing to a more robust overall safety system.
Q9: Difficulty of collecting negative images
... collecting corresponding negative images is even much more challenging than designing a safe sampling method. ... IP like Spiderman ...
A. We thank the reviewer for raising this practical concern. However, we argue that this challenge is not unique to our method but is inherent to any content filtering system, including the safety checkers the reviewer mentioned.
For a post-hoc safety checker to be able to identify and flag an image of Spiderman, it must have been trained on images of Spiderman. Therefore, any service provider that possesses a functional content checker for a specific IP or celebrity identity implicitly possesses the corresponding inhouse dataset of images. Our method simply proposes to leverage this already existing data for negation during inference. Consequently, we believe that for a service provider, collecting these negative images does not pose an additional burden compared to building a safety checker [2] or implementing a retraining-based safety method [3], both of which face the exact same data requirement.
References
[1] Ban, Yuanhao, et al. "Understanding the Impact of Negative Prompts: When and How Do They Take Effect?." ECCV2024.
[2] https://platform.openai.com/docs/guides/moderation
[3] Biggs, Benjamin, et al. "Diffusion soup: Model merging for text-to-image diffusion models." ECCV2024.
Final Remarks for Reviewer Wpcs
Our safe denoiser gives a concrete geometric intuition—each forbidden concept acts as a “repulsive magnet” in latent space, so every diffusion step is gently steered away from the negative image manifold rather than relying on brittle textual negation. This same mechanism unlocks protections that text alone cannot deliver: instance-level copyright blocks (e.g., specific album art in the music industry) and “right-to-be-forgotten” privacy filters for individual faces work immediately once a handful of reference images are supplied. Operationally the method is plug-and-play—layers neatly on top of any sampler or prompt filter, reuses the small image sets already needed by standard safety checkers, and adds < 5 % runtime—so the conceptual clarity comes hand-in-hand with real-world deployability, directly addressing the two concerns you highlighted. Should these additions resolve the insight and practicality concerns you highlighted, we kindly ask you to consider updating your score, and we will engage promptly and thoroughly in the discussion phase should any additional questions arise.
Sorry for this late reply. I have carefully considered the authors’ rebuttal and the comments from the other reviewers. Here are my further questions:
-
Q1 (solved, hoping more qualitative results in main text)
-
Q2 (solved. These insights are specially valuable, at least for me. However, I find that the first point and the second point are somewhat overlapped, as they are both about the ambiguity of the text. Besides, "the method as an orthogonal safety tool" is also very valuable. I suggest incorporating this in the conclusion section)
-
Q3 (I do not agree that Chihuahua is specific, it still shares very similar features with the dog class. What I mean is the IP concept, which has very distinct individual features. For instance, SpongeBob is a yellow square sponge with big eyes. Note that copyright infringement is also a critical safety concern.)
-
Q4 Q5 & Q9 (solved. These limitations, including a.structural manipulation b.difficulty of collecting negative images c.practicality should be included in the limitation Section although most safety mechanism also suffers from this.)
-
Q6 & Q7 (solved)
-
Q8 (I agree. But the in-depth safety is achieved at the cost of the structural change which can also view as a utility-safety tradeoff.)
Thank you so much for these constructive feedbacks. We will update our revision to reflect the discussion.
Q3. I do not agree that Chihuahua is specific, it still shares very similar features with the dog class. What I mean is the IP concept, which has very distinct individual features. For instance, SpongeBob is a yellow square sponge with big eyes. Note that copyright infringement is also a critical safety concern.
We thank the reviewer for focusing on the IP protection. We believe copyright-infringing prompts can be classified into the following three scenarios:
- Cases that explicitly include the target IP's name (e.g., "SpongeBob"),
- Cases that do not include the name but provide a detailed description of the target IP (e.g., "a yellow square sponge with big eyes"), and
- Cases where infringing content is generated despite the absence of both name and description.
Regarding the third scenario, which is the most challenging for text-based methods, we conducted a qualitative experiment using Munch's [The Scream]. The prompt "If Barbie were the face of the world's most famous paintings" mentions neither Munch nor [The Scream], yet SD1.4 generates the style of [The Scream] [1]. However, when we used the four variations of the painting (two from 1893, one from 1895, and one from 1910) by Munch as a negative dataset, this artistic style was successfully eliminated. Due to the rules regarding adding new results, we are not allowed to show these generated images during the discussion period, but we will include these in our final version. The generated images maintained the "Barbie" keyword from the prompt but rendered it in modern or classical styles, with the unique characteristics of [The Scream] no longer present. This result not only provides reinforcing evidence for our answer to Q1 but also serves as a powerful proof-of-concept that our methodology can control abstract IP concepts like 'style' or 'artistic touch' that are difficult to define and control with text alone.
The reviewer's feedback has provided a crucial opportunity to extend our core methodology to the practical problem of copyright. Beyond this qualitative proof, we are aware of the importance of quantitative evaluation. While time constraints prevent us from including full results now, we commit to incorporating a rigorous quantitative analysis in the camera-ready version, following the established well-equipped protocol from prior work [2]. This will provide clear data on the performance gains our Safe Denoiser offers for the first two scenarios of copyright-related prompts.
Finally, through this discussion with the reviewer, we have identified a significant potential extension of our method: dynamically adding images that an existing post-hoc filter flags as harmful or infringing to our negative dataset. This would establish a cycle that goes beyond defending against real data, using the generative model's own failure cases to make the generation process more robust. We expect this to function as a powerful safeguard that complements existing filter systems and will report on this in our copyright-related experiments. We hope that these additional experiments and discussion help resolve the reviewer's concerns. Thank you again for your valuable feedback.
Q8 I agree. But the in-depth safety is achieved at the cost of the structural change which can also view as a utility-safety tradeoff.
We agree completely. Any safety method that intervenes in the generation process results in a structural change. Therefore, as the reviewer insightfully pointed out, the core questions boil down to the following two:
- To what distribution does the steered trajectory converge? and
- Does this change improve the utility-safety Pareto curve?
Our research offers a fundamental distinction from existing studies, particularly concerning the first question. We have mathematically proven that our method converges to a safe distribution. While most prior works focus on empirical results, we provide a theoretical foundation for the mechanism, which is a key contribution.
Furthermore, our method also presents a powerful solution to the second question. The fact that it can be combined with existing techniques like negative prompts with negligible additional cost means it is an effective tool for improving the Pareto frontier. In other words, we (1) theoretically prove convergence to a safe distribution and (2) improve the Pareto optimum by combining with existing methods. In doing so, we offer a powerful and new solution to the tradeoff problem you've pointed out.
References
[1] Jeon, Dongjae, Dueun Kim, and Albert No. "Understanding and Mitigating Memorization in Generative Models via Sharpness of Probability Landscapes." ICML2025
[2] He, Luxi, et al. "Fantastic copyrighted beasts and how (not) to generate them." ICLR2025.
Thanks for your further response. Looking for the qualitative results in the final version (especially the Munch case). As all the concerns have been addressed, I have raised my scores accordingly.
We sincerely appreciate the reviewer for these constructive and fruitful discussion. We'll reflect our discussion in the revision.
This paper addresses the problem of diffusion models generating inappropriate data. Existing approaches either require retraining the diffusion model, leading to high computational cost, or avoid retraining but lack theoretical guarantees. In contrast, the proposed method corrects the sampling trajectory using inappropriate data to steer the generation process away from undesirable regions. Since the method does not require retraining, it is computationally efficient, and it is also supported by theoretical guarantees. Experiments on several datasets demonstrate that the proposed method can effectively prevent the generation of inappropriate images without compromising generation quality.
优缺点分析
Strengths
- This paper proposes an efficient approach that corrects the sampling trajectory without requiring retraining of the diffusion model.
- Theoretical guarantees are clearly stated, particularly in Appendix A, and the approximation described in Section 3.2 appears reasonable.
- The proposed method also demonstrates strong empirical performance across datasets.
Weaknesses
- There are multiple apparent typographical errors in mathematical expressions. Here are a few examples:
- In Figure 2(a), the label should likely be .
- In Algorithm 1, line 9, it seems that should be .
- In Appendix A, line 791, seems to be mistakenly written as .
- In line 792, the expression begins with , which likely should be .
- It would be helpful to provide more explanation around the use of expectations (). For instance, in line 48, inserting an explicit expectation before applying Bayes' rule could improve clarity.
- In Figures 2(b) and 3, using not only color but also different marker shapes would improve readability. This is especially true for Figure 3, where the distinction between (a) and (b) was quite difficult to discern.
问题
The proposed method does not require retraining, which is a key advantage. However, in Table 2, it outperforms retraining-based methods such as ESD and RECE. Intuitively, I would have expected retraining-based methods to perform better if computational cost were not a concern. Why does the proposed method achieve better performance in practice? Additionally, is it possible that retraining-based methods would outperform your approach if the number of inappropriate data samples were increased?
局限性
yes
最终评判理由
The proposed method of preventing inappropriate images without retraining is interesting, and the experimental results are strong. On the other hand, there are issues with the writing. Hoping that the writing will be improved, I will take the position of Weak Accept.
格式问题
No major formatting issues observed.
We are grateful to the reviewer for their careful and detailed feedback.
Q1: Typographical errors in mathematical expressions
There are multiple apparent typographical errors in mathematical expressions.
A. We thank the reviewer for their careful proofreading. We will correct all typographical errors in the mathematical expressions in our final revision.
Q2: Clarifying the use of expectations
It would be helpful to provide more explanation around the use of expectations (). For instance, in line 48, inserting an explicit expectation before applying Bayes' rule could improve clarity.
A. We agree with this suggestion. To improve clarity and readability for our audience, we will revise the manuscript to include the explicit expectation operator as the reviewer recommends.
Concretely, we will clarify in line 48 that the expectation follows from a standard application of Bayes' rule: . We will also make it clear that, by this Bayes' rule, the posterior distributions for and can be defined as Definition 3.1 of our manuscript.
Q3: Improving readability of Figures 2(b) and 3
In Figures 2(b) and 3, using not only color but also different marker shapes would improve readability. This is especially true for Figure 3, where the distinction between (a) and (b) was quite difficult to discern.
A. That is an excellent suggestion for improving the accessibility and readability of our figures. We will update Figures 2-(b) and 3 to use distinct marker shapes in addition to colors in our revision. We thank the reviewer for this helpful advice.
Q4: Performance compared to retraining-based methods
The proposed method does not require retraining, which is a key advantage. However, in Table 2, it outperforms retraining-based methods such as ESD and RECE. Intuitively, I would have expected retraining-based methods to perform better if computational cost were not a concern. Why does the proposed method achieve better performance in practice? Additionally, is it possible that retraining-based methods would outperform your approach if the number of inappropriate data samples were increased?
A. We thank the reviewer for this insightful question. We agree with the reviewer’s intuition that retraining-based methods should theoretically have an advantage over our training-free method under ideal conditions.
However, we hypothesize that their underperformance comes from the inherent challenges of fine-tuning large-scale models with limited data. The retraining-based methods in our comparison are typically fine-tuned on safety-related datasets containing only a few thousand samples. It is a well-documented phenomenon in machine learning that fine-tuning massive pre-trained models on small, specialized datasets can degrade their generation capabilities, often due to issues like catastrophic forgetting or overfitting. In essence, the model's rich and generalized knowledge can be compromised when adapting to a narrow data distribution.
Our training-free approach is designed to circumvent this disadvantages. By modifying the inference process without altering the model's learned weights, we preserve the full and robust performance of the original foundation model.
Thank you for your response. My concerns have been addressed. I will maintain my score as Weak Accept.
We appreciate the reviewer once again for the thoughtful feedback. We’ll make sure to reflect the comments appropriately in the final revision.
This paper proposes a training-free method called Safe Denoiser to improve the safety of diffusion models during image generation. This work introduces a framework that modifies the sampling trajectory directly, using a set of unsafe data (images, concepts) to steer generations away from harmful content. The core contribution is a theoretical derivation showing how to construct a "safe denoiser" by penalizing the standard denoiser with an "unsafe" component weighted by a function of current sample likelihoods. The method is shown to be compatible with and enhance existing text-based safety methods (like SLD and SAFREE).
优缺点分析
Strengths
- Strong theoretical foundation (Theorem 3.2) connects safe, unsafe, and standard denoisers.
- Algorithm is training-free.
- Extensive experiments across multiple datasets and attack types.
Weaknesses
-
The experiments are only conducted on SD v1.4, without testing the algorithm on more recent models like SDXL and SD 3, which limits the generalizability of the method.
-
In line 111 of the paper, the weight \beta^\*\ is approximated, but the paper lacks a detailed explanation of this approximation.
-
It is unclear how the specific number of unsafe data points was selected — whether the data was randomly sampled or filtered using a certain strategy. The paper does not seem to provide sufficient details.
-
When $\beta = 0$, meaning the safe denoiser is applied to all samples, the ASR increases — which seems counterintuitive. Can the authors explain the possible reasons for this behavior?
-
In Table 2, the proposed method achieves an FID of 22.55, which is even better than that of SD v1.4. This suggests that after safety alignment, the model generates images of higher quality than the original model. It is a strange phenomenon that safety and quality are usually trade-off. What could be the possible reason for this improvement?
-
Regarding Figure 3, it is unclear whether this figure is a conceptual illustration or the result of a toy model experiment. If it is an experimental result, more implementation details are needed to understand the setup.
问题
-
The experiments are only conducted on Stable Diffusion v1.4. Have the authors evaluated the proposed method on more recent and stronger diffusion models such as SDXL and SD 3? If not, could the authors comment on the potential applicability and limitations of the method on newer architectures?
-
In line 111, the weight is approximated, but the paper lacks details on the approximation method or its justification. Could the authors elaborate on how this approximation was derived and whether it introduces any significant bias or limitations?
-
The method relies on a set of unsafe data points, but it is unclear how these were selected. Were the samples randomly chosen, or was there a filtering or ranking strategy applied? Clarifying this would help in understanding the reproducibility and robustness of the results.
-
In the ablation study, when and the safe denoiser is applied to all samples, the Attack Success Rate (ASR) increases, which seems counterintuitive. Can the authors provide an explanation for this behavior?
-
In Table 2, the FID score of the proposed method (22.55) is even better than that of the base model SD v1.4 (25.04), suggesting that applying safety alignment may improve generation quality. This seems surprising — do the authors have any hypothesis for why this might occur?
-
Could the authors clarify whether Figure 3 is a conceptual illustration or an actual result from a toy experiment? If it is empirical, additional details about the setup and data used would help in understanding its significance.
局限性
Yes.
最终评判理由
My concerns are all addressed.
格式问题
No.
We sincerely thank the reviewer for their insightful comments and constructive suggestions. We address each of the points below.
Q1: Generalizability to newer models like SDXL and SD3
The experiments are only conducted on SD v1.4. ... Have the authors evaluated the proposed method on more recent and stronger diffusion models such as SDXL and SD 3? ...
A. We've extended evaluation to SD3 using Ring-A-Bell, as it has been identified as a particularly challenging dataset for SD3 [1]. The results are presented below.
| Model (SD3) | ASR ↓ | Toxic Rate ↓ |
|---|---|---|
| SD3 | 0.304 | 0.330 |
| + SAFREE | 0.278 | 0.298 |
| + Ours | 0.203 | 0.267 |
As shown, our algorithm can not only be easily integrated with existing methods like SAFREE, but also leading to a substantial performance improvement. While applying SAFREE to SD3 improves the Attack Success Rate (ASR) by approximately 9% over the baseline SD3, adding our method boosts the performance by a total of 33% compared to the original SD3. This demonstrates the effectiveness and applicability of our approach on newer and more powerful diffusion architectures.
Q2: Details on the approximation of
... the weight is approximated, but the paper lacks a detailed explanation of this approximation. Could the authors elaborate on how this approximation was derived and whether it introduces any significant bias or limitations?
A. Thanks for the question. The exact relationship between the optimal weight and our proxy is . We approximate this as . This holds because the scaling term becomes nearly constant for reasonably large . While is a true constant, the key insight is that for large , the distribution becomes broad and largely independent of the specific .
You are right to point out the bias: our approximation does not hold well for small . However, we don't apply our guidance at that late stage of sampling as stated in line 117. The main structure of the image is already decided when is large. Applying guidance when is small is unnecessary and could add the exact kind of sampling bias you're concerned about. So, we only use our method in the early stages where the approximation is accurate and the guidance is most effective.
Q3: On the selection of unsafe data points
It is unclear how the specific number of unsafe data points was selected — whether the data was randomly sampled or filtered using a certain strategy. ...
A. We used a ranking and filtering strategy to select unsafe data points for our nudity experiments (including those in Table 2). Specifically, we began by generating images using prompts from the I2P dataset. We then used a pre-trained model, Nudenet [2], to obtain a nudity prediction score for each generated image. After ranking all images by this score, we applied a threshold of 0.6, collecting all 515 prompt-image pairs that exceeded this value. Unless otherwise specified, all nudity experiments were conducted using this 515-item dataset.
For our multi-category experiments (including Table 3), we randomly sampled 3,000 images from the full I2P dataset (4,702 items). This random sampling was necessary to ensure the unsafe denoiser computation could fit within our available GPU memory (24GB). We verified that the performance variance resulting from this sampling was statistically insignificant.
We consider reproducibility as a core principle. To that end, we have included all code used for our experiments in the supplementary material for the reviewers' inspection. We plan to make the code fully public upon acceptance of the manuscript.
Q4: Counterintuitive behavior
When , meaning the safe denoiser is applied to all samples, the ASR increases — which seems counterintuitive. Can the authors explain the possible reasons for this behavior?
A. We actually don't find this result counterintuitive. Setting means applying the safe denoiser unconditionally, regardless of the sample's own value. Since a small indicates a sample is already safe, this means we are steering even the safe trajectories. Any such perturbation risks distorting a safe path into an unsafe one, so it's best to avoid touching safe trajectories whenever possible. While this risk might become negligible with a massive, million-sample negative dataset, that's not a practical scenario. In reality, with a relatively small-sized dataset, this unnecessary perturbation can't be guaranteed to be safe. This exact problem is precisely why we introduced the hyperparameter in the first place: to make the guidance selective. As for the curve bending back up, that's simply the opposite problem: when is set too high, unsafe samples no longer receive guidance and the ASR increases again.
Q5: Surprising FID score improvement
In Table 2, the FID score of the proposed method (22.55) is even better than that of the base model SD v1.4 (25.04), ... do the authors have any hypothesis for why this might occur?
A. We were equally intrigued and conducted an additional experiment on SD3 to investigate whether this was an experimental anomaly or a phenomenon specific to SD1.4. As the table below shows, the trend persists on SD3. Even more, both SAFREE and our method achieve a better FID score than the baseline model.
| Model (SD3) | FID ↓ | CLIP ↑ |
|---|---|---|
| SD3 | 23.145 | 31.457 |
| + SAFREE | 22.987 | 31.241 |
| + Ours | 22.538 | 31.154 |
Based on these consistent findings, we have formed the following hypothesis.
Hypothesis💡: The FID score improves due to an increase in sample diversity introduced by our safe denoiser.
We offer two lines of reasoning for this conjecture. First, it is a well-known phenomenon that high CFG scales create a trade-off between fidelity and diversity; as CFG increases, fidelity is improved and sample diversity tends to decrease. This reduction in the variance of the sample distribution is a primary cause of FID degradation at high CFG values [1,2]. Our safe denoiser is not tied to the prompt conditioning and, from the perspective of CFG, may act as a form of stochasticity. This counteracts the diversity reduction caused by high CFG, thereby improving the FID score. We have confirmed experimentally that our method produces more diverse outputs for a given prompt compared to the baseline, and we also observed that at low CFG scales (where diversity is already high for both baseline and ours), the baseline SD3 model achieves a better FID, lending further support to this hypothesis.
Second and minor, this phenomenon can also be viewed from the imperfection of the FID metric itself. FID approximates the sample and data (feature) distributions as two multivariate Gaussians and measures the distance between them. Therefore, we consider this level of FID improvement to be an insignificant signal, as it is unlikely to correspond to a meaningful difference in generation quality in real-world applications.
Q6: Details on Figure 3
Regarding Figure 3, it is unclear whether this figure is a conceptual illustration or the result of a toy model experiment. ...
A. We thank the reviewer for requesting clarification. Figure 3 is indeed the result of a toy model experiment.
To provide the implementation details: the experiment was conducted using the 'two moons' dataset. The negative dataset was a subset of the same 'two moons' dataset, created by filtering based on a specific threshold. Similar to the formulation in Eq. (4) of our paper, we used the following for the guided denoiser: and denoised with , where we used the PF-ODE (DDIM) for the solver. As this figure illustrates an experiment on the guidance weight, we tested three different values for : , , and . All quantities required for this toy experiment, namely , , , were approximated using Monte-Carlo estimation, i.e., , , and , where and are the negative and positive datasets, respectively. We will add these details to the caption of Figure 3 and the supplementary material in the revised manuscript to ensure clarity.
Reference
[1] Sadat, Seyedmorteza, et al. "CADS: Unleashing the diversity of diffusion models through condition-annealed sampling." ICLR2024
[2] Kynkäänniemi, Tuomas, et al. "Applying guidance in a limited interval improves sample and distribution quality in diffusion models." NeurIPS2024.
Final Remarks for Reviewer vAUD We once again thank the reviewer for their constructive feedback. We believe our responses and the corresponding revisions have fully addressed the concerns raised. Therefore, we respectfully ask the reviewer to reconsider their evaluation and raise their score.
Thanks for your response.
I still have a question. Regarding Q1, the security gain of the algorithm on SD3 is significantly lower than that on SD v1.4, which may indicate that the method is less effective on the new model.
Regarding Q1, the security gain of the algorithm on SD3 is significantly lower than that on SD v1.4, which may indicate that the method is less effective on the new model.
Thank you for the follow-up question about SD3. SD1.4 and SD3 differ in how much unsafe material they saw during training, and this affects the extra safety we could add afterwards.
- SD1.4 was trained on LAION-5B [1] which contains NSFW contents [2] (refer the official model card of Stable Diffusion v1.4). Its attack-success-rate (ASR) is therefore very high (0.797).
- SD3 used an NSFW filter during training [3], so its ASR is already much lower (0.304).
Since the two models start from very different ASR levels, we also report the percentage reduction to show the relative impact of each safety method.
| Model | ASR ↓ | ASR Drop ↓ | Percent Drop ↓ |
|---|---|---|---|
| SD1.4 | 0.797 | - | - |
| + SAFREE | 0.278 | -0.52 | -65.2% |
| + Ours | 0.127 | -0.67 | -84.1% |
| SD3 | 0.304 | - | - |
| + SAFREE | 0.278 | -0.026 | -8.6% |
| + Ours | 0.203 | -0.101 | -33.2% |
We have the following implications:
- On SD1.4, ours improves 18.9% (84.1%-65.2%) over the SAFREE method.
- On SD3, ours improves 24.6% (33.2%-8.6%) over the SAFREE method.
Looking at how much each method improves over the same strong baseline shows that the gain on SD3 is actually larger in relative terms. We hope this makes clear that our SD3 result is substantial, not a sign of reduced effectiveness.
References
[1] Schuhmann, Christoph, et al. "Laion-5b: An open large-scale dataset for training next generation image-text models." NeurIPS2022.
[2] Thiel, David. "Identifying and eliminating csam in generative ml training data and models." Stanford Internet Observatory, Cyber Policy Center, December 23 (2023): 3.
[3] Esser, Patrick, et al. "Scaling rectified flow transformers for high-resolution image synthesis." ICML2024.
Thank for your response. My concerns have all been addressed, so I raise my score.
We hugely appreciate the reviewer for raising important points and engaging further discussion. We will carefully reflect our discussion in the revision.
This paper addresses the well-motivated problem of unsafe generation in diffusion models. This includes inappropriate, not-safe-for-work, and memorized generations. Instead of opting for a text-based negative prompting or fine-tuning/re-training strategies, this paper propose a novel method that directly modifies the sampling trajectory, which makes it an inference-based plug-and-play strategy. Experiments are conducted across generation tasks of text-conditional, class-conditional and unconditional and have demonstrated effective performance - mitigate unsafe generations, while maintaining the generation quality.
优缺点分析
Strengths:
- The proposed method of using negative sets to modify the inference trajectories is novel.
- The inclusion of theoretical derivations contribute to effectively deliver the insight.
- The paper include a range of generation tasks include text-conditional, class-conditional and unconditional in the experiments, which is good.
- The results are overall effective, where integrating the method can mitigate unsafe generations, while maintaining the generation quality.
- The plug-and-play design of the inference-time method contributes to its practical significance.
Weaknesses:
- From Table 2, we can observe that the proposed method consistently decreases the CLIP performance (i.e., alignment to text prompt), either integrating with SLD (drop from 29.28 to 29.10) or integrating with SAFREE (drop from 30.98 to 30.66). The same observation can be seen from Table 3. Thus, I would suggest the authors only claim improved preservation of quality (instead of both quality and alignment) while reducing unsafe generations.
- It would be better if, in the results section, the authors could explain what the potential reasons could be that, after modifying the inference process to prevent unsafe generation, the FID can be improved even compared to the original Stable Diffusion. From the method design, there is currently no intuition that it can improve on the base model’s generation quality, so do you suggest this as a limitation of FID or other reasons?
- The presentation of Figure 2 can be significantly improved. appears in the caption of Figure 2, but is neither in the visual nor previously defined in the caption. The curves in the left plot are also not defined. Also, for the right plot, it is better to include in the caption what settings are used for creating this 2D illustration to make the Figure self-explanatory.
- The authors have acknowledged that one of the method’s limitations is its requirement for careful tuning of hyperparameters to balance image quality and safety, which is good to have such a discussion in the paper and appendix. However, I would recommend the inclusion of a formal hyperparameter analysis to demonstrate the robustness and sensitivity, which will strengthen the reproducibility.
- (Minor) There is no Figure 4 in the paper, Figure 5 appears after Figure 3. Also, in line 32, “intuition” is misspelled.
问题
Please see the first four bullet points mentioned in the weaknesses section, which corresponds to the four questions that I’d like to ask.
局限性
Yes
最终评判理由
I appreciate the authors' rebuttal, which has addressed my concerns very well. Therefore, I'll keep supporting this paper's acceptance.
格式问题
This minor issue is mentioned in the last bullet point in the weaknesses section: There is no Figure 4 in the paper, Figure 5 appears after Figure 3. Also, in line 32, “intuition” is misspelled.
We are grateful to the reviewer for their valuable feedback and suggestions. We address each of the comments below.
Q1: Claiming preservation of "alignment" vs. "quality"
From Table 2, we can observe that the proposed method consistently decreases the CLIP performance (i.e., alignment to text prompt), either integrating with SLD (drop from 29.28 to 29.10) or integrating with SAFREE (drop from 30.98 to 30.66). The same observation can be seen from Table 3. Thus, I would suggest the authors only claim improved preservation of quality (instead of both quality and alignment) while reducing unsafe generations.
A. That is an excellent point, and we thank the reviewer for this insightful observation. We agree that given the slight decrease in CLIP scores, claiming preservation of "alignment" could be misleading. In our revision, we will adjust our claims to more accurately state that our method preserves overall generation quality while effectively reducing the generation of unsafe content.
Q2: Explanation for FID improvement
It would be better if, in the results section, the authors could explain what the potential reasons could be that, after modifying the inference process to prevent unsafe generation, the FID can be improved even compared to the original Stable Diffusion. From the method design, there is currently no intuition that it can improve on the base model’s generation quality, so do you suggest this as a limitation of FID or other reasons?
A. That's an excellent question. Our take is that this isn't simply a limitation of FID, but rather a revealing side-effect of our method's interaction with the sampling process.
The core issue is the well-known diversity trade-off with high CFG scales. High CFG creates high-fidelity samples but kills diversity, and FID heavily penalizes this lack of variance. We observe our safe denoiser counteracts this. Since its guidance is independent of the text prompt, it acts as a diversifying force, breaking up the uniformity that high CFG creates.
The result is a set of samples that is not only safe but also more varied—and therefore, a better statistical match to the real data in the eyes of the FID metric. In short, our safety guidance seems to have the beneficial side-effect of fixing a known weakness of high-CFG sampling.
Q3: Improving the presentation of Figure 2
The presentation of Figure 2 can be significantly improved. appears in the caption of Figure 2, but is neither in the visual nor previously defined in the caption. The curves in the left plot are also not defined. Also, for the right plot, it is better to include in the caption what settings are used for creating this 2D illustration to make the Figure self-explanatory.
A. We thank the reviewer for these concrete and valuable suggestions. We will revise Figure 2 and its caption in the final manuscript to make it fully self-explanatory. Specifically, we will:
- change the figure label to match the caption's notation (),
- state that the definition of is presented in Preliminary (in line 48),
- state that the black arrows depict the standard denoising process (i.e., making a data prediction and then re-noising),
- state that the red arrow visualizes the key step of our method: shifting the data prediction, , to a safe prediction, , before the re-noising step is applied,
- add more details for Figure 2-(b) about the settings used to generate the plot.
We believe these revisions will significantly improve the figure's clarity.
Q4: Sensitivity analysis
The authors have acknowledged that one of the method’s limitations is its requirement for careful tuning of hyperparameters to balance image quality and safety, which is good to have such a discussion in the paper and appendix. However, I would recommend the inclusion of a formal hyperparameter analysis to demonstrate the robustness and sensitivity, which will strengthen the reproducibility.
A. We agree that a hyperparameter analysis is crucial for demonstrating the robustness of our method. To this end, we would like to kindly point the reviewer to Figure 5 in our manuscript, where we have already provided an ablation study on the two most critical hyperparameters: the number of unsafe data points and the guidance weight .
In addition, we recognize that sensitivity to the choice of negation set is another important aspect of robustness. Therefore, we conducted a new ablation study to investigate this. We constructed five distinct negation sets, each containing 600 non-overlapping data points, and evaluated our method on each. As shown in the table below, the performance remains highly stable across these different sets, with minimal variance. We believe analysis done in the paper and here addresses the core of the reviewer's concern regarding the sensitivity.
| Avg. | Harrasment | Hate | Illegal Activity | Self-harm | Sexual | Shocking | Violence | |
|---|---|---|---|---|---|---|---|---|
| Stats (over 5 exp) | 0.14350.0010 | 0.15770.0027 | 0.11010.0013 | 0.15910.0014 | 0.14710.0035 | 0.0836 0.0022 | 0.16290.0028 | 0.18380.0015 |
Q5: On typos and figure numbering
(Minor) There is no Figure 4 in the paper, Figure 5 appears after Figure 3. Also, in line 32, “intuition” is misspelled.
A. We are grateful to the reviewer for their careful proofreading and for catching these errors. We will correct the figure numbering and the typo in our final revision.
I appreciate the authors' rebuttal, which has addressed my concerns very well. Therefore, I'll keep supporting this paper's acceptance.
We appreciate the reviewer for these thoughtful reviews. We will revise our paper reflecting the discussion we had.
This paper proposes a training-free method that leverages a negation set (e.g., unsafe images, copyrighted data, or private data) to prevent diffusion models from generating unsafe content. Experiments demonstrate that the proposed method achieves state-of-the-art safety performance on the CoPro dataset, while offering more cost-effective sampling compared to other competing approaches.
优缺点分析
Strengths
- The mathematical formulations are clearly presented and easy to understand.
- The proposed method is a general approach that can be combined with other techniques without requiring model fine-tuning.
- Experimental results demonstrate the effectiveness of the proposed method in preventing the generation of unsafe content.
Weakness
- Since the method relies on a negation set, the choice of this dataset has a direct impact on its robustness and performance.
- In Section 5.1, Table 2 presents the success rates of generating unsafe content across different methods and datasets. It also reports image quality and text-image alignment on COCO-30K. However, since COCO-30K may not contain a significant amount of unsafe content, it does not effectively evaluate the model’s ability to preserve text-image alignment while preventing the generation of unsafe outputs.
- Section 5.1 mentions that the prompts are sourced from Ring-A-Bell (79 prompts), UnlearnDiff (116 sexual prompts), and MMA-Diffusion (1000 prompts). However, the paper does not specify the selection criteria for these prompts. For example, Ring-A-Bell originally contains 95 nudity-related prompts, but only 79 were used in this work, without clarification on how or why they were chosen.
- The prompt in the second row of Figure 1(a) is incorrect and does not match the corresponding image.
问题
- It would be beneficial if the authors could report image quality and text-image alignment metrics on the subset of unsafe prompts (Ring-A-Bell, UnlearnDiff, MMA-Diffusion), in order to assess whether the proposed method is able to suppress unsafe content while maintaining high visual fidelity and alignment with the input text. 2.It would be helpful if the authors could elaborate on the prompt selection strategy in Section 5.1. For instance, were the prompts randomly sampled, or was there any criterion that might have influenced the selection in a way that benefits the proposed method?
- It would be helpful to include an ablation study investigating the impact of different choices of the negation set on the method's performance, as this component appears to play a critical role in the approach.
局限性
Yes
最终评判理由
I appreciate the authors' rebuttal, in which most of my concerns are addressed, so I decide to increase my score.
格式问题
No
We thank the reviewer for their thoughtful and detailed feedback. We have carefully considered all comments and would like to address the concerns raised.
Q1: Ablation study on the choice of negation set
Since the method relies on a negation set, the choice of this dataset has a direct impact on its robustness and performance. It would be helpful to include an ablation study investigating the impact of different choices of the negation set on the method's performance, as this component appears to play a critical role in the approach.
A. We thank the reviewer for this excellent suggestion. We agree that investigating the performance sensitivity to the choice of the negation set is crucial. To address this, we have conducted an additional ablation study. In this study, we performed five separate experiments, each using a distinct, non-overlapping set of 600 images from I2P for the negation data. More precisely, in the CoPro experiment in Tab. 3, we used 3,000 images as the negative dataset. As an ablation study, we divided the dataset into five disjoint chunks and conducted the same experiments on each chunk (denoted as Exp {1~5}). The results, summarized in the table below, demonstrate that the performance (IP) is highly stable across different negation sets, with minimal variance in the outcomes. This confirms the robustness of our approach to the choice of negation set.
| Exp Cases | Avg. | Harrasment | Hate | Illegal Activity | Self-harm | Sexual | Shocking | Violence |
|---|---|---|---|---|---|---|---|---|
| Stats | 0.14350.0010 | 0.15770.0027 | 0.11010.0013 | 0.15910.0014 | 0.14710.0035 | 0.0836 0.0022 | 0.16290.0028 | 0.18380.0015 |
| Exp1 | 0.1446 | 0.1611 | 0.1085 | 0.1576 | 0.1464 | 0.0868 | 0.1674 | 0.1842 |
| Exp2 | 0.1428 | 0.1548 | 0.1092 | 0.1604 | 0.1478 | 0.0833 | 0.1625 | 0.1814 |
| Exp3 | 0.1432 | 0.1597 | 0.1099 | 0.1604 | 0.1422 | 0.0812 | 0.1632 | 0.1856 |
| Exp4 | 0.1445 | 0.1576 | 0.1120 | 0.1597 | 0.1520 | 0.0847 | 0.1618 | 0.1835 |
| Exp5 | 0.1424 | 0.1555 | 0.1106 | 0.1576 | 0.1471 | 0.0819 | 0.1597 | 0.1842 |
Q2: On evaluating text-image alignment for unsafe prompts
In Section 5.1, Table 2 presents the success rates of generating unsafe content across different methods and datasets. It also reports image quality and text-image alignment on COCO-30K. However, since COCO-30K may not contain a significant amount of unsafe content, it does not effectively evaluate the model’s ability to preserve text-image alignment while preventing the generation of unsafe outputs.
A. We thank the reviewer for bringing up the concern about evaluating text-image alignment under unsafe prompts. Recent safety research [1, 2] has converged on two core metrics—(i) generation quality for predominantly safe prompts, and (ii) attack-success rate for intentionally unsafe prompts.
In line with that convention, our study focuses on ensuring refusal or safe outputs whenever the prompt is unsafe. Defining the “ideal” residual alignment (i.e., fully following only the benign portion of an unsafe prompt) remains an open problem: in some applications a partially aligned image may be desirable, whereas in others a fully blank output is considered the most responsible behaviour.
We therefore adopted the widely used two-metric protocol and left the degree of residual alignment as an application-dependent design choice. We respectfully note that COCO-30K, while mostly safe, is suitable for measuring general text-image alignment; evaluating nuanced partial alignment under unsafe prompts would require a specialised benchmark, which we see as an exciting avenue for future work.
Q3: On the selection criteria for evaluation prompts
Section 5.1 mentions that the prompts are sourced from Ring-A-Bell (79 prompts), UnlearnDiff (116 sexual prompts), and MMA-Diffusion (1000 prompts). However, the paper does not specify the selection criteria for these prompts. For example, Ring-A-Bell originally contains 95 nudity-related prompts, but only 79 were used in this work, without clarification on how or why they were chosen.
A. We thank the reviewer for the question regarding our prompt selection criteria. We used a subset of the original prompts for two primary reasons:
- Clarity of Content: We filtered the datasets to include only those prompts that, when generated, produce images clearly and unambiguously identifiable as belonging to a nudity category by a human evaluator. Prompts that were ambiguous or contained other unsafe concepts (e.g., self-harm or violence) were excluded.
- Prompt Sensitivity: We specifically selected prompts where minor variations in the text could significantly alter whether the model produced a harmful output. This allowed us to test the robustness of the safety mechanisms against subtle rephrasing attacks.
These selection criteria were adopted directly from prior work to ensure a fair and rigorous comparison with existing methods. We would revise Appendix C.2 of our manuscript (that already discussed some points) to include this detailed clarification.
Q4: Incorrect prompt in Figure 1(a)
The prompt in the second row of Figure 1(a) is incorrect and does not match the corresponding image.
A. Thanks for catching the error. The correct prompt is "a painting of a marionette puppet hanging limp with blood running from his eyes, by greg rutkowski, horror themed, stark light and shadows, grayscale". We will reflect this to the camera-ready.
References
[1] Gong, Chao, et al. "Reliable and efficient concept erasure of text-to-image diffusion models." ECCV2024.
[2] Yoon, Jaehong, et al. "Safree: Training-free and adaptive guard for safe text-to-image and video generation." ICLR2025.
Final Remarks for Reviewer K2qK
We sincerely thank the reviewer for their dedicated and constructive feedback. We believe we have thoroughly addressed the concerns raised. Therefore, we respectfully ask the reviewer to reconsider their evaluation and raise their score.
Dear Reviewer,
Thank you again for taking the time to review our work. We noticed that your review raises important questions: particularly regarding the robustness of the negation set, alignment evaluation under unsafe prompts, and the prompt selection criteria. We appreciate these concerns and have provided detailed responses in our rebuttal.
In particular:
- We conducted an ablation study (as suggested) showing consistent performance across five independently sampled negation sets (see Rebuttal Q1).
- We clarified the alignment evaluation protocol and acknowledged that fine-grained partial alignment under unsafe prompts is an open challenge (Q2).
- We also described our prompt selection strategy (Q3), based on content clarity and sensitivity, in line with prior safety work.
We would deeply appreciate it if you could share any further thoughts, especially whether the responses address your concerns or whether additional clarification is needed. Your input is crucial to ensure a fair and thorough assessment of our work.
Warm regards,
The authors
Dear Reviewer,
Could you please review rebuttal of the authors and let them know if you have any further comments or concerns? Do you feel your original comments have been adequately addressed? I would be grateful if you could kindly engage in the discussion to help move the review process forward.
Best regards, AC
Dear Reviewer,
We would like to gently follow up once more to reiterate our appreciation for your initial comments and the opportunity to address them through the rebuttal. If there are any remaining concerns or points where you feel clarification is needed, we are fully open and eager to engage further.
We understand that the review timeline is tight, and we appreciate the time and effort you’ve already devoted. However, any additional feedback would be highly valuable to ensure that our work is properly understood and evaluated.
Thank you again for your consideration.
Warm regards,
The authors
I appreciate the authors' rebuttal, in which most of my concerns are addressed, so I decide to increase my score.
We appreciate the reviewer for the constructive and detailed review. We will revise our paper reflecting the points the reviewer suggested.
This paper proposes Safe Denoiser, a training-free method to improve the safety of diffusion models during image generation. It introduces a framework that modifies the sampling trajectory directly, using a set of unsafe data (images, concepts) to steer generations away from harmful content. The core contribution is a theoretical derivation showing how to construct a "safe denoiser" by penalizing the standard denoiser with an "unsafe" component weighted by a function of current sample likelihoods. The method is shown to be compatible with and enhance existing text-based safety methods.
Strengths: +This paper has a strong theoretical foundation connecting safe, unsafe, and standard denoisers. +The core algorithm is training-free. +Extensive experiments across multiple datasets and attack types.
Weaknesses: -The discription about performance adventage should be more objective, such as, the authors only claim improved preservation of quality (instead of both quality and alignment) while reducing unsafe generations. -More explaination should be provided about experiment analysis, such as, what the potential reasons could be that, after modifying the inference process to prevent unsafe generation, the FID can be improved even compared to the original Stable Diffusion.
During rebuttal, some responses have been provided, such as, the robustness of the proposed method, the effectiveness on newer, more powerful models, high-level intuition and questions about real-world practicality.
The final rating is 4 BA and 1 Accept. All reviewers agree the significant contribution of this paper.