Minimalist Concept Erasure in Generative Models
A concept erasure method that works for SOTA rectified flow DiT models
摘要
评审与讨论
This paper introduces "Minimalist Concept Erasure," a framework designed to remove unwanted concepts from generative models with minimal performance degradation. The core algorithmic idea involves learning a binary mask that selectively prunes neuron connections, guided by an end-to-end optimization process aimed at minimizing the distributional distance between outputs of the original and erased models. The authors validate their method primarily using the FLUX model, demonstrating improved effectiveness in concept erasure and enhanced robustness against specific adversarial attacks compared to several baseline approaches, while claiming to preserve image quality.
给作者的问题
-
Rigorous Definition of "Minimalism": Could you provide a more rigorous definition and quantification of "minimalism" in the context of your algorithm? How do you measure and compare the "minimal modification" achieved by your method against alternative approaches?
-
Mechanism of Robustness Enhancement: Can you elaborate on the specific mechanisms through which neuron masking enhances robustness against adversarial attacks? Is there a theoretical or empirical analysis that supports your claim that neuron masking provides inherent robustness advantages over weight-tuning methods?
-
Comprehensive Performance Evaluation: To address the "minimal performance degradation" claim more convincingly, could you include a more comprehensive evaluation of model performance beyond FID and SSIM, assessing diversity, novelty, controllability, and potential biases? Human evaluation studies or user preference tests could also provide valuable insights.
-
Broader Model and Concept Validation: To strengthen the generalizability claim, could you present experimental results on a wider range of generative model architectures (e.g., GANs, VAEs) and concept categories, including more complex and abstract concepts?
-
Statistical Significance and Practical Significance: Could you include statistical significance testing for all quantitative comparisons and provide a more in-depth discussion of the practical significance of the observed performance differences, considering the computational cost and complexity of your method?
论据与证据
While the paper presents quantitative and qualitative evidence, several claims could benefit from more rigorous and nuanced validation.
- Claim 1: "Minimal Performance Degradation"
The claim that minimalist concept erasure robustly erases concepts with minimal overall model performance degradation is not fully supported by the presented evidence. While metrics such as FID and SSIM scores are informative, they do not comprehensively capture all facets of model performance. Essential dimensions like diversity, novelty, and controllability remain insufficiently evaluated. For instance, it remains unclear whether the method preserves stylistic variations or fine-grained details. Additional metrics and analyses would help substantiate a more holistic performance assessment.
- Claim 2: "Robustness Against Adversarial Attacks"
Although robustness against adversarial attacks is demonstrated, the scope of validation is limited. The paper tests only a few adversarial prompts (e.g., Ring-A-Bell, MMA-Diffusion, P4D, I2P), leaving uncertainty about performance under more diverse or sophisticated adversarial scenarios. Additionally, the mechanism behind increased robustness through neuron masking is not thoroughly explained—is it due to increased sparsity, modified decision boundaries, or another factor? A broader and deeper adversarial analysis, including white-box and transfer attacks, would strengthen this claim.
- Claim 3: "Model-Agnosticism"
The claim of model-agnostic applicability appears overly broad given that experiments focus primarily on rectified flow models (FLUX and SD-XL). The paper lacks evidence demonstrating effectiveness across fundamentally different generative architectures, such as GANs or VAEs. Either this claim should be qualified or supported by additional experiments involving diverse generative architectures to validate broader applicability.
方法与评估标准
The proposed minimalist concept erasure method has notable strengths but also clear limitations in both method design and evaluation.
- Strength: Minimalist Objective
The "minimalist" objective—focusing on minimal modifications guided by output distribution—is conceptually appealing, directly addressing concerns about excessive modifications in existing concept erasure methods.
- Weakness: Insufficient Algorithmic Detail
The current description of the algorithm lacks clarity on key details. How is the binary mask initialized and optimized? Are additional loss functions beyond distributional distance utilized? How is the trade-off between concept erasure effectiveness and performance preservation explicitly managed during optimization? Providing more detailed algorithmic specifications would facilitate reproducibility and deeper understanding.
- Weakness: Evaluation Metrics Limitations
While the metrics (ACC, CLIP, FID, SSIM) are standard, they do not completely capture perceptual quality, diversity, controllability, or potential biases. A comprehensive evaluation including human assessments and metrics targeting diversity, novelty, and bias would enrich the analysis. For instance, low FID does not necessarily guarantee perceptually satisfying or diverse image outputs.
- Weakness: Limited Baseline Comparison
Although the paper compares against methods such as ESD, CA, SLD, EAP, and FlowEdit, it lacks sufficient justification regarding baseline selection and deeper qualitative comparisons. A more detailed and nuanced analysis of these baselines, emphasizing their relative strengths and limitations across varied scenarios, would contextualize the paper's contributions more effectively.
理论论述
The paper currently lacks explicit theoretical claims and formal proofs. While the minimalist objective is intuitively appealing, rigorous theoretical justification would significantly enhance the paper. A theoretical analysis—possibly leveraging optimization theory, information theory, or network pruning theory—could clarify why minimalist concept erasure is effective and efficient. For instance, examining the relationship between network sparsity and robustness or exploring convergence properties of the proposed optimization method would substantially strengthen the paper.
实验设计与分析
The experimental approach demonstrates several limitations impacting robustness and generalizability:
- Weakness: Limited Granularity in Ablation Studies
The presented ablation studies lack sufficient depth, primarily considering coarse-level masking strategies (Attn, FFN, Norm). A more fine-grained ablation study evaluating variations such as masking ratios, specific layers, or strategies would provide valuable insights into method sensitivity and optimal configuration.
- Weakness: Insufficient Qualitative and Error Analysis
The paper emphasizes quantitative metrics with limited qualitative visual examples. Conducting more thorough error analyses, examining failure cases, common artifacts, or method limitations, alongside detailed qualitative assessments (e.g., human evaluations or user preference tests), would deliver richer and more comprehensive insights into the method’s practical effectiveness and limitations.
补充材料
The supplementary material provides the source code implementation, which is valuable for practical replication. However, due to the absence of theoretical derivations or mathematical proofs and since the provided code was not practically executed during this review, the clarity and reliability of the algorithmic details remain uncertain.
与现有文献的关系
The paper adequately references relevant scientific literature on concept erasure in generative models but would benefit from:
-
Connecting the method more explicitly to broader research themes such as model compression, adversarial robustness, and fairness.
-
Discussing relevant theoretical frameworks such as information bottleneck theory or rate-distortion theory to enrich theoretical understanding.
Supplementary references to key literature on sparsity, information bottleneck theory, and adversarial robustness would further contextualize and enrich the paper.
遗漏的重要参考文献
As mentioned previously, essential references that could enrich the paper's context include:
-
Sparsity and Network Pruning: "Learning both Weights and Connections for Efficient Neural Networks" (Han et al., 2015), "The Lottery Ticket Hypothesis: Training Pruned Neural Networks" (Frankle & Carbin, 2018).
-
Information Theory and Model Compression: Relevant works on information bottleneck principle and rate-distortion theory.
其他优缺点
Strengths:
-
Conceptually Appealing Approach: The minimalist objective intuitively addresses key concerns in concept erasure.
-
Practical and Scalable Framework: Combining neuron masking and end-to-end optimization offers a potentially practical and scalable solution.
-
Promising Initial Results: Despite limitations, experimental outcomes indicate potential effectiveness worthy of further exploration.
Weaknesses:
-
Lack of Theoretical Depth: The paper currently lacks rigorous theoretical grounding for algorithmic decisions.
-
Experimental Evaluation Limitations: Evaluations demonstrate weaknesses in statistical rigor, evaluation scope, and depth, limiting robustness and generalizability.
-
Overstatement of Claims: The novelty, effectiveness, robustness, and general applicability are occasionally overstated without sufficient supporting evidence or nuanced limitation discussions.
其他意见或建议
-
Reconsider the framing of "minimalism." While conceptually appealing, the term might be misleading without a more rigorous definition and justification. Consider focusing on "efficient" or "targeted" concept erasure instead.
-
Emphasize the limitations and potential trade-offs of the proposed method more explicitly throughout the paper, providing a more balanced and realistic perspective.
Thank you for recognizing the strengths of our work. We’re glad you found the minimalist objective conceptually appealing, and we appreciate your acknowledgement of the practicality and scalability of our framework. We reply to your concerns below:
“While metrics such as FID and SSIM scores are informative, they do not comprehensively capture all facets of model performance.”
The use of ACC, CLIP, FID, and SSIM is well justified, as these metrics are commonly used for evaluation in prior unlearning works. Please refer to our Related Works section for supporting literature.
“Essential dimensions like diversity, novelty, and controllability remain insufficiently evaluated. For instance”
Can you elaborate on evaluation metrics for diversity, novelty, and controllability and provide concrete metrics?
“The paper tests only a few adversarial prompts (e.g., Ring-A-Bell, MMA-Diffusion, P4D, I2P), leaving uncertainty about performance under more diverse or sophisticated adversarial scenarios”
Adversarial attack on unlearning is a newly field emerged just recently. There are only a few related works available. In this work, we selected SOTA adversarial attack methods (Ring-A-Bell from ICLR2024, MMA-Diffusion from CVPR2024, P4D from ICML2024, and I2P from CVPR2023). We believe our choices are representative. We welcome any specific suggestions you may have.
“The paper lacks evidence demonstrating effectiveness across fundamentally different generative architectures, such as GANs or VAEs”
Our method can be easily extended to GANs and VAEs, since they both perform single-step generation. However, study on GANs and VAEs have limited importance due to their limited generative ability. We choose to evaluate our method on SOTA models. This is also acknowledged by reviewer B5EZ and BukH.
“The current description of the algorithm lacks clarity on key details. How is the binary mask initialized and optimized? Are additional loss functions beyond distributional distance utilized? How is the trade-off between concept erasure effectiveness and performance preservation explicitly managed during optimization?”
Our final loss is precisely described in L172, i.e. no other loss is utilized. In addition, Appendix H provides details and training configuration. Trade-off between concept erasure effectiveness and performance preservation is discussed in L370-375 and Figure 5 in section Ablating .
“The paper currently lacks explicit theoretical claims and formal proofs.”
Our loss is rigorously derived from KL divergence between the generative distributions of a model. The problem formulation and core derivation are all presented in Section 3. Appendix A shows full derivation of preservation loss. Appendix B shows full derivation of erasure loss. Appendix C shows the full loss derivation for diffusion models. Appendix D shows the connection between step-wise loss by analyzing a different KL divergence setup.
“The presented ablation studies lack sufficient depth”
We present 4 other ablation studies besides module ablation. They are closely related to the algorithmic choice mentioned in this work.
“However, due to the absence of theoretical derivations or mathematical proofs and since the provided code was not practically executed during this review, the clarity and reliability of the algorithmic details remain uncertain.”
We want to emphasize that Appendix A,B,C, and D provides sufficient mathematical proofs. All these proofs are appended to the maintext rather than separately in the supplementary materials. Regarding the issue with executing our code. We would like to know more details so we can help you reproduce our results.
“The paper emphasizes quantitative metrics with limited qualitative visual examples. Conducting more thorough error analyses, examining failure cases … ”
Besides Figure 3-6 in the main text, we present Figure 7-14 in Appendix, which include some failure cases in these qualitative examples due to incorrect setups or hyperparameter changes.
“Is there a theoretical or empirical analysis that supports your claim that neuron masking provides inherent robustness advantages over weight-tuning methods?”
As stated in L201, we build on the finding that masking leads to improved robustness performance [1].
We hope our response has addressed your concerns, and we’re happy to answer any further questions you may have. If there are no additional concerns, we would be grateful if you would consider further supporting this work by raising your rating.
[1]: Pruning for robust concept erasing in diffusion models
Thank you for your response and clarification.
However, after carefully revisiting your paper, I realized the core reason for my confusion was the term "Minimalist" itself. Although you clearly define "minimal" in Section 3 as referring specifically to minimizing the distributional difference in the final outputs (minimal changes at the output-level), other parts of the paper still seem ambiguous. For example, certain statements regarding neuron masking in Section 3.6 could easily mislead readers—especially those familiar with network pruning—to interpret minimalism as referring to minimal parameter changes, weight-space distances (such as minimizing norms like ∥θ - θ'∥), or pruning fractions.
To prevent similar misunderstandings by other readers, I suggest explicitly reiterating and emphasizing within your main text—especially when first introducing the term or describing practical mechanisms—that your definition of minimalism strictly targets minimal changes in output distributions, not necessarily minimality in network parameters or pruning metrics.
If you clearly include these suggested clarifications within the main text, my concerns will be fully addressed, and I will gladly raise my evaluation score accordingly.
Thank you for your comment and for reconsidering our work.
We appreciate your acknowledgement that the term “minimalist” is clearly defined in Section 3. To further clarify this term and prevent potential misunderstandings, we will make the following revisions in the main text:
- Add a label in Figure 2 to highlight the minimal output difference.
- Revise Section 1 to include a more descriptive explanation when the term “minimalist” is first introduced, and add a reference to Section 3 and Equation 5 for readers seeking a more rigorous definition.
- Revise Section 3.6 to emphasize that we mask neurons for robust concept erasure, and that our definition of “minimalist” does not refer to minimal parameter change.
Thank you again for your thoughtful feedback and for helping us improve our work.
The paper introduces a concept erasure method that is minimal in design, requiring only the final output of the diffusion model, rather than access to intermediate timesteps. In addition, the authors propose a neuron masking technique as an efficient alternative to traditional fine-tuning. Both approaches demonstrate strong erasure performance, particularly when applied to flow-matching models such as FLUX.
给作者的问题
Please see Experimental Designs Or Analyses section
论据与证据
Yes, I believe that the claims in this submission are supported by extensive evaluation.
方法与评估标准
The proposed methods and evaluation criteria make sense for concept erasure. Most concept erasure works evaluate on object erasure, artist style erasure, object removal, and some copyrighted characters. This work follows this evaluation setup, along with evaluation on adversarial attacks.
理论论述
I have checked the correctness of the preservation loss (Appendix A), the erasure loss (Appendix B), full loss derivation (Appendix C) and connection with step-wise concept erasure loss. While I may not have understood it completely, I can say that it is sound.
实验设计与分析
- My main concern is that several recent concept erasure works have been ignored. Can you please add them so that we can compare the proposed method against similar baselines?
[1] One-dimensional Adapter to Rule Them All: Concepts, Diffusion Models and Erasing Applications [2] MACE: Mass Concept Erasure in Diffusion Models [3] Receler: Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers [3] Selective Amnesia: A continual learning approach to forgetting in deep generative models, Heng et al [4] Forget-me-not: Learning to forget in text-to-image diffusion models. Zhang et al [5] Scissorhands: Scrub data influence via connection sensitivity in networks, Wu et al.
-
It is unclear to me if the authors trained different models for every concept or separate models with inappropriate objects, IP characters, nudity, and art styles removed? If not, I would suggest that the authors train for a multi-concept erasure method, similar to UCE.
-
Can you provide more details on how you trained ESD, UCE, and other ablations with FLUX?
-
Many concept erasure methods like ESD, UCE and [4] have shown that cross-attention plays a major role in learning undesired concepts. But most of these methods have considered SDv1.5, while this paper considers a transformer based FLUX model? Do the authors have any insight into which attention blocks (joint attention/self-attention) in FLUX play a role in generating these concepts? Does it help if we can selectively tune these attention blocks? This is not a weakness of the paper, so please don't feel pressured to do this analysis.
补充材料
I checked the proofs and the results in the appendix.
与现有文献的关系
The key contributions of this paper are towards concept erasure in flow-based models like FLUX. I think this is the first work to consider newer models. Their results seem to be better than other concept erasure methods considered in the paper. However, I still feel that the authors ignore many recent concept erasure works.
遗漏的重要参考文献
Please see references in Experimental Designs Or Analyses Section that I think the paper has not considered.
其他优缺点
Please see the Experimental Designs or Analyses section
其他意见或建议
Minor typo in the introduction. line 68 - full stop needed before 'In practice'
Thank you for the thoughtful and thorough review. We greatly appreciate your acknowledgment that our claim is well-supported by the evaluation with conventional settings as well as adversarial settings. We also appreciate your review of our theoretical derivations and acknowledgement of their soundness. We address your concern below:
1. Additional concept erasure works:
We appreciate your comment regarding additional baselines. From your reference, Recler edit features after cross-attention modules, MACE finetunes and in cross-attention modules, and Forget-Me-Not performs attention-resteering on cross attention modules. Since all three methods rely on cross-attention, they are not directly applicable to FLUX, which uses MM-attention instead. As for One-dimensional Adapter, SA, and Scissorhands, including them as baselines poses significant challenges, as they do not provide official implementation for FLUX, and adapting them to FLUX requires substantial effort due to their complexity. We will include these baselines and discuss them in our related works.
2. Training on separate models or a monolithic model
In our evaluation, we train separate models for each category that contains multiple concepts—an evaluation setting commonly adopted in prior unlearning works. Additionally, we present a result where a single monolithic model is trained to unlearn concepts across multiple categories. As shown in the table, unlearning multiple concepts across different categories is more challenging compared to unlearning concepts in a single category.
| Number of Unlearned Concepts | CLIP ↑ | FID ↓ | SSIM ↑ |
|---|---|---|---|
| FLUX (Original) | 0.31 | 40.4 | - |
| 10 (IP + Styles) | 0.29 | 44.4 | 0.55 |
| 20 (IP + Styles + Nudity) | 0.23 | 49.3 | 0.53 |
| 50 | 0.20 | 58.3 | 0.48 |
Table: COCO performance as the number of unlearned concepts increases. Arrows indicate preferred direction.
3. Details on how baselines are compared:
Since many unlearning works are implemented on SD1.4 and lack official support for FLUX, we reimplemented all baselines used in this work. For a fair comparison (hyperparameters are originally proposed for SD1.4), we performed hyperparameter search for each baselines to find the best-performing hyperparameter, which we used in our experiments. Details of baseline implementation are shown in Appendix H.2. One qualitative example of ESD hyperparameter performance on FLUX is shown in Figure 12. We will update the appendix to include the full set of baseline hyperparameters. Additionally, we plan to release our FLUX implementations of these baselines upon acceptance to support reproducibility.
4. Do the authors have any insight into which attention blocks (joint attention/self-attention) in FLUX play a role in generating these concepts?
On FLUX, we observe behavior that differs significantly from U-Net-based SD models (e.g., SD1.4). Unlike SD models, which generate concepts in cross-attention layers, we observe that normalization layers and FFN layers play a critical role in generating concepts. This is supported by our ablation study in Table 4. We also present additional qualitative results in Figure 11 in Appendix K.2. These findings motivate our decision to exclude attention modules from FLUX unlearning, as noted in Line 382.
We hope our response has addressed your concerns, and we’re happy to answer any further questions you may have. We sincerely appreciate your recognition of our work as a meaningful contribution to the current unlearning research with SOTA FLUX model. We would be grateful if you would consider further supporting this work by raising your rating.
Thank you for your comprehensive rebuttal. I understand that Receler and Forget-me-not methods may be hard to apply to FLUX model based on joint attention, and training the other baselines I mentioned may be difficult. I have a few questions in this regard -
- Have the authors tried to extend their method to other models like SD3.5 etc? I would like to point out that proposing a very model-specific erasure method is of very little use to researchers, however, proposing a more general concept erasure method (perhaps only for MM attention models) would be more useful. Many of the recent works that I proposed exhibit excellent performance and it would be useful for us to understand the efficacy of minimalistic concept erasure if we can compare it with other recent works.
Thank you for providing some results for multi-concept erasure. Please include them in the paper.
Thank you for acknowledging our rebuttal.
Regarding the extension to SD3.5: SD3.5 shares the same architecture as FLUX (MM-DiT), and both are rectified flow models [1,2,3]. Therefore, we chose to evaluate our method on the better-performing FLUX model to show its effectiveness on SOTA model. Nevertheless, we tried our method on SD3.5 prior to submission, and it worked.
Regarding generalization to other models: Our method is model-agnostic by design. We also included additional results on SD-XL in our response to reviewer ab8s. This demonstrates that our method is effective across different architectures (U-Net and MM-DiT) as well as various training paradigms (diffusion and flow matching).
Thank you for your suggestion. We’ll include the multi-concept erasure results in the paper and discuss the additional recent works you recommended.
: https://huggingface.co/blog/sd3-5
This paper studies minimalist concept erasure, which aims to remove inappropriate content from a generative model with minimal modification to the original model, specifically in diffusion/flow matching models. Unlike previous approaches that operate on each step of the sampling chain, the training objective in this paper applies to the entire trajectory, prioritizing its effect on the final generated images. Additionally, the authors introduce a KL regularization term to ensure minimal changes to the model’s behavior. Across multiple benchmarks, this method demonstrates advantages over other compared approaches.
给作者的问题
No
论据与证据
Extensive empirical analysis supporting the theoretical findings, detailed in Appendices A–D. Effective content erasure while preserving the model’s functionality, as evidenced in Tables 2–5.
方法与评估标准
The final objective is given in Equation 16, where the first term ensures that the model’s output is not correlated with the concept c, while the second term encourages the model’s output to align with a reference model.
理论论述
The new loss is related to the step-wise loss function, as in sec 3.4 and Appendix D, and to the alignment, as in sec 3.5
实验设计与分析
Authors run experiments on state-of-the-art flow-based models and inspect whether the model could still generate images with erased concepts. In the meantime, the author also benchmarks the image quality to ensure the model’s performance doesn’t degenerate. The authors compared multiple methods and showed clear improvement.
补充材料
The author provides supplementary material for further loss derivation and more examples.
与现有文献的关系
This work is related to other step-wise concept erasing models and the alignment model
遗漏的重要参考文献
This paper provides sufficient references
其他优缺点
Strength: The primary strength of this paper is that the proposed objectives are motivated and in a very concise format. The performance is also significantly better than others. Weakness: This method needs to generate the entire chain, which is less efficient than step-wise methods.
其他意见或建议
No
Thank you for the thoughtful and thorough review. We’re glad you found our approach to minimalist concept erasure well-motivated and effective across benchmarks. We appreciate your feedback and the recognition of our theoretical and empirical contributions.
Regarding your concern, we acknowledge the efficiency trade-off introduced by complete gradient trajectories across all generation steps. Nevertheless, since unlearning only requires one training run, improving performance with a trade-off on runtime is acceptable. Besides, we show that concept removal requires only a few GPU hours.
We’re happy to answer any further questions. If you don’t have any additional concerns and find that this work complements the current unlearning research, we would greatly appreciate it if you could raise your score to further support it.
The paper presents a technique to unlearn concepts from generative models. Unlike existing concept erasure techniques, the proposed technique unlearns concept based only on distributional distances of the final generation outcomes.
给作者的问题
- What was the training time. How many neutral concepts are selected.
- What happens if these neutral concepts are reduced. Does model behaves unpredictable for neutral concepts?
论据与证据
-
The authors claim that “our method adopts a connectionist perspective, treating concepts as being stored in the interconnected structure of neurons.” However, this requires further analysis and discussion to fully understand the approach. Generative models inherently learn highly correlated concepts with complex interdependencies. It remains unclear how neuron masking is applied to concepts that exhibit strong correlations, such as “nudity” and “revealing clothing.” A more detailed explanation and experimental validation are necessary to clarify the handling of such cases.
-
Figure 2 suggests that “the model learns an optimal trajectory as the gradient propagates through all generation steps.” However, this claim is not rigorously proven in the paper, and further justification is required to substantiate it.
-
Additionally, the title “Minimalist Concept Erasure” appears misleading, as the paper does not quantitatively demonstrate that minimal changes are made to images or features. To justify this claim, the authors should either introduce an evaluation metric or provide a detailed analysis of feature-level modifications to verify the extent of erasure.
方法与评估标准
- The paper overlooks a crucial evaluation metric—assessing how well the unlearned model retains neutral concepts. Effective unlearning should remove specific concepts without inducing catastrophic forgetting, ensuring the model can still generate neutral, unrelated content. A dedicated analysis on this aspect is necessary.
- The proposed unlearning loss function closely resembles existing approaches, such as the Concept Ablation (CA) method. The formulation does not introduce significant innovations beyond prior work.
- The paper does not sufficiently address the scalability of neuron masking when unlearning multiple concepts. As the number of masked neurons increases, training stability may degrade. Additional justification and experimental validation are necessary to demonstrate the method’s robustness under varying numbers of masked neurons.
- For models with a large number of denoising steps, the backpropagation of gradients from the final step to the initial step accumulates all previous gradients, which could lead to instability or unintended interference at early denoising steps. The impact of long-range gradient dependencies on early denoising dynamics should be analyzed.
- The current loss function is applied only at the final denoising steps, leaving open the possibility that the target concept remains preserved in earlier steps. To rigorously validate concept erasure, the authors should empirically and mathematically demonstrate how the concept is removed across intermediate steps ( X_1, X_2, … ). This could be achieved by discarding stochasticity in denoising and analyzing the generative trajectory.
- For the last point, represents are how feature space is affected for objects at different denoising step. If the concepts are removed properly, the feature should be highly discriminative from the original feature of concept.
理论论述
The authors theoretically claimed that they formulate the loss to preserve neutral concepts, however, no empirical proof is provided.
实验设计与分析
- The results on the SDXL model are not provided, despite the authors claiming that their approach is suitable for diffusion models. Additionally, they state that these results are included in the appendix, but they are either not referenced or were inadvertently omitted. In either case, the absence of quantitative results raises concerns.
- The evaluation of neutral prompt concept preservation is missing, which is crucial for understanding how the model generalizes after unlearning specific concepts.
- In Equation 5, the authors mention that a neutral prompt set is created and sampled to compute the loss. However, there is no explanation regarding which neutral prompts the model is trained on or how these concepts are selected. Furthermore, details on the number of neutral concepts used for a given target concept are not provided, making it difficult to assess the robustness of the approach.
补充材料
Discussed already.
与现有文献的关系
Contributions are related to existing literature. However, all the papers discuss about the retention accuracy of the model on non-target concepts. This is missing from the paper.
遗漏的重要参考文献
Some of the papers are missing and comparisons are also missing:
- Gong, C., Chen, K., Wei, Z., Chen, J., and Jiang, Y.-G. Reliable and efficient concept erasure of text-to-image diffusion models. In European Conference on Computer Vision, pp. 73–88. Springer, 2025.
- Hong, S., Lee, J., and Woo, S. S. All but one: Surgical concept erasing with model preservation in text-to-image diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 21143–21151, 2024
- Kim, C., Min, K., and Yang, Y. Race: Robust adversarial concept erasure for secure text-to-image diffusion model. In European Conference on Computer Vision, pp. 461–478. Springer, 2025.
- Huang, Chi-Pin, et al. "Receler: Reliable concept erasing of text-to-image diffusion models via lightweight erasers." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.
- Zhang, Gong, et al. "Forget-me-not: Learning to forget in text-to-image diffusion models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024.
- Lu, Shilin, et al. "Mace: Mass concept erasure in diffusion models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
Some of these are discussed in the paper but comparison is missing.
其他优缺点
Already discussed
其他意见或建议
Writing could be improved. Some equations uses notations that are not defined. Check all equations carefully.
- Equation 22 for example The statement does not seem to be complete or correct "During backward propagation, we recompute the forward before gradient calculation." Appendix M. Robustness study of unlearned models with neural prompts. I think the authors meant neutral prompts.
Thank you for the thoughtful and thorough review. We address your concern below:
Claims:
-
As stated in L201, prior masking-based unlearning methods achieve accurate and robust results, which we build on due to their strong performance. Regarding neuron masking for strongly correlated concepts, Table 2 shows that model performance degrades minimally on neutral prompts (see LAION dataset). Visual examples in L228 further illustrate that our method does not unlearn a concept by pushing it toward a predefined anchor or a contrastive concept (our result is neither nude nor heavily dressed). Further understanding how the model functions internally falls under the domain of mechanistic interpretability research and is beyond the scope of this work.
-
Our derivation in Appendix A shows how we derive from the KL divergence between the original and unlearned models to our preservation loss. L263 explains how the mask gradient is composed of the gradients of all intermediate masks, following the chain rule.
-
We conduct extensive evaluations of model performance post-unlearning. For target concepts, we report CLIP (text alignment), FID (image quality), and SSIM (structural similarity between original and unlearned outputs), with results in Table 1 showing our method outperforms baselines. For neutral prompts, we evaluate 5,000 samples from the LAION dataset (Table 2), demonstrating that our unlearned model maintains strong performance.
Methods and Evaluation Criteria:
-
We response in Claims 3.
-
Our method fundamentally differs from CA: while CA applies per-step losses on intermediate variables, our loss operates only on the final output. Additionally, CA requires an anchor concept paired with each target concept, whereas our approach does not rely on any anchor.
-
In our evaluation, we train separate models for each category that contains multiple concepts. This evaluation setting is adopted in almost all unlearing works. We present an additional result to unlearn one monolithic model across multiple categories. | Number of Unlearned Concepts | CLIP ↑ | FID ↓ | SSIM ↑ | |----------------------------------|--------|--------|--------| | FLUX (Original) | 0.31 | 40.4 | - | | 10 (IP + Styles) | 0.29 | 44.4 | 0.55 | | 20 (IP + Styles + Nudity) | 0.23 | 49.3 | 0.53 | | 50 | 0.20 | 58.3 | 0.48 | Table: COCO performance as the number of unlearned concepts increases.
-
We do not observe any significant gradient instability issues, as modern model architectures utilize residual connections that help stabilize gradients during training.
-
One cannot conclude that an early step still preserves a concept simply because it is not altered, as it may no longer resembles the concept in the final outcome. Afterall, we concern the presence of a concept in the last step.
-
In response to your feedback, we performed UMAP visualization on the final denoised latent. The results show clear separation between unlearned and original FLUX features using the same prompts. We will include this plot in the revision to further support our claims, alongside detection accuracy results.
Experimental Desings Or Analyses:
-
We apologize for the misunderstanding caused. We include the result on SD-XL below and will add it in Appendix. In addition, Appendix C shows the derivation and final loss for diffusion models, we welcome you to check its correctness. | Category | Method | ACC ↓ | CLIP ↑ | FID ↓ | SSIM ↑ | |----------------|--------|--------|--------|--------|---------| | IP | Ours | 3% | 0.30 | 33.4 | 0.63 | | IP | SDXL | 100% | 0.31 | 35.5 | – | | Art Styles | Ours | 7% | 0.29 | 37.5 | 0.53 | | Art Styles | SDXL | 89% | 0.31 | 35.5 | – | Table: Concept erasure results (IP Characters and Art Styles) on SDXL.
-
We response in Claims 3.
-
In training details in Appendix E (Line 887), we state that neutral prompts are randomly selected from the GCC3M dataset. In addition, we use 100 neutral prompts during unlearning. We appreciate your suggestion on the detailed settings of neutral prompts and will include these details in the main text.
Questions:
-
In Appendix H.1, we state that we train 400 steps on an H100 GPU. As replied above, we randomly select 100 GCC3M prompts as neutral prompts.
-
We include one additional ablation study on the number of neutral prompts. Our method does not require a large number of neutral prompts. | # Images | ACC ↓ | CLIP ↑ | FID ↓ | |----------|--------|---------|--------| | 100 | 4% | 0.31 | 35.4 | | 50 | 4% | 0.31 | 39.7 | | 10 | 12% | 0.27 | 48.2 | Table: Ablation on number of neutral prompts.
We would like to thank you again for your thorough review. We are happy to answer any further questions.
Authors proposed a generative method based on formulating a novel minimalist concept erasure objective based only on the distributional distance of final generation outputs. Reviewers have initial concerns about the theoretical aspect and evaluation criterion, which have been addressed in the rebuttal. Two reviewers are satisfied by the responses.