EraseAnything: Enabling Concept Erasure in Rectified Flow Transformers
摘要
评审与讨论
This paper highlights the limitations of existing concept-erasing methods, such as CA, ESD, and UCE, which were developed for Stable Diffusion models utilizing U-Net, cross-attention, and CLIP text encoders. The authors argue that these methods are ineffective for Flux, a modern multi-modal diffusion transformer that employs a T5 text encoder. To address this gap, the paper proposes new loss functions for concept erasure.
Recognizing that concept erasure involves a bi-level optimization problem balancing both concept removal and irrelevant concept preservation, the authors integrate multiple loss terms: the original ESD loss (Equation 2), an attention-attenuation loss (Equation 3), a diffusion loss (Equation 4), and a reverse self-contrastive loss (Equation 5) to preserve irrelevant concepts. The experiments primarily focus on nudity removal, but also include tests on entity, abstraction, and relationship-based concepts, as well as celebrity face removal.
给作者的问题
No additional questions.
论据与证据
This paper argues that existing text-to-image concept-erasing methods, such as CA, ESD, and UCE, which were originally developed for Stable Diffusion architectures, fail to generalize to the Flux architecture. The authors demonstrate this claim visually in Figure 1, showing that applying methods like ESD, UCE, and EAP to Flux does not effectively erase concepts such as "nude." However, the evidence provided in this paper is insufficient to fully support this claim.
Firstly, according to line 363, the authors conducted experiments that only fine-tunes “add_k_proj” and “add_q_proj” within the dual-stream blocks of Flux. This limited approach raises concerns because it excludes other potentially crucial layers, such as all attention layers across both 19 dual-stream and 38 single-stream blocks ("to_k", "to_q", "to_v", "to_out.0"), which could significantly influence concept erasure. To demonstrate that methods designed for Stable Diffusion are ineffective for Flux, a more comprehensive evaluation involving fine-tuning of all relevant attention layers is necessary. In the appendix, while the authors mention excluding "add_v_proj" and "to_v" due to numerical sensitivity, fine-tuning these layers is common practice in the community [A].
Moreover, the authors primarily rely on LoRA-based fine-tuning, which inherently preserves the original concepts within pre-trained model weights. Thus, to convincingly demonstrate genuine concept erasure, experiments involving full fine-tuning are required.
Secondly, Flux employs not only the T5 encoder but also the CLIP text encoder. Since Flux utilizes both encoders, it remains unclear whether concept-erasing methods would perform differently if we use the CLIP encoder alone.
[A] huggingface, https://github.com/huggingface/diffusers/tree/main/examples/dreambooth
方法与评估标准
The evaluation methods generally make sense for assessing quality. However, the claim in Section 5.2 that FlexControl provides an "erase anything" solution requires additional supporting evidence. Specifically, authors should demonstrate the method's effectiveness through diverse and challenging cases, such as erasing color (i.e., red rose, green bag) or object count (i.e., two oranges, three cats).
理论论述
The target in the loss function is a scalar value using the L2 norm. However, ESD’s objective function regresses the model toward the guided prediction of the pre-trained model.
实验设计与分析
As mentioned previously, the experimental analysis should include results from LoRA and full fine-tuning across all attention layers.
补充材料
Misleading argument on trainable parameters in Section A. See “Claims And Evidence”
与现有文献的关系
Concept erasure in generative models is a key research challenge in text-to-image generation. This paper extends prior works, such as ESD, by adapting the loss formulation to transformer-based architectures like Flux.
遗漏的重要参考文献
No issues found.
其他优缺点
See “Claims And Evidence”
其他意见或建议
Additional evaluation with Stable Diffusion 3.
Thank you for your detailed comments and interest in our work!
-
(A) Limited Fine-Tuning due to VRAM Constraints
We acknowledge the reviewer's concern regarding limited fine-tuning. Due to 80GB VRAM constraints on our single A100, full fine-tuning was infeasible. We opted for LoRA, prioritizing layers with the most significant impact on text-to-image generation. Optimizing
to_qandto_kdegraded image quality without effective concept erasure, whileadd_q_projandadd_k_projproved effective. Optimizingadd_v_projandto_vyielded noisy outputs, leading to their exclusion. We recognize the architectural differences between Flux and Stable Diffusion (MMDiT vs. U-Net, Rectified Flow vs. DDPM/DDIM) and base our conclusions on empirical observations. -
(B) T5 Dominance in Flux Generation
Our experiments demonstrate that the T5 encoder significantly influences Flux's generation. As shown here,
prompt_embedsfrom T5 acts asencoder_hidden_states, similar to CLIP in SD models. Conversely,pooled_prompt_embedsfrom CLIP primarily affects time embeddings, with minimal impact on final output. Adding noise to T5 features drastically altered the output, while changes to CLIP features were negligible. Therefore, we focused on T5. -
(C) EraseAnything: Quantity and Color Validation
We appreciate the reviewer's request for diverse examples. To further validate EraseAnything's robustness, we provide examples demonstrating erasure of quantity and color: "green" from "green bag," "red" from "red rose," "five" from "five pencils," and "three" from "three cats" (image). Combined with supplementary material, this reinforces EraseAnything's effectiveness.
-
(D) Rationale for Excluding SD3
We excluded SD3/SD3.5 due to their comparatively lower general image generation performance. Given Flux's status as a flagship model developed by Robin Rombach's team(Black Forest Lab, father of Stable Diffusion), we believe our findings on Flux sufficiently demonstrate EraseAnything's capabilities.
-
Image URLs:
- T5 vs. CLIP: https://imgur.com/a/047aypl
- Erase Quantity & Color: https://imgur.com/a/TIxXi9u
I appreciate the authors’ responses to points (B) and (C) in the rebuttal. However, I still have concerns regarding (A) and (D).
I understand that full fine-tuning may be infeasible due to VRAM limitations. That said, regarding the use of LoRA, I remain unconvinced that fine-tuning only the “add_k_proj” and “add_q_proj” layers in Flux is sufficient, as these exist in only 19 of the 57 transformer blocks.
In this context, I believe it would be meaningful to evaluate the proposed method on SD3, where all blocks are dual-stream and contain “add_k_proj” and “add_q_proj” layers. This would help clarify whether the issue is specific to Flux’s architecture or generalizable to DiT-based models.
Since the core motivation of this work is that methods effective on U-Net-based models do not transfer to DiT-based models, a deeper examination of the trainable parameter choices is central to the paper's contribution.
Thank you for the insightful comments.
Regarding point (A) and the choice of add_q_proj and add_k_proj for LoRA fine-tuning: As illustrated partially by the provided code snippet. Specifically, encoder_hidden_states (carrying the text conditioning) are projected via encoder_hidden_states_query_proj and encoder_hidden_states_key_proj (these correspond to the add_q/k_proj in the diffusers implementation). These projections are then concatenated with the image features' query and key vectors, respectively:
# Assuming encoder_hidden_states_*_proj correspond to add_q/k_proj layers
query = torch.cat([encoder_hidden_states_query_proj, query], dim=2)
key = torch.cat([encoder_hidden_states_key_proj, key], dim=2)
# value projection is typically separate and not targeted here
...
# Attention scores are calculated using these combined representations
attn_weight = query @ key.transpose(-2, -1) * scale_factor
By applying LoRA to add_q/k_proj, we are directly modifying the weights that project the text conditioning before it influences the attention scores (attn_weight) calculated w.r.t the image features. This provides a targeted way to modulate how the text concept influences the image generation process at these specific cross-attention points.
We acknowledge the reviewer's observation that these layers exist in only 19 of the 57 blocks in the Flux[schnell/dev]. While tuning only a subset seems incomplete, we hypothesize that these particular dual-stream blocks are critical junctions for integrating text-based conceptual information.
Our empirical results suggest that modifying the text injection mechanism even within these key blocks provides a sufficiently strong and targeted signal to guide the model effectively for concept manipulation tasks. The visualization linked below, which shows the image impact of altering attention weights (related to the output of these layers), offers some qualitative support for the sensitivity of the generated output to modifications within this mechanism: https://imgur.com/a/TSTnMBO.
Regarding the suggestion to evaluate on SD3 and SD3.5: We consent to the reviewer and will add those experiments and incorporate the results into the final version of the paper to verify the adaptability of EraseAnything.
Given that current text-to-image models can generate inappropriate content related to pornography, violence, or copyright violations, the problem of effective concept erasure has become a critical research topic. Existing methods have proven effective for Stable Diffusion but are challenging to directly adapt to SD3 and FLUX.
This paper investigates concept erasure algorithm on FLUX, highlighting the differences between FLUX and SD in terms of model structure and encoder properties. The authors propose a robust concept erasure algorithm leveraging bi-level optimization techniques, integrating Forge Me Not, Erasing Stable Diffusion, and Reverse Self-Contrastive approaches. Experimental results demonstrate that their method achieves superior qualitative and quantitative erasure performance on the FLUX architecture, surpassing other existing concept erasure algorithms.
update after rebuttal
The author show the comparison between bi-level optimization and multi-objectives and explain the reason why attention localization owns contribution to the flux-based architecture and the generative field, which address my concern. However, the attention localization and the bi-level optimization is kind of simple and boardline to me, so I keep my rate.
给作者的问题
My main concern lies in the authors' overstatement regarding the contribution of their method, as well as the claimed ease of adapting it to flow-matching-based diffusion models. The author should answer the following question to show the novelty and contribution of their method.
- What's the difference between bi-level optimization and multi-object optimization?
- What's the depth analysis of attention localization? it seems to be a very straightforward idea.
- What's the real difficulty of adapting to unlearning into flow matching? The method the authors used is just the combination of existing methods with slight modification.
论据与证据
- The paper appears to overclaim the contribution of bi-level optimization. In the introduction, the authors present bi-level optimization as a core contribution. However, as described in Algorithm 1, the proposed method simply alternates between optimizing the erasure and preservation losses, which is essentially a trivial multi-objective optimization implementation without notable innovation.
- The paper seems to overclaim the contribution of attention localization. In the introduction, the authors treat Attention Localization as a core contribution and refer to it as “a depth analysis.” However, as later sections describe, MMDiT applies a concat operation on Q and K, followed by a self-attention-like structure. Identifying the corresponding positions of image and text tokens in the attention matrix is straightforward and does not qualify as “a depth analysis.”
方法与评估标准
Yes.
理论论述
No theoretical claims.
实验设计与分析
The paper employs well-established techniques such as Attention, Erasing Stable Diffusion (ESD), and LoRA, which are already mature methods for concept forgetting and fine-tuning. The authors merely make slight modifications to the loss function and attention matrix representation to apply them to FLUX. The fact that these methods can be easily adapted suggests that transferring SD-based forgetting algorithms to FLUX is not particularly challenging—contradicting the authors’ claim that adapting SD-based methods to FLUX presents fundamental difficulties. Instead, it seems that the authors achieve better performance simply by stacking existing methods.
补充材料
I read all of the supplementary.
与现有文献的关系
Flow-matching-based diffusion models represent the cutting edge of generative models in current research. Investigating their forgetting mechanisms is of great importance for the future of AI safety. The method proposed in this paper lays a foundational groundwork for future research in this direction, offering significant value for advancing both the capabilities and responsible use of such models.
遗漏的重要参考文献
No.
其他优缺点
None.
其他意见或建议
Typo: In line 197, nude nude should be nude.
We sincerely appreciate your insightful feedback. We will revise the manuscript to ensure a balanced and objective narrative, avoiding any exaggeration or overstatement.
- (A) Difference between bi-level optimization and multi-objective optimization
We frame unsafe concept erasing as a bi-level optimization problem, rather than a multi-objective one. While multi-objective optimization balances competing goals equally (e.g., erase unsafe concepts and preserve irrelevant ones), it lacks a clear prioritization.
In contrast, bi-level optimization explicitly models a hierarchy:
• Lower-level erases unsafe concepts.
• Upper-level evaluates whether irrelevant concepts are preserved.
This reflects the asymmetric nature of our goals: erasure is primary; preservation is a constraint. Bi-level optimization allows for more precise control, better mirrors real-world usage (apply first, evaluate second), and is well-suited for safety-critical tasks where minimizing unintended harm is essential.
- (B) attention localization
Attention localization analysis in UNet-based Stable Diffusion is intuitive and well-understood due to the presence of explicit cross-attention mechanisms. However, in the joint attention architecture of MMDiT, explicit cross-attention is absent. Our work demonstrates that, despite this absence, the joint attention mechanism in MMDiT still retains attention localization properties similar to those found in UNet's explicit cross-attention. Leveraging this insight, we introduce a framework to concept erasure within the MMDiT architecture. Although this insight may seem straightforward, it is beneficial for the research community, particularly benefiting future FLUX-based erasure studies.
- (C) real difficulty in adapting unlearning to flow matching?
Achieving effective concept erasure in Flow Matching models presents a significant challenge, as direct application of existing methods like UCE and ESD proves inadequate. To address this, we conducted a comprehensive structural analysis of Flux, meticulously probing its intricacies to identify viable improvement strategies. Through extensive experimentation, we found that precise adjustments to the to_q_proj and to_k_proj projections within the dual transformer block are essential for successful erasure.
I admit that this paper is the first one that points the attention localization in FLUX, although it is extremely straightfoward, so, fine.
As for bi-level optimization, please show the comparison with multi-objective optimization, or it could not be argued as core contribution.
Thank you for the positive feedback and recognition!
For the multi-objective optimization evaluation, we adopted the experimental settings from Table 3 of the original paper, focusing on specific categories: Entity (e.g., soccer) and Abstraction (e.g., artistic style). The table below compares Bi-level Optimization (BO) with multi-objective optimization. We report CLIP classification accuracies (%) for each erased category across three metrics:
- Acc_{e} (Efficacy): Accuracy on the erased category itself (lower is better ↓).
- Acc_{ir} (Specificity): Accuracy on remaining, unaffected categories (higher is better ↑).
- Acc_{g} (Generality): Accuracy on synonyms of the erased class (lower is better ↓).
| METHOD | ACCe ↓ | ACCir ↑ | ACCg ↓ |
|---|---|---|---|
| Bi-level (ENTITY) | 12.5 | 91.7 | 18.6 |
| Multi-objective (ENTITY) | 12.7 | 79.3 | 28.5 |
| Bi-level (ABSTRACTION) | 21.1 | 90.5 | 24.7 |
| Multi-objective (ABSTRACTION) | 22.3 | 77.4 | 31.2 |
Which demonstrate that BO has it merit in such task, and we will add it in the final copy of this version.
This paper introduces EraseAnything, a flux-based concept erasing method designed to selectively remove target concepts while preserving irrelevant ones. The authors employ a bi-level optimization strategy to mitigate overfitting and catastrophic forgetting—key challenges in concept erasure. Experimental evaluations across diverse tasks demonstrate the method’s effectiveness and versatility, highlighting its potential impact in controlled information removal and model robustness.
update after rebuttal
The authors’ rebuttal has clarified my previous concerns. Taking into account the other reviewers' feedback and the authors' response, I choose to maintain my original evaluation (Weak accept).
给作者的问题
See weakness.
论据与证据
Yes.
方法与评估标准
Yes.
理论论述
No, as there are no theoretical claims made
实验设计与分析
I reviewed the experimental results presented in Tables 2, 3, and 4, and they appear to be correct.
补充材料
The Supplementary Material includes code for both training and testing. However, I did not run the code myself.
与现有文献的关系
This paper explores the problem of targeted concept erasure in deep learning models, aligning with broader discussions in the machine learning community on model interpretability, unlearning, and mitigating biases. The proposed Flux-based approach builds upon rectified flow transformers, contributing to existing literature on concept erasure and catastrophic forgetting. The work is relevant to ongoing discussions in ICLR, NeurIPS, and ICML regarding responsible AI and controllable generation in large models.
遗漏的重要参考文献
None
其他优缺点
Paper Strengths:
The paper is well written. The main motivation is clear and easy to understand.
Major Weaknesses:
In Table 2, UCE outperforms the proposed EraseAnything in terms of the number of DETECTED NUDITY instances but performs worse in terms of FID and CLIP on the MS-COCO 10K dataset. What accounts for this discrepancy between these metrics? Does this imply that UCE is superior to EraseAnything overall?
其他意见或建议
None
Thank you for your kind words and review!
To be concise:
- UCE's aggressive nudity removal significantly distorts images.
- EraseAnything prioritizes image quality and text alignment, offering a better trade-off.
As shown in this image, optimizing 'K' in our UCE implementation on Flux[dev] reduces nudity but degrades image quality, highlighting this inherent trade-off. (Optimizing 'Q' had no effect, and 'V' yielded noisy images.)
While UCE removes more nudity, EraseAnything maintains superior image quality (better performance on FID and CLIP on the MS-COCO 10K dataset), which is crucial for practical use.
- UCE Optimization https://imgur.com/a/at2lkh8
The authors’ rebuttal has clarified my previous concerns. Taking into account the other reviewers' feedback and the authors' response, I choose to maintain my original evaluation (Weak accept).
Thank you for your valuable time and insights. We truly appreciate your support.
In this paper, the authors propose a methodology for concept unlearning while ensuring the preservation of unrelated concepts in the latest text-to-image (T2I) models based on Flow Matching and Transformer-based diffusion models such as Flux. The authors introduce a bi-level optimization (BO) framework. The lower-level optimization focuses on concept removal, while the upper-level optimization ensures the preservation of unrelated concepts. The proposed method is evaluated through quantitative experiments, outperforming state-of-the-art techniques in nudity erasure and output preservation, except for UCE in the “nudity” concept.
给作者的问题
- Can you elaborate on the last paragraph in Appendix A. Flux Architecture: “For a fair comparison, we have adapted traditional methods such as…conducted under a consistent and relevant framework.” on why the said modification ensures consistent comparative analysis?
- Instead of the ESD loss function, can we utilize the UCE loss function since you established a linear relationship between the text embeddings and attention map? How would it affect the performance of the model?
论据与证据
All the claims made in this paper are supported by clear and convincing evidence.
方法与评估标准
The proposed method and the evaluation criteria make sense for the problem. The authors follow standard evaluation criteria that evaluate the method on the benchmark.
However, adversarial attack-based benchmarking is lacking. For instance, Ring-a-bell [1] and UnlearnDiff [2] should be used like the many baselines for detailed comparisons.
[1] “Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models?,” Tsai et al.
[2] “To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now,” Zhang et al.
理论论述
The derivation for the loss is sound and justifies the claim (embedding of the target concept is aligned with the embeddings of irrelevant concepts and pushed away from the synonym of the target concept).
实验设计与分析
Yes, the soundness of the following experimental designs by the author is verified and looks good.
补充材料
Yes, I reviewed all the supplementary sections.
与现有文献的关系
The proposed approach targets the transformer-based recent models, which might interest a broader community.
遗漏的重要参考文献
In the literature work, FMN[1] paper could be added as they do attention (cross-attention) regularization between the attention map and text embeddings for concept erasure.
[1] “Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models,” Zhang et al.
其他优缺点
Strengths:
- The paper has been written in a concise and clear way.
- The analyses presented on why the current state-of-the-art methods do not work for models Flux is presented very well.
- The paper explored the potential of scaling concept erasure to multiple concepts and presented results in the appendix F2 Multiple Concept Erasure.
- The User Study analysis shown in Figure 4 and in Appendix E is extremely helpful in assessing the effectiveness of the method under various metrics.
Weakness:
- Several evaluations, such as adversarial attacks such as Ring-a-Bell [1], UnlearnDiff [2], and Attacks, have not been presented. It could be a useful evaluation of the effectiveness of the attention map regularization loss method.
- Table 2/3/4 can include more baseline methods (AdvUnlearn) for an in-depth analysis. Look at (https://huggingface.co/spaces/Intel/UnlearnDiffAtk-Benchmark)[https://huggingface.co/spaces/Intel/UnlearnDiffAtk-Benchmark].
- Additionally, Flux-based baselines are missing. For instance, many of the prior works (AdvUnlearn, UCE, ESD, etc.) could be extended to Flux and treated as the baseline to see the proposed approach's impact truly.
[1] “Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models?,” Tsai et. al.
[2] “To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images ... For Now,” Zhang et. al.
其他意见或建议
N/A
Thank you for your kind words and recognition!
-
(A) Adversarial attack experiments
Thank you for the suggestion to include adversarial attack experiments, which we consider very important. Following the paper's methodology, we used
NudeNet(Bedapudi, 2019) with a detection threshold of 0.6 to test the Attack Success Rate (ASR) on the RingABell-Nudity[1]dataset (comprising 285 Ring-A-Bell revised prompts focused on nudity). Since the prompts in RingABell-Nudity are already processed according to the standard procedure, we did not reapply the Ring-A-Bell method.The table below shows the results of our tests on ESD, CA, and our proposed method using this dataset. We also included the attack results from MU-Attack
[2]. Step 0 means only attack the very initialvelocityof Flux, Step 0,1,2 means attack the initial threevelocity. According to our experiments, when attack too much steps, would yield irrelevant image w.r.t to the prompt.Concept Methods Flux[dev] ESD (Flux[dev]) CA (Flux[dev]) EraseAnything (Flux[dev]) Nudity (RingABell Nudity) Original (Org) 59.65% 7.36% 3.16% 2.46% MU-Attack (step 0) 64.56% 11.57% 15.44% 8.77% MU-Attack (steps 0,1,2) 65.96% 14.74% 16.49% 11.93% -
(B) More baselines
Thank you for your advice. We will conduct a thorough probe on relevant papers, and incorporate more baseline methods, such as
AdvUnlearninto the final version. -
(C) Flux baseline missing?
We have implemented all relevant methods, including ESD, CA, and MACE, within the Flux[dev] framework. Therefore, all concept erasing methods reported in the tables were conducted on Flux[dev].
-
(D) Questions
Firstly, this signifies that we have adapted the previous SD 1.5 methods to the Flux[dev] architecture. Secondly, given that UCE led to poor visual results (https://imgur.com/a/at2lkh8)
[3]—a finding consistent with its SD 1.5 implementation after careful review—we opted to use ESD as our baseline. -
References
[1] "Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models?," Tsai et al.
I thank the authors for providing more clarifications.
- (A) I appreciate the authors performing the additional experiments on these benchmarks. Given their importance, if accepted, I hope this will be added to the camera-ready draft.
- (B) I hoped to get more apple-to-apple comparisons during the rebuttal phase, as AdvUnlearn is a very strong baseline, and comparison is missing.
- (C & D) Thanks for the clarification.
I am inclined to increase the score to 3 (weak accept) or even 4 (accept) if a comparison with AdvUnlearn is provided during the discussion phase and shows the improvement.
Thank you for your kind words! To promptly address point (B), we provide a comparison between the fast version and the standard version, using the same experimental setup as previously described and same optimization practice as defined in https://github.com/OPTML-Group/AdvUnlearn/tree/main.
Response to (B):
As demonstrated by the T5 vs. CLIP comparison https://imgur.com/a/047aypl, optimizing CLIP embeddings within Flux yields a negligible impact on the final output. Therefore, to respond to your inquiry, we have applied the same optimization method (AdvUnlearn) to the T5 model, utilizing the previously mentioned experimental settings.
| Concept | Methods | Flux[dev] | ESD (Flux[dev]) | CA (Flux[dev]) | AdvUnlearn AT (Flux[dev]) | AdvUnlearn Fast-AT (Flux[dev]) | EraseAnything (Flux[dev]) |
|---|---|---|---|---|---|---|---|
| Nudity (RingABell Nudity) | Original (Org) | 59.65% | 7.36% | 3.16% | 6.67% | 9.82% | 2.46% |
This paper proposes EraseAnything, a concept erasure method for Flux-based diffusion models to address a timely and underexplored problem.
All four reviewers leaned positive, with one increasing their score to accept after a strong rebuttal. While some concerns remain—mainly around the simplicity of the bi-level optimization and attention localization—the authors responded thoroughly with additional experiments and clarifications. The method’s empirical effectiveness and its relevance to safety and controllability make it a valuable contribution.