SafeText: Safe Text-to-image Models via Aligning the Text Encoder

Yuepeng Hu,Zhengyuan Jiang,Neil Zhenqiang Gong

OpenReview PDF

提交: 2024-09-27更新: 2024-11-13

摘要

关键词

Text-to-image ModelAlignmentSafeguard

评审与讨论

审稿意见

评分: 3置信度: 42024-10-26

For preventing unsafe generation in T2I diffusion models, most of the previous methods work on diffusion models themselves, including fine-tuning the parameters or modifying generation processes. On the contrary, this paper proposes to fine-tune the text encoder. It keeps the embeddings of unsafe prompts away from them, while protecting the embeddings of safe prompts.

优点

The method is easy and intuitive.
When erasing nudity, the method achieves better significantly than other methods, in terms of effectiveness and utility.

缺点

The writing needs to be improved. (1) I recommend adding a schematic diagram to demonstrate the method. (2) In Eq.2, the writing style does not align with the habit. Generally speaking, a smaller loss indicates a better performance. However, the authors mention, "the effectiveness goal may be better achieved when the loss term L_e is larger". And same problem occurs in Eq.3. (3) The figures and tables in the Appendix should be numbered individually.
I have some concerns about the method. (1) For a concept that we want to erase, the method needs a prompt distribution. The performance is dependent on it because the alignment is conducted using the sampled prompts from it. However, the authors didn't discuss how we can obtain a proper distribution estimation. There is no discussion about the results under different training datasets and the visualization of the embeddings of the gathered prompts. I cannot determine how reliable the method is. (2) In practical applications, the prompts users give are complex. There are many easy or short prompts and also implicit or long ones. For a very long prompt, there may be only one unsafe word. Under this condition, can the method still stop its generation? It's also a question of how reliable this method is. I think it can be addressed by analyzing the embedding patterns of the text encoder or conducting experiments on carefully categorized prompts. (3) From Fig.1, the images generated by the unsafe prompts in this method are unmeaningful. Why not align these unsafe prompts to the safe ones?
The concerns about the experiments. (1) I think the metric LPIPS may not be suitable. It is usually used in image quality assessment. CLIP-score is more reasonable. (2) Can the method defend against the white-box attacks such as Prompt4Debugging? (3) I recommend the authors show the results of the original models as well. (4) Can the method be applied to other concepts? Such as shocking, bloody, objects and painting styles.

问题

See Weaknesses.

审稿意见

评分: 3置信度: 42024-10-29

This paper addresses the limitations of existing safety alignment methods and introduces SafeText, an innovative alignment approach that fine-tunes the text encoder instead of modifying the diffusion module. Additionally, the paper proposes two critical loss functions: the effectiveness loss, which adjusts unsafe prompt content, and the utility loss, which ensures the quality of images generated from safe prompts. Validation across multiple prompt datasets demonstrates the effectiveness of SafeText.

优点

The method proposed in this paper is effective, enabling substantial alterations to content generated from unsafe prompts while minimally altering content generated from safe prompts.
The experiments in this paper are thorough, including extensive comparative and ablation studies, with visualized results provided in the appendix.

缺点

Novelty Issues: The two loss functions proposed in this paper are applicable in most scenarios. For instance, in incremental learning, the effectiveness loss can ensure the implementation of new tasks, while the utility loss helps minimize the forgetting of original content. Therefore, positioning these two loss functions as the primary contribution of this paper has its limitations.
Dataset: This paper introduces a U-prompt dataset collected by themselves; however, there is no detailed description of this dataset, such as examples, statistical information, or other relevant insights. To enhance readers' understanding of the paper, it is recommended to provide a comprehensive overview of the dataset, including representative examples and statistical analysis.
Scenario: Based on the description in the paper, my understanding is that the NSFW scenario considered focuses solely on nudity, without addressing other scenarios such as violence or gore. To better demonstrate the effectiveness of the proposed method, it is recommended to include a broader range of scenarios.
Experimental Setup: The paper provides limited information on hyperparameters and lacks details on essential experimental settings, such as the number of training epochs and training duration.

问题

See weaknesses.

审稿意见

评分: 3置信度: 42024-11-01

Most of the existing methods for safety alignment modify the diffusion models but ignore another component, i.e. the text encoders. This paper proposes an easy method to fine-tune the text encoder to align the embeddings of unsafe prompts to the ones of safe prompts.

优点

The proposed method is easy to follow and understand for readers.
The authors evaluate their method under various text-to-image model structures, which is missed by previous studies.
The authors provide rich analysis on the hyper-parameters of the method.

缺点

Limited concepts. In the quantitative experiments, the authors only provide the results for NSFW concepts. Previous methods also evaluate their methods for erasing styles, objects, et al.
Limited baselines. A recent method, Latent Guard [1], also consider text encoders for erasing concepts. As a similar method, it should be also compared in the experiments. The authors have better to compare with them to declaim their contributions and advantages.
Limited evaluation metrics. This method may affect the encoding ability of texts, which in turn affects the diffusion model's image-text consistency performance. However, LPIPS and FID the authors use for evaluation cannot reflect it accurately. The metric such as CLIP can help.
From Figure 1, it seems that the method will generate meaningless images. However, for some methods such as MACE, the models can generate dressed persons for a prompt containing the key word about nudity.
A figure for showing the method is needed.

[1] Runtao Liu, Ashkan Khakzar, Jindong Gu, Qifeng Chen, Philip Torr, Fabio Pizzati. Latent guard: A safety framework for text-to-image generation. arXiv preprint arXiv:2404.08031, 2024.

问题

Besides the points mentioned in Weaknesses, another point should be analyzed in the paper. For training the text encoder, the datasets D_s and D_un are involved. If we remove the unsafe components in the prompts of D_un, these prompts will become the safe ones and can the model generate normal images in this condition?

审稿意见

评分: 5置信度: 42024-11-12

This paper proposes a new method for preventing unsafe generation of text2image models. It is achieved by aligning the text encoder of T2I to safe versions of text encoders. Comprehensive experiments are conducted to show the effectiveness.

优点

The paper is well organized and easy to follow. The proposed method is straightforward, and the idea is simple.
Comprehensive experiments and ablation studies are conducted to illustrated the proposed method's performance.

缺点

My main concern is about the necessity of this method. If we are going to align those unsafe text embeddings to safe embeddings by tuning the text encoder, why not just detect these unsafe texts and just not generate anything? The text filtering strategy is more efficient and flexible to configure. The diffusion module tuning methods actually erase certain concepts from the model, but the proposed text-aligning method is not well motivated from the application perspective.
The consistency between generated images and safe texts is not evaluated. The paper only evaluated the generated image quality, but the semantic consistency should also be considered for utility.

问题

The questions are as mentioned above.

撤稿通知

2024-11-13

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.