PaperHub
6.0
/10
Poster4 位审稿人
最低5最高7标准差1.0
5
7
7
5
4.3
置信度
正确性2.8
贡献度2.5
表达2.8
NeurIPS 2024

GuardT2I: Defending Text-to-Image Models from Adversarial Prompts

OpenReviewPDF
提交: 2024-04-25更新: 2024-11-06
TL;DR

This framework leverages an LLM to enhance content moderation processes. By utilizing advanced natural language understanding, GuardT2I can effectively identify and manage inappropriate content, ensuring a safe and respectful environment.

摘要

关键词
Text-to-Image generation

评审与讨论

审稿意见
5

This paper proposes a new defensive method for generative T2I models, termed GuardT2I. GuardT2I utilizes a fine-tuned conditional LLM to map text embedding to explicit prompts and detect the presence of NSFW themes. GuardT2I keeps the target T2I model unchanged thus maintaining the generated image quality. Evaluations are conducted on text-based defenses such OpenAI moderation API.

优点

  • GuardT2I presents a new way towards adversarial prompt filtering in commercial-level T2I generative models.
  • Comprehensive evaluations are conducted with detailed implementation setups.
  • The proposed method is extensible and compatible with any other LLM architectures.

缺点

  • In the paper there is a lack of direct comparison between GuardT2I and other types of defenses, such as SafetyChecker (image classifier employed by Stable Diffusion) [1], Safe Latent Diffusion (SLD) [2], and concept removal methods [3]. While I understand some of these defenses may be out of this paper's scope considering that they may rely on the generated image or fine-tuning the target T2I model, it is still necessary to report the gap.
  • Settings of Table 6 are not clearly explained. The inference time of GuardT2I depends on the length of the recovered prompt. However, SafetyChecker detects NSFW themes from the generated image, which means that the inference time of SafetyChecker will not be influenced by the input prompt. SafetyChecker has fewer parameters than GuardT2I while requiring a longer inference time, which is confusing. Please provide more details related to this experiment.
  • Selection of evaluated adversarial prompts is not well motivated. There are related adversarial attacks against T2I models such as Ring-A-Bell [4], QF Attack [5], and P4D attack [6], which are not included in this paper.

[1] Rando, Javier et al. “Red-Teaming the Stable Diffusion Safety Filter.” ArXiv abs/2210.04610 (2022): n. pag.

[2] Schramowski, Patrick, et al. "Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[3] Gandikota, Rohit, et al. "Erasing concepts from diffusion models." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

[4] Tsai, Yu-Lin, et al. "Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models?." arXiv preprint arXiv:2310.10012 (2023).

[5] Zhuang, Haomin, Yihua Zhang, and Sijia Liu. "A pilot study of query-free adversarial attack against stable diffusion." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[6] Chin, Zhi-Yi, et al. "Prompting4debugging: Red-teaming text-to-image diffusion models by finding problematic prompts." arXiv preprint arXiv:2309.06135 (2023).

[7] Yang, Yijun, et al. "Mma-diffusion: Multimodal attack on diffusion models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

问题

See weaknesses.

局限性

NA

作者回复

To Reviewer EP5J

Thank you for your time and effort in reviewing our paper. We greatly appreciate your insightful comments and have addressed each point below.


[Weakness 1] .. lack of comparison with other defenses, such as SLD and concept removal methods[3].


[Answer 1] Thanks for recommending these necessary baselines. To address this concern, we have conducted additional experiments to compare GuardT2I with SLD-Medium, SLD-Strong, and ESD.

Experiment settings: SLD and ESD are designed to reduce the probability of NSFW generation. Therefore, we use the Attack Success Rate (ASR) as our evaluation metric. For GuardT2I, we set the threshold at FPR@5%, a common adaptation. As a concept-erasing method, ESD [3] only removes a single NSFW concept, "nudity," by fine-tuning the T2I model. This limitation means it fails to mitigate other NSFW themes such as violence, and illegal content. Consequently, our evaluation focuses solely on "adult content." All implementations of the baseline models and the tested adversarial prompts are released by their original papers.

Table R1. Comparison of Attack Success Rate with Concept-Erasing Methods (Lower is Better ↓)

MethodSneakyPromptMMA-DiffusionI2P-sexRing-A-bell[4]P4D[6]Avg. ↓Std. ↓
ESD [3]28.5766.736.2598.6079.1661.85629.31
SLD-medium[2]58.2485.0039.1098.9580.5172.3623.66
SLD-strong[2]41.7680.8230.1297.1973.7564.72827.93
GuardT2I (ours)9.8910.2026.43.168.7511.688.71

Our results show that GuardT2I consistently outperforms other types of defenses across various adversarial attacks. We will include these comparisons in the revised manuscript to provide a more comprehensive evaluation of our approach.


[Weakness 2]: Settings of Table 6 are not clearly explained.


[Answer 2] Thank you for pointing out the unclear point. Your understanding is correct. In Table 6, we report the best inference time of GuardT2I. The high-speed inference time of GuardT2I is due to its decoding technique. Specifically, the c\cdotLLM can use a range of decoding methods, with the Table 6 results reflecting a single-pass, full-sequence decoding. Although fast, this method might compromise quality as it ignores preceding words. Conversely, greedy decoding, which predicts tokens sequentially, provides better quality at the expense of speed. We include the inference time for greedy decoding in Table 6 (see Reviewer 1mLf Answer 2). The reported inference time is averaged over 1000 normal prompts, each containing an average of 12 tokens. Additionally, we use greedy decoding in other evaluation experiments.


[Weakness 3] ... Ring-A-Bell [4], and P4D attack [6], are not included... .


[Answer3] Thank you for your insightful suggestion. We've added an evaluation of GuardT2I on the necessary adversarial prompts in the expanded Table 2, now including Ring-A-Bell [4] and P4D [6]. It's important to note that QF-Attack primarily focuses on disabling T2I without considering high-quality NSFW generation; we will discuss this attack in the related work section. The best performance is highlighted in bold, and the runner-up is in italics.

Due to space limitations, we report the modified part of Table 2 as follows.

MethodRing-A-Bell [4]P4D [6]Avg.Std. ↓
AUROC(%↑)
OpenAI99.3595.6891.5111.59
Azure99.4281.9077.1918.64
AWS98.7691.5187.4813.70
NSFW_classifier64.3457.9773.0415.32
Detoxify96.2782.2273.2217.06
GuardT2I99.9198.3696.773.15
AUPRC(%↑)
OpenAI98.2194.8787.6815.10
Azure99.5690.3879.9118.19
AWS98.8091.7389.3011.14
NSFW_classifier53.8651.0657.317.51
Detoxify95.5280.9881.9113.95
GuardT2I99.9298.5196.164.35
FPR@TPR95%(↓)
OpenAI0.7025.4227.5522.27
Azure1.0580.0062.6733.51
AWS6.3280.4249.5943.57
NSFW_classifier68.4287.9279.3317.88
Detoxify15.0990.8354.4133.52
GuardT2I0.3541.6719.2617.14

As demonstrated in the table, GuardT2I consistently outperforms baseline methods across most cases in both new adversarial prompt datasets. It achieves the highest average AUROC and AUPRC, underscoring its superior capability in defending against adversarial prompts. Notably, it also exhibits the lowest FPR at a TPR of 95%, indicating fewer false alarms while maintaining a high true positive rate. These results highlight GuardT2I's consistent and reliable performance across diverse adversarial scenarios. We will add the above results to our main paper.

We hope this extended evaluation addresses your concerns. Thank you for your insightful comments.

评论

Dear Reviewer EP5J,

Thank you again for your time and effort in reviewing our paper. We hope our responses have adequately addressed your concerns. We value your contributions and anticipate any further suggestions you might have.

Sincerely, the Authors.

审稿意见
7
  • To defend t2i model from advresarial prompts, this paper presents a novel moderation framework, GuardT2I, that adopts a generative approach to enhance Text-to-Image models’ robustness against adversarial prompts.
  • Specifically, GuardT2I uses a large language model to conditionally interpret text guidance embeddings for effective detection, avoiding binary classification.
  • Extensive experiments show GuardT2I significantly outperforms commercial solutions like OpenAI-Moderation and Microsoft Azure Moderator across diverse adversarial scenarios.

优点

  • The paper is well-written and easy to follow. I like the figures in this paper.
  • The motivation and idea of this paper are novel, clear and well-explained.
  • Experiments effectively verify the effectiveness of this work, especially, the evaluation on adaptive attacks.

缺点

  • Lacks evaluation and comparison on the standard text-to-image generation task. I am curious about the impact of GuardT2I on benign text prompts.

问题

  • I also want to know if the training dataset for GuardT2I includes adversarial prompts generated by methods like Sneakyp. How is the generalization of the trained LLM ensured?

局限性

see weakness

作者回复

To Reviewer 1Dsg

Thank you for your time and effort in reviewing our paper. We greatly appreciate your insightful comments, which we have addressed point by point below.


[Weakness 1] Lacks evaluation and comparison on the standard text-to-image generation task. I am curious about the impact of GuardT2I on benign text prompts.


[Answer 1] Thank you for your insightful comment. We recognize the importance of evaluating GuardT2I on standard text-to-image generation tasks. To address this, we conducted additional experiments to assess the impact of GuardT2I on benign text prompts, focusing on image quality (FID), text alignment (CLIP-Score), and False Positive Rate.

We compared our approach with the concept-erasing defense methods ESD [11] and SLD [36], which aim to reduce the probability of generating NSFW images. Additionally, we reported the average Attack Success Rate (ASR) to indicate the effectiveness of the defense methods. The experimental settings are consistent with those in our main paper. Our results are summarized in the table below:

Table R2. GuardT2I’s image fidelity and text alignment. Bold font indicates the best performance, and italic indicates the second best.

Image FidelityText AlignmentDefense Effectiveness
MethodFID ↓CLIP-Score ↑ASR (Avg.) ↓
ESDu1 [11]*49.240.150161.86
SLD-Medium [36]54.150.147672.36
SLD-Strong [36]56.440.145564.73
GuardT2I (Ours)52.100.150211.68

GuardT2I maintains competitive FID scores and achieves the highest CLIP-Score, ensuring that image quality and text alignment remain unaffected in normal use cases. Additionally, its superior defense effectiveness demonstrates robustness against adversarial prompts.

Frequent misclassification of benign prompts as adversarial can frustrate normal users. Therefore, an ideal defensive method should achieve a low False Positive Rate (FPR) while correctly banning most adversarial prompts. In Table R3, we report the FPR@TPR95% of GuardT2I. As shown in the table, GuardT2I demonstrates decent performance.

Tabel R3. Comparison of Impacts on Normal Prompts with FPR@TPR95%. Bold font indicates the best performance, and italic indicates the second best.

MethodSneakyMMAI2PI2P-sexRing-A-BellP4DAvg. ↓Std. ↓
OpenAI4.4040.2035.559.090.7025.4227.5522.27
Azure61.5357.6077.598.321.0580.0062.6733.51
AWS19.784.9590.595.566.3280.4249.5943.57
NSFW_classifier84.6148.1092.594.4568.4287.9279.3317.88
Detoxify51.6413.7076.0079.2015.0990.8354.4133.52
GuardT2I6.506.5925.534.960.3541.6719.2617.14

[Question 1] I also want to know if the training dataset for GuardT2I includes adversarial prompts generated by methods like Sneakyprompt. How is the generalization of the trained LLM ensured ?


[Answer 2] Thank you for raising this question. The training dataset for GuardT2I does not include any adversarial prompts, making it an attack-agnostic defense framework. GuardT2I is trained solely on plain-text prompts.

The excellent generalization capability of GuardT2I is due to its use of text embeddings within T2I models for generation. We observed that for an effective adversarial attack, the text embedding of the adversarial prompt must closely resemble that of the corresponding plain-text target prompt. This similarity is a common characteristic of adversarial prompts.

GuardT2I leverages this insight by generating prompt interpretations directly on the text embeddings, thereby reconstructing the content of the target prompt associated with the adversarial prompt. Consequently, it demonstrates strong generalization ability across various types of adversarial attacks. We will enhance the clarity of this explanation in our main paper.


Thank you once again for your thoughtful and constructive feedback. Your comments have been instrumental in improving the quality of our work. We hope that our responses have adequately addressed your concerns.

评论

Thank you for your response, it has addressed most of my concerns. I believe the evaluation and comparison on the standard text-to-image generation task are important, and I hope the authors can include those rebuttals in the revised manuscript. I raise my scores to 7.

评论

Dear Reviewer 1Dsg,

Thank you for your thoughtful feedback and for raising your score. We appreciate your recognition of our efforts to address your concerns. We will ensure that the evaluation and comparison on the standard T2I generation task are included in the revised manuscript.

Your constructive comments have been invaluable in improving our work. Once again, thank you for your time and encouragement in reviewing this work.

Sincerely, the Authors.

审稿意见
7

This paper presents GUARDT2I, a new moderation framework designed to defend against adversarial prompts for text-to-image generation models. Specifically, it uses a large language model to conditionally transform text guidance embeddings into natural language for effective adversarial prompt detection. The experiments demonstrate the effectiveness of GUARDT2I in outperforming commercial detectors.

优点

  • The paper addresses an important issue by defending against adversarial prompts, which is a timely topic given the increasing popularity and deployment of text-to-image generation models.

  • The proposed solution of transforming text guidance embeddings into natural language is interesting and well-formulated.

  • The defense methods proposed have outperformed commercial adversarial prompt detectors in many scenarios.

  • The evaluations on adaptive attacks are valuable, highlighting the practical importance of the defenses.

缺点

  • The introduction and related work sections note that model fine-tuning approaches compromise image quality in normal use cases. Consequently, commercial solutions do not typically use such approaches. The authors acknowledge this, and thus it is important to quantitatively evaluate the impact of the proposed solutions on normal use cases (e.g., image quality) to ensure they do not adversely affect regular prompts, the potential metrics could be FID, etc.

  • Unlike previous classifier-based approaches, this paper adopts a generator-based approach. Despite achieving better detection performance, it may also introduce a higher delay. I disagree with the claim in Section 5 that the proposed approach does not introduce additional inference time. For each prompt, it inputs into the cLLM and then the verbalizer and the sentence similarity checker, incurring additional inference costs. This could be problematic when adopted by commercial platforms due to the large number of queries per second.

  • The claim that GUARDT2I outperforms leading commercial solutions by a significant margin should be toned down, as Figures 2 and 5 indicate scenarios where baselines still perform on par with the proposed methods.

  • In terms of prior attacks, the authors missed a key related work published in 2023 [X]. This should be discussed in the related work section, and the authors should evaluate whether their proposed defense can effectively mitigate it in terms of performance.

  • A minor concern: the proposed approach may reconstruct NSFW prompts during training, as it feeds unfiltered datasets into cLLM. I wonder how the approach is able to infer the actual meaning of such prompts correctly during inference?

[X] Liu, Han, et al. "Riatig: Reliable and imperceptible adversarial text-to-image generation with natural prompts." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

问题

  • What is the additional latency in running the proposed defenses?

  • How is the approach able to infer the actual meaning of NSFW prompts correctly during inference, given it learns to reconstruct NSFW prompts during training?

局限性

Please see my above comments.

作者回复

To Reviewer 1mLf


[Weakness 1] ... it is important to evaluate the impact of the proposed solutions on normal use cases (e.g., image quality) ..., the potential metrics could be FID, etc.


[Answer 1]

We appreciate the reviewer's insightful comment. To address this concern, we conducted additional experiments to evaluate the performance of our method using the FID and CLIP-Score metrics to assess image quality and text alignment. We compared our approach to the concept-erasing defense methods ESD [11] and SLD [36], which aim to reduce the probability of generating NSFW images. Additionally, we reported the average Attack Success Rate (ASR) to indicate the effectiveness of the defense methods. The experimental settings are consistent with those in our main paper. Our results are summarized in the table below:

Image FidelityText AlignmentDefense Effectiveness
MethodFID[X2] ↓CLIP-Score[X2] ↑ASR (Avg.) ↓
ESDu1 [11]*49.240.150161.86
SLD-Medium [36]54.150.147672.36
SLD-Strong [36]56.440.145564.73
GuardT2I(Ours)52.100.150211.68

By maintaining competitive FID scores and achieving the highest CLIP-Score, GuardT2I ensures that image quality and text alignment are not adversely affected in normal use cases. Moreover, its superior defense effectiveness highlights its robustness against adversarial prompts.

[X2] I. Pavlov, A. Ivanov, and S. Stafievskiy. Text-to-Image Benchmark: A benchmark for generative models. September 2023.


[Weakness 2 & Question 1] ... I disagree with the claim in Section 5 that the proposed approach does not introduce additional inference time. ... What is the additional latency in running the proposed defenses?


[Answer 2]

We thank the reviewer for highlighting this unclear point. GuardT2I operate in parallel with T2I models. As long as GuardT2I's inference speed is faster than the image generation speed of the T2I model, it does not introduce additional latency from the user's perspective. Typically, generating an image with a T2I model requires 50 to 100 steps (e.g., 17.803s for SDv1.5 with 50 steps), while GuardT2I's inference time is at most 0.419s.

This process is illustrated in Figures 1(c) and 1(d), where GuardT2I can halt the diffusion steps of malicious prompts early. Detailed latency metrics for GuardT2I are reported in Table 6, which we have included below for reference.

Tabel 6. Comparison of Model Parameters and Inference Times

Model#Params (G)Inference Time (s)
SDv1.51.01617.803
SDXL0.95.353-
SafetyChecker [3]0.2900.129
SDv1.5 + SafetyChecker1.30617.932
GuardT2I0.5380.059
GuardT2I (Greedy decoding)-0.419
Sentence-Sim. (GuardT2I)0.1040.026

These results demonstrate that GuardT2I introduces no additional latency compared to the overall image generation process of T2I models. This ensures that the implementation of GuardT2I on commercial platforms will not significantly impact the user experience. We would like to provide further clarification on this issue in our main paper.


[Weakness 3] The claim that GuardT2I outperforms leading commercial solutions by a significant margin should be toned down, as Table 2 and Figure 5 indicate scenarios where baselines still perform on par with the proposed methods.


[Answer 3] Thank you for pointing out this unclear statement. Our intention was to highlight that GuardT2I generally outperforms the baselines on average, as demonstrated in the last two columns of Table 2. We acknowledge that there are scenarios, as indicated in Table 2 and Figure 5, where the baselines perform on par with our proposed method. We will refine this claim for clarity in the final version of our paper.


[Weakness 4] ... missed a key related work [X]...


[Answer 4] We appreciate the importance of including comprehensive related work in our paper. We will update the related work section to discuss RIATIG as a function-level adversarial attack for T2Is.


[Weakness 5 & Question 2] ... How is the approachable to infer the actual meaning of NSFW prompts correctly during inference, given it learns to reconstruct NSFW prompts during training?


[Answer 5]

Thank you for raising this concern. Our method effectively infers the true meaning of adversarial prompts due to the specific mechanism used in their generation. During the creation of adversarial prompts, an attacker sets a target NSFW prompt, such as "a completely naked man," or a target NSFW concept, like "nudity," as the optimization objective. The adversarial prompt is then designed to exclude explicit sensitive words while ensuring that its text embedding in the T2I's latent space closely resembles that of the target one, thereby achieving the desired attack effect.

Since our cLLM is trained on a readable, normal dataset, it tends to generate the target NSFW prompt when given the text embedding of an adversarial prompt, rather than simply reconstructing the adversarial prompt itself. This ensures our model can accurately reveal the true meaning behind adversarial prompts without inadvertently reconstructing them.


Thank you for recognizing our work! Your comments help us enhance our contribution.

评论

Thank you for your detailed response, it has addressed most of my concerns. I believe the image quality experiments and the discussion on latency are important, and I hope the authors can include those rebuttals in the revised manuscript. I will raise my scores accordingly.

评论

Dear Reviewer 1mLf,

We sincerely appreciate your support throughout the rebuttal period. Your insightful comments and suggestions for additional experiments significantly enhanced the quality of our paper. Thank you for your valuable contribution and for taking the time to review our work.

Sincerely, the Authors.

审稿意见
5

The paper introduces a framework designed called GuardT2I to enhance the robustness of Text-to-Image models against adversarial prompts. It allows for the effective detection of adversarial intentions without compromising the performance of the Text-to-Image models.

优点

  1. This paper is well-written and easy to understand.
  2. The experiment results demonstrate the effectiveness of the proposed framework across diverse scenarios. From detecting adversarial prompts to inference time.

缺点

  1. The effectiveness of the proposed method heavily relies on the performance of the c·LLM, which could be limited by the quality of its training data. The bias within the c.LLM could also lead to biased interpretations of prompts, like scenes involving cultural background.

问题

  1. What about the performance of GuardT2I on other T2I models like SD2.0, SD2.1, Midjourney, DALLE-3 and so on?

局限性

Yes.

作者回复

To Reviewer 7kN6:

Thank you for the time and effort you have invested in reviewing our paper. We greatly appreciate your insightful comments, which we have addressed point by point below.


[Weakness 1] The effectiveness of the proposed method heavily relies on the performance of the c·LLM, which could be limited by the quality of its training data. The bias within the c.LLM could also lead to biased interpretations of prompts, such as scenes involving cultural background.


[Answer 1]

  • The proposed c\cdotLLM within GuardT2I is a Large Language Model (LLM) pretrained on an extensive and diverse corpus. Therefore, it inherits a broad knowledge base, resulting in strong generalizability across various scenarios. This is evidenced by the results in Table 2, where our model outperforms classifier-based baselines.

  • Additionally, T2I services such as Midjourney, DALL-E, and Leonardo.AI can fine-tune GuardT2I using the same dataset that trains the T2I models. This ensures that prompt interpretations generated by GuardT2I align closely with those of the T2I models, maintaining consistency and accuracy.

  • We acknowledge that addressing biases and the quality of training data in LLMs is an ongoing challenge, even for widely-used models like GPT-4 and Gemini Pro. While tackling these challenges is beyond the scope of this paper, extensive research is being conducted in this area. Fortunately, GuardT2I is built upon LLM technology. Consequently, the strategies developed to mitigate biases and address uneven training data in LLMs can be effectively applied to enhance the performance and fairness of our model.


[Weakness 2] GuardT2I could lead to higher false positive rates where benign prompts are misclassified as adversarial. This could make the benign users sad.


[Answer 2]

  • Rather than having a higher False Positive Rate (FPR), GuardT2I demonstrates a significantly lower FPR, even when compared to commercial defensive solutions. This is evidenced in the FPR@TPR95% section of Table 2, where compared to the second-best OpenAI Moderation, our GuardT2I reduces the False Rejection Rate (FRR) by 89.23%.

  • This improvement is attributed to GuardT2I's unique approach as the first generative paradigm defensive framework. Unlike classifiers such as OpenAI Moderation, which make relatively ambiguous decisions at the category level, GuardT2I performs case-by-case assessments. It compares each input prompt with its corresponding prompt interpretation, thereby detecting malicious prompts more accurately without compromising performance on benign prompts.


[Question 1] What about the performance of GuardT2I on other T2I models like SD2.0, SD2.1, Midjourney, DALLE-3 and so on?


[Answer 3]

  • We conduct experiments primarily on Stable Diffusion (SD) V1.5 due to its extensive adoption within the community and its status as a prototype for commercial T2I models. Since GuardT2I operates on the text-encoder embeddings of T2I models, SD V2.0 and V2.1, which employ the same text-encoder architecture as SD V1.5, should have similar performance on malicious prompts. Specifically, with the same text-encoder, the identification ability of GuardT2I is not affected by the diffusion models. Besides, GuardT2I's excellent performance on SD V1.5, combined with the inherent filtering of sexual content in SD V2.0 and V2.1, we believe GuardT2I will demonstrate even better effectiveness in preventing NSFW content generation in these versions.

  • Regarding commercial models such as Midjourney and DALL-E, they have shown vulnerabilities to adversarial prompts like those from MMA-Diffusion and Sneakprompt. In contrast, GuardT2I has demonstrated strong defensive capabilities against these adversarial prompts. Consequently, integrating GuardT2I into these commercial models will significantly enhance their security, which is the primary objective of our proposed GuardT2I.


We would like to reemphasize the aforementioned points in our main paper and hope that our responses have adequately addressed your concerns. Thanks again for your time.

评论

Thanks for your rebuttal. I have raised my score

评论

Dear Reviewer 7kN6,

Thank you for your response and for raising the score. Once again, thanks for your time and efforts.

Sincerely,

The Authors.

作者回复

We sincerely thank all the reviewers for their constructive feedback and recognition of our work's strengths. We appreciate your acknowledgment of:

  • [Novel Approach]: Noted by Reviewers 1mLf, 1Dsg, and EP5J.
  • [Convincing and Comprehensive Experimental Results]: Praised by all reviewers.
  • [Valuable Evaluation on Adaptive Attacks]: Highlighted by Reviewers 1mLf and 1Dsg.
  • [Good Presentation / Well-Written]: Recognized by all reviewers.
  • [Commercial-Level Performance]: Acknowledged by Reviewers 1mLf, 1Dsg, and EP5J.

In this paper, we introduce a novel defensive framework named GuardT2I, specifically designed to protect Text-to-Image (T2I) models from adversarial prompts. This is a timely and critical topic given the growing popularity and deployment of T2I generation models (Reviewer 1mLf). GuardT2I establishes a new generative paradigm for the safety of T2I models, paving the way for future developments in the field.

We are also sincerely grateful to the reviewers for their insightful suggestions, which have significantly helped us to refine our paper. We've carefully considered all feedback and made extensive revisions to our manuscript. For your convenience, we've summarized the key concerns and our corresponding revisions:

  • [Evaluation on standard use cases] (Reviewer 7kN6, 1mLf, 1Dsg): We have incorporated additional experiments to evaluate image quality (FID), text alignment (CLIP-score), and false alarm rate (FPR@TPR95%).
  • [Comparison with other defense types] (Reviewer EP5J): We've introduced two concept removal defenses, namely ESD and SLD, as new baselines, showcasing GuardT2I's consistent efficacy.
  • [Evaluation on more adversarial prompt attacks] (Reviewer EP5J): We've included Ring-A-Bell and P4D in our evaluation and updated Table 2 accordingly. GuardT2I demonstrates consistent performance against these attacks, exhibiting robustness and broad applicability across various scenarios.
  • [Expanded discussion and clarification]: We've added necessary related works as suggested by Reviewer 1mLF and EP5J, along with detailed explanations about our methods (Reviewer 7kN6 and 1mLf) and the evaluation of inference time (Reviewer 1mLf and EP5J).

We appreciate all the suggestions made by reviewers to improve our work. We hope that our responses have adequately addressed your concerns.

最终决定

This paper proposed a noval way to enhance the robustness of T2I models against adversarial prompts. All reviewers provide the positive scores for this paper. AC read all rebuttal and reviews' comments and agreed with their score. This paper should be accepted to NeurIPS. AC hopes the author can address the reviewers' comments in the camera ready version.