/10

Poster3 位审稿人

最低2最高3标准差0.5

ICML 2025

AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion Models

Yaopei Zeng,Yuanpu Cao,Bochuan Cao,Yurui Chang,Jinghui Chen,Lu Lin

OpenReview PDF

提交: 2025-01-23更新: 2025-07-24

摘要

关键词

Diffusion modelAdversarial attack

评审与讨论

审稿意见

评分: 22025-03-10

The paper presents an adversarial attack on image-to-image diffusion models to generate NSFW content. The authors train a Variational Autoencoder (VAE) to encode NSFW content into clean images and introduce an adaptive attack method to circumvent existing NSFW defense mechanisms. Through experiments on two image-to-image diffusion models, the authors demonstrate that their method can effectively bypass current defenses and generate NSFW content.

给作者的问题

Please follow the weakness for questions.
Can diffusion-based purification counter the proposed method? For instance, could a purification method that gradually removes the NSFW concept, as defined by Equation 1, during its diffusion process effectively mitigate the attack?
In the MMA-Diffusion paper, a multimodal attack is presented. However, it is unclear which implementation of MMA-Diffusion is used in this paper. Could you provide further clarification on the specific implementation and how it relates to the proposed method?

论据与证据

Yes

方法与评估标准

Yes

理论论述

Not applicable

实验设计与分析

Yes

补充材料

Yes, Experiment part.

与现有文献的关系

The paper evaluates the generation of NSFW content by diffusion models, an important topic concerning the appropriate use of these models.

遗漏的重要参考文献

The author proposes an adversarial attack method, but it is primarily evaluated in a white-box scenario and on similar model architectures during the main evaluation.

其他优缺点

Strength:

The paper is well-written and easy to follow.
The exploration of NSFW topics in image-to-image diffusion models is novel.
The attack methods, which use a trained generator to modify only the image input, are effective against current defenses.
The experimental results are robust and demonstrate the effectiveness of the proposed attack in a white-box scenario.

Weaknesses:

Since the adversarial image generator is trained using white-box diffusion models, it is crucial to evaluate its performance in black-box scenarios to demonstrate the generalization of the proposed attacks. However, the major experiments are conducted in a white-box setting. Additionally, the performance on the latest SDv3.0 is significantly lower, which poses a notable limitation for the further deployment of the proposed attacks.
The proposed method incorporates the concept of current defenses into the training of the adversarial image generator, which ensures better performance against these defenses. The authors should evaluate the method against more unseen or advanced defenses, such as using multi-model LLMs for defense, to further validate the usability of the proposed attacks.
The experiments are conducted solely on open-source models. It is essential to assess the performance on models with online services, such as Midjourney and Leonardo.Ai, to gain a comprehensive understanding of the method's effectiveness.
The experiments are performed only on a test set used for training the generator with hundreds of images, which may not be sufficient to evaluate an adversarial attack method on large SD models. Moreover, the performance on SDv3.0 indicates that the training data is critical for the success of the proposed attacks, as they are trained with the target diffusion model. The authors should provide guidelines on selecting the training dataset and explain the rationale behind their choices.

其他意见或建议

The proposed method has potential to be a strong defense, the paper can be largely enhanced if the author can construct a defense method.

作者回复

2025-04-01

Thank you for the insightful comments. Below, we thoroughly address each point with additional experiments and clarifications.

W1: Experiments mostly conducted in a white-box setting; limited effectiveness on SDv3.0.

A1: We have conducted additional black-box experiments to evaluate transferability across I2I models. Table 7 in manuscript shows strong black-box transferability from SDv1.5 to SDv2.0 and SDv2.1 (ASR of 80.5% and 84.0%, respectively; adaptive version: 73.5% and 77.5%), validating real-world applicability.

We also evaluated AdvI2I-Adaptive under defenses across multiple I2I models. The results demonstrate the attack persistence when transferring from SDv1.5 to (black-box) SDv2.0 and SDv2.1.

Target Model	w/o Defense	SLD	SD-NP	GN	SC
SDv1.5-Inpainting	78.5%	75.0%	75.5%	72.5%	72.0%
SDv2.0-Inpainting	73.5%	72.5%	75.5%	69.5%	67.0%
SDv2.1-Inpainting	77.5%	73.0%	76.0%	73.0%	70.0%
SDv3.0-Inpainting	33.0%	30.5%	30.5%	27.0%	30.0%

Moreover, we extended our evaluation to newer diffusion models. Transferring attacks from SDv1.5 to SDXL-Turbo and FLUX yields ASRs of 62.5% and 74.0%, respectively, further highlighting the method’s generalization capability across model architectures in black-box setting.

Source Model	Target Model	ASR
SDv1.5-Inpainting	SDXL-Turbo	62.5%
SDv1.5-Inpainting	FLUX.1-dev ControlNet Inpainting-Alpha	74.0%

Regarding the explanations on reduced ASR on SDv3.0, please refer to our response Reviewer rvyv's A6.

W2: Evaluation against multimodal LLMs.

A2: We appreciate this valuable suggestion. Evaluating AdvI2I against a multimodal LLM defense (GPT-4o) significantly reduced ASR from 81.5% to 9.5%, demonstrating promising defensive potential. However, we observed practical limitations of directly using multimodel LLM as defense:

High computational overhead: GPT-4o detection (2.94 sec/image) exceeds generation time (1.39 sec/image, SDv2.1 with 50 steps).
Misclassifications due to limited visual understanding, even with carefully designed prompts ( (Detailed examples and failure cases are in this link.)).
Potential vulnerability to adaptive adversarial attacks [1, 2].

Thus, while promising, multimodal LLM-based defenses currently face practical challenges. Improving these methods remains an important future research direction.

[1] Stop Reasoning! When Multimodal LLM with Chain-of-Thought Reasoning Meets Adversarial Image.

[2] On the Robustness of Large Multimodal Models Against Image Adversarial Attacks.

W3: Evaluation on online services.

A3: Thank you for this suggestion. We conducted additional experiments on Leonardo.Ai (Phoenix 1.0). Despite strong protections of the online platform, AdvI2I (34.5%) and AdvI2I-Adaptive (31.5%) still significantly outperform the baseline MMA (28.0%), further confirming our method’s real-world effectiveness.

Method	ASR
MMA	28.0%
AdvI2I (ours)	34.5%
AdvI2I-Adaptive (ours)	31.5%

W4: Limited evaluation dataset size and dependency on training data. Guidelines for dataset selection needed.

A4: We clarify that our training and test samples were randomly split without complete overlap. We have also verified sample transferability of AdvI2I on a completely unseen set (new images and prompts never seen during training) in the manuscript's Table 5.

For training dataset selection, we specifically chose non-NSFW images from the "sexy" category provided by the NSFW Data Scraper [1] because these images prominently feature people, closely aligning with real-world scenarios and attack contexts that attackers are likely to target.

Regarding the explanations on reduced ASR on SDv3.0, please refer to our response Reviewer rvyv's A6.

[1] https://github.com/alex000kim/nsfw_data_scraper?tab=readme-ov-file#nsfw-data-scraper

C1: Potential as a defensive method.

A5: We appreciate your insight. We tested our method’s defensive potential by using AdvI2I to embed a "wearing clothes" concept into images. Interestingly, this significantly reduced the ASR (96.5% → 24.5%) of explicit prompts (e.g., "Make the woman naked") on SDv1.5-Inpainting. This demonstrates AdvI2I’s broader conceptual versatility and suggests promising future defensive applications. We will include this discussion in our revised paper to suggest directions for future research.

Q1: Can diffusion-based purification counter the proposed method?

A6: Please see our response Reviewer FsK8's A1. Thank you.

Q2: The implementation of MMA-Diffusion used in this paper.

A7: MMA generates adversarial text prompts and corresponding images to bypass diffusion model safety filters. We replaced standard prompts in our dataset with MMA-generated adversarial prompts and optimised adversarial images from our test images.

审稿人评论

2025-04-08

Thank you for the detailed responses. The authors have addressed most of my concerns. However, there are some common suggestions among all the reviewers regarding the dataset and transferability issues. The dataset is relatively small to effectively demonstrate the performance of the proposed attacks in practical scenarios. Additionally, the performance drop in SDv3.0 further indicates that the choice of dataset is crucial for the success of the proposed attack. Although the authors provided some experimental results with newer diffusion models, I still believe that transferability should be a major focus of the paper (with comparisons to baselines) rather than a minor aspect. Therefore, the current version of the paper may not meet the conference criteria. The paper could be significantly improved by enhancing the section on defense strategies. I will keep my score.

审稿意见

评分: 32025-03-14

This paper proposes AdvI2I, a novel framework that induces diffusion models to generate NSFW content using adversarial images.
It circumvents existing defense mechanisms, such as Safe Latent Diffusion (SLD), without modifying text prompts, underscoring the urgent need for stronger security measures to prevent the misuse of I2I diffusion models.

给作者的问题

I would like the authors to address the weaknesses outlined above.
How can SAM be used to defend your model in a similar way to MACE: Mass Concept Erasure in Diffusion Models? I believe applying SAM defense at the end of the model could effectively mitigate most attacks.

论据与证据

Yes, the details can be found in the Strengths and Weaknesses of my review below.

方法与评估标准

It makes sense, and the authors provide a fair comparison.

理论论述

No proofs are provided in this paper.
I believe the idea makes sense.

实验设计与分析

Since most settings are the same as the baseline, I believe the experiments are fair enough.
I would like to request that the authors release the code as soon as the paper is accepted.

补充材料

There is no suppl..

与现有文献的关系

The paper investigates the security issues of I2I diffusion models and proposes an adversarial image-based approach to attack I2I diffusion models, addressing the limitations of previous text-based attacks in [1][2].
- [1] Yang, Y., Gao, R., Wang, X., Ho, T.-Y., Xu, N., and Xu, Q. MMA-diffusion: Multimodal attack on diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
- [2] Ma, J., Cao, A., Xiao, Z., Zhang, J., Ye, C., and Zhao, J. Jailbreaking prompt attack: A controllable adversarial attack against diffusion models. arXiv preprint arXiv:2404.02928, 2024.

遗漏的重要参考文献

NAN

其他优缺点

Strengths
- This paper explores a novel problem, that is, adversarial image attacks targeting I2I diffusion models.
- The paper proposes a well-designed and effective novel pipeline.
- The paper considers various experiments, including transferability and image quality. The experimental results, which compare against baselines and different defense methods, demonstrate the method’s generalizability.
- The paper is well-written and easy to follow.
Weaknesses:
- AdvI2I relies on adversarial images. It would be beneficial to explore the robustness of the method against adversarial defense strategies, such as DiffPure [1], to demonstrate the effectiveness of adversarial image attacks in a more comprehensive manner.
  - [1] Nie W, Guo B, Huang Y, et al. Diffusion models for adversarial purification. arXiv preprint arXiv:2205.07460, 2022.
- There is no detailed analysis of runtime comparisons, nor is there adequate discussion on the computational costs and efficiency of the proposed method. This could limit its practical feasibility.
- The main results are limited to SDv1.5-Inpainting and InstructPix2Pix, excluding more advanced versions of Stable Diffusion or other models like FLUX. This limits the generalization potential of AdvI2I.
- The proposed attack requires a white-box setting with a safety checker. The experiments in the appendix only evaluate transferability for ViT-L/14-based models. Is this evaluation comprehensive enough to assess the effectiveness of the method on existing diffusion models?
- The method section introduces the concept of Adaptive Attack. How does it differ from the Image-Modal Attack in MMA? Please clarify the distinction between the two approaches.

其他意见或建议

In line 15 of Algorithm 1, should it refer to Eq. 2 instead of Eq. 4 when not using AdvI2I-Adaptive?

作者回复

2025-04-01

Thank you for insightful feedback. We have conducted additional analyses and experiments to address your concerns.

W1: The robustness of the method against adversarial defense strategies, such as DiffPure.

A1: We have now evaluated the robustness of AdvI2I against DiffPure as suggested. DiffPure reduces the ASR for the SDv1.5-Inpainting model (nudity concept) from 82.5% to 72.5%, a modest decline indicating AdvI2I’s robustness to such purification defenses.

Method	w/o Defense	DiffPure
Attack VAE	41.5%	33.5%
AdvI2I (ours)	82.5%	72.5%

W2: Runtime comparisons or discussion on the computational costs of the proposed method.

A2: Thank you for highlighting this point. We conducted additional experiments comparing AdvI2I to MMA on SDv1.5-Inpainting, measuring ASR and runtime cost. The results clearly demonstrate AdvI2I’s efficiency advantage. AdvI2I maintains high ASR (82.5%) at an extremely low attack runtime (only 0.008 sec/image), demonstrating superior practical utility.

Method	ASR (%)	Average Time Cost (s)
MMA	42.0	415.984
AdvI2I (ours)	82.5	0.008

W3: More advanced versions of Stable Diffusion or other models like FLUX.

A3: We appreciate this valuable suggestion. We have now evaluated AdvI2I on recent advanced diffusion models, including SDXL-Turbo (incorporating a refiner block) and FLUX (using diffusion transformers). As summarized in table below, AdvI2I consistently exhibits strong generalization capabilities across these new models.

Method	Model	w/o Defense	SD-NP	GN	SC
AdvI2I	SDXL-Turbo	82.5%	79.5%	66.0%	18.5%
AdvI2I	FLUX.1-dev ControlNet Inpainting-Alpha	80.0%	76.5%	65.0%	18.5%

W4: The proposed attack requires a white-box setting with a safety checker.

A4: The safety checker we evaluated is widely adopted across diffusion models. However, to comprehensively assess black-box transferability, we additionally evaluated AdvI2I-Adaptive using an entirely different NSFW detector (MHSC [1]) as the safety checker. Results below confirm that AdvI2I-Adaptive achieves significantly higher transferability (41.0%) compared to MMA (20.5%), underscoring the generalization advantages provided by our learned adversarial noise generator.

Method	w/o Defense	SC	Black-box MHSC
MMA	68.5%	64.5%	20.5%
AdvI2I-Adaptive (ours)	78.0%	70.5%	41.0%

W5: Clarify the distinction between AdvI2I-Adaptive and the Image-Modal Attack in MMA.

A5: The key differences are as follows:

Attack Requirement: MMA’s adversarial image targets only the safety checker, and still requires an unsafe prompt to induce the diffusion model to generate NSFW content. This reliance makes it easier to defend against (see Table 2 in our manuscript). In contrast, AdvI2I-Adaptive requires only an adversarial image, which simultaneously fools both the diffusion model and the safety checker, without needing an unsafe prompt.
Attack Method: MMA performs direct image-space optimization per input, while AdvI2I-Adaptive uses a generator to produce adversarial images conditioned on clean inputs. This makes our method more effective (see Table 9 in the manuscript) and more efficient (see A2 of Reviewer FsK8).

Q1: In line 15 of Algorithm 1, should it refer to Eq. 2 instead of Eq. 4 when not using AdvI2I-Adaptive?

A6: You are correct; line 15 of Algorithm 1 should reference Eq. 2 when describing the standard AdvI2I (non-adaptive version). We sincerely appreciate your careful observation and will correct this mistake.

Q2: How can SAM be used to defend your model similarly to MACE: Mass Concept Erasure in Diffusion Models? I believe applying SAM as a defense could effectively mitigate most attacks.

A7: We appreciate your insightful suggestion. However, we found that directly applying SAM to detect and mask NSFW content of generated images is ineffective. Specifically, SAM indiscriminately masks entire human bodies or clothes, regardless of actual NSFW content presence. In our validation with 200 non-NSFW images, SAM consistently produced masks inaccurately labeling human figures as containing nudity—even when explicitly prompted with specific sensitive body parts (e.g., "breasts," "genitalia") that were absent. This suggests SAM inherently attempts to mask areas matching textual prompts, regardless of content appropriateness, and struggles to accurately interpret abstract concepts such as "nudity". Consequently, using SAM as an effective defense would require methodological refinements, which we believe represent a valuable direction for future research.

We provide representative examples in this link.

Thank you again for these valuable comments. We will include these experiments and analyses in the revised version.

审稿意见

评分: 22025-03-15

This paper proposes AdvI2I, a framework for adversarial image attacks on image-to-image (I2I) diffusion models to induce NSFW content generation without modifying text prompts. By training a generator to inject perturbations aligned with NSFW concept vectors (extracted via contrastive text pairs), AdvI2I bypasses defenses like LLM filters. The enhanced AdvI2I-Adaptive further improves robustness through Gaussian noise and NSFW embedding similarity minimization. Experiments demonstrate high attack success rates against defenses, exposing I2I model vulnerabilities and urging stronger safeguards.

给作者的问题

No.

论据与证据

The claims are empirically supported through: (1) demonstration that adversarial text prompts are easily defended against (34-96% ASR reduction via LLM filters in Table 2), and (2) AdvI2I-Adaptive's maintained high ASR (>70%) under SLD and safety checker defenses. However, the low transferability to SDv3.0 (34% ASR) lacks sufficient analysis to explain generalization limits.

方法与评估标准

Evaluation Limitations: The attack success metrics depend entirely on algorithmic detectors (NudeNet, Q16 classifier). While these provide quantitative benchmarks, they cannot fully capture real-world human perceptual thresholds.

理论论述

No theoretical analysis is provided.

实验设计与分析

Dataset Limitations: The dataset (400 images from the "sexy" category) exhibits selection bias and lacks representation of critical NSFW concepts such as political extremism, hate symbols, or graphic violence. And Small sample size (200 test images) raises concerns about statistical significance, especially for FID.
Model Coverage: Evaluations lack of state-of-the-art models like SDXL and PixArt-α, limiting insights into modern I2I pipelines.
The low ASR (34%) on SDv3.0 is attributed to data filtering without ablation studies.

补充材料

Yes, I have reviewed the whole supplementary material.

与现有文献的关系

Previous text-based attacks, such as QF-Attack, Ring, and MMA, have optimized adversarial text prompts to induce NSFW content in T2I models. In contrast, AdvI2I generates adversarial perturbations directly on input images to induce NSFW content.

遗漏的重要参考文献

No.

其他优缺点

Strengths:

Clear Motivation. Identifies critical limitations of text-based adversarial attacks and proposes AdvI2I generating adversarial image perturbations to induce NSFW content in I2I diffusion models.
Well-designed method: combining NSFW concept vector extraction with adversarial generator, enabling efficient perturbation generation

Weaknesses:

Concerns regarding sample transferability: The method proposed in this paper relies on generating adversarial samples through image generation models. This approach may result in a heavy dependency on training data and known I2I (image-to-image) models, casting doubt on its effectiveness when applied to unknown samples or untested I2I models. The low ASR (34%) on SDv3.0 seems verify this point.
Impractical methodology: The adversarial noise intensity in the experiments is excessively high (ranging from 32/255 to 128/255). Such noise levels introduce visually perceptible anomalies, which would likely trigger immediate rejection of the samples in subsequent processing pipelines. Furthermore, overly conspicuous noise is also susceptible to detection by defense models, ultimately rendering this approach impractical in real-world applications.
Evaluation Limitations: The attack success metrics depend entirely on algorithmic detectors (NudeNet, Q16 classifier). While these provide quantitative benchmarks, they cannot fully capture real-world human perceptual thresholds.
Dataset Limitations: The dataset (400 images from the "sexy" category) exhibits selection bias and lacks representation of critical NSFW concepts such as political extremism, hate symbols, or graphic violence. And Small sample size (200 test images) raises concerns about statistical significance, especially for FID.
Model Coverage: Evaluations lack of state-of-the-art models like SDXL and PixArt-α, limiting insights into modern I2I pipelines.
The low ASR (34%) on SDv3.0 is attributed to data filtering without ablation studies.

其他意见或建议

No.

作者回复

2025-04-01

Thank you for the insightful comments and suggestions.

W1: Concerns regarding sample transferability.

A1: Our training and test samples were randomly split, meaning images are not entirely overlapping. We have also verified sample transferability of AdvI2I on unseen data (new images and prompts never seen during training) in manuscript's Table 5.

For the model transferability, manuscript's Table 7 shows that AdvI2I achieves high ASRs (80.5% on SDv2.0 and 84.0% on SDv2.1) when transferred from SDv1.5. Regarding the relatively lower ASR on SDv3.0, please see our explanation in response A6.

We also extended our evaluation to newer diffusion models. Results below highlighting the method’s transferability across model architectures.

Source Model	Target Model	ASR
SDv1.5-Inpainting	SDXL-Turbo	62.5%
SDv1.5-Inpainting	FLUX.1-dev ControlNet Inpainting-Alpha	74.0%

W2: High adversarial noise intensity.

A2: We compared AdvI2I against baselines using a lower noise bound (16/255) on InstructPix2Pix. The results confirm that AdvI2I maintains effectiveness.

Method	ε	w/o Defense	SLD	SD-NP	GN	SC
AdvI2I	16/255	70.5%	68.5%	70.0%	55.5%	14.5%
Adaptive	16/255	70.5%	65.5%	68.5%	62.5%	55.5%

In addition to quantity results, manuscript's Figure 2 also qualitatively shows that our noise generator produces perturbations (64/255) that remain visually imperceptible to humans.

W3: Evaluation depends on algorithmic detectors.

A3: We follow the common practice in prior studies [1, 2, 3, 4] to use these detectors (NudeNet, Q16) for consistent quantitative benchmarking. These detectors are trained on human-labeled datasets and thus reflect human perception to some extent.

To further strengthen our evaluation, we tested AdvI2I using a multimodal LLM (GPT-4o) with carefully designed prompts to simulate human judgment. The prompt, examples and results are shown in the file. The results closely align with algorithmic detectors.

[1] Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models.

[2] Ring-A-Bell! How Reliable are Concept Removal Methods For Diffusion Models?.

[3] Sneakyprompt: Jailbreaking text-to-image generative models.

[4] Mma-diffusion: Multimodal attack on diffusion models.

W4: Insufficient coverage of concepts, and small test size.

A4: We appreciate this suggestion. In addition to the "nudity" and "violence" concept covered in our manuscript, we further evluated the suggested "political extremism" concept. The concept vector is constructed with prompts related to "extremism" and "terrorism.". The results confirm AdvI2I's versatility across diverse NSFW concepts.

Method	Concept	w/o Defense	SLD	SD-NP	GN	SC
AdvI2I	Extremism	76.5%	73.0%	73.5%	60.5%	27.5%
AdvI2I-Adaptive	Extremism	74.5%	70.0%	72.5%	71.5%	72.0%

Regarding sample size, our primary metric (ASR) typically does not require massive samples for reliable evaluation. Previous works ([1] Ring-A-Bell: 200 prompts, [2] MMA: 60 images) employed similar or smaller evaluation sets. Nonetheless, we now increased our test set to 500 images, and AdvI2I continues to demonstrate consistent effectiveness.

Method	w/o Defense	SLD	SD-NP	GN	SC
AdvI2I	75.2%	71.2%	71.6%	62.0%	17.0%
AdvI2I-Adaptive	71.4%	65.4%	65.8%	68.6%	65.4%

Additionally, for image quality (Section F of our manuscript), we use several metrics in addition to FID. These metrics are less sensitive to sample size and consistently validate the high visual quality of our generated images.

W5: Evaluations on new I2I models.

A5: Please see our response Reviewer FsK8's A3. Thank you for your advice.

W6: The low ASR on SDv3.0.

A6: The relatively lower performance on SDv3.0 is primarily due to its explicitly filtered training dataset, as noted in [1]. Interestingly, when we directly used prompts to request nudity content on SDv2.1 and SDv3.0 without defenses, SDv2.1 easily generated such content, while SDv3.0 did not. And even adversarial prompts (e.g. QF, Ring, etc.) fail significantly more often on SDv3.0 compared to earlier SD models.

Model	QF	Sneaky	Ring	MMA	MMA-Mask
SDv1.5	68%	48%	98%	100%	64%
SDv2.1	62%	46%	88%	94%	64%
SDv3.0	14%	34%	48%	46%	28%

This suggests that SD3.0 has less risk of generating NSFW content, regardless of attack methods. Therefore, a potential future direction to enhance I2I safety is to totally nullify the NSFW concept from the model by thoroughly cleaning the training data.

[1] Scaling rectified flow transformers for high-resolution image synthesis.

Thank you again for your valuable feedback.

最终决定Accept (poster)

2025-05-01

This paper received 2 weak reject and 1 weak accept. After careful consideration of the submission, the reviews and rebuttal, this meta-review recommends weak accept.

On positive aspects, this paper tackles an important problem of image-to-image diffusion model safety with clear motivations, a novel solution and thorough experimentation.

The main concerns focus on the limitations in dataset scope (400 images in submission and extended to 500 in rebuttal) and transferability of the attack (especially for SDv3.0). In the rebuttal, the authors also pointed out that previous works ([1] Ring-A-Bell: 200 prompts, [2] MMA: 60 images) employed similar or smaller evaluation sets. So, the dataset scope seems to be a fair setting. Following the initial reviews, the authors have conducted comprehensive evaluations across different models, including transferability and black-box settings, which show strong empirical results and discusses real-world applicability.

Even though some concerns remains, e.g., the paper can be largely enhanced if the author can construct a defense method. This meta-review thinks that the main concerns have been largely addressed, and thus overall recommend weak accept.