TriggerCraft: A Framework for Enabling Scalable Physical Backdoor Dataset Generation with Generative Models
TriggerCraft is a generative framework that automates physical backdoor dataset creation, enabling realistic attack simulation and research without real-world setup.
摘要
评审与讨论
The paper introduces triggercraft, which is a new method to generate simulated real world backdoor attack datasets. It utilises existing VQA models to suggest the class for the trigger, followed by existing image editing/synthesis models to insert the inages, followed by image-generation scoring to filter the generated images.
优缺点分析
Strengths:
- The paper tackles an important task of enabling the study of real world poisoning attacks.
- Investigating the applicability of using image synthesis models to this end and their transferability to real world settings are interesting.
Weaknesses:
- My main concern with this paper is the writing and clarity of description which make it hard to understand the motivation, the experimental setup, and the results. This made it challenging to evaluate the correctness of the claims. See some specific comments below.
- The motivation for this study could also be improved in my opinion. Physical world triggers are very relevant and interesting (for evasion by wearing clothes or accessories for instance) as the authors mentioned. However, I am unsure what the relevant setting for physical world poisoning of the training set (L103 - "poisoning 5% of ImageNet"). This would mean physically planting an object in tens of thousands of real images. Also, as the proposed approach inherently suggests "editing pixels" (with an image synthesis/editing model) what sort of insights do the authors suggest would transfer from there to physical world settings?
问题
Suggestions:
- The details, and text especially in Figure 1 could be made notably larger to improve readability and accessibility.
- The figures should be placed near where they are first referenced. For instance Figure 2 appears 3 pages before it is referenced which makes reading less fluent.
- Include an experimental setup with more specific details of the setup and experiments (or forward reference these in the appendix) - for instance which dataset was used for sub-section 5.1?
Questions:
- Could the authors clarify whether the trigger is meant to be a specific object instance (e.g my specific bag) or a general class (e.g any handbag)? Which of these two settings would the authors say are more applicable in the real world? Perhaps it is worth adding some experiments with image synthesis models which use a specific object.
- Continuing the question above - could the authors clarify the type of backdoor attack used? Is the trigger meant to induce a "targeted" attack? Is this a clean label attack?
- Could the authors confirm that their approach focuses on also generating the poisoned datasets with the synthesis models aiming to simulate "physical world"? If so, what is the motivation for this as data for training is usually scraped and stored digitally and therefore editing pixels could be just as likely as editing in the physical world? If the motivation is in fact that your approach simulates real world data for poisoning it would be beneficial to show that insights from the simulated setup hold for poisoning with real world objects.
- If the choice of trigger is not per-image and also just a general task - is the mental load of selecting an object class indeed challenging? Does a human choice lead to higher or lower quality than an automated pick?
- L242-244 - "Such requirements stagnated the development of physical backdoor research, as these metrics could not effectively score a “good” synthesized image with physical backdoors." - could the authors provide support for this claim (i.e that this is the reason for lack of research into the field)? In fact, the authors themselves use an existing per-image metric, showing that such metrics existed prior to this work.
- Table 2 - what do "Real" CA and "Real" ASR refer to?
Phrasing, typos:
- The authors frequently uses "surreal" to depict synthetically generated images which aim to be natural looking. I think this phrasing is confusing.
- L232 - " is indeed of utmost crucial" -> "is indeed of utmost importance"/"is crucial"
- L234 - "is nowhere to be done" - this should be rephrased as it is not clear. Are you suggesting that no metrics to suggest whether a generated or edited image are natural?
- L272 - "Realistic Vision" could the authors provide a reference or citation?
- l288 - the \sim symbol is placed incorrectly above 15.
局限性
Yes, however I would suggest adding somewhere a short phrase about the potential negative impact of making backdoor attacks easier - while highlighting the importance of investigating this to improve defense methods.
最终评判理由
The motivation to enable research on backdoor attacks, by lowering annotation efforts, is interesting. However, I have remaining concerns as to what extent the experiments provided support this motivation. Furthermore, the writing of the paper could improve a lot as the clarity is currently low, and the editing required is beyond the scope of a camera ready revision.
格式问题
No
We thank the reviewer for the constructive comments.
Q1: Writing and clarity of the paper
Thank you for the comments. We will respond to the comments separately below, and revise the paper accordingly.
Q2: Motivation
We’d like to clarify that the problem mentioned in the reviewer’s comment is the exact challenge we’re solving in this paper. Specifically, L103 highlights the intense workload needed to “physically implant objects into tens of thousands of real images” when constructing a physical backdoor dataset, which is the current major bottleneck of physical backdoor research. Our paper tackles this bottleneck by designing a novel generative model framework of synthesizing physical backdoor datasets, that are comparable to manually-collected real world datasets for backdoor studies but within laboratory constraints (as discussed in Abstract and Introduction). We hope this can help enable and accelerate the physical backdoor research.
Q3: Insights for "editing pixels" to physical world settings
The Image Generation and Image Editing models are the means to synthesize a physical backdoor dataset – which is challenging to collect manually. Our paper empirically demonstrates (specifically, with comparable ASRs between synthetic and real poisoned samples) that the proposed framework, based on these generative models, indeed can provide a valuable environment for studying physical backdoor research within the confine of the researchers’ laboratories, which is a significant contribution.
Q4: Triggers types (specific/general) and their applicability in real world
In backdoor attacks, the attack medium (i.e., the trigger) can vary from a specific object (e.g., my specific bag) to a general class of objects (e.g., any handbag), depending on objectives of the researcher’s study. Specific object is often more “secretive” and gives the adversary “more control” on when then model would misclassify, while general class of objects can lead to less secretive and less control; the general class, however, should also more likely lead to false positives (i.e., falsely triggering the backdoor, as discussed in Q2 of 5CCA or Q6 of CiuS). Our work allows a greater control of choosing the generality (from specific to general) of the trigger, based on selecting the compatibility level of the Trigger Suggestion Module, or tweaking the prompt of diffusion models (we do not explore this direction in this paper as this deserves an independent extension of our work).
We do not think the specific trigger is more applicable than the general trigger in real world settings, because the general trigger (e.g., any handbag) can be an unintentional, spurious feature associated with a target class (e.g., most cat images have handbags, thus a handbag is a discriminative feature or “trigger” of the cat images, unintentionally), which is also a serious security risk.
Finally, our experiments include a comparable example to “using a specific object”. Specifically, the tennis ball is an object that is consistent (similar to my handbag) but generally does not exist in the existing dataset. Please refer to our experiments on tennis ball for more details.
Q5: Type of backdoor attack used
Our work focuses on targeted backdoor attacks in the dirty-label setting. We first synthesize poisoned images, then perturb their class labels to the target label, making it dirty-label attack.
Nevertheless, our framework could be further extended to study clean label attacks in the physical world, which we leave for future exploration.
Q6: Simulating physical world with synthesis models and its motivation
Yes, our framework aims to enable backdoor researchers to simulate physical backdoor datasets, via synthesizing them digitally, within the confine of their laboratories. Manually collecting physical datasets is arduous due to extensive approvals and limited budgets/resources. Thus, this paper aims at rigorously demonstrating the potential of editing in the digital space (i.e., synthesizing images) in emulating the manually collected dataset.
We follow the reviewer’s suggestions and tabulate our results below. We observe that with a unique trigger (tennis ball), Image Editing models achieve similar performance as the real data, while Image Generation models fall short. For generic triggers like books, Image Editing models performed relatively well, while Image Generation models failed to achieve comparable Real ASR as the Real Data.
This is due to distributional shifts between the synthesized and real data, and such gaps are amplified from unique to generic triggers, and from Image Editing to Image Generation models. This suggests that Image Editing models and unique triggers are comparatively better in simulating a real world dataset. Despite certain limitations, current generative models are already sufficiently capable of supporting synthetic dataset creation for physical backdoor research; more importantly, as these models continue to advance, the limitations will diminish (e.g., the distributional gap is reduced for Image Generation), making them increasingly valuable for this line of research.
| Method (Trigger) | CA | ASR | Real CA | Real ASR |
|---|---|---|---|---|
| Real Data (Tennis Ball) | - | - | 94 | 100 |
| Image Editing (Tennis Ball) | 93 | 81 | 82 | 82 |
| Image Generation (Tennis Ball) | 94 | 99 | 80 | 44 |
| Real Data (Book) | - | - | 94 | 100 |
| Image Editing (Book) | 93 | 65 | 82 | 56 |
| Image Generation (Book) | 94 | 100 | 81 | 22 |
Q7: Mental load of trigger selection task and automated pick quality compared to human's choices
Indeed, the mental load of selecting an object class is still challenging, especially when the number of classes is large (e.g. ImageNet with 1000 classes). This has been one of the most challenging aspects of (especially physical) backdoor research: the trigger should fit in the context of the image. Hopefully, Trigger Suggestion Module in our framework can alleviate this for security researchers by suggesting the suitable trigger, while Poison Selection Module helps filter out implausible images (i.e. the trigger does not fit to the image’s context, aligned with human preferences).
Next, we rigorously assess the Trigger Suggestion Module’s in matching human’s preferences (assuming the number of classes/images is small - minimizing cognitive load for human testers) in Sec. 4, Sup. Material. Specifically, we’ve conducted a human evaluation test for our Trigger Suggestion Module, to measure human’s preference against suggested triggers. We discovered that for 96% of the time, one out of five of suggested triggers aligns with humans’ preference, suggesting VQA’s suggestions are highly similar to humans’ choices. This study suggests that Trigger Suggestion Module can be a valuable tool to for picking human-aligned triggers that also fit into the context well.
Q8: Support for L242-244 claim
We’d like to first clarify that “these metrics” refers to conventional metrics (FID/IS), which alone are unable to quantify whether synthesized images are “good” for physical backdoor. Specifically, these metrics can measure distributional gaps between sets of real and synthesized datasets, but cannot quantify whether the trigger fits in the context of the image (surreality). As a result, we need a “per-image metric” (i.e., ImageReward), instead of distributional metrics.
The statement “such requirements stagnated the development of physical backdoor” refers to the fact that ensuring “surreality” (discussed in the previous sentence) requires security researchers to collect data manually, as they cannot yet simply use generative models to synthesize the poisoned samples and relying FID/IS to ensure “surreality”. Hence, physical backdoor research progresses slowly.
We will revise this statement and make it clearer (incorporating the above explanation) in the final version.
Q9: Real CA and Real ASR in Table 2
Real CA refers to the Clean Accuracy evaluated on the manually collected, real world dataset with physical triggers, while Real ASR refers to the Attack Success Rate on the aforementioned dataset.
Q10: Frequent use of "surreal" is confusing
Thank you, we’ll revise and amend accordingly. To explain, in Section 4.2, we discuss that both quality and surreality refer to the fidelity of images, or that “the images are clear and the objects appear natural to humans” (L200-206). Then Section 4.3 (L235-250) discusses that the “naturalness of the synthesized images” can be formally defined as alignment with human preference, which is also discussed in [1]. This formal definition of fidelity motivates us to use ImageReward, which captures human preference, as our assessment metric for fidelity.
Q11: L232 - " is indeed of utmost crucial" -> "is indeed of utmost importance"/"is crucial"
Thank you, we’ll revise and amend accordingly.
Q12: L234 - phrasing is not clear
We meant that these metrics are “not suitable” in quantifying whether a generated/edited image is natural. We agree that this phrase can be misleading, and will change it immediately to “conventional metrics are unsuitable for assessing the quality/surreality of the syntheticly generated physical backdoor samples” in our manuscript.
Q13: L272 - Reference for Realistic Vision
We’d like to direct the reviewer to Sec. 1 (Sup. Material) L22, which specifies the reference of the model (SG161222/Realistic_Vision_V5.1_noVAE). This model is hosted on huggingface (as we’re not allow to submit a link, the reviewer can search for it on the huggingface website – we will include this reference in the final version).
Q14: L288 - the \sim symbol is placed incorrectly above 15.
We thank the reviewer for the suggestion, we’ll revise and amend accordingly.
Reference:
[1] ImageReward: Learning and evaluating human preferences for text-to-image generation. arXiv 2023
I thank the authors for their detailed response, below I will try to re-clarify my concerns.
Regarding the motivation - Beyond the study of physical poisoning attacks "in labs" the question is about the motivation to do this in the real world. In other words, if this dataset is very hard to collect in order to study, perhaps it is also unreasonably hard to perform this attack in real life making the motivation for such attacks much lower compared to "editing pixels"? Another way to phrase the question would be "Under what setting would an attacker opt for this highly laborious poisoning effort, in place of other approaches?"
Furthermore, the authors could also better motivate which insights from GenAI based data could carry over to this real world attack. In other words, anything to do with the data or poisoning type which is achieved through this approach would be hard to transfer directly if I understand correctly? Do the authors suggest that optimisation methods for the attack which work on the synthetic data will also work better in the case of real world data?
Q6 - If I understand correctly, the tabulated results show that poisoning a model with data synthesised with Image Generation or Editing also transfers to a real world trigger, correct? If I understand correctly, this could mean that an attacker could perform an attack for a real world trigger without real world poisoning (i.e. only generating images and never creating a physical poison). However, it is unclear to me that this proves that a researcher could derive insights about the poison creation or attack optimisation from this. This again relates to the motivation which is not clear to me. If the claim is that you can poison a model in digital space such that a real world trigger can activate it, one should show that this approach is better than other which do not rely on image generation or editing. However, if the claim is that insights from this poisoning transfers to real world poison, then I feel this claim needs to better substantiated.
I thank the authors again for their response and effort.
Q1-1: Motivation
We appreciate the reviewer’s question regarding the motivation of physical backdoor attacks.
First, we’d like to clarify that the difficulty in physical dataset collection does not imply impracticality of physical attacks in real life, and physical attacks complement digital “pixel” attacks in many scenarios. As explained, collecting and sharing data for research are challenging, despite the realistic and harmful threat of physical attacks, compared to digital attacks:
- In closed-loop systems (e.g., surveillance cameras, autonomous vehicles, biometric gates), attackers lack digital access (rendering “pixel-editing” impractical), but can manipulate physical scenes in front of sensors. In such systems, physical poisoning is often the only feasible vector. [1] demonstrates this: they poison a face recognition system by capturing real images of volunteers wearing physical accessories (e.g., glasses, scarves), which are inserted into the training set. At inference, the physical triggers consistently induce targeted misclassification with success rates > 90%, under varied lighting, angles, and backgrounds. [2] place physical stickers on traffic signs during passive data collection to compromise models.
- As discussed in L31-35/89-90, the motivation of physical attacks is mainly due to digital backdoors' shortcomings, that are susceptible to noisy perturbations and difficulties in injecting digital triggers in real-time systems (also discussed in Sec. 2 of [5]).
Second, we'd like to clarify that experimenting on synthesized data does suggest transferable insights to the real data, as proven by our experiments. We showed that both "edited" and "generated" images yield comparable results as the manually-collected data, confirming our hypothesis of transferability; most importance are (i) comparable ASRs and resistance against defenses, (ii) effectiveness against noisy perturbations as physical triggers are more resilient compared to pixel-based triggers (similar to [1]), (iii) or increased effectiveness but lower defense resistance of more consistent triggers, compared to less consistent ones, against Neural Cleanse.
Our work, thus, provides a ground for future research works that intend to study potential optimizations (e.g., optimizing in latent space [3], trigger placement [1], weather as trigger, [4]) for both attack and defense paradigms, with synthetic data. This would abstract the need for researchers to go through laborious processes in collecting data and validating their assumptions. And more importantly, synthesized datasets can be easily shared for research and benchmark purposes without ethical approvals or concerns, which is currently lacking in this domain.
We hope this clarifies the reviewer’s concerns; we will update the paper accordingly based on our discussion.
Q6: Claims of motivation
Thank you for the thoughtful follow-up. Above, we explain our motivation: propose a synthetic framework to replace manually collected data for physical research, instead of another attack that “poisons a model digitally with real world trigger activation".
The table confirms the “transferability” where a model poisoned entirely with synthetic images can be triggered by real-world objects at inference time. Besides transferrability, we focus on (1) a flexible framework (e.g., the separation of Trigger Selection and Poison Selection allows the researcher to investigate various spectrum of triggers and incorporate other type of qualitative filters, such as synthesized triggers of certain sizes, respectively), (2) validation of the framework (for different types of synthesis and triggers against previously observed phenomena in the literature -- see previous answer's second point), and (3) “gaps” between synthetic and real poisoning (due to limitations of the generative tools).
An example of (3): while the real-data baseline achieves the highest ASR, image-editing-based synthetic data exhibits the strongest transfer among synthetic methods on both the unique (Tennis Ball: 82% vs 44%) and generic (Book: 56% vs 22%) trigger, which also confirms TriggerCraft's synthetic data can faithfully replicate relative behaviors seen in prior physical-world studies [1,2].
Hence, TriggerCraft automates trigger suggestion, image synthesis, and plausibility filtering, making the construction of large physical datasets (laboratory) accessible. We hope this can help accelerate physical attack and defense research without costly data collection.
[1] Wenger et al. Backdoor attacks against deep learning systems in the physical world. CVPR'21
[2] Gu et al. BadNets: Identifying vulnerabilities in the machine learning model supply chain. arXiv'17
[3] Xiao et al. Dynamic Weighted Learning for Unsupervised Domain Adaptation. CVPR'21
[4] Zhang et al. Towards robust physical‑world backdoor attacks on lane detection. ACMMM'24
[5] Wenger et al. Natural Backdoor Datasets. arXiv'22
Dear Reviewer Trxv,
As the discussion deadline approaches, we wanted to follow up on our rebuttal. We hope that our detailed responses and clarification have adequately addressed your concerns.
If there are any remaining questions or clarifications needed, we would be more than happy to assist.
Thank you again for your time and consideration.
I would like to thank the reviewers again for the responses, however I still have concerns about the clarity of the work and more importantly to what extent the results provided support the main claim.
The main motivation of the paper is to alleviate or minimise the requirement for real world poisoning datasets for research. For instance, a researcher studying poisoning mitigations (defences) would like to know that building a defence method for this synthetic data would also work for a real world poisoning data.
However, if I understand correctly the transferability shown by results shows that: Image generated (or edited) poison -> Transfers to real world triggers.
This does not necessarily show what would happen with a real world poison dataset. The fact that both lead to a real world trigger "working" doesn't mean that this derives from the same cause. Subsequently, this doesn't mean that insights (e.g defence approaches) which work for the generated poison dataset hold for the real world poison dataset. Generated images and real images don't necessarily have the exact same distribution and the reason for a defence method to work could be some hidden texture only prevalent in generated images (for example).
I understand justifying such a claim is challenging, one option would be to try use a real world poison dataset which already exists or is collected specifically and try to generate a comparable image generated dataset and show transferability (generated -> real world) of elements of the poison dataset, and not just the trigger.
I would like to thank the reviewers again for the discussion and their work, and yet I have remaining concerns and also think that the paper would also benefit notably from re-writing to improve the clarity.
Thank you for the comments. We’d like to counter the important argument just raised by the reviewer:
- The reviewer stated
“if I understand correctly the transferability shown by results shows that: Image generated (or edited) poison -> Transfers to real world triggers. This does not necessarily show what would happen with a real world poison dataset. The fact that both lead to a real world trigger "working" DOESN’T mean that this derives from the same cause.” - We will confirm that "The fact that both lead to a real world trigger "working" DOES mean that this derives from the same cause", which is the result of our main experimental study in the paper.
- We believe this misunderstanding is a primary source of confusion and may have contributed to the reviewer’s inclination toward rejection.
For physical poison data collected manually, the actual object’s appearance (i.e., its shape, color, texture, etc…) is the source of backdoor activation [1-3]. However, for a synthesized one, there are 2 potential sources of backdoor activations: (a) the digital artifact caused by generation (e.g., a generated tennis ball may have, potentially invisible, subtle pixel discontinuities, color pattern, hallucinations, etc…) and (b) the object’s appearance (again, i.e., its shape, color, texture, etc…). A substantial part of our paper (i.e., the analysis on attack effectiveness) is set up to investigate this source of backdoor activations (as mentioned in Section 5.4, L291-300). Knowing the source of activation will also demonstrate “transferability” (as we will explain shortly).
(1) Is the artifact or the physical object the source of backdoor activations?
-
One straightforward strategy to answer this question is to “manually collect” a real replica of the synthetic data. However, this is extremely challenging as there are many types of triggers different in shapes, sizes, settings, texture, etc… and there are different contexts or image distributions (e.g., animal, human, nature, city, places, etc…). Proving a synthetic framework like ours “transfer” by collecting parallel data would result in an absurd, unrealistic amount of effort, time, resources, and money. Adding the number of existing backdoor defenses makes it even worse.
-
Instead, we take another significantly more efficient strategy, via “eliminating the impossible cause” with manually collected (small) test physical datasets. Specifically, as explained during our discussions, we “observe” attack behaviors on the real, manually collected data and synthetic data. If the backdoor is activated due to the artifact, the attack performance (i.e., ASR) on the real and synthetic data will be significantly different, because real data do not contain generative artifacts. As can be observed in the paper and previously mentioned in the discussion, “the ASR between real and synthetic data are comparable”, indicating that the source of activation cannot be artifacts. Consequently, the only remaining plausible explanation is the object’s appearance.
(2) why is this answer important for tranferability? Because isolating the source of backdoor activation indicates the appearance (shape, color, textures, etc…) of the trigger object “transfers” from digital to real physical world. This shows that we can synthesize the poisoned data in the lab with exactly the same activation characteristics as the “manually poisoned data”. Note that [Wenger et al. 2022] also briefly noted this observation in their work, which motivates our work.
(3) But what about the defenses? We’ve just shown that it’s sufficient to very “transferability” by focusing on backdoor activation characteristics. In addition, as discussed very often in the paper, there is a significant shortage (if not a complete absence) of physical-trigger datasets; collecting real, poisoned training data, as explained above, is also challenging. Nevertheless, our paper has verified two important characteristics of physical triggers against key defenses, as extensively observed in previous works [1-3]: the synthesized physical backdoor can easily bypass STRIP’s or noisy perturbation, as physical object usually keeps the same appearance after these perturbation and Grad-CAM focuses on the trigger object during backdoor activation on both real and synthetic test data.
In summary, our paper is well motivated, and our analysis strongly isolates the synthetic object's physical appearance as the only source of backdoor activation in the physical world. Consequently, we believe that our current manuscript sufficiently and well demonstrates main claims of the paper.
References:
[1] Yin et al. Physical backdoor: Towards temperature-based backdoor attacks in the physical world. CVPR'24
[2] Wenger et al. Backdoor attacks against deep learning systems in the physical world. CVPR'21
[3] Wang et al. TrojanRobot: Physical-World Backdoor Attacks Against VLM-based Robotic Manipulation. arXiv'25
I kindly disagree with the core point made by the authors, and hope that they carefully read my detailed response.
I do not believe that the results shown lead to “eliminating the impossible cause”. There could be many reasons for a trigger to "fool" a poisoned model at inference time, and it is far from trivial to specify all. The binary choices the authors suggest are, in my opinion, overly simplistic. For instance, it is possible to poison a model, through data alterations which do not even include or resemble the trigger [1, 2, ...]. If we were to use such an attack to "activate" a real world trigger, it is highly plausible that it would be effective. However, understandably, it is unlikely that insights from this pixel editing poison optimisation attack would transfer to an object (co-location) based real world poisoning, even if both lead to the same ASR on the real world trigger.
While the data generation pipeline suggested is not explicitly optimising the poison in such a manner, it is very possible that something in the data synthesis + selection pipeline is leading to trigger success due to data artifacts which would not be present in a real world poisoning data.
I do agree that it is possible that the reason for attack success on the real world trigger, would also apply to real world datasets (perhaps it is even likely for specific kinds of attacks), but it most definitely isn't the only option. This is the main claim of the paper, with possible negative impacts if the assumption is incorrect (e.g. researchers study defenses for real world poisoning only on synthetic data, perhaps missing, or underestimating a real world threat). Hence, I believe the authors must provide stronger evidence to support this claim.
I thank the authors for this discussion, and hope that they consider my comment, and do not disregard it as a misunderstanding, as this core distinction is important (in my opinion), and followed by an in depth attempt to understand the authors claim.
[1] Souri, Hossein, et al. "Sleeper agent: Scalable hidden trigger backdoors for neural networks trained from scratch." Advances in Neural Information Processing Systems 35 (2022): 19165-19178. [2] Lederer, Tzvi, Gallil Maimon, and Lior Rokach. "Silent Killer: A Stealthy, Clean-Label, Black-Box Backdoor Attack." arXiv preprint arXiv:2301.02615 (2023).
We’re very grateful for your comment. Please see our response to each statement made by the Reviewer below:
There could be many reasons for a trigger to "fool" a poisoned model at inference time, and it is far from trivial to specify all. The binary choices the authors suggest are overly simplistic. It is possible to poison a model, through data alterations which do not even include or resemble the trigger [1, 2, ...]
We’d like to clarify that the binary choice “categorizes” the sources of backdoor activation into 2 parts: (1) the generative artifact (again, this “groups” all potential “alterations” caused by the generative model), and (2) the object’s appearance (again, this groups the “physical characteristics” of an object). An example is an image of a dog with a synthesized tennis ball by Image Editing; this ball may contains (1) generative artifacts such as invisible discontinuity in the pixels transitions and obviously (2) the shape and general color of the ball. (A) Suppose that (1) is the activating trigger of backdoor training, then (1) would NOT be activated on real photos of a dog/tennis ball combination as the real photo DOES NOT have any of such artifacts. On the other hand, (B) if (2) is the activating trigger, then it should activate both real and synthetic photos of dog/tennis ball during inference. Consequently, we will observe significantly different ASRs in (A) between real data and synthetic data, while they should be comparable in (B).
However, understandably, it is unlikely that insights from this pixel editing poison optimisation attack would transfer to an object (co-location) based real world poisoning, even if both lead to the same ASR on the real world trigger.
First, we’d like to clarify that we DO NOT optimize the attack, but only focus on high-quality image generations with the trigger. As explained, if our goal is to optimize the attack, the framework should, for example, “suggest” the most consistent object (e.g., squared sticker note with consistent color everywhere) as this would result in the best ASRs.
Second, assuming the hypothetical case, suggested by the Reviewer, where there is “something” (i.e., ANY accidental alterations by the synthesis process, besides the object’s appearance) during the synthesis or selection process that activates the backdoor and results in comparable high ASRs between real and synthetic data with the trigger. However, real test data is “captured” with entirely different, independent processes (camera, postprocessing, environments, time periods, etc…), making it extremely unlikely (if not impossible) for this “something” to exist in the test images, making the source of activation "something else".
While the data generation pipeline suggested is not explicitly optimising the poison in such a manner, it is very possible that something in the data synthesis + selection pipeline is leading to trigger success due to data artifacts which would not be present in a real world poisoning data.
We’d like to clarify that only the data synthesis directly modifies the images, while selection “looks” at image quality. We understand the reviewer’s concern about the potential “artifacts” in either synthesis or selection with attack optimization. Yes, we completely agree with the Reviewer that, if we optimize either of these steps with ASR (e.g., choosing the photos with certain characteristics to achieve high ASR in real test data), then we cannot ensure “transferrability”. However, we do not “optimize” ASR in any part of the framework; in image synthesis, we generate images using a “general prompt” and image selection filters images using a completely unrelated objective to ASR (i.e., ImageReward).
It is possible that the reason for attack success on the real world trigger, would also apply to real world datasets (perhaps it is even likely for specific kinds of attacks), but most definitely isn't the only option. This is the main claim of the paper, with possible negative impacts if assumption is incorrect...
As discussed, our work is not the first one analyzing ASRs to isolate the source of activations. Nevertheless, it provides the most rigorous analysis with synthetic data from generative models, with the goal of accelerating physical backdoor research.
I thank the authors for this discussion, and hope that they consider my comment, and do not disregard it as a misunderstanding
We’d also like to sincerely thank the Reviewer for the comments. In our responses, all we attempt to do is to “clarify” with the Reviewer that (1) our framework DOES NOT optimize the synthesized trigger for high ASRs (or choosing any accidental bias or artifacts of the process as the source of backdoor activations) and (2) our analysis is therefore discard artifacts as potential triggers, leaving object’s appearance as the only plausible triggering factor.
This paper introduces TriggerCraft, a framework for generating physically realizable trojan backdoor datasets using generative models. Physically realizable backdoors use real-world objects (e.g., books, tennis balls) that could physically exist in the scene as triggers, making them more practical and harder to detect than digitally injected backdoors. The framework consists of trigger suggestion based on VQA, trigger generation using diffusion models to edit or generate new images, and poison selection module that filters poisoned samples. The authors demonstrate that synthetically generated backdoor datasets achieve comparable attack success rates (~60-95%) and defense evasion properties to real physical backdoor datasets.
优缺点分析
The framework is validated against real physical backdoor data collected with multiple devices under various conditions as opposed to only being validated on GenAI images.
What is the computational or monetary cost for API calls to perform the poison image generation? Understanding the barrier to implementation would be a reasonable single line improvement.
Line 53: "... create a physical backdoor" should more accurately be described as a physically realizable or physically plausible backdoor. The method still relies on image-generation and the inputs which are poisoned are still images. A physical trigger would be an instantiation in the real world of the objects selected in this paper as plausible trigger objects in the images.
Line 283: The metric "Clean Accuracy (CA)" needs to be mentioned before its use in Table 1 & 2, used just as CA, which appear well before the line 283 explanation of what CA is. A brief note in the table captions would assist reader clarity.
Line 348: I don't find any of those samples as "odd" with respect to the main subject class. Every dog wants to lay down on pillows and blankets. If you want to highlight the VQA suggestion filtering, better examples are required. Dog + blanket cannot be the strangest pairing the VQA generated.
The act of discovering or injecting correlations between semantically relevant objects/concepts/content within the images and trigger behavior moves the state of the art in poisoning forward. Prior work explored discovering spurious correlations between objects (like tennis ball and dog) and leveraging that as a trigger cause. Generative image creation enables significantly expanded options for exploring the space of possible attacks.
Beyond simply enabling trigger injection research, tooling like this enables more fundamental exploration of the correlations and representation bound up within networks. By exploring what is trigger-able, and which physically realizable triggers don't produce acceptable ASR, insights can be gained into the functional representation confounders within DNN. A step in the right direction to empirically characterizing potential model failure modes within normal, expected data.
Authors present an interesting application of VQA to generate candidate objects for physical realizability.
问题
The “Real CA” shows a consistent roughly 15 percent drop from synthetic CA. Could this be due to a significant distribution shift introduced by the generating model? Have you analyzed or identified what factors contribute to this gap? Could additional augmentations or generation strategies reduce this?
You mention that “generic triggers” (like books) lead to lower Real ASR due to appearance diversity. Could the framework be extended to generate consistent trigger appearances across samples? Additionally, with common objects one runs into the problem of non-poisoned samples with the properties of a trigger (False Positive in trojan detection). For example, if the dataset contains pictures from a library, and the book trigger is successfully injected, many non-poisoned images now will activate the trojan behavior.
局限性
Yes
最终评判理由
I am satisfied with the clarifications the authors provided in the rebuttals across all reviewer comments. I am am keeping my original assessment of accept.
格式问题
None
We thank the reviewer for the constructive comments.
Q1: What is the computational or monetary cost for API calls to perform the poison image generation? Understanding the barrier to implementation would be a reasonable single line improvement.
All experiments in our paper are conducted on a Linux Server with AMD EPYC 7513 32-Core, 7 Nvidia RTX A5000 24GB and 512 GB RAM (also provided in Sec. 1, Supplementary Material). Each of the image generation models (InstructDiffusion and Realistic Vision V5.1) can be hosted on a single A5000 24GB card.
Q2: Line 283: The metric "Clean Accuracy (CA)" needs to be mentioned before its use in Table 1 & 2, used just as CA, which appear well before the line 283 explanation of what CA is. A brief note in the table captions would assist reader clarity.
Thank you for the suggestion. Please note that the results in Tables 1 and 2, despite appearing earlier in the text, are first discussed after line 283, where we define CA and Real CA. Nevertheless, we will revise the text accordingly based on the reviewer’s suggestion (e.g., note in the table captions).
Q3: Line 348: I don't find any of those samples as "odd" with respect to the main subject class. Every dog wants to lay down on pillows and blankets. If you want to highlight the VQA suggestion filtering, better examples are required. Dog + blanket cannot be the strangest pairing the VQA generated.
Thank you for the comment, which is also raised by Reviewer 5CCA. We agree with your suggestion, and we will change this example accordingly. We primarily intend to discuss “hallucination” as a potential limitation of our method, although perhaps, in our examples, the suggested objects seem logical.
Q4: The “Real CA” shows a consistent roughly 15 percent drop from synthetic CA. Could this be due to a significant distribution shift introduced by the generating model? Have you analyzed or identified what factors contribute to this gap? Could additional augmentations or generation strategies reduce this?
Yes, this drop is due to a distribution shift between the generated images and real images. We note that, for Image Generation, the images are generated with Stable Diffusion, using only the labels, cannot fully capture the real world variations. Such distributional mismatch is also consistently observed in the prior work [1] (Tab. 1 in their paper shows that there’s an accuracy gap around 15% between synthesized and real ImageNet-100, and the gap increased to ~40% in ImageNet-1K) and is one of the limitations of existing generative models. Nevertheless, as their capability continues to improve, the distributional gap will become smaller.
A more immediate solution to reduce the distributional gap is to improve the diversity of the text prompts (created from the labels). For example, one could try to extend the techniques proposed in [2] to increase the variations of background, lighting, orientation, angle that fit the downstream settings. Nevertheless, this deserves a substantial and separate study, as an extension to our work. We leave this for future work.
Importantly, this limitation does not affect the core contribution of our work, which is to demonstrate the feasibility of enabling scalable physical backdoor dataset generation using generative models. As the quality and diversity of generative models continue to improve, we expect the distributional gap to narrow further, strengthening the utility of our proposed framework.
Q5: You mention that “generic triggers” (like books) lead to lower Real ASR due to appearance diversity. Could the framework be extended to generate consistent trigger appearances across samples?
Yes, our proposed framework could be further extended to generate such a consistent trigger. In fact, by specifying a consistent appearance to generic triggers like books, we’re able to transform generic triggers into more specific ones. An example would be adding adjectives to a generic trigger, such as adding adjectives like “red”, “closed” to “books”, will transform the trigger to be “red closed books”, which likely has a more consistent appearance/perception. By guiding generative models with more specific prompts (e.g., by adding adjectives), our framework can hypothetically generate a more consistent trigger.
Although our current implementation focuses on general trigger generation, the framework is flexible and naturally supports this. We are happy to discuss this direction in the conclusion as part of future work.
Q6: Additionally, with common objects one runs into the problem of non-poisoned samples with the properties of a trigger (False Positive in trojan detection). For example, if the dataset contains pictures from a library, and the book trigger is successfully injected, many non-poisoned images now will activate the trojan behavior.
We agree with the reviewer; there are 2 ways to mitigate the falsely activated backdoors (false positives): transforming the generic triggers to be more specific (detailed in Q5), and avoiding a highly compatible trigger (based on our Trigger Suggestion Module). In the latter option, we suggest using a trigger object with moderate compatibility to avoid such inadvertent triggering of the attack. Importantly, our framework (specifically the Trigger Suggestion Module) flexibly allows the security researcher to study a wide range of physical backdoor triggers, from the generic triggers to highly compatible triggers. We hope this will help to accelerate the state of stagnated physical backdoor research.
Reference:
[1] Fake it till you make it: Learning transferable representations from synthetic ImageNet clones. arXiv 2023
[2] Diversify, don't fine-tune: Scaling up visual recognition training with synthetic images. arXiv 2025
Q2: I agree that the definition of Clean Accuracy (CA) happens before the table in the latex file, but not in the pdf, so including it in the caption helps the reader based on the order the pdf presents the information. Doubling up the definition just helps clarity.
Q4: If you have a citation that demonstrates a very similar 15% gap in sim-to-real accuracy, it would improve the paper to (in the fewest words due to page count limitations) call out that gap, explain the reason, cite the paper, and possibly argue that ever improving models will likely reduce the gap. I agree it does not invalidate the work, but oddities in the data presented can raise questions in the reader and heading off those concerns is very easy with a citation.
Q6: The filtering criteria for only using medium plausibility trigger objects from (also discussed by reviewer 5CCA) results in False Positive trigger activations. In many real world domains there are an extremely long tail of unlikely but plausible images. Filtering for high comparability might actually be problematic. As, per example, dog + astronaut is an unlikely image, but an eminently plausible one. Per 5CCA commentary on Q2: the cup as a trigger results in False Positive trigger activation. You risk higher than necessary trigger FP rates by filtering for anything other than low likelihood triggers that remain plausible images in the long tail. Is that something the authors considered? There is an explicit precision/recall trade off with the trigger object likelihood in the source images vs the trigger response accuracy.
I am satisfied with the clarifications the authors provided in the rebuttals across all reviewer comments.
Thank you for your kind acknowledgement. We’re grateful that the clarifications addressed your concerns, and we sincerely appreciate your time and thoughtful engagement.
The paper attempts to address the problem of backdoor attacks in deep neural networks (DNNs), which is a relevant topic in the field of machine learning security. The proposed framework, TriggerCraft, introduces a modular design for generating physical backdoor datasets, which could potentially be of interest to researchers working on adversarial machine learning.
优缺点分析
Strengths
The paper attempts to address the problem of backdoor attacks in deep neural networks (DNNs), which is a relevant topic in the field of machine learning security. The proposed framework, TriggerCraft, introduces a modular design for generating physical backdoor datasets, which could potentially be of interest to researchers working on adversarial machine learning.
Weaknesses
-
Overstated Novelty of the Problem: The paper claims that backdoor attacks, including those in the physical world, represent an "emerging threat" to the integrity of DNNs, capable of compromising deep learning systems in an undetectable manner. This claim is misleading, as backdoor attacks are not a novel or emerging issue. The field has been extensively studied for years, with a substantial body of literature and research already addressing this problem in depth. The authors fail to acknowledge the maturity of this research area, which undermines the perceived novelty and significance of their work.
-
Lack of Clear Research Value and Significance: While the paper proposes a framework for generating physical backdoor datasets, it does not adequately articulate the value and significance of this contribution. The authors fail to convincingly demonstrate how their framework advances the state of the art or addresses critical gaps in the existing literature. Furthermore, there is no clear comparison with prior methods to highlight the advantages or unique contributions of the proposed approach, leaving the reader uncertain about its practical or theoretical impact.
-
Weak Logical Structure and Argumentation: The logical flow of the paper is insufficiently coherent, with a lack of tight connections between the problem statement, the proposed solution, and its evaluation. For instance, when introducing the TriggerCraft framework, the authors do not adequately justify the rationale behind adopting a modular design or explain how the individual modules interact to achieve the stated objectives. This lack of clarity weakens the overall persuasiveness of the paper.
-
Absence of Ablation Studies: The paper does not include ablation studies to evaluate the effectiveness of individual components within the TriggerCraft framework. Ablation studies are essential for understanding the contribution of each module and for validating the design choices made in the framework. Without such experiments, it is difficult to assess the robustness or necessity of the proposed components, significantly limiting the credibility of the results.
-
Insufficient Theoretical Justification: The theoretical explanations of the modules within the TriggerCraft framework are underdeveloped. For example, in the trigger suggestion module, the authors mention the use of a VQA model but fail to provide a detailed explanation of how this model is leveraged to assess trigger compatibility. This lack of depth in theoretical grounding makes it challenging to evaluate the soundness of the proposed approach or its generalizability to other contexts.
问题
no
局限性
no
最终评判理由
The authors’ rebuttal and clarifications resolved some of my concerns, while a few issues remain; my final score reflects this balance.
格式问题
no
Thank you for the comments.
Q1: Overstated Novelty of the Problem: The paper claims that backdoor attacks, including those in the physical world, represent an "emerging threat" to the integrity of DNNs, capable of compromising deep learning systems in an undetectable manner. This claim is misleading, as backdoor attacks are not a novel or emerging issue. The field has been extensively studied for years, with a substantial body of literature and research already addressing this problem in depth. The authors fail to acknowledge the maturity of this research area, which undermines the perceived novelty and significance of their work.
We appreciate the reviewer’s perspective, but would like to clarify that our claim that backdoor attacks are indeed an emerging threat, given the low barrier for anyone to train and deploy DNNs into practical use cases. While digital backdoor attacks have been studied extensively, physical backdoors remain comparatively underexplored due to the high cost, logistical constraints, and ethical approvals required to collect real-world datasets. As a result, many potential attack vectors and defences remain untested or poorly understood.
Our work aims to fill this gap by proposing a scalable method to study physical backdoors using synthesized datasets. We believe this is a meaningful contribution that enables broader investigation into a threat space that is otherwise difficult to access. That said, we are open to softening the phrasing around “emerging threat” in the final version, while maintaining the core motivation of our work.
Q2: Lack of Clear Research Value and Significance: While the paper proposes a framework for generating physical backdoor datasets, it does not adequately articulate the value and significance of this contribution. The authors fail to convincingly demonstrate how their framework advances the state of the art or addresses critical gaps in the existing literature. Furthermore, there is no clear comparison with prior methods to highlight the advantages or unique contributions of the proposed approach, leaving the reader uncertain about its practical or theoretical impact.
We would like to clarify that our proposal to generate physical backdoor datasets is a meaningful and timely contribution to the field. Although prior works [1,2] have demonstrated the feasibility of physical backdoor attacks, further progress has been limited due to there are little-to-none publicly available physical backdoor datasets. This limitation is driven by I/ERB and privacy constraints that inhibit releases of the datasets.
Our framework addresses this bottleneck by enabling scalable, high-quality synthetic dataset generation using deep generative models. It lowers the entry barrier for researchers and provides a practical and reproducible way to benchmark physical backdoor methods. Without such a benchmark, fair evaluation and systematic comparison are difficult, much like how progress in image classification was accelerated by the introduction of ImageNet. While our dataset is more targeted in scope, it plays a similarly foundational role in enabling structured and repeatable experimentation within the physical backdoor domain.
Hence, our focus is not to advance state-of-the-art methods, but rather to act as a catalyst to physical backdoor research, by enabling the discovery of state-of-the-art methods via our proposed framework, as discussed in Sec. 3.
Q3: Weak Logical Structure and Argumentation: The logical flow of the paper is insufficiently coherent, with a lack of tight connections between the problem statement, the proposed solution, and its evaluation. For instance, when introducing the TriggerCraft framework, the authors do not adequately justify the rationale behind adopting a modular design or explain how the individual modules interact to achieve the stated objectives. This lack of clarity weakens the overall persuasiveness of the paper.
We’d like to clarify that we’re motivated by the intense workload required to create physical backdoors, in order to study such a domain, which includes, but not limited to, extensive I/ERB approvals, monetary budgets, human resources and time. This motivates our proposed framework, which includes three modules tasked to assist researchers in synthesising datasets that are able to simulate physically collected datasets.
Our framework consists of three interconnected modules, each designed to reflect a key step in the conventional backdoor attack workflow. The first step in most backdoor studies involves identifying a suitable trigger object. To support this, our Trigger Suggestion module assists researchers in generating plausible candidate objects, reducing manual effort and cognitive load. Once a trigger is selected, the next typical step is to apply it across a dataset, which in physical settings usually requires manual collection. Our Trigger Generation module automates this process by synthesizing data with the trigger embedded, removing the need for real-world data collection. Since synthetic data may introduce visual artefacts, our Poison Selection module further refines the dataset to ensure that the poisoned samples appear natural and are aligned with human visual preferences.
We’d like to emphasize that our proposed framework closely resembles the actual process in backdoor studies, and we empirically justified the effectiveness of our proposal via extensive experiments.
We recognize that clarifying the interactions among modules can further improve the presentation, and we will revise the paper accordingly. However, we consider this a minor issue that does not affect the validity or significance of our contribution. The proposed design is well aligned with established practice, empirically supported, and provides an accessible tool to advance research in physical backdoor attacks.
Q4: Absence of Ablation Studies: The paper does not include ablation studies to evaluate the effectiveness of individual components within the TriggerCraft framework. Ablation studies are essential for understanding the contribution of each module and for validating the design choices made in the framework. Without such experiments, it is difficult to assess the robustness or necessity of the proposed components, significantly limiting the credibility of the results.
We’d like to clarify that our work does include essential experiments to understand the effectiveness of each individual component within our framework. We first show that the Trigger Suggestion module is able to suggest triggers that closely resemble human’s preference (Sec. 4 Sup.) and discuss the compatibility of suggested triggers may result in backdoor effects (see Sec. 4.1, main paper). Meanwhile, for both Trigger Generation module and Poison Selection module, we empirically validated their effectiveness via extensive experiments (see Tab. 1-2, main paper) that are able to achieve comparable CAs and ASRs to the real-world, physically-collected dataset.
Notably, we also included additional examples in Fig. 9-12 to demonstrate the quality and diversity of the synthesized dataset, by the Trigger Generation module; whereas for the Poison Selection module, we included additional ablations to understand its effectiveness in ensuring the quality of the synthesized dataset (see Fig. 5-8, Sup.).
Q5: Insufficient Theoretical Justification: The theoretical explanations of the modules within the TriggerCraft framework are underdeveloped. For example, in the trigger suggestion module, the authors mention the use of a VQA model but fail to provide a detailed explanation of how this model is leveraged to assess trigger compatibility. This lack of depth in theoretical grounding makes it challenging to evaluate the soundness of the proposed approach or its generalizability to other contexts.
We’d like to clarify that our work proposes a novel framework to overcome the dataset problem in physical backdoor research. Thus, besides the framework, the main contribution of the paper is to empirically and rigorously validate the proposed framework against the physical backdoor settings. While theoretical justification might be helpful, it is either inappropriate or not a novelty of our work, as for example, theoretical discussions on diffusion models already exist in the related works.
We appreciate the suggestion and are open to incorporating additional theoretical analysis where appropriate. If the reviewer could kindly specify which direction or aspect they believe would benefit from such analysis, we would be glad to address it in the revision
Reference:
[1] Finding naturally occurring physical backdoors in image datasets. NeurIPS 2022
[2] Backdoor attacks against deep learning systems in the physical world. CVPR 2021
Thank you for the final-hour comments. We’d like to clarify on the issues raised by the reviewer.
First we’d like to respectfully disagree on the comment Lack of novelty in the problem setting: Physical backdoors have already been extensively studied in the literature over the past few years. The additional references provided by the authors during the rebuttal, including works from as early as 2021, further confirm that this problem setting is not new:
- It is factually incorrect that physical backdoors have been extensively studied:
- Physical backdoors, despite progressing over past few years, do not show much growth, as they require significant effort in manually collecting/sharing the dataset, which costs time, money, human resources. Evidently, there’re only ~20 works proposed since 2021, as noted in [1-19], signifying a stagnated research – the main motivation of our work. In comparison, there've been thousands of digital pixel backdoor attacks.
- Prior works [1-19] investigate different aspects of physical backdoors, striving to bring impact to the research community. However, they are sparsely distributed across different verticals (autonomous driving, face recognition, object detection), mainly due to lack of data sharings and well-recognized datasets (e.g., ImageNet), where researchers could’ve jointly worked together to compare/advance on each other.
- Since there’s no shareable datasets, researchers have to curate their own datasets (e.g., [1] and [17] are studying facial recognition in the physical world, but have to curate their own datasets). With a high barrier to execute research ideas (as different ideas of different triggers require manual collection of datasets), research are stagnated. Motivated to alleviate this, we propose a framework that allows researchers to freely synthesize datasets to study physical backdoors, and bridge gaps of insufficient data sharing across the community.
Second, we also respectfully disagree with the reviewer’s comment on Inconsistency between “rigorous validation” and dismissal of theoretical justification, as within backdoor research landscapes, empirical validations are oftentimes the norm, as evident in notable, highly cited works like BadNets, Blended, SIG, LIRA, WaNet and etc.
- These works, although mostly provide only empirical evidence, are key contributors to advance the backdoor research, by exposing different threats to the community.
- We've rigorously validated the effectiveness and practicality of our framework, via extensive experiments, similar to that of prior works.
- Thus, we’d like to highlight that without theoretical justification DOES NOT diminish our contributions, as our framework, noted in the first paragraph, is able to bridge data scarcity gaps within the community.
In summary, we're motivated by consistent gaps that exist within physical backdoors, and proposed a framework to novelly resolve such a gap via synthesizing datasets, which are proven rigorously on its practicality and effectiveness via extensive empirical evidence.
References:
[1] Backdoor attacks against deep learning systems in the physical world. CVPR’21
[2] Finding naturally occurring physical backdoors in image datasets. NeurIPS’22
[3] Towards robust physical‑world backdoor attacks on lane detection. ACMMM’24
[4] Natural Occlusion-Based Backdoor Attacks: A Novel Approach to Compromising Pedestrian Detectors. Sensors’25
[5] PUBA: A physical undirected backdoor attack in vision-based UAV detection and tracking systems. IJCNN’24
[6] Scenedoor: An environmental backdoor attack for face recognition. VCIP’24
[7] A physical backdoor attack against practical federated learning. CSCWD’25
[8] Backdoor learning on Siamese networks using physical triggers: FaceNet as a case study. ICDF2C’23
[9] Moiré backdoor attack (MBA): A novel trigger for pedestrian detectors in the physical world. MM’23
[10] On the credibility of backdoor attacks against object detectors in the physical world. ACSAC’24
[11] Palette: Physically-realizable backdoor attacks against video recognition models. TDSC’24
[12] Towards practical deployment-stage backdoor attack on deep neural networks. CVPR’22
[13] Physical backdoor: Towards temperature-based backdoor attacks in the physical world. CVPR’24
[14] The invisible Polyjuice potion: An effective physical adversarial attack against face recognition. CCS’24
[15] DiffPhysBA: Diffusion-based physical backdoor attack against person re-identification in real-world. arXiv’24
[16] Dangerous Cloaking: Natural Trigger based Backdoor Attacks on Object Detectors in the Physical World. arXiv’22
[17] Towards Clean-Label Backdoor Attacks in the Physical World. arXiv’24
[18] Physical backdoor attacks against LiDAR-based 3D object detectors. arXiv’24
[19] BadDepth: Backdoor Attacks Against Monocular Depth Estimation in the Physical World. arXiv’25
Thank you for the authors’ clarifications in the rebuttal. However, the provided explanations do not fully address my original concerns, and in fact raise additional issues.
-
Lack of novelty in the problem setting Physical backdoors have already been extensively studied in the literature over the past few years. The additional references provided by the authors during the rebuttal, including works from as early as 2021, further confirm that this problem setting is not new. As such, the claimed novelty of tackling physical backdoors is substantially weakened.
-
Inconsistency between “rigorous validation” and dismissal of theoretical justification In the rebuttal, the authors state that their work “rigorously validate[s] the proposed framework,” yet they also argue that “theoretical justification might be helpful, it is either inappropriate or not a novelty of our work.” This is problematic: claiming rigorous validation typically entails providing clear and sound reasoning—often involving at least some level of theoretical grounding—on why and how the framework works. By dismissing theoretical justification as inappropriate, the authors undermine the credibility of their own claim of rigor and leave doubts about whether the evaluation is truly sufficient to support the paper’s central claims.
In light of these points, my overall assessment of the paper remains unchanged.
Physical backdoor attacks are hard to test, since physical data collection is expensive and time-consuming. Instead, the authors propose that colocation backdoors (where a colocated object causes a misclassification to an arbitrary class) can be reasonably approximated by: 1) automatically generating suggestions for the colocated trigger, 2) creating images with that trigger object added in, either by generation or editing, and 3) rejection-sampling for quality with a multimodal reward model. Empirically, this not only allows the training of backdoored models, but also seems to generalize to manually-collected physical data.
优缺点分析
Strengths
- For many embodied AI systems, physical triggers are the main goal for an attacker. Therefore, improving understanding of physical triggers is very important.
- The work actually verifies empirically that their datasets generalize to newly collected images of the physical world, which is both highly useful, and rare even in other physical-trigger research.
Weaknesses
- Limited to colocation backdoors.
- The "trigger compatibility levels" seem key in deciding later experimental details, yet involve thresholds that are not explained or empirically justified.
- Some claims are under-supported or undefined. For instance, claims that quality and "surreality" are both measures of how natural an image is to a human (I have not been able to find any definition for "surreal" as a keywork), or the claim that generative models produce diverse images (see the diversity of papers claiming the contrary).
- The work's most important contribution is the demonstration that the synthetic dataset generalizes to the real environment (via physically collected photos). However, the physical collection process is extremely underspecified. The paper would be much stronger if this were to be remedied (e.g. how many pictures, under what conditions were they taken, how were they processed/filtered, examples...)
问题
- Are there systematic patterns in what the VQA model suggests to add into a photo? "What are 5 suitable objects..." is a strangely worded prompt.
- Is ImageReward the only rejection sampling step after the image model edits/generates?
- In the results tables, what is the CA and Real CA without any poisoned images in the model's dataset?
- In Fig. 4, why does NC find generated-image backdoors but not edited ones? What do the identified masks look like, compared to the edited-image backdoors? With such a low sample size, it's hard to draw conclusions...
- Do any other physical-trigger attacks report their digital vs. real CA/ASRs for comparison?
- I would appreciate some more commentary or examples (perhaps in Appendix) to further characterize the "Artifacts in Image Editing and Image Generation" section.
- (extremely minor) "For example, the suggestions for “dog”, such as “blanket" and “pillow," seem odd since dogs do not naturally appear alongside these items." I disagree, dogs may often be in indoor settings, and these are common accoutrements for furniture.
局限性
yes
最终评判理由
I raised my score from 3 to 4.
While there are noteworthy methodological weaknesses raised by various reviewers, the core element of validating sim-to-real for synthetic backdoor dataset generation is novel and validated in a reasonable way. Indeed, their generation framework, whatever else the flaws, appears to result in similar performance as other means of producing co-location backdoor attacks when transferring from training to real-world test. This convinces me that this work is on margin useful for backdoor researchers to base future efforts on.
格式问题
none
We thank the reviewer for the constructive comments.
Q1: Limited to colocation backdoor.
We would like to first clarify that focusing on colocation backdoors is not a limitation of our work but a key contribution. As outlined in both the Abstract and Introduction, our study specifically targets colocated object triggers, an important yet underexplored area within the broader domain of physical backdoor attacks.
Physical backdoor research typically involves two categories of adversarial triggers: (i) colocated or co-occurred physical objects and (ii) environmental conditions such as lighting or weather variations [1]. The latter is easier to simulate using standard augmentations like contrast adjustment or filtering, without needing new physical data. In contrast, colocated object attacks are more challenging, as there is no straightforward way to generate and share high-quality datasets without introducing visual artefacts. E.g., naive cut-and-paste methods [2] often create unrealistic inconsistencies (see Fig. 4).
This complexity has led to a scarcity of high-quality, publicly available datasets for colocation backdoors, as most existing work either avoids this scenario or faces strict ethical and procedural barriers in collecting physical data [5, 6, 7]. Our proposed pipeline directly addresses this bottleneck by offering a practical and scalable approach for generating colocated backdoor datasets, thus unlocking a previously stagnant line of research.
Q2: Trigger Compatibility Levels' threshold
Thank you for the comment. The trigger compatibility level is related to the probability of “inadvertently” triggering of the backdoor attack. That is, when the trigger is highly compatible, it co-exists with the “subject” more than 50% of the time in real-world environment (e.g. a violin and a bow); this high co-existence will also lead to an undesirable activation of the backdoor during inference time, making it a less ideal candidate for the backdoor adversary, who often needs the trigger to be “secretive” to themselves. On the other extreme end of the spectrum, low compatibility levels (<10%) indicate an “absurd” trigger if it co-exists with the subject class (e.g., an absurd poisoned sample is a dog with a spacesuit, where a spacesuit generally doesn’t appear along with a dog), making it also a less realistic choice for a backdoor adversary (as it will raise suspicion in the training dataset or the inference input).
A security researcher, using our proposed framework, therefore, would likely “choose” a trigger with moderate compatibility level (e.g., we suggest using "books" in our paper), to mimic a realistic backdoor attack scenario, which emphasizes on the naturalness of the image (via adequate co-existence with the main subject) and minimal chance of falsely triggering the attack. Nevertheless, the security researcher can also choose an “absurd” trigger or “inadvertent” trigger to study the extreme ends of backdoor attacks, indicating the flexible design choice of our framework to facilitate physical backdoor research. We will include this discussion in the final revision.
Q3: Under-supported or undefined claims, such as "quality/surreality refers to naturalness of image to human"and "generative models produce diverse images".
Thank you. As explained in Section 4.2 (lines 200–206), we use both quality and surreality to refer to image fidelity, which we define as images that are clear and whose objects appear natural to humans. Further elaboration is provided in Section 4.3 (lines 235–250), where we discuss that the naturalness of synthesized images can be formally understood as alignment with human preference, a perspective also supported in [3]. This interpretation underpins our use of ImageReward as a fidelity metric, given its strong alignment with human judgment.
Regarding diversity, we would like to clarify that we do not claim generative models always produce diverse outputs. Rather, we state that “DMs are capable of synthesizing and editing high quality and high diversity images” (lines 205–206). While we fully acknowledge the known limitations in diffusion model diversity, we highlight that, compared to other generative approaches such as GANs and VAEs (Section 2.3), diffusion models remain the more effective option. Importantly, our extensive experiments demonstrate that current diffusion models are already sufficient to enable synthetic dataset creation for physical backdoor research. As these models continue to advance, their utility and impact in this space are only expected to grow.
We will revise to make these points clearer.
Q4: Collection process of physical dataset
Thank you for the comment. The specification of the dataset collection process is discussed in Sections 2 and 3 in the Supplementary Material. To summarize, the real, physical dataset is collected via 8 different types of devices, with a total of 1,176 images, under various lighting conditions and environments (indoor/outdoor). We will make this part clearer in the main text of the final version.
Q5: Systematic patterns in VQA models' suggestions
The prompt, while seemingly unconventional, fits the needs of physical backdoor tasks, as it enables VQA models to suggest suitable trigger objects. We keep the phrasing generic to avoid prompt engineering and stay model-agnostic. In our experiments, we do not observe consistent patterns in the suggestions. Still, since "suggestions" depend on both prompt and dataset, analyzing their variation is a promising direction for future work.
Q6: Is ImageReward the only rejection sampling step after the image model edits/generates?
Yes. As explained in Q3, ImageReward is used to reflect human preferences for fidelity/quality; for this reason, it is the only rejection step. Nevertheless, our framework flexibly allows the addition of other rejection modules that are of interest to the security researcher.
Q7: CA and Real CA without poisoned images (poisoning rate = 0)
The CA and Real CA for both edited and generated images are reported in the following table:
Image Editing (ImageNet-5)
| Trigger | Poisoning Rate | CA | ASR | Real CA | Real ASR |
|---|---|---|---|---|---|
| None | 0 | 94 | - | 83 | - |
| Tennis Ball | 0.05 | 94 | 77 | 82 | 81 |
| 0.10 | 95 | 80 | 79 | 82 | |
| Book | 0.05 | 93 | 76 | 79 | 66 |
| 0.10 | 93 | 77 | 79 | 71 |
Image Generation (Synthesized ImageNet-5)
| Trigger | Poisoning Rate | CA | ASR | Real CA | Real ASR |
|---|---|---|---|---|---|
| None | 0 | 100 | - | 70 | - |
| Tennis Ball | 0.1 | 100 | 88 | 58 | 92 |
| 0.2 | 99 | 90 | 58 | 95 | |
| 0.3 | 100 | 88 | 61 | 92 | |
| 0.4 | 100 | 89 | 56 | 92 | |
| 0.5 | 100 | 89 | 58 | 86 | |
| Book | 0.1 | 100 | 97 | 61 | 58 |
| 0.2 | 100 | 98 | 61 | 58 | |
| 0.3 | 100 | 98 | 64 | 84 | |
| 0.4 | 100 | 98 | 61 | 83 | |
| 0.5 | 100 | 98 | 59 | 75 |
As shown above, Image Editing models exhibit only a small Real CA performance gap (~2%) with or without the poisoned images (see Tab. 1). This is expected, as these models apply perturbations to clean images, causing minimal distribution shifts. In contrast, Image Generation models show a larger gap (5–10%, see Tab. 2), which is reasonable given the limited diversity in fully synthesized datasets. This aligns with prior findings [4] and our discussion in line 296.
Q8: NC finds generated-image backdoors but not edited ones, and their corresponding NC masks.
As discussed in Section 5.5, this can be attributed to “the larger trigger sizes in generated images”, making the trigger objects appear more consistent in the image generation case. Fig. 10 and 12, for example, in the Supplementary Material also support this observation. This is also observed in the related physical backdoor study in [6], where more consistent trigger objects can be more easily spotted by NC.
We can also confirm this phenomenon in NC’s identified masks. However, due to NeurIPS’ policy, we’re not able to include these masks. Nevertheless, we will include them in the final version of the paper.
Q9: Do any other physical-trigger attacks report their digital vs. real CA/ASRs for comparison?
No. The prior works mainly deal with manual collection of physical datasets, and do not have a digital counterpart of their datasets.
Q10: More commentary or examples to characterize the "Artifacts in Image Editing and Image Generation" section.
We’d like to point the reviewer to the Supplementary Material, Fig 5-8 (bottom row), for the artifacts of the edited/generated samples. We’ve also included additional synthesized examples in Fig. 9-12.
Q11: (Extremely minor) Disagreement on the suggestions such for "dog", which are "blanket" and "pillow" are odd.
Thank you for the comment. We agree with your suggestion, and we will change this example accordingly. Nevertheless, we primarily intend to discuss “hallucination” as a potential limitation of our method, although perhaps, in our examples, the suggested objects seem logical.
Reference:
[1] Physical backdoor attacks using RGB filters. TDSC 2023
[2] Targeted backdoor attacks on deep learning systems using data poisoning. arXiv 2017
[3] ImageReward: Learning and evaluating human preferences for text-to-image generation. arXiv 2023
[4] Fake it till you make it: Learning transferable representations from synthetic ImageNet clones. arXiv 2023
[5] Finding naturally occurring physical backdoors in image datasets. NeurIPS 2022
[6] Backdoor attacks against deep learning systems in the physical world. CVPR 2021
[7] Dangerous cloaking: Natural trigger based backdoor attacks on object detectors in the physical world. arXiv 2022
[8] WaNet – Imperceptible warping-based backdoor attack. ICLR 2019
[9] LIRA: Learnable, imperceptible and robust backdoor attacks. ICCV 2021
Thank you very much for the detailed response; it was very informative.
Q1
To be clear, I do not think this limitation is disqualifying. It is important to note, but colocation backdoors are an important and relevant category to work on. Still, it is also the case that generative models are ill-equipped to create other kinds of backdoor triggers, such as the adversarial RGB-filtering effect of [1], or strange image artifacts such as [2]. I think this reinforces my belief that this paper’s key contribution is in the fact that it physically verified the validity of the pipeline.
Q2
Thank you for the clarification. I agree this is an important metric to track for optimizing both stealth and efficacy of the backdoor. To help guide the discussion you intend to add, here are some questions that might be useful to examine:
- What is the distribution of “compatibility levels”?
- Are generative models good enough at generation to make “low compatibility” triggers seem plausible? For the absurd example you gave, maybe a generative model might make more images like the well-known one of astronaut Leland Melvin with his dogs.
- It is written in the paper that “High” compatibility levels risk “potentially compromising stealth due to natural co-occurrence”. Why would this be the case? Intuitively, I would expect this to compromise backdoor efficacy, if there are many real images with the backdoor trigger that are “mislabeled” (from the perspective of the attacker), while the stealth would remain high (as these are extremely plausible images).
Q3
This mostly addressed my concerns. I remain dissatisfied with the use of the word “surreal”, which is still: 1) not defined, 2) not standard in the field, and 3) is incorrect if the goal is a word to mean “realistic” or “high-quality”.
From the Oxford English Dictionary: “surreal: Having the qualities of surrealist art; bizarre, dreamlike.”
Q4
Some concrete questions here:
- Were the images collected by the authors? How many photographers were there?
- How were the images staged? E.g. were all the images of the dog + tennis ball colocation collected in one burst, or were there multiple “shoots”?
- How were the images processed? I presume images from different sources were processed using different proprietary algorithms (which is different not just by phone company, but by model as well), except perhaps the Ricoh standalone camera?
- Could a collage of images be somehow provided? It would be useful for judging, e.g., variation in backgrounds, image artifacts (blur, compression, occlusions), etc.
I would appreciate if these were answered in the review phase, as well as included in the main text.
Q5
Thank you for the clarification. It could be useful to add a note, perhaps in the Appendix or Supplementary materials, observing that there seemed to be no pattern to VQA suggestions other than plausibility.
Q6
Thank you for the clarification.
Q7
Excellent, thank you for these results! It’s really interesting that in the image generation setting, the real CA drops substantially after even a small amount of poisoning – but then, increasing the poisoning ratio 5x doesn’t seem to cause much change. I think this is worth including in the final paper.
One additional question: When you say Synthesized ImageNet-5, is it the case that in the Table 2 results, every image in the training dataset is synthesized? If so, this should really be mentioned in the main text. If not, how do we explain that with 0 poisoning ratio, this gets a higher digital CA but a lower real CA?
Q8
Thank you for the clarification. It is interesting that NC’s masks line up with this nicely.
Q9
Thank you for the clarification.
Q10
I see. I had assumed the artifacts were noticeable after the rejection sampling step, but this seems to be about image generations/edits before the ImageReward rejection sampling. I still do not see why this is due to generative image products raising human suspicion. Intuitively, I would expect this to cause that, perhaps.
Q2-(1): Distribution of compatibility levels
As requested, the distribution of compatibility levels is as follows:
| Compatibility Level | Count | Examples |
|---|---|---|
| High | 0 | - |
| Moderate | 3 | Book, table, purse |
| Low | 3083 | Tree, blanket, table, lamp, pillow |
Although none of the VQA-suggested triggers qualified as high compatibility, this is not a concern, as we focus on moderate compatibility triggers, which are more typical and practical.
Q2-(2): Generative models capabilities for low compatibility triggers
With the advancement of generative models, we believe they can generate plausible images even with low compatibility triggers. However, such triggers are often unrealistic in the real world and may raise suspicion during inference when physically deployed. Hence, we recommend using moderate compatibility triggers.
Due to NeurIPS policy, we’re unable to upload any media files. Nonetheless, we include an example prompt that’s able to synthesize such an image, as follows:
"RAW photo, astronaut Leland Melvin with his dogs, 8k uhd, dslr, soft lighting, high quality, film grain, depth of field, hard focus, photorealism, perfect lighting, highly detailed textures, fine grained"
Q2-(3): High compatibility compromise backdoor efficacy
We would like to clarify that high compatibility refers to an object being suggested to plausibly co-exist with a large number of images in the dataset, not that it already appears in many of them. E.g., if VQA suggests “cup” as highly compatible with over 50% of the images, it means that a “cup” could be realistically inserted into more than half of the images, not that 50% of the images already contain a cup.
Using “cup” as a trigger, the adversary can still poison the model successfully. However, during inference, the backdoor may be inadvertently activated by benign images that naturally contain a cup. This frequent unintended activation degrades the model’s performance and can raise suspicion. For this reason, high-compatibility triggers are generally less desirable, as effective backdoor attacks typically rely on triggers that remain “secret” in the natural distribution.
Q3: Use of surreal
We apologize and will revise the use of the term “surreality” to “quality” accordingly in the final version.
Q4-(1): Image collection personnels
Yes. We have a total of 5 photographers, collecting the images with various devices and in different locations.
Q4-(2): Image collection stages
The images are not collected in one burst, but through multiple “shoots” with varying angles and perspectives, to simulate real-world environments and ensure diversity in the dataset.
Q4-(3): Processing of collected images
All images were captured using phones and saved using each phone model’s default compression algorithm (typically JPEG). As you noted, this may vary slightly across manufacturers and models, but we did not alter the images beyond what the devices applied by default. No pre- or post-processing was performed.
Q4-(4): Collage of images
Due to NeurIPS policy, we’re unable to upload media files. Nonetheless, we’d like to point the reviewer to the image samples (see row 1 of Fig. 5 main paper, Fig. 1-2 supp. material). The complete set of images will be publicly released for research use upon acceptance of this work.
Q5: VQA suggestions' patterns
Thank you, we’ll include this in the Appendix.
Q7: Synthesized ImageNet-5 details in Tab. 2
Yes, in Tab. 2, all the images in the training dataset are synthesized. We will clarify this detail in the final revision.
Q10: Artifacts after rejection sampling and generative image products raising human suspicion
We would like to clarify that our Poison Selection Module (with ImageReward), is designed to retain only plausible synthesized images, thereby ensuring the overall quality of the dataset.
Despite advances in generative models, artifacts still commonly occur, as shown in [1,2]. These artifacts can discourage researchers from using synthesized datasets, as they often deviate significantly from real-world distributions (see bottom rows of Fig. 5-8 in the supp. material). Furthermore, such distorted images typically fail to meet the requirements of physical backdoor datasets, which need both the trigger object and the main class to be clearly represented in each poisoned image.
Regarding the concern about “synthesized images that would raise human suspicion”: in a typical physical backdoor setting, the adversary creates the poisoned dataset, but the model is trained by a benign user who may inspect the dataset beforehand. If the images show obvious artifacts, this may raise suspicion and compromise the stealth of the attack. Our framework is designed with this in mind to ensure that the dataset contains only high-quality, plausible images, free of distortions or synthetic artifacts, to maintain realism and avoid detection.
Q2
The table is much appreciated. The comparative scarcity of high and moderate compatibility items makes me wonder if the decision boundaries are particularly useful where they are. It would likely be worth including some discussion on this in the supplementary materials. Additionally, the clarification on what compatibility levels mean is also appreciated -- I misunderstood and thought this concerned what subjects are actually present in the image. Given the particular prompting structure (e.g. always exactly 5 choices), this seems like a weaker metric than I expected.
Q4
Thank you for these additional details. I believe these should be well-documented in the supplementary materials. I do note that phone cameras differ by more than their compression algorithms. Modern phones apply heavy post-processing to their images taken through default camera apps. See:
- https://www.apple.com/newsroom/2021/09/apple-unveils-iphone-13-pro-and-iphone-13-pro-max-more-pro-than-ever-before/
- https://semiconductor.samsung.com/news-events/tech-blog/new-isocell-image-sensors-level-up-smartphone-photography/
- https://developer.huawei.com/consumer/cn/CameraKit Naturally, the exact algorithms are proprietary, but documenting that the images are post-processed at all seems useful.
Q10
This concern of mine specifically regards the phrasing. In lines 355-357:
We conjecture this phenomenon to the limitations of the deep generative models, where the generated and edited images have unnatural parts that may raise human suspicion.
This sentence is unclear, and the comma implies the two fragments are causally linked. This could stand to be rephrased, e.g.
We conjecture this phenomenon occurs due to the limitations of deep generative models. Therefore, the generated and edited images could have unnatural parts that raise human suspicion.
Q2: Trigger Compatibility Levels
Thank you for the comments. We would like to acknowledge that the compatibility level and its decision boundaries are not a data-driven decision (based on the experimented datasets), but rather based on existing insights from the literature (e.g., [1] stated common objects that are accidentally included might activate backdoors unintentionally, [2] stated JPEG compression that are commonly used may expose the implanted backdoor before deployment). In addition, the 50% threshold reflects the “assumption” that the trigger is likely to appear frequently in natural scenes. This is based on the expectation that the large-scale VQA model, having been trained on a large corpus of natural images, encodes common knowledge about the presence of the object in the real world.
Nevertheless, we truly appreciate and enjoy this constructive discussion with the reviewer, while believing that explicitly distinguishing these decision boundaries in the main paper may introduce unnecessary confusion to the readers, distracting them from the main contributions of our paper. Instead, trigger compatibility levels should act as a guideline for researchers/practitioners to select an appropriate trigger, should these scenarios occur with their datasets.
Thus, we will accordingly revise this section as follows: 1) we will clearly describe this metric as the guideline (instead of specific metric) for users of our framework, and 2) we will include detailed discussion, based on our discussion in the Appendix.
Q4: Details of compression algorithm
Thank you for the comments, we’ll include such a discussion in the supplementary material.
Q10: Phrasing of L355-357
Thank you for the comments, we’ll revise the statement accordingly in the main paper.
We appreciate your comments and are happy to discuss further. We hope that these discussions will contribute to your consideration.
References:
[1] Wenger, E. et al. (2022). Finding naturally occurring physical backdoors in image datasets. NeurIPS
[2] Duan, Q. et al. (2024). Conditional Backdoor Attack via JPEG Compression. AAAI
Thank you once again for the detailed responses through the review period, which have updated me positively on the method's novelty and relevance. I have no further questions for the authors.
Thank you for your supportive feedback. We’re encouraged to know that our responses have addressed the earlier concerns. We sincerely appreciate your thoughtful review and would be grateful for your kind consideration moving forward.
This paper proposes a new backdoor attack called TriggerCraft for creating physically realizable trojan backdoors. It automates: (1) trigger suggestion via VQA to propose plausible physical objects; (2) trigger insertion by editing or generating images using diffusion models; and (3) sample filtering through a multimodal reward model with rejection sampling. Synthetic backdoor datasets produced this way yield high attack success rates and similarly evade defenses, performing on par with models trained on manually collected physical trigger data.
Proposing a physically realizable backdoor attack is indeed interesting. However, as pointed out by reviewers, the motivation of the approach is not fully clear. It would be interesting to study its relation to real physical backdoors more. Also, in general, the attack is easier than the defence. It would be nice to propose a defence algorithm for the proposed attack. Finally, the writing of the paper could be further improved, and it requires major revision. Thus, I encourage the authors to revise the paper based on the reviewers’ comments and resubmit it to a future venue.