DiffBreak: Is Diffusion-Based Purification Robust?
DiffBreak provides the first reliable framework for differentiating through diffusion-based purification, revealing key vulnerabilities under adaptive attacks.
摘要
评审与讨论
This paper attempts to break diffusion based adversarial defenses empirically with (1) improved implementation details (2) additional techniques like LF. The authors present siginificant vulnerabilities in the DBP methods.
优缺点分析
Strength
- I believe this paper conveys a very strong and impactful message to the community by exposing the vulnerability of diffusion based defenses. DBP methods have been widely studied and are considered promising given their effciency compared to AT. However, if all the previous success should be attributed to improper and inadaquate evaluation protocal, then it is of great importance to convey this to the reserach community.
- The technical insights and implementation details are solid.
- The introduction of LF methods is novel and exhibits high effectiveness.
Weakness Generally, there is no major technical flaws in this paper. There are two minor weakness from my standpoint.
- I personally do not favor discussing much about Theorem 3.1, since both the theoretical result and the proof process are relatively intuitive. Clearly stating the reasons why the certifiable robustness of DBP might not realizable in practise is already a meaningful complement to related theoretical works.
- The "Reproducing Stochasticity" assumption is valid for worse case evaluation. However, this might be a assumption too strong to hold in practise.
问题
Please see weakness.
局限性
Yes
最终评判理由
This is an impactful paper to the community that studies diffusion based purification defenses. The vulnerabilities uncovered by the authors are critical and the technical contributions are solid.
格式问题
Figure 1 is overly streched horizontally for me. However, this is not a major issue.
We thank the reviewer for the encouraging and insightful feedback, and we’re glad the contributions and message of the paper resonated.
Q1.Emphasize practical reasons why DBP’s certified robustness may fail in practice, rather than detailing Theorem 3.1.
Thank you for the valuable suggestion. We agree the proof does not rely on heavy machinery. Theorem 3.1 is included to settle a core premise in the DBP literature. The result—that, with accurate gradients, the adaptive attack can drive the purification pipeline so that is drawn from an adversarial rather than the natural distribution—while intuitive in hindsight, runs counter to the working premise implicitly adopted in prior attack designs (S3.1): for example, DiffHammer [41] treats stochasticity chiefly as an obstacle rather than an exploitable mechanism (see our response to Reviewer 4tho’s Q2), and DiffAttack [20] introduces per‑step surrogate losses under the implicit assumption that the purifier isn’t directly steered by standard gradient-based attacks. Our experiments (see S4) show that, once gradients are corrected, standard adaptive attacks realize the steering predicted by the theorem, meeting or surpassing these approaches.
Stating the theorem, thus, avoids overestimation or misattribution of robustness, and sets a clear methodological baseline so that future work can develop sound defenses and evaluate them with principled attacks—i.e., accurate, end‑to‑end gradients without unnecessary augmentations. In fact, in our response to Reviewer 4tho’s Q3, we already explain how we’re currently designing an adversarial defense that aims for robustness by building on the conclusions of Theorem 3.1. To our knowledge, we provide the first formal treatment showing that this “purifier steering” extends beyond deterministic, one‑shot preprocessors to long, stochastic, iterative purifiers, namely, diffusion‑based pipelines—i.e., the stochastic distribution over purification outcomes can itself be altered under accurate differentiation.
That said, if the presentation feels dense, we are happy to streamline the main text—foreground practical implications (why certification may not be realizable for DBP under practical conditions), keep a short intuition in S3.1, and move derivational details to the appendix.
Q2. "Reproducing Stochasticity" assumption might be too strong in practice
Agreed—thanks for pointing this out. As noted in App. D.1.3, failing to reproduce stochasticity is typically less severe than the other backprop concerns we address. In vanilla (non-guided) DBP, non‑deterministic cuDNN kernel choices induce only minute gradient discrepancies (raw ≈ 1e‑4, relative ≈ 7.3×10⁻⁵), so this factor is often secondary. However, stochasticity reproduction can be crucial in certain guided schemes: when the guidance function depends on randomness, gradients become tied to the specific noise realizations, and not preserving them can bias gradients and derail attack optimization. The impact is guidance‑dependent and can be substantial, especially for future DBP variants that may rely on stochasticity-dependent guidance.
To preclude such confounders, DiffBreak/DiffGrad includes a structured Noise Sampler that records and reuses noise during backprop—alongside our other fixes—to eliminate all gradient miscalculations as a source of over‑estimated robustness. Our goal is a robust, generalizable reference implementation for differentiating through DBP. We believe that providing such a reliable tool to the community is crucial for future research in the field, given the current struggle to implement backprop through DBP properly.
We agree that the main text is brief; we will add a short summary in S3.2 and point to App. D.1.3 for details.
Q3. Figure 1 is overly stretched horizontally
Thank you for the suggestion; we’ll correct Figure 1’s aspect ratio and switch to a vector graphic to avoid stretching in an updated version.
Thanks for the rebuttal. My concerns are generally addressed and I am in favor of acceptance.
This paper challenges the robustness of diffusion-based purification (DBP) methods. This paper nullifies DBP’s core premise by proving that standard gradient-based attacks can manipulate the diffusion model’s score function to generate adversarial distributions rather than natural ones. To address these issues, this paper proposes DiffGrad, which is a reliable backpropagation toolkit for DBP that fixes gradient inaccuracies in prior works. Experiments show that DBP’s robustness drops significantly when using DiffGrad to calculate the gradients.
优缺点分析
Strengths:
- This paper tackles a very important & interesting problem in DBP methods. Evaluation is a long-standing problem for DBP methods. Due to the computational cost of accessing Diffusion model’s gradient, many methods rely on inappropriate approximations of the gradients, which in general may lead to the overestimated robustness of DBP methods.
- This paper is well-written and theoretically sound. The theoretical analysis proves DBP’s vulnerability to adaptive attacks.
- Experiments are solid and sufficient. The robustness of DBP methods drops significantly after applying DiffGrad.
- The observation that the low-frequency AEs can beat both DBP and AT methods is interesting and surprising, especially for DBP methods since they do not implicitly make assumptions on the types of incoming AEs.
Weaknesses:
- Experiments use a small test set (256 samples per dataset), which is not sufficient, although the setting aligns with previous work.
- The core insight that attacks can exploit DBP’s stochasticity aligns with previous work (i.e., DiffHammer), though this work goes further with low-frequency attack.
问题
Regarding the future of adversarial purification, do you view diffusion models as a promising direction if further refined, or would you recommend exploring fundamentally different architectures beyond diffusion models? Looking forward, what key innovations would be needed to make diffusion models robust against adversarial attacks? Any discussion is appreciated.
局限性
Yes, the authors provide a detailed discussion of limitations and broader impacts in Appendix A.
最终评判理由
All my concerns are well-addressed, so I increase my confidence score accordingly. I initially gave a very positive score (see strengths for why) and I will keep my positive score towards accepting this paper.
格式问题
N/A
We thank the reviewer for their thoughtful and supportive evaluation, and for recognizing the paper’s theoretical soundness, empirical strength, and the significance of our findings.
Q1. Evaluation set size
Thank you for the observation and for noting the alignment with prior work. We agree the per-dataset evaluation set is small. However, as noted by the reviewer, this follows standard DBP practice (e.g., DiffHammer [41]; Chen et al., 2024) because DBP is a compute-intensive defense to evaluate. Additionally, we emphasized breadth over scale: two datasets (CIFAR-10, ImageNet), multiple classifiers and DBP schemes, diverse attacks/baselines, and MV ablations (see App.I). The same qualitative conclusions hold across these settings (see S4-S6, extended results in App.G–App.I), which we believe is more informative for assessing DBP than enlarging a single test set. We will make this trade-off explicit in S4 and App.F.
Chen et al. Robust classification via a single diffusion model. ICML, 2024
Q2. Does the core insight that attacks can exploit DBP’s stochasticity align with DiffHammer?
Thank you for emphasizing the significance of our contribution and low-frequency attack. While both works analyze DBP’s stochasticity under attacks, the conclusions differ. DiffHammer posits that DBP’s purified trajectories partition into two groups—those sharing adversarial vulnerabilities and those that do not—and that gradients from the latter can impede convergence (“misleading gradients”). The attack proposed in DiffHammer [41] builds on this assumption. Our analysis (see S3.1, Thm 3.1) shows that, under correct differentiation, an adaptive attack can shape the purification distribution, driving the score‑driven dynamics so that the terminal sample is drawn from an adversarial rather than a natural distribution. As such, our analysis indicates that DBP’s stochasticity is not an intrinsic obstacle but a process that can be steered when gradients are computed accurately. Accordingly, this analysis does not support a fixed partition of DBP’s output space as assumed by DiffHammer; when the attack manipulates the purifier, it can alter the output distribution.
Empirically, under identical settings and with DiffGrad’s accurate gradients, gradient-based attacks strengthen (both the standard AutoAttack and DiffHammer), with AutoAttack matching or outperforming DiffHammer (see S4.2) and confirming our theory.
Q3. Do you view DBP as a promising direction? What refinements are needed?
Thanks for the question. DBP remains a promising defense as diffusion purifiers can improve clean accuracy compared to prior purification methods. However, as a standalone defense, Theorem 3.1 shows DBP is vulnerable to adaptive attacks when accurate gradients are available. We are therefore actively exploring a direction in which the same purifier is configured to follow a different, private SDE at deployment than the standard public SDE that can be used for attack optimization. Thus, the inference‑time purification dynamics differ from the public dynamics an attacker optimizes against. The private dynamics are not exposed under our threat model and are intended to be difficult to infer from model I/O alone. Under this design, attack‑optimized gradients need not transfer to the behavior used at inference—without resorting to gradient masking (a strategy repeatedly shown vulnerable in previous work). This circumvents the setting analyzed by Theorem 3.1 rather than contradicting it. Initial results are promising, and we will report a full treatment in a separate work.
I would like to thank the authors for their responses. My concerns have been well-addressed. I will keep my positive score towards accepting this paper.
The paper critically evaluates the popular diffusion-based purification (DBP) defense against adversarial examples and highlights the flaws in the existing approaches. The paper uses adaptive attacks to show that the robustness claimed by prior works is inflated due to the reliance of these defenses on flawed gradients. The paper then proposes a toolkit, DiffBreak, to eliminate the gradient flaws and curb the inflated robustness estimates. The paper also proposes a low-frequency attack inspired by [21] that further degrades the performance of the DBPs.
优缺点分析
Strengths:
- The paper highlights a new vulnerability of existing DBP-based defenses, as well as previous methods such as DiffusionHammer, which further reduce the robustness performance of DBP-based defenses. In particular, the paper’s insight that the use of accurate gradients for generating adversarial examples ends up using the purification process to the attacker’s advantage is critical.
- The paper proposed a toolkit that could be used by model builders to evaluate their models’ robustness to adversarial examples.
- The low-frequency adversarial examples while not novel, have been shown to be effective at bringing down the robustness of various models.
- The paper thoroughly evaluates various methods, including some recently proposed defenses, and highlights that their robustness is also flawed.
Weaknesses/Questions
- The proposed Majority Voting (MV) based metric is weaker than the worst-case robustness metric proposed by [41]. What is the purpose of this new metric?
- The evaluation consists of very low-resolution data. Do the findings transfer to much larger image sizes?
- What is the computational complexity for computing an attack against DBP for higher resolution images?
- Are the proposed findings transferable to different modalities of data or limited to just images?
问题
See above
局限性
Yes.
最终评判理由
My concerns have been addressed. I maintain my rating.
格式问题
NA
We thank the reviewer for recognizing the paper’s core insights and contributions. We appreciate the thoughtful questions, which helped us clarify key aspects of the work.
Q1a. What is the purpose of the new Majority Voting (MV) metric?
Thanks for the question. Our goal is to measure DBP’s true robustness. To achieve this, we must define a realistic operational scenario reflecting how it should be deployed by a careful defender, rather than rely on a simplified protocol that is fragile to variance. Prior evaluations made two questionable choices:
(1) Single‑run defense. Each input was assumed to be purified and classified once to obtain the label, so random diffusion variance alone could cause misclassification.
(2) Single‑path attack. Adversarial examples were evaluated on a single purification path alone (upon attack termination), ignoring repeated‑query risk and inflating robustness scores.
DiffHammer’s “worst‑case” metric [41] addresses (2) by letting the attacker resubmit the same sample N times (each purified along a new path), but still leaves issue (1): the defender also purifies each submission only once. An attack, therefore, “succeeds” if any of the N purifications errs, so the robustness estimate is dominated by stochastic noise. Yet, Lucas et al. [27] show that such one‑shot stochastic defense deployments are breakable with enough queries.
Majority Vote (MV) resolves both problems. It models a realistic and more robust (responsible) defender that purifies each input N times in parallel and returns the majority label, reducing variance-driven errors. MV suppresses variance while retaining manageable cost; it can be seen as a lightweight analogue of randomized smoothing—achieving practical stability with N~8−10 samples rather than thousands, albeit without a formal certificate, which is acceptable because many realistic attacks (ℓ∞, LF, StAdv) are beyond current certification theory. MV is thus the simplest statistically sound deployment, essential for isolating genuine vulnerabilities from incidental variance-driven failures. Otherwise, any robustness numbers—ours or those of previous studies—are entangled with avoidable sampling noise, obscuring whether failures stem from DBP itself or an unrealistically fragile deployment. Since MV inherently aggregates predictions over multiple copies, the threat of resubmissions is readily accounted for as well.
We therefore adopt MV as the primary metric. In App.I, we conduct ablation studies to explore the effect of the set size N on robustness and efficiency, and highlight best practices and recommendations for future work.
Q1b. MV robustness appears weaker than the worst-case robustness proposed by DiffHammer [41]
For the same N, MV is strictly harder for the attacker as it takes the majority label: success now requires misclassifying > ⌊N/2⌋ of the N purified copies, whereas DiffHammer’s [41] worst-case counts a single error. Hence, MV is an upper bound on worst-case robustness. This aligns with our empirical findings.
Tables 1–2 show that, when gradients are computed with DiffGrad, worst-case robustness (Wor.Rob) is always no greater than MV robustness (MV.Rob) for all attacks we evaluate. Apparent violations arise only when our MV.Rob results are compared to the Wor.Rob values originally reported by DiffHammer [41]. These inversions reflect implementation differences in gradient computation. As discussed in S3.2, DiffGrad removes back-prop mismatches that can reduce gradient fidelity and inflate reported robustness (for both Wor.Rob and MV.Rob). Thus, MV-versus-worst-case comparisons should be made using a common, DiffGrad-based implementation of each attack (note that DiffHammer doesn’t report MV.Rob originally). Note: Wor.Rob ≤ MV.Rob assumes the same inputs and the same set of N paths. Our MV vs. DiffHammer’s Wor.Rob comparisons use different samples/paths, so small sampling noise may appear; the main driver is still the gradient-computation differences. We include the original Wor.Rob numbers from [41] for completeness and to illustrate the impact of gradient fidelity (see S3.2), not as cross-implementation evidence about MV versus worst-case. We will also correct misleading boldface in Table 1 in an updated version.
Table 3 shows MV robustness near zero under a different attack (our LF attack)—by design, this stronger attack defeats DBP even under MV, so the corresponding worst‑case numbers are, by definition, no higher and are omitted.
Q2. Evaluation is based on low-resolution data. Do the findings transfer to larger image sizes?
Thanks for the valuable question. Tables 2–3 already evaluate three ImageNet classifiers at 224×224 (upsampled to 256×256), the standard resolution in prior DBP work [20, 30, 40], so our results are not limited to CIFAR-10; they also apply to more practical, real-world resolutions. DBP is a computationally intensive defense, so the prior DBP work (e.g., DiffPure [30], GDMP [40], DiffAttack [20])—and our own—focuses on CIFAR‑10 and ImageNet, as for much larger image sizes, score model capacity and memory/latency grow accordingly, making large-scale studies impractical. Still, we consider multiple classifiers, DBP schemes, attack settings, and majority vote (MV) sample counts to ensure breadth and robustness of conclusions.
Both the mechanism and our evidence support that the findings transfer beyond small images. Theorem 3.1 is dimension-agnostic (accurate differentiation lets an attacker steer the purification distribution regardless of resolution), the gradient mismatches we address in S3.2 (DiffGrad) are implementation-level and apply at any size, and MV (S3.3) addresses variance in randomized purifiers, which is resolution-independent. For LF (see S5), we optimize small filter networks to induce structured, low‑frequency perturbations under a fixed LPIPS bound—neither the construction nor the constraint assumes small images. We already show stealthy, effective LF adversarials on ImageNet-resolution models (see S5.1, App.J); as resolution grows, these perturbations can be distributed over more pixels while remaining within the LPIPS threshold. Related evidence from UnMarker [21] (see S5)—though targeting watermark removal, not DBP, via different optimization objectives—demonstrates imperceptible low‑frequency edits at 512×512.
In short, our results already cover ImageNet‑resolution DBP and include extensive MV/EOT evaluations. Exploring ultra‑high‑resolution regimes (≫256²) is valuable future work, but is orthogonal to the core mechanisms we study and to prevailing evaluation practice for DBP.
Q3. What is the computational attack complexity for higher resolution images?
As mentioned in App.A (Broader Impact & Limitations), we provide an asymptotic analysis in App.D.1.3 with details in App.D.3. Bottom line: The bottleneck is DBP itself (long reverse diffusion and higher-capacity score models at larger resolutions). Our implementation keeps peak memory near the per-timestep cost and reuses DBP compute; if DBP is practical at a given resolution, the attack is practical too. That is, whenever the dimensionality allows for training score models and using DBP as a defense, running attacks is also feasible. Compared to a non-checkpointed attack implementation (which is memory-exhaustive), we roughly only require an additional forward propagation in total in terms of latency. Further details are below:
As in prior work (DiffAttack [20], Liu et al. [26]), the attack scales with the DBP pipeline. With our checkpointing-based DiffGrad, the computation graph is kept for one diffusion step (t) at a time (otherwise attacks become infeasible due to memory overhead [40]): during forward propagation, memory is essentially that of a single score-model invocation to compute ; internal activations are discarded after the step, and we retain only the lightweight sequence (which can be offloaded to CPU). During backprop, we re-invoke the score model on each saved , build the graph for that step only, take gradients, then drop the graph before moving to the next step. This avoids graph accumulation across the trajectory, so peak memory is dominated by the per-step score model activations (and backprop) cost (scaled by the number of batched EOT/MV samples N), not by the number of diffusion steps. Wall-clock time is proportional to the per-step model latency × number of steps; checkpointing does not introduce extra multiplicative overheads. For our LF attack (see S5), we add a small LF filter network that updates only once per backprop, adding negligible overhead.
Concretely, at ImageNet scale and on a 40 GB A100 GPU, we observe DBP inference times of ~7s (N=1) and ~25s (N=8), compared to attack iteration times (forward+backprop) of ~16s (N=1) and ~75s (N=8). CIFAR-10 timings are reported in App.I; we will add these ImageNet timings there and cross-reference them in S4.2.
Q4. Are the proposed findings transferable to different data modalities?
Thanks for the interesting question. Our experiments target vision because DBP has so far been studied primarily on images, allowing direct comparison to prior work. Nevertheless, both Theorem 3.1 and the DiffGrad backprop module are modality‑agnostic: they depend only on the diffusion dynamics and implementation, not on pixel structure. We therefore expect the same vulnerabilities for, e.g., speech or video purification. Furthermore, all the arguments in support of the LF attack, including its efficacy stemming from its low-frequency nature, also hold across various domains (i.e., audio and video). However, the specific implementation that relies on spectral filter networks would require domain-based adaptations (e.g., replacing 2D Fourier filters with their 1D counterparts for audio) to apply it to these different tasks.
I thank the authors for their responses. My concerns have been addressed. I keep my initial score.
This paper argues that Diffusion-Based Purification (DBP), a defense against adversarial attacks, is not robust. The authors claim that attacks can manipulate the core diffusion model instead of just tricking the classifier. They introduce a toolkit called DiffBreak to show this, which includes a more accurate way to calculate attack gradients (DiffGrad) and a better evaluation method (Majority-Vote). They also propose a "low-frequency" attack that they claim breaks DBP completely.
优缺点分析
Strengths:
- The paper points out that testing a randomized defense like DBP with just one try isn't reliable. Their proposed Majority-Vote (MV) method is a more common-sense approach to measuring robustness.
- The DiffGrad module seems like a solid piece of engineering. It fixes several technical issues in previous attack implementations, which could help the research community conduct more reliable experiments in the future.
- The authors test their claims across several standard datasets and models (CIFAR-10, ImageNet), which makes their empirical findings convincing.
Weaknesses:
- The paper exhibits weaknesses in its treatment of prior work and internal consistency. It repeatedly characterizes methods like DiffHammer and DiffAttack as having "flawed" gradients, overlooking that approaches such as surrogate approximations represent intentional computational trade-offs rather than fundamental errors. For example, DiffHammer’s checkpointing balances memory efficiency against gradient precision, a reasonable engineering compromise.
- The paper first evaluates DBP against standard, norm-bounded attacks and shows it still has some robustness. Then, it introduces a completely different type of attack (the "LF attack") which is not norm-bounded and uses it to declare that DBP is broken. This seems like shifting the evaluation criteria.
问题
- The LF attack uses LPIPS as a perceptual constraint. Have you experimented with other perceptual metrics, and do you believe the attack's success is specific to LPIPS or would generalize to other perceptual similarity measures?
- Could the principle of implicitly optimizing the parameters of a pre-processing defense be generalized to other types of defenses that are often treated as fixed, such as JPEG compression or other input transformations?
- The Majority-Vote protocol is a key contribution. The experiments use K=10 or K=8 samples. How was this number determined? Is there a performance/robustness trade-off curve as K increases, and what would be a practical recommendation for K in a real-world deployment?
局限性
The authors admit their methods are very computationally expensive.
最终评判理由
Thank you for the detailed and constructive rebuttal. Your responses have clarified several of my primary concerns and have convinced me to raise my score.
格式问题
The paper formatting is adequate.
We’re thankful for the thoughtful feedback and glad the reviewer found our engineering, MV protocol, and empirical results convincing.
Q1. Clarifying “flawed gradients”
Thank you for the helpful comment. We agree surrogate approximations and checkpointing are legitimate trade-offs. By “flawed gradients,” we mean robustness overstatements due to (i) implementation variants that may affect gradient fidelity or (ii) auxiliary objectives that don’t improve performance once gradients are accurate. We appreciate the valuable observation, and we’ll adopt more neutral wording in the paper and make this distinction explicit. We provide details below:
DiffHammer [41] (implementation divergence)
We have no issue with checkpointing (also used in our DiffGrad). Yet, as noted in App. D.1.5, DiffHammer combines it with Lee & Kim’s [24] surrogate process (see DiffHammer’s App. C.1.1), but DiffHammer’s code diverges from [24], introducing gradient deviations. Re-running the standard AutoAttack on CIFAR‑10 with the surrogate implementation that matches [24] drops worst-case robustness (N=10) from DiffHammer’s reported 33.79% for the same attack (their Table 1) to 14.45%—even below the DiffHammer attack (reported 22.66%)—confirming the inflation came from gradient fidelity, not compute trade-offs.
On the surrogate method [24]
It shortens the trajectory and backprops through this shorter horizon but retains the full graph over that path. In DBP, memory—not FLOPs—is the bottleneck; Checkpointing (our DiffGrad, Liu et al. [26]) recomputes individual path segments, yielding exact gradients with lower peak memory. The surrogate may improve speed but not memory pressure, and its truncation can miss dependencies, weakening gradients. Empirically, checkpointed full-gradient attacks are stronger (App. G). Attackers precompute adversarial examples offline (DBP’s latency makes real-time adversarial generation infeasible either way), so runtime savings alone don’t translate to a practical advantage.
DiffAttack [20] (design choice)
DiffAttack adds per-step losses to AutoAttack, increasing overhead. It also originally used the adjoint method [25] instead of standard backprop for checkpointing (see App.G), which is similar in cost but known to yield weaker attacks [24]. With our full gradients (DiffGrad), standard AutoAttack matches (CIFAR-10) or outperforms (ImageNet) DiffAttack without extra losses (see S4). We’ll clarify these points in S4.2.
Summary. Our concern is limited to implementations and additions that don’t improve attacks once gradients are accurate. We'll follow the reviewer’s suggestion and better characterize these previous works to avoid confusion and enhance clarity.
Q2. Does LF shift the evaluation criteria?
Thanks for the valuable question. Our aim is to evaluate DBP practically—not just within an ℓ‑norm ball. As an empirical defense that must resist various practical threats, DBP isn’t tied to a specific norm, and prior work includes non-norm-bounded attacks (e.g., StAdv [47] in DiffPure [30], Chen et al. (2024)) alongside ℓ-bounded ones. We follow this precedent, proceeding in two steps:
1.Norm-bounded (S4.2)
We first isolate the effect of accurate gradients. While evaluations often include non-norm attacks, recent attack papers focus on enhancing the norm-bounded AutoAttack. Our comparison with these works validates our theory: with accurate gradients, standard (i.e., gradient-based) attacks can alter DBP (S3.1) without extra surrogates (DiffHammer/DiffAttack). This allows us to focus on standard gradient-based approaches as strong DBP attacks when developing an effective strategy later as well.
2.Imperceptible, non-norm-bounded (S5)
We then ask the broader question: How robust is DBP overall? Though robustness drops under ℓ-bounded attacks, some remains—motivating further evaluation for a thorough assessment. We enforce imperceptibility via an LPIPS threshold akin to an ℓ-budget, but that also permits low-frequency perturbations via our optimizable filters. Under MV, StAdv (the standard non-norm baseline) fails at high resolutions (S5/App.J); Our LF attack is the first to reliably break this sound DBP setup, showing the key factor is a stronger, DBP-tailored attack—not relaxed constraints.
Summary. We didn’t shift criteria: we validated our theory under standard ℓ-bounds, then assessed general robustness in an established imperceptible, non-norm setting—both under MV. LF shows DBP fails against stealthy attacks, the core practical concern—despite partial robustness to ℓ-bounded attacks, which we acknowledge.
Chen et al. Robust classification via a single diffusion model. ICML, 2024
Q3. Perceptual metrics other than LPIPS for LF
LPIPS is solely a perceptual constraint while LF’s success comes from its structured perturbations (see S5). We chose LPIPS since it’s the standard metric, with user validation and published thresholds enabling a clear, auditable similarity bound (see S5.1). LF isn’t tied to LPIPS; other metrics could work if they have validated thresholds to enforce similarity. Our aim isn’t to benchmark metrics, but to show LF breaks DBP under a proper perceptual constraint. App.J’s qualitative examples and the chosen LPIPS threshold confirm imperceptibility.
Q4. Does the principle of implicitly optimizing the pre-processing defense generalize beyond DBP?
Thanks for the great question. It depends on the pre-processor. Our findings apply when the defense has input-dependent, differentiable parameters—e.g., ML-based purifiers like the score model in DBP or other learned purifiers. They don’t apply similarly to fixed, non-adaptive transforms.
For instance, in DBP, if the exact score were available (as in the idealized analysis by Xiao et al. [48] brought in S3.1), the purification function would be fixed (predetermined). Here, the attacker cannot “optimize the defense,” and the robustness argument from S3.1 would hold even against adaptive attacks, leaving DBP highly robust. In practice, a perfect score oracle is unavailable and replaced by an ML model, whose parameters can be exploited—causing deviation from the ideal behavior (see S3.1)—precisely what we prove in S3.1 and verify empirically: with accurate gradients, DBP can be driven to generate adversarial—not natural—distributions. This vulnerability extends to all learned purifiers; the significance of our work lies in showing that it holds even for stochastic, iterative processes, forcing them to output adversarial distributions. This is more significant than a similar finding for deterministic, parameterized defenses that yield a single output.
A fixed pre-processor (e.g., JPEG or blur) has no input-controllable parameters. Attackers can’t change the defense’s fixed functionality , but only optimize s.t. (where h is fixed) lands in the classifier’s vulnerable region. Conceptually, this is evasion, which yields weaker attacks (e.g., if a perfect score existed, attacks would fail against DBP—see above). Yet, common fixed transforms (unlike a perfect score oracle), such as JPEG, can’t even resist the weaker evasion attacks, as repeatedly shown in prior work.
Takeaway. Implicit optimization applies to input-adaptive (learned) pre-processors. For fixed ones, attackers optimize the transform’s output—not the transform. Our results show that even stochastic, iterative purifiers can be implicitly optimized.
Q5. Determining the MV sample count K
We chose K=10 for CIFAR-10 to match prior work’s worst-case evaluations (see S4.2). Ablations with K=1→10→128 in App.I (referenced in S4.2) show MV.Rob rises with K (DiffPure: 35.16%→39.45%→47.72%; GDMP: 8.59%→16.80%→32.81%), while Wor.Rob drops significantly under repeated resubmissions (larger K). This supports MV’s role (S3.3) in reducing variance. On a 40 GB A100 GPU, inference takes ~5.29s (K=1), ~6.54s (K=10), and ~26.53s (K=128). Batching the K samples, thus, makes K=10 nearly cost-free; K=128 adds ~21s per step. We stop at K=128 due to memory limits (largest feasible batch size)—control experiments with K=256 (split batches) showed negligible robustness gains, indicating saturation. App.I recommends K≈10 as a practical default and K=64–128 for security-critical use. For LF (S5.1), we use K=128 to evaluate our attack in the most robust MV setup.
As in [41], we restrict ablations to CIFAR-10 for feasibility. For ImageNet, we use K=8 due to similar memory limits and diminishing MV returns beyond that. Inference rises from 7s (K=1) to 25s (K=8), with batching offering ~2.2× speedup (56s→25s). Yet, larger models and data size make batching less effective than in CIFAR-10. We recommend K≈8 unless higher values yield clear MV gains. This will be added to App.I.
Q6. Computational Cost
We agree the experiments are compute-heavy—but the cost stems from DBP itself, not our attacks. Our DiffGrad matches prior work in attack-side cost and avoids memory bottlenecks.
DiffGrad uses checkpointing as in [20,26] but fixes backprop issues (S3.2, App. D) without extra cost. Its complexity matches the adjoint method [25] but is more efficient than DiffAttack [20], which adds per-step loss overhead. DiffHammer [41] also adds post-hoc computations. We further accelerate attacks by EOT batching (see S3.2). Backprop-free methods (BPDA) are faster but far less effective (see Table 1). The surrogate [24] improves speed only, but doesn't improve practicality (Q1).
LF overhead
The small LF filter network updates only once per backprop, adding negligible overhead.
MV protocol
MV is costlier than single-purification, but isn’t an optional add‑on—it’s the correct deployment for this stochastic defense (see S3.3). It reduces variance-driven errors, making it essential. Still, its cost is offset by batching and choosing K based on the MV gain-latency trade-off (see Q5).
Dear reviewer, the authors have prepared a detailed rebuttal to address your comments. Please go through the authors’ rebuttal and respond accordingly. Thanks.
Thank you for the detailed and constructive rebuttal. Your responses have clarified several of my primary concerns and have convinced me to raise my score.
-
Q1: I agree that "robustness overstatements" is a more precise term. Your rebuttal successfully clarified the distinction between intentional trade-offs and implementation divergences. The empirical evidence for DiffHammer was particularly convincing and resolved my initial concern.
-
Q2 & Q4-Q5: My reservation about shifting evaluation criteria has been addressed; the two-step evaluation approach is now well-justified. Furthermore, I found the clarification on the scope of your claims (adaptive vs. fixed pre-processors) and the detailed ablation study on the MV sample count (K) to be insightful. These additions strengthen the paper's contributions.
-
Q3: The LF attack is a key contribution, but its evaluation relies solely on the LPIPS metric. While you argue it's not limited to LPIPS, this claim remains speculative without empirical support. As I mentioned in my discussion, supplementing this with preliminary results on 1-2 other common perceptual metrics would have made the claims about "breaking DBP" much more robust and generalizable.
-
Q6: The acknowledged high computational cost remains a significant practical barrier. While this is a challenge for the field in general, it does limit the immediate applicability of the proposed methods.
Thanks for the thoughtful and constructive follow-up, and for re-evaluating the paper. We’re glad the main concerns are resolved; we’ll incorporate the clarifications we’ve already promised in our earlier response in the paper and address the two remaining points below for completeness:
On perceptual metrics (LPIPS). We’ll follow the reviewer's guidance and add a short note in S5.1, making the scope explicit and flagging generalization to other perceptual metrics as important future work, inviting follow-up evaluations of LF under alternative visual constraints. As mentioned in our response (Q3), we use LPIPS because it’s widely adopted as a perceptual constraint, and prior work provides known thresholds we can reuse for adversarial optimization. For adversarial examples, calibrated thresholds are essential; otherwise, an overly permissive constraint can label visibly altered adversarial images as “successful,” which defeats the point and inflates attack success (i.e., underestimates robustness). Both independently calibrating other non-norm-bounded metrics (e.g., via user studies) and re-running robustness evaluations are time-consuming and not feasible within the discussion window. Thus, we’ll keep claims scoped to LPIPS and explicitly note this in S5.1, encouraging future work to reproduce our experiments with alternative perceptual metrics for which rigorously calibrated thresholds may exist or become available.
On compute. Thanks for highlighting the distinction between the cost of our method and the cost of the evaluated defense—this helps clarify the source of overhead. We’ll better emphasize in App.A (Broader Impact & Limitations) that even with our efficient implementation, the dominant cost from DBP itself is inherited by attacks and robustness evaluations, and we will add a note that practicality may improve with advanced hardware or more efficient DBP schemes in the future. As promised in our responses (Q5 and Reviewer 7gyq's Q3), we’ll include the ImageNet timing breakdown and point to practical knobs in App.I, with cross-references in S4.2.
Thanks again. Your pointers throughout the discussion phase will materially improve our paper.
The present work is a study on diffusion-based purification (DBP). This study challenges the popular view that DBP is robust thanks to the denoising mechanism of diffusion models. This work identifies several reasons that make existing adversarial attack methods ineffective. Moreover, it proposes a new attack method against DBP. The reviewers agree that this paper makes an interesting contribution and is well written. The majority of the weaknesses pointed out by the reviewers have been properly addressed in the rebuttal.