PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
4
4
5
3.5
置信度
创新性2.8
质量3.0
清晰度2.8
重要性3.0
NeurIPS 2025

FAST: Foreground‑aware Diffusion with Accelerated Sampling Trajectory for Segmentation‑oriented Anomaly Synthesis

OpenReviewPDF
提交: 2025-04-07更新: 2025-10-29
TL;DR

Segmentation-oriented Industrial Anomalies Synthesis

摘要

关键词
Anomaly synthesisanomaly segmentation

评审与讨论

审稿意见
4

This paper proposes FAST, a diffusion-based framework for generating industrial anomaly synthesis data for segmentation tasks. The framework consists of two main components: first, AIAS, a training-free sampling strategy that reduces the denoising steps from the DDPM standard of 1000 to 10, and second, FARM, a module that reconstructs anomaly-specific content from noisy inputs and injects anomaly-aware noise at each sampling step to preserve anomaly signals throughout the overall denoising process. The authors evaluate their method by generating 500 synthetic samples per category on two datasets and training multiple segmentation models with the synthetic data, achieving performance improvements.

优缺点分析

Strength

  1. AIAS provide well-formulated theorems that justify the multi-step backward approximation. The supplementary material clearly explains the proof and derivation process for this procedure.
  2. There is computational efficiency by reducing sampling steps to around 10 during the generation phase.
  3. Systematic comparison and justification through various ablation studies.
  4. FARM's foreground-aware processing tends to address the uniform spatial processing limitation of existing diffusion models.

Weaknesses

  1. It is questionable whether the characteristic feature of this paper - reducing sampling steps to around 10 - is truly a significant advantage for synthetic data generation. In industrial anomaly detection, fast inference is important only during the anomaly segmentation stage, but for synthetic data generation, wouldn't quality be more important? Wouldn't it be better to keep only the foreground-aware processing and use multiple steps for sampling?
  2. While the fixed x̂_0 assumption is somewhat acknowledged, isn't it the case that x̂_0 actually changes quite significantly over 10 steps? This raises concerns since Lemma 2 relies on this assumption.
  3. In the experimental phase, it was mentioned that 500 synthetic samples were generated, with 1/3 used for training and the rest for evaluation, but a clearer explanation seems necessary regarding the methods used, including masks.
  4. In core multi-step aggregation formula x_te = Πx_ts + Σx̂_o + ε_te of AIAS, there is insufficient theoretical analysis of how approximation errors accumulate with respect to segment length (t_s - t_e). Information about how errors are amplified in the actual diffusion process seem necessary.

问题

read weakness please

局限性

The core approximation of AIAS relies on assumptions that may not hold in practice, particularly with complex noise schedules or highly variable x^_0 predictions. The theoretical analysis of approximation error bounds and convergence properties is limited, and the error accumulation patterns according to segment length are not sufficiently characterized. And it is questionable whether reducing sampling steps to 10 is truly a priority in synthetic data generation. In industrial deployment, final segmentation performance may be more important than generation speed, and the potential quality degradation at extremely few steps has not been sufficiently analyzed.

最终评判理由

I appreciate the authors' clear rebuttal. While you have addressed some of my concerns well, some concerns have only been partially resolved, so I have maintained my score in order to maintain a positive evaluation of the paper.

格式问题

  1. In figure 1 caption : foregrond-aware needs to change for foreground-aware
  2. Main Paper and Supplementary Material should maintain independent equation numbering systems (Algorithm 3, Equation 13 in Main paper). 3.Additionally, abbreviated forms such as "Sec. 4.3" and full forms such as "Section 4.3" are used inconsistently throughout the paper.
作者回复

We sincerely thank you for your detailed comments and positive feedback, such as "clear theoretical justification" and "complete studies." Below are our responses to your questions.

Q1: Practical benefit of fast sampling in anomaly synthesis.

We thank you for this insightful comment. While synthesis quality is important, we argue that fast sampling is highly beneficial in segmentation-oriented industrial anomaly synthesis (SIAS).

  1. In real-world manufacturing, anomaly types evolve quickly, and segmentation models must be updated frequently. FAST enables engineers to generate hundreds of segmentation-aligned anomalies per class in minutes rather than hours or days after training, making rapid retraining feasible.
  2. Our proposed AIAS achieves up to 100× speedup by reducing sampling steps from 1000 to 10, while still maintaining competitive segmentation performance (see Table 3 in the paper).
  3. We observe that longer sampling may enhance visual realism but introduce overfitting to generative details at the cost of anomaly segmentation consistency. Therefore, in our context, slightly trading off fine-grained fidelity for significant speed gains is not only acceptable but necessary for practical deployment.

Q2: Detailed experimental setup.

We appreciate your suggestion and provide further explanation.

In the data synthesis phase, we follow AnomalyDiffusion’s mask generation strategy by employing two complementary methods: (1) augmenting real anomaly masks via geometric and morphological transformations, and (2) training a Latent Diffusion Model to learn the distribution of real masks and generate novel ones. All generated masks are screened to ensure plausibility before being used to synthesize anomaly–mask pairs. Detailed information can be found in the AnomalyDiffusion paper. For each anomaly type within every object class, we generate 500 anomaly–mask pairs to ensure sufficient diversity for segmentation training.

In the downstream segmentation task, we employ SegFormer, BiseNetV2 and STDC as backbone models, as they are widely adopted in industrial scenarios due to their balance of accuracy and efficiency. Besides, we uniformly split the real anomalies from MVTec and BTAD into 2/3 for validation and 1/3 for training. The training set of segmentation models combines this 1/3 subset with the synthetic samples. This design ensures that the segmentation model is evaluated solely on unseen data while benefiting from synthetic samples during training. Additional implementation details, including model settings and training parameters, are available in Supplementary Materials A.5.


Q3: The validity of the fixed x^0\hat{x}_0 assumption in Lemma 2.

We appreciate your concern and provide the following reasons for the fixed x^0\hat{x}_0 assumption.

  1. Training Objective Consistency:
    The denoising diffusion model's training objective directly encourages consistency in predicting x^0\hat{x}_0 across different timesteps, particularly at lower tt. This ensures that even when reused in a large segment, a fixed x^0\hat{x}_0 is still relatively accurate.

  2. AIAS Theoretical support:
    In AIAS, the reverse process is divided into limited segments. Within each segment, the trajectory is approximated by a fixed x^0\hat{x}_0 to maintaini the denoising distribution. At the start of each segment, the network re-calibrates x^0\hat{x}_0 based on xtx_t from the previous segment, effectively preventing error accumulation across segments. Previous work such as diffusion-based anomaly detection methods, also demonstrates that diffusion models are robust to small perturbations in x^0\hat{x}_0, especially when guided by strong cues like masks or backgrounds. Thus, even if x^0\hat{x}_0 from the previous segment is slightly biased, the trajectory remains aligned with the data manifold.

  3. Denoising Distribution Dependence:
    The denoising distribution is Gaussian, with its mean given by the posterior mean:

   μt1t=Atx0+Btxt\mu_{t-1 \mid t} = A_t x_0 + B_t x_t

   where the coefficient AtA_t is defined as

   At=αˉt1βt1αˉtA_t = \frac{\sqrt{\bar{\alpha}_{t-1}} \, \beta_t}{1 - \bar{\alpha}_t}.

   and the coefficient BtB_t is defined as

   Bt=αt(1αˉt1)1αˉtB_t = \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t}.

   As tt decreases, x^0\hat{x}_0 becomes more reliable, with AtA_t increasing and BtB_t decreasing. Therefore, AIAS increasingly relies on the re-calibrated x^0\hat{x}_0, which helps mitigate the propagation of error from xtx_t. This also motivates the design of AIAS as a coarse-to-fine aggregation process.

  1. Experimental Results:
    Our experiments show that increasing KK actually provides noticeable improvements within a certain range. However, with K=10K=10, the results are already acceptable.

Q4: Theoretical analysis of error accumulation.

Thank you for your suggestion. The sampling approximation error in AIAS originates from two sources:
(1) local segment-wise error, introduced by keeping x^0\hat{x}_0 fixed within each segment, and
(2) global cumulative error, resulting from propagation across segments.

Ideally x^0(t)\hat{x}_0^{(t)} should be adaptively estimated at each time step t.

However, AIAS holds x^0k\hat{x}^k_0 fixed within the kthk_{th} segment, leading to a local approximation error at the endpoint tet_e.

δte=Δtte:=Σ(x^0x^0(t))(1)\delta_{t_e} = \Delta_{t \rightarrow t_e} := \Sigma (\hat{x}_0 - \hat{x}_0^{(t)}) \tag{1}

Assuming the full reverse process is divided into KK segments with partition points:

T=t0>t1>t2>>tK=0,T = t_0 > t_1 > t_2 > \cdots > t_K = 0,

Each segment (tktk+1)(t_k \rightarrow t_{k+1}) contributes a local error and the global cumulative error can be described recursively. Specifically, the offset at time t1t_1 is computed as:
δt1=Σ0(x^0(0)x^0(t0))\delta_{t_1} = \Sigma_0 (\hat{x}_0^{(0)} - \hat{x}_0^{(t_0)}).

Then, the state at t2t_2 follows:
xt2=Π1xt1+Σ1x^0(1)x_{t_2} = \Pi_1 x_{t_1} + \Sigma_1 \hat{x}_0^{(1)},

which leads to the offset propagation:
δt2=Π1δt1+Σ1(x^0(1)x^0(t1))\delta_{t_2} = \Pi_1 \delta_{t_1} + \Sigma_1 (\hat{x}_0^{(1)} - \hat{x}_0^{(t_1)}).

By recursion, the total error at the final timestep tKt_K can be expressed as:

Δtotal=δtK=k=0K1(l=k+1K1Πl)Σk(x^0(k)x^0(tk)).(2)\Delta_{\text{total}} = \delta_{t_K} = \sum_{k=0}^{K-1} \left( \prod_{l = k+1}^{K-1} \Pi_l \right) \cdot \Sigma_k \left( \hat{x}_0^{(k)} - \hat{x}_0^{(t_k)} \right). \tag{2}

We further assume that the prediction error of x^0\hat{x}_0 is uniformly bounded at each segment:

  • x^0(k)x^0(tk)ϵ0\|\hat{x}_0^{(k)} - \hat{x}_0^{(t_k)}\| \leq \epsilon_0
  • ΣkCΣ\|\Sigma_k\| \leq C_\Sigma
  • Πkα<1\|\Pi_k\| \leq \alpha < 1

Here, α\alpha represents the maximum norm of the inter-segment state propagation matrices Πk\Pi_k, and CΣC_\Sigma denotes the influence of x^0\hat{x}_0 on the subsequent state within a segment. Notably, both are deterministic and computable under known noise schedules, as shown in Lemma 1.

Therefore, the total error admits the following conservative upper bound:

Δtotalk=0K1αK1kCΣϵ0=CΣϵ0m=0K1αm=CΣϵ01αK1α.(3)\|\Delta_{\text{total}}\| \leq \sum_{k=0}^{K-1} \alpha^{K-1-k} \cdot C_\Sigma \epsilon_0 = C_\Sigma \epsilon_0 \sum_{m=0}^{K-1} \alpha^m = C_\Sigma \epsilon_0 \cdot \frac{1 - \alpha^K}{1 - \alpha}. \tag{3}

This bound is highly conservative in practice, ϵ0\epsilon_0 tends to decrease as KK increases, since shorter segments reduce approximation error and the overall procedure becomes closer to standard DDPM sampling. Moreover, the assumption that all errors accumulate in the same direction is pessimistic. Due to the inherent robustness of denoising diffusion models, the error introduced in one segment is typically corrected in the next, as x^0\hat{x}_0 is re-estimated from the updated xtx_t. This re-calibration helps mitigate the propagation of prior inaccuracies.

These observations further explain why AIAS remains effective even when using only K=10K = 10 segments. The combined effect of theoretical error control, diffusion model robustness, and segment-wise re-estimation of x^0\hat{x}_0 ensures that the synthetic samples retain acceptable fidelity.

评论

Q1 : While I understand the benefits of rapid data generation post-training, I still believe inference speed is more critical. Even though your method is model-agnostic, shouldn't quality take precedence over generation speed if segmentation performance is the ultimate goal? Frankly, I'm not convinced that segmentation models are updated so frequently in real manufacturing settings, and even if they are updated, I cannot imagine scenarios where existing model inference would be shut down. Q2-Q4:I believe the concerns have been addressed to some extent. Additionally, after reading other reviewers' comments, I am curious about results on standard evaluation metrics for industrial anomaly detection papers such as AUROC and PRO. Therefore, while I acknowledge the technical contributions, I have questions about practical priorities and the lack of standard evaluation metrics.

评论

We thank the reviewer for the follow-up comment. We provide response below for the raised questions.

Sampling speed and anomaly quality.

We acknowledge the importance of generation quality, but we want reiterate that FAST is specifically designed for Segmentation-Oriented Industrial Anomaly Synthesis (SIAS), which differs from traditional anomaly detection or generic image generation. In SIAS, the synthesized data is used to adapt segmentation models. Our goal is to generate structurally aligned, mask-consistent anomalies in a timely manner, rather than focusing on visual realism. That is why using too large steps leads to degraded segmentation performance. While increasing the number of steps within a certain range can marginally improve image quality, it comes at the cost of significantly longer inference time, as shown in our ablation study. Besides, in real-world manufacturing, it is common for production lines to undergo frequent reconfiguration (known as production line changeover) to accommodate new products or materials. These changes often introduce new types of anomalies that existing segmentation models cannot handle. Delays in updating segmentation models during such transitions can lead to increased manual inspections, false positives, missed detections, and even line downtime. FAST is particularly suitable for solving this issue. As far as we know, there are also domain adaptation studies that focus on this issue, aiming to reduce the time required for segmentation model adaptation in order to improve production efficiency. FAST aligns with this motivation by enabling rapid synthesis of segmentation-oriented anomalies, thus supporting timely model updates under real-world constraints.

Different metrics across multiple methods.

In our opinion, our proposed SIAS is inherently different from traditional anomaly detection-based anomaly synthesis. That is why we initially focused exclusively on segmentation-based metrics such as mIoU and accuracy. However, after carefully considering your comments and similar concerns raised by other reviewers, we agree that including anomaly detection metrics such as AUROC, PRO, F1 and AP can offer additional insight into FAST’s performance. Therefore, we provide a comparative table below on the MVTec dataset. We hope these results help further demonstrate that FAST not only excels in segmentation-oriented synthesis but is also compatible with detection-based evaluation. Due to character limits, AP and F1 results are included in the comment for Reviewer QVrM. We hope you to check that section.

Table: AUROC comparison across various methods

CategoryCutPasteDRAEMGLASSDFMGANRealNetAnomalyDiffusionAnoGenFAST
bottle99.0597.9697.4099.3298.9799.5699.3999.19
cable95.2995.9595.0795.5795.3695.7498.0797.37
capsule98.1199.5696.5099.2699.1999.1699.5799.64
carpet97.1499.8296.4299.3696.3498.3099.2399.40
grid98.9499.6695.6499.2099.6099.3399.4399.61
hazel_nut99.1499.4595.6399.3298.8999.7699.6799.82
leather99.8599.9099.7899.8899.9399.8599.8899.48
metal_nut99.3299.5494.8099.5099.4698.8099.4799.88
pill95.3896.7098.2999.6299.1999.4799.6999.87
screw92.7899.4794.8899.1498.9996.3199.5298.77
tile98.6099.7399.0799.7499.3199.5399.6799.77
toothbrush88.2198.5385.4297.9596.3796.2398.2999.80
transistor96.4092.8094.4197.1795.5998.8498.9799.78
wood94.0199.2495.9499.3198.7697.5299.2199.67
zipper99.5899.6899.7299.4799.6999.6599.7399.71
Average96.7998.5395.9398.9298.3898.5499.3299.45

Table: PRO comparison across various methods

CategoryCutPasteDRAEMGLASSDFMGANRealNetAnomalyDiffusionAnoGenFAST
bottle82.7986.5570.2081.1085.0083.5384.2790.77
cable55.0469.3050.7665.1562.7968.6664.2076.14
capsule44.5868.1733.5956.4066.2139.5957.0768.73
carpet64.6273.7965.9465.1365.9963.6765.7774.46
grid43.5256.6037.9144.4953.1746.2553.0962.21
hazel_nut79.3684.4364.9282.9178.7583.4879.5488.29
leather50.3263.3467.4161.9168.6368.0759.7074.46
metal_nut69.6487.0359.7683.6780.5473.5280.0291.33
pill54.4072.7736.6069.0270.1160.9667.4381.02
screw20.6349.3523.0651.8648.4236.1152.2248.70
tile82.8387.8083.3788.5982.9183.1485.1590.67
toothbrush29.0769.1230.1562.2560.7737.6758.8178.56
transistor55.0368.1049.7770.5768.9272.8569.8781.91
wood64.4678.8054.9772.9772.4966.8476.8484.74
zipper74.6278.0277.0973.7377.1174.8275.5477.92
Average58.0672.8853.7068.6569.4563.9468.6377.99
审稿意见
4
  1. Existing methods uniformly process the entire region of defect images and are unable to generate specific defect structures for downstream segmentation. This paper proposes a defect-aware generation method based on diffusion models, namely FAST.
  2. The method consists of two modules. The first module, AIAS, is a training-free sampling algorithm for industrial defect generation. This module reduces the number of sampling steps and accelerates the generation speed through a coarse-to-fine aggregation approach. The second module, FARM, adaptively adjusts the noise in the masked region after each sampling step to preserve the defect features in the denoising process.
  3. Experiments demonstrate that the proposed method outperforms existing anomaly generation methods in terms of metrics on downstream segmentation tasks across multiple datasets.

优缺点分析

Strengths

  1. This paper proposes an accelerated sampling algorithm AIAS. Experiments demonstrate that compared with the DDIM and PLMS sampling methods, AIAS can improve the metrics of anomaly segmentation.
  2. This paper proposes FARM, which reconstructs the foreground defect and adds noise during the sampling process, enabling the model to generate realistic local foreground defects. The effectiveness of the FARM module is verified through ablation experiments.

Weaknesses

  1. The paper did not compare with the few-shot defect generation method AnoGen [1]. We worry that such experiments may not be sufficient.
  2. Only mIoU and Acc are used as evaluation metrics. We are afraid that such a setting may not be perfect. We think that it is necessary to test the Inception Score (IS) of the generated images and the Intra-cluster pairwise LPIPS distance (IC - LPIPS) [2].

[1] Gui, Guan, et al. "Few-shot anomaly-driven generation for anomaly classification and segmentation." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.

[2] Ojha U, Li Y, Lu J, et al. Few-shot image generation via cross-domain correspondence[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 10743-10752.

问题

  1. When conducting experiments, the strategy of generating 500 images for each category of objects seems to lack a basis. Is the number of generated images insufficient?
  2. The paper requires an input mask during inference. Therefore, we would like to know how the input mask is obtained during the experiments.
  3. This paper proposes an accelerated sampling algorithm. Thus, we are curious about how long it takes to train a defect generation model and how long the inference process takes.
  4. At what value of t does the experiment start to perform the fine-grained standard DDPM?

局限性

Yes

最终评判理由

The Author has addressed all my concerns, so I’ve raised my score to “Borderline Accept.”

格式问题

No

作者回复

We sincerely thank you for your positive comments about our innovations. Below are our responses to your specific questions.


Q1: Comparison to AnoGen.

We thank you for highlighting AnoGen, which is an important and well-recognized contribution to few-shot anomaly synthesis. We will cite AnoGen in the final version and now we provide a direct comparison between FAST and AnoGen. Our preliminary results show that AnoGen achieves impressive performance in most classes, and FAST also demonstrates excellent segmentation performance (e.g., higher mIoU and Acc in some classes), highlighting its effectiveness in segmentation-oriented anomaly synthesis.

Table: Comparison of mIoU and pixel-wise Accuracy (%) across 15 MVTec categories

Methodbottlecablecapsulecarpetgridhazel_nutleathermetal_nut
AnoGen74.93 / 81.2159.24 / 69.1246.68 / 62.5269.72 / 75.8041.60 / 66.5072.30 / 81.3457.71 / 73.4890.07 / 92.72
FAST86.86 / 90.9073.71 / 77.9463.22 / 71.1273.84 / 83.5352.45 / 70.7090.81 / 94.7966.60 / 74.1894.65 / 96.88
MethodpillscrewtiletoothbrushtransistorwoodzipperAverage
AnoGen80.51 / 86.7543.61 / 59.9686.24 / 91.2758.05 / 72.7172.30 / 82.6568.28 / 78.9266.60 / 74.2965.48 / 75.42
FAST90.17 / 94.0749.94 / 57.4890.13 / 93.7774.98 / 88.6391.80 / 94.5078.77 / 86.3172.80 / 84.7376.72 / 83.97

Q2: Evaluation Metrics: IS and IC-LPIPS.

We appreciate your suggestion to evaluate IS and IC-LPIPS, and we report the results in the following table. FAST achieves competitive scores, indicating that our method generates visually realistic and diverse samples, even though it is not explicitly optimized for generic image generation. However, segmentation-oriented industrial anomaly synthesis (SIAS) is a novel direction and different from conventional image generation or anomaly detection. Our objective is not to maximize perceptual quality, but to synthesize structurally coherent and spatially aligned anomalies that benefit downstream segmentation. We agree that IS and IC-LPIPS can still offer helpful auxiliary insights, but pixel-level metrics such as mIoU and Acc on real unseen anomalies are more indicative of a method’s value in SIAS.

Table: Comparison of IS and IC-LPIPS for anomaly synthesis across various methods

CategoryCutPasteDRAEMGLASSDFMGANRealNetAnomalyDiffusionFAST
bottle1.38 / 0.191.65 / 0.261.52 / 0.221.95 / 0.441.75 / 0.201.47 / 0.251.78 / 0.27
cable1.75 / 0.401.73 / 0.381.79 / 0.431.64 / 0.471.65 / 0.421.85 / 0.442.04 / 0.51
capsule1.29 / 0.201.92 / 0.251.17 / 0.221.39 / 0.271.91 / 0.271.47 / 0.281.67 / 0.29
carpet0.85 / 0.261.38 / 0.361.16 / 0.301.26 / 0.321.08 / 0.321.02 / 0.331.47 / 0.37
grid2.15 / 0.422.14 / 0.442.10 / 0.451.63 / 0.482.51 / 0.382.70 / 0.502.59 / 0.52
hazel_nut1.07 / 0.322.00 / 0.432.13 / 0.332.01 / 0.452.46 / 0.422.10 / 0.362.48 / 0.42
leather1.10 / 0.311.38 / 0.461.12 / 0.411.39 / 0.491.48 / 0.441.59 / 0.471.65 / 0.48
metal_nut1.88 / 0.402.09 / 0.361.90 / 0.382.03 / 0.382.07 / 0.442.19 / 0.422.34 / 0.48
pill1.87 / 0.281.35 / 0.381.97 / 0.291.87 / 0.361.52 / 0.341.45 / 0.321.91 / 0.38
screw0.95 / 0.421.06 / 0.401.17 / 0.421.23 / 0.311.20 / 0.461.26 / 0.491.22 / 0.51
tile1.29 / 0.471.38 / 0.511.98 / 0.491.69 / 0.501.38 / 0.491.91 / 0.542.25 / 0.56
toothbrush1.39 / 0.251.37 / 0.271.36 / 0.331.32 / 0.271.40 / 0.321.42 / 0.291.43 / 0.35
transistor1.66 / 0.371.78 / 0.411.77 / 0.361.48 / 0.391.48 / 0.391.66 / 0.401.84 / 0.46
wood1.61 / 0.341.48 / 0.421.20 / 0.441.75 / 0.481.44 / 0.381.83 / 0.392.34 / 0.46
zipper1.44 / 0.251.39 / 0.321.28 / 0.261.47 / 0.261.55 / 0.221.60 / 0.281.72 / 0.34
Average1.45 / 0.331.61 / 0.381.57 / 0.361.61 / 0.391.56 / 0.371.70 / 0.381.91 / 0.43

Q3: On the 500-images-per-class setting.

We sincerely apologize for not clearly describing our data composition. The downstream segmentation models are trained on a combination of 1/3 real anomaly data and synthetic samples, while the remaining 2/3 real anomaly data are used for validation. Importantly, we generate 500 samples for each anomaly type within each class, rather than just 500 per class. Thus, the actual number of synthesized samples is much larger than 500 per class when multiple anomaly types exist. This design ensures sufficient structural diversity while maintaining training efficiency. We also conducted internal ablations showing that 500 per anomaly type strikes a good trade-off between segmentation performance and computational cost. We will clarify this design and rationale in the revised paper.


Q4: Mask generation clarification.

We apologize for not making this clearer in the paper. Our mask synthesis strategy adopts the protocol of AnomalyDiffusion, which consists of two complementary components:

  1. Geometric augmentation of real anomaly masks through operations such as rotation and flipping.
  2. Synthesis of new masks using a Latent Diffusion Model (LDM) trained on these real examples.

All synthesized masks are manually screened to ensure visual realism, diversity, and consistency with industrial abnormal structures. We refer readers to AnomalyDiffusion for further reference.


Q5: Training and inference time & value of t for fine-grained sampling.

As reported in Supplementary A.5, FAST is trained for approximately 80k iterations on a single A100 GPU. However, thanks to our training-free AIAS module, inference is significantly accelerated. To illustrate this, we provide the inference time under different numbers of AIAS segments (with batch size set to 8, the image size is 256 × 256), shown in the following table:

Table: Inference time for different sampling step settings (MVTec, batch size = 8)

Step Count251030501002005001000
Time (s)1.392.923.8710.6418.1636.7374.21183.34367.97

As noted in our supplementary material, coarse sampling is applied until t = 2, after which we perform standard posterior updates to refine fine-scale textures. This choice of t is empirically determined through extensive analysis, balancing segmentation performance and inference efficiency. Under this setting, FAST maintains high segmentation quality while significantly reducing computational cost.

评论

Thank you for your previous response; it addresses my concerns. However, after considering opinions from other reviewers, I still believe that mIoU and Acc are insufficient for evaluating downstream segmentation models. AUROC and PRO are also crucial, as they better capture the model’s tendency to miss anomalous regions. In practice, beyond accurate segmentation, preventing missed detections is equally important. Therefore, I kindly ask you to report the AUROC and PRO metrics for the detection experiments. Additionally, AP and F1 also reflect the model’s precise segmentation quality, and these metrics were employed in AnomalyDiffusion.

评论

We thank the reviewer for the valuable suggestion to include AUROC and PRO as additional evaluation metrics. We provide a consolidated response below for the raised questions.

Request for other metrics.

The reason we initially adopted only segmentation-based metrics is that we consider Segmentation-Oriented Industrial Anomaly Synthesis (SIAS) to be a new research direction, distinct from conventional anomaly detection-based methods. That said, we now acknowledge that some detection metrics such as AUROC, PRO, F1 and AP can still serve as useful auxiliary references. Therefore, to address your concern and those of other reviewers, we report these results across multiple baselines on the MVTec dataset. The results are summarized in the table below. We hope these additional evaluations offer a more comprehensive understanding of FAST and addresses your concerns. And we will update the table with complete results in the camera-ready version.

Due to character limits, PRO and AUROC results are provided in the official comments for Reviewers Wznz and CbJy. We kindly invite you to refer to those comments.

Table: F1-score comparison across various methods

CategoryCutPasteDRAEMGLASSDFMGANRealNetAnomalyDiffusionAnoGenFAST
bottle78.5683.5965.3678.1082.1281.6680.9888.77
cable42.0860.0539.0853.9754.0262.7557.6969.67
capsule33.0458.2124.2843.3254.9325.3650.3658.29
carpet58.5573.4761.5155.8361.0156.9458.0469.82
grid29.6037.2718.6718.9030.8931.6342.9948.13
hazel_nut73.8482.0755.2579.3172.5079.8174.8486.95
leather36.7951.7857.1150.4056.9655.1144.6466.67
metal_nut57.9185.3153.2478.4675.3967.0273.3190.46
pill42.9967.0522.9160.3960.5452.5862.0076.60
screw15.6641.6218.2143.2740.3728.7647.4240.38
tile77.7384.4678.0985.5876.3776.2481.0589.35
toothbrush23.9460.4820.9547.9449.3533.7654.4476.74
transistor42.9759.3918.2955.9761.3268.6063.4178.07
wood59.0273.1646.0664.4265.2461.7970.8482.80
zipper66.8872.1770.5166.4071.5368.5569.3772.52
Average49.3066.0143.3058.8460.8456.7062.0273.01

Table: Average Precision (AP) comparison across various methods

CategoryCutPasteDRAEMGLASSDFMGANRealNetAnomalyDiffusionAnoGenFAST
bottle92.7294.7190.6992.6394.8093.3293.0697.35
cable74.0979.9478.0180.2775.9876.1578.5787.97
capsule64.5583.1670.5876.6880.4667.9573.9685.42
carpet75.7290.5282.1185.0777.9078.2081.7189.59
grid56.6974.8865.9371.4873.8761.1265.0178.41
hazel_nut91.9394.2186.2693.9192.3693.7490.8896.89
leather84.0590.8087.9285.7890.0385.6687.0791.08
metal_nut88.5595.2181.0195.8691.8884.7390.4497.66
pill69.6986.8577.7287.4185.2177.8980.9282.25
screw43.0868.2461.3473.5869.0862.0364.8768.62
tile92.9496.7894.3796.6993.5593.4394.6697.52
toothbrush36.9780.5243.7777.6667.1653.6564.4488.80
transistor73.0177.5473.2582.6878.1082.0485.9194.46
wood73.6491.1277.7889.6187.1677.5386.5295.27
zipper90.1092.4892.9589.5490.9390.9691.6392.97
Average73.8586.4877.5885.2483.2378.5582.0189.62
评论

We sincerely thank the reviewer for the careful observation regarding the AP scores. The discrepancy from the numbers reported in [3] indeed arises from our evaluation protocol. Our work focuses on Segmentation-Oriented Industrial Anomaly Synthesis (SIAS), where the primary goal is to assess how well the synthesized anomalies improve downstream segmentation performance. To ensure fairness across all compared methods, we did not use the full detection pipelines or pretrained detectors from the original papers (including DRAEM, DFMGAN, and RealNet). Instead, we unified the evaluation by:

  1. Generating anomalies using only the synthesis method of each model, rather than their complete pipelines with detection branches.

  2. Feeding the generated data into the same segmentation backbone such as Segformer, which was trained from scratch.

  3. Extracting the anomaly map from the SegFormer logits by taking the channel corresponding to the anomaly. And the anomaly map is then used to compute AP and other evaluation metrics.

We did not adopt the official pretrained weights because doing so would couple the synthesis evaluation with model-specific downstream architectures and training procedures, and thus introduce bias unrelated to synthesis quality. This unified backbone ensures that the AP values reflect only the quality of the synthesized anomalies, rather than differences in downstream detection models. As a result, the AP values may differ from those in [3], but they remain directly comparable within our controlled setting. We will clarify this evaluation protocol in the final version to avoid confusion.

评论

Thank you for the clarification. The experimental setup seems quite reasonable. I hope the authors can provide a more detailed description of the experimental settings, or consider open-sourcing the training code of the paper (including DRAEM, DFMGAN, and RealNet) in the future, which would be a valuable contribution to the community. Taking all feedback into account, I’ve raised my score from 3 to 4.

评论

We sincerely thank you for the positive feedback and for raising the score. We will include a more detailed description of the experimental settings in the camera-ready version. The code for FAST has already been open-sourced, and we plan to release additional resources (e.g., trained weights) in the future to further benefit the community.

评论

Thank you to the author for the detailed response to my concerns. I will consider increasing the score. However, before doing so, I noticed that the AP metrics provided by the author for the DRAEM methods differ significantly from the AP metrics reported in other papers [3]. Was the DRAEM used in the experiments based on the official weights? Additionally, the AP of RealNet also seems to differ somewhat.

[3] Dual-Interrelated Diffusion Model for Few-Shot Anomaly Image Generation.

审稿意见
4

The paper presents a new method to synthesize anomalies in industrial settings. The paper proposes an anomaly-informed accelerated sampling module and a foreground-aware reconstruction module. The experiments are performed on MVTec and BTAD datasets, obtaining state-of-the-art results.

优缺点分析

Strengths:

  • The quality of the synthesized anomalies is high.
  • The quantitative results on both datasets are state-of-the-art.

Weaknesses:

  • The paper is not self-contained. A brief introduction of Segformer, BiseNetV2 and STDC should be added, at least in the supplementary file.
  • The evaluation metrics used mean Intersection over Union and pixel-wise accuracy are not very common in anomaly detection. I understand that in terms of image-level AUROC and pixel-level AUROC, the performance is saturated, but most methods report these metrics (including DRAEM, GLASS etc).
  • The method section is full of formulas, however the underlying intuition behind the method is not present. Therefore, the method is hard to follow.
  • Some data sets are not used such as VisA and MPDD datasets.
  • How are the masks generated in the first place?

Overall: The paper presents a novel high quality anomaly synthesis model, however, major details are missing (see weaknesses) and the method section is not clear, missing the underlying intuition behind the proposed method.

问题

  • The evaluation metrics used mean Intersection over Union and pixel-wise accuracy are not very common in anomaly detection. I understand that in terms of image-level AUROC and pixel-level AUROC, the performance is saturated, but most methods report these metrics (including DRAEM, GLASS etc).
  • The method section is full of formulas, however the underlying intuition behind the method is not present. Therefore, the method is hard to follow.
  • Some data sets are not used such as VisA and MPDD datasets.
  • How are the masks generated in the first place?

局限性

I could not find a limitation section, where the authors describe the limitation of their methods. I would have liked to find one paragraph describing the limitation along with the conclusion section.

最终评判理由

The authors answered all my questions, including additional experiments showing that their method is also useful for industrial anomaly detection.

格式问题

NA

作者回复

We sincerely thank you for the thoughtful and constructive feedback. We address the concerns below:


Q1: Lack of description for segmentation backbones, method intuition and limitations.

Thank you for the suggestion. We will add brief descriptions of SegFormer, BiSeNetV2, and STDC in the supplementary material for completeness. These three models are widely adopted in industrial scenarios due to their favorable trade-off between performance and inference speed. Specifically, SegFormer is a transformer-based model known for its lightweight design and robust accuracy; BiSeNetV2 employs a dual-branch structure for real-time segmentation; and STDC leverages short-term dense concatenation for fast and effective feature extraction. Evaluating performance across these architectures can further validate FAST's practical value in boosting segmentation performance under real-time constraints.

For the method intuition, we will revise Section 3 to better explain the underlying intuitions behind AIAS. Specifically, traditional diffusion models rely on dense, step-by-step denoising with predicted noise or clean image at each timestep. In contrast, AIAS accelerates this process by dividing the trajectory into a small number of coarse-to-fine segments. Within each segment, we fix the clean image estimate x^0\hat{x}_0 instead of predicting it at each step, and analytically aggregate multiple DDPM transitions into a single closed-form update. By progressing from coarse to fine segments, AIAS compensates for the accumulated error that may arise due to temporal sparsity, enabling fast yet semantically consistent synthesis.

As for limitations, FAST currently relies on binary anomaly masks as input, which may limit applicability in scenarios where such masks are difficult to obtain. Extending it to weakly- or self-supervised mask generation is an exciting direction.


Q2: Uses other metrics instead of common ones in anomaly detection.

We appreciate your concern. As stated in Sec.1 and Table 1, our work targets a fundamentally different task from anomaly detection. While anomaly detection focuses on image- or region-level classification, which are often evaluated using AUROC, our work, segmentation-oriented industrial anomaly synthesis (SIAS), is a novel and emerging task that focuses on generating fine-grained plausible anomalies to enhance downstream pixel-level segmentation performance.

Therefore, our evaluation strictly follows segmentation-centric benchmarks. AUROC is unsuitable for evaluating spatial alignment and structural fidelity, and the most appropriate metrics are mIoU and Acc, which are standard choices in segmentation tasks. However, we agree that evaluating generation quality is also reasonable, no matter in anomay detection- or anomaly segmentation-based anomaly synthesis. We additionally provide generation quality assessments such as Inception Score (IS) and Improved Conditional LPIPS (IC-LPIPS) as follows.

Table: Comparison of IS and IC-LPIPS for anomaly synthesis across various methods

CategoryCutPaste (IS / IC-LPIPS)DRAEMGLASSDFMGANRealNetAnomalyDiffusionFAST
bottle1.38 / 0.191.65 / 0.261.52 / 0.221.95 / 0.441.75 / 0.201.47 / 0.251.78 / 0.27
cable1.75 / 0.401.73 / 0.381.79 / 0.431.64 / 0.471.65 / 0.421.85 / 0.442.04 / 0.51
capsule1.29 / 0.201.92 / 0.251.17 / 0.221.39 / 0.271.91 / 0.271.47 / 0.281.67 / 0.29
carpet0.85 / 0.261.38 / 0.361.16 / 0.301.26 / 0.321.08 / 0.321.02 / 0.331.47 / 0.37
grid2.15 / 0.422.14 / 0.442.10 / 0.451.63 / 0.482.51 / 0.382.70 / 0.502.59 / 0.52
hazel_nut1.07 / 0.322.00 / 0.432.13 / 0.332.01 / 0.452.46 / 0.422.10 / 0.362.48 / 0.42
leather1.10 / 0.311.38 / 0.461.12 / 0.411.39 / 0.491.48 / 0.441.59 / 0.471.65 / 0.48
metal_nut1.88 / 0.402.09 / 0.361.90 / 0.382.03 / 0.382.07 / 0.442.19 / 0.422.34 / 0.48
pill1.87 / 0.281.35 / 0.381.97 / 0.291.87 / 0.361.52 / 0.341.45 / 0.321.91 / 0.38
screw0.95 / 0.421.06 / 0.401.17 / 0.421.23 / 0.311.20 / 0.461.26 / 0.491.22 / 0.51
tile1.29 / 0.471.38 / 0.511.98 / 0.491.69 / 0.501.38 / 0.491.91 / 0.542.25 / 0.56
toothbrush1.39 / 0.251.37 / 0.271.36 / 0.331.32 / 0.271.40 / 0.321.42 / 0.291.43 / 0.35
transistor1.66 / 0.371.78 / 0.411.77 / 0.361.48 / 0.391.48 / 0.391.66 / 0.401.84 / 0.46
wood1.61 / 0.341.48 / 0.421.20 / 0.441.75 / 0.481.44 / 0.381.83 / 0.392.34 / 0.46
zipper1.44 / 0.251.39 / 0.321.28 / 0.261.47 / 0.261.55 / 0.221.60 / 0.281.72 / 0.34
Average1.45 / 0.331.61 / 0.381.57 / 0.361.61 / 0.391.56 / 0.371.70 / 0.381.91 / 0.43

Q3: Mask generation

We apologize for not clarifying this. Our mask generation strategy follows the setup of AnomalyDiffusion, and employs two approaches:

  1. Augmenting real anomaly masks via transformations (e.g., rotation, flipping).
  2. Generating novel masks using a Latent Diffusion Model (LDM) trained on a small set of real anomaly masks.

All generated masks are visually screened to ensure structural plausibility and diversity. We will provide further implementation details to the supplementary material and refer readers to the AnomalyDiffusion paper for reproducibility.


Q4: Missing studies on other datasets

We thank you for this observation. In response, we have conducted additional experiments on VisA dataset and included the results below. These results reinforce our claim that FAST is a downstream-backbone-agnostic and dataset-agnostic framework for segmentation-oriented industrial anomaly synthesis.

Table: Pixel-level segmentation results (mIoU / Accuracy) on VisA using SegFormer trained on synthetic anomalies

CategoryCutPasteDRAEMGLASSRealNetDFMGANAnomalyDiff.FAST
candle27.67 / 38.216.63 / 37.8432.47 / 41.4630.47 / 38.0932.47 / 44.6530.37 / 39.0439.91 / 46.66
capsules60.31 / 74.7469.05 / 80.5265.05 / 68.1165.38 / 74.8159.29 / 64.2465.90 / 72.4170.08 / 78.32
cashew86.79 / 88.5287.33 / 89.4086.23 / 88.1685.30 / 87.9286.55 / 88.6686.50 / 88.3288.92 / 90.98
chewinggum68.52 / 78.2969.37 / 81.1570.96 / 76.6768.50 / 80.9167.41 / 76.3267.03 / 79.3671.15 / 80.21
fryum47.17 / 49.0149.15 / 65.1138.65 / 39.9750.80 / 51.9854.08 / 55.1360.07 / 65.2263.65 / 66.13
macaroni132.60 / 42.1332.70 / 39.2029.70 / 36.6520.02 / 25.3523.91 / 31.0627.96 / 37.0637.19 / 45.18
macaroni222.68 / 25.4622.98 / 28.2124.90 / 28.6617.45 / 22.4213.20 / 14.6620.31 / 24.5030.78 / 38.68
pcb113.22 / 13.4713.89 / 14.1012.14 / 12.425.59 / 5.657.22 / 7.3123.76 / 24.4127.18 / 26.94
pcb218.14 / 20.0423.61 / 26.6020.37 / 22.8519.64 / 21.6118.32 / 19.8023.04 / 32.5926.40 / 36.37
pcb335.40 / 37.6139.21 / 43.1433.22 / 36.0232.70 / 36.4356.87 / 66.8438.19 / 45.9346.75 / 51.46
pcb455.88 / 64.5957.54 / 68.1855.07 / 66.4454.94 / 68.0657.81 / 68.1757.50 / 68.9360.62 / 72.91
pipe_fryum84.54 / 88.6484.10 / 86.0486.75 / 88.6582.59 / 86.4786.78 / 89.0984.08 / 88.2388.72 / 91.14
Average46.08 / 51.7345.67 / 53.7246.29 / 50.5044.91 / 48.9047.12 / 52.5447.79 / 56.2454.28 / 60.41
评论

Thank you for the comprehensive reply. Regarding the reply for "Uses other metrics instead of common ones in anomaly detection." -- I understand that the proposed method does not tackle the anomaly segmentation directly, but since anomaly segmentation is used to asses the method's improvement, it would still be useful to include more metrics for anomaly detection.

I also agree with reviewer Wznz that reducing the steps during generation might not be very useful for the downstream performance.

评论

We sincerely thank you for raising important questions regarding evaluation metrics and the practical value of accelerated sampling. We provide response below for the raised questions.

Results on other metrics:

Our initial focus on segmentation-based metrics (e.g., mIoU and pixel-wise Accuracy) stems from our belief that Segmentation-Oriented Industrial Anomaly Synthesis (SIAS) represents a new research direction, where segmentation performance is of primary importance, rather than detection-oriented evaluation. However, we now understand that anomaly detection metrics can provide complementary insights into the effectiveness of synthesized data. In response to this concern, we have additionally reported AUROC, PRO, F1 and AP scores on the MVTec dataset across several baselines. These results are shown in the table below and we hope that they help present a more complete picture of the performance of FAST. Due to character limits, AP and F1 results are included in the comment for Reviewer QVrM. We sincerely hope you to check that section.

Table: AUROC comparison across various methods

CategoryCutPasteDRAEMGLASSDFMGANRealNetAnomalyDiffusionAnoGenFAST
bottle99.0597.9697.4099.3298.9799.5699.3999.19
cable95.2995.9595.0795.5795.3695.7498.0797.37
capsule98.1199.5696.5099.2699.1999.1699.5799.64
carpet97.1499.8296.4299.3696.3498.3099.2399.40
grid98.9499.6695.6499.2099.6099.3399.4399.61
hazel_nut99.1499.4595.6399.3298.8999.7699.6799.82
leather99.8599.9099.7899.8899.9399.8599.8899.48
metal_nut99.3299.5494.8099.5099.4698.8099.4799.88
pill95.3896.7098.2999.6299.1999.4799.6999.87
screw92.7899.4794.8899.1498.9996.3199.5298.77
tile98.6099.7399.0799.7499.3199.5399.6799.77
toothbrush88.2198.5385.4297.9596.3796.2398.2999.80
transistor96.4092.8094.4197.1795.5998.8498.9799.78
wood94.0199.2495.9499.3198.7697.5299.2199.67
zipper99.5899.6899.7299.4799.6999.6599.7399.71
Average96.7998.5395.9398.9298.3898.5499.3299.45

Table: PRO comparison across various methods

CategoryCutPasteDRAEMGLASSDFMGANRealNetAnomalyDiffusionAnoGenFAST
bottle82.7986.5570.2081.1085.0083.5384.2790.77
cable55.0469.3050.7665.1562.7968.6664.2076.14
capsule44.5868.1733.5956.4066.2139.5957.0768.73
carpet64.6273.7965.9465.1365.9963.6765.7774.46
grid43.5256.6037.9144.4953.1746.2553.0962.21
hazel_nut79.3684.4364.9282.9178.7583.4879.5488.29
leather50.3263.3467.4161.9168.6368.0759.7074.46
metal_nut69.6487.0359.7683.6780.5473.5280.0291.33
pill54.4072.7736.6069.0270.1160.9667.4381.02
screw20.6349.3523.0651.8648.4236.1152.2248.70
tile82.8387.8083.3788.5982.9183.1485.1590.67
toothbrush29.0769.1230.1562.2560.7737.6758.8178.56
transistor55.0368.1049.7770.5768.9272.8569.8781.91
wood64.4678.8054.9772.9772.4966.8476.8484.74
zipper74.6278.0277.0973.7377.1174.8275.5477.92
Average58.0672.8853.7068.6569.4563.9468.6377.99

The importance of fast sampling speed.

In the SIAS, synthesized data are used to adapt downstream segmentation models. Our objective is to synthesize anomalies that are structurally aligned and mask-consistent within a limited timeframe, rather than pursuing high-fidelity visual realism. This is why using a large number of sampling steps can degrade segmentation performance. While we acknowledge that increasing the number of steps within a certain range may lead to slight improvements in segmentation performance, our ablation study shows that this comes at the cost of a significant increase in inference time, which can hinder its practical applicability.

Moreover, in industrial environments, it is common to meet "production line changeover", different lines in different bases will experience frequent reconfigurations that introduce new products and materials. Such transitions often lead to previously unseen types of anomalies that existing segmentation models are not equipped to detect. Delays in updating these models during transitions may result in increased reliance on manual inspection, higher false positive or false negative rates, and even production downtime. FAST is designed with this scenario in mind. There are also some domain adaptation studies that aim to minimize model switching time to enhance operational efficiency. Similar to them, FAST enables fast generation of segmentation-oriented anomalies, facilitating timely updates of segmentation models in dynamic manufacturing settings.

审稿意见
5

This paper proposes FAST, a novel framework for unsupervised anomaly segmentation. Existing self-training methods often suffer from foreground-background imbalance and error accumulation in pseudo-labels. To address this, FAST introduces two key components:

  1. Foreground Guidance Module: Uses CLIP-based saliency to focus learning on foreground objects, reducing background bias in normal-only training.
  2. Two-Branch Self-Training: Combines a soft label branch (probabilistic supervision) and a hard label branch (high-confidence pseudo-labels filtered by saliency) to enhance training stability and segmentation accuracy.

优缺点分析

Strenghts

  1. Clear and Well-Motivated Problem Setup.
    The paper identifies a key challenge in anomaly segmentation—foreground suppression due to background-dominated normal samples—and addresses it directly.
  2. Foreground-Aware Design.
    Leveraging CLIP-based saliency maps for guidance is a novel and practical way to enhance semantic focus.
  3. Solid Theoretical Justification.
    The framework is conceptually well-grounded, and the proposed design choices are supported with theoretical proofs.

Weakness

  1. Marginal Performance Gains.
    The improvements over strong baselines are relatively small on standard benchmarks, limiting the perceived practical impact.
  2. CLIP Dependency.
    The method relies heavily on CLIP saliency, which may not generalize well to domains with domain shift or non-natural images.
  3. Saliency Map Heuristic.
    The use of CLIP-derived saliency as ground-truth guidance is not learned, and may introduce biases or fail in complex backgrounds.

问题

  1. CLIP Reliability.
    Have the authors tested the robustness of CLIP-based saliency guidance across different domains (e.g., grayscale, medical, or low-resolution images)?
  2. Inference Latency.
    Can the authors provide runtime analysis or memory benchmarks compared to one-branch baselines? Is the method suitable for real-time applications?

局限性

Yes

最终评判理由

I appreciate the authors’ detailed and well-prepared rebuttal, which has clarified several of my earlier concerns. In particular, the additional experiments on medical datasets, the explanation of CLIP dependency, and the runtime analysis have addressed many of my questions and strengthened the paper. I also acknowledge that the theoretical grounding of the framework and the foreground-aware design are well-motivated and conceptually sound.

However, some issues remain only partially resolved. While the rebuttal provides evidence of performance gains, the improvements over baselines are still perceived as marginal on standard benchmarks, and the dependency on saliency-based guidance continues to raise concerns regarding robustness in more diverse domains. Paper clarity could also be further improved to ensure accessibility to a broader audience.

Taking these points into account, I believe the paper presents a technically solid contribution with a meaningful idea and practical potential, but with limitations in generalization and clarity that prevent me from being fully convinced of its broader impact. For these reasons, I will maintain my current score.

格式问题

N/A

作者回复

We sincerely thank you for your thoughtful comments, particularly for "solid theoretical grounding" and our proposed "foreground-aware design." Below we address the key concerns.


Q1: Limited performance gains.

We respectfully note that FAST consistently outperforms all prior anomaly synthesis methods across multiple segmentation architectures and datasets (see Table 1, Table 2, and Supplementary A.6). In particular, FAST improves the average segmentation accuracy by +9.22, +9.46, and +6.67 points of segmentation models like SegFormer, BiSeNetV2, and STDC on MVtec-AD compared with the second strongest baselines.

Additionally, ablation studies demonstrate substantial gains over the baseline LDM in both quality and controllability of the synthesized anomalies. FAST achieves these improvements with as few as 10 sampling steps, offering over 100× speedup compared to standard DDPM (Table 3). We believe these improvements are non-trivial and practically meaningful for real-world industrial deployment.


Q2: CLIP dependency and generalization.

We appreciate your concern regarding CLIP-based saliency guidance. In our setup, instead of directly relying on CLIP embeddings, FARM learns to reconstruct anomaly-aware content in a data-driven manner and is fully decoupled from CLIP during inference, and FAST does not treat CLIP-based saliency as fixed ground truth.

As also demonstrated in the ablation study (Fig. 8), this strategy significantly improves anomaly localization without relying on saliency maps. Specifically, we follow AnomalyDiffusion by employing Bert/CLIP just for textual embeddings initialization, and embeddings are still optimized during training.

To further assess generalization under domain shift, we conduct additional experiments on a grayscale medical dataset (Montgomery X-ray). We extract 1,200 image–mask pairs (256×256) for training and 400 for testing, and augment the training data with 500 synthetic samples generated by FAST.

The results show encouraging improvements in downstream lesion segmentation, suggesting that FAST holds promise in domains with domain shift or limited visual structure, even when saliency may be less reliable.

Table: Segmentation results on the Montgomery County X-ray dataset using Segformer
Original data refers to training using only real samples, while Augmented data includes both real and FAST-synthesized samples.

ExperimentmIoU (%)Accuracy (%)
Original data87.7990.26
Augmented data89.4191.82

Q3: Inference latency.

Thanks for this important question. As reported in Table 3 and Sec. 4.3, FAST achieves impressive performance with only 10–50 steps, compared to 1000 steps in DDPM.

Furthermore, FAST does not require retraining for different step sizes, making it highly practical for real-time or resource-constrained scenarios. We also provide the time cost for varying the number of AIAS segments (batch size = 8), which illustrates that FAST achieves up to 100× speedup while maintaining high-quality generation, as shown in Table 2.

In terms of inference latency, FAST is also highly efficient. It operates with minimal additional trainable parameters and achieves competitive results. We report runtime benchmarks under a consistent setting (batch size = 1), comparing FAST with other diffusion-based anomaly synthesis methods, as shown in Table 3.

These results confirm the suitability of FAST for downstream real-time segmentation tasks.

Table: Time cost (in seconds) of pixel-level anomaly segmentation with different diffusion steps on the MVTec dataset (batch size = 8)

Step Count251030501002005001000
Time (s)1.392.923.8710.6418.1636.7374.21183.34367.97

Table: Inference FLOPs and number of trainable parameters of different models

ModelInference FLOPs (TFLOPs)Trainable Parameters (M)
AnomalyDiffusion0.61356.0
RealNet1.11556.7
FAST0.513.40
评论

Thank you for your clear and well-prepared rebuttal. While I appreciate the detailed clarifications, additional examples, and further experiments, which have helped address several of my earlier concerns, some of my concerns remain only partially addressed. I will maintain my current score.

最终决定

This paper presents FAST, an innovative framework for unsupervised anomaly segmentation. Current self-training approaches frequently grapple with issues such as foreground-background imbalance and the accumulation of errors in pseudo-labels. To tackle these challenges, FAST incorporates two core components: Foreground Guidance Module: Leverages CLIP-based saliency to direct learning toward foreground objects, thereby mitigating background bias in training that relies solely on normal samples. Two-Branch Self-Training: Integrates a soft label branch (providing probabilistic supervision) and a hard label branch (utilizing high-confidence pseudo-labels filtered through saliency) to boost both training stability and segmentation accuracy.

All reviewers feel positive about this work. The authors have addressed all the issues.

The AC recommends the acceptance of this work.