PaperHub
5.4
/10
Rejected5 位审稿人
最低5最高6标准差0.5
6
5
5
6
5
4.0
置信度
正确性2.6
贡献度2.6
表达2.6
ICLR 2025

One Step Diffusion-based Super-Resolution with Time-Aware Distillation

OpenReviewPDF
提交: 2024-09-19更新: 2025-02-05

摘要

关键词
Efficient diffusionSuper-resolutionKnowledge distillation

评审与讨论

审稿意见
6

The author proposes a time-aware diffusion distillation method, named TAD-SR, where a novel score distillation strategy is introduced to align the score functions between the outputs of the student and teacher models after minor noise perturbation. Such distillation strategy eliminates the inherent bias in score distillation sampling (SDS) and enables the student models to focus more on high-frequency image details by sampling at smaller time steps. Furthermore, a time-aware discriminator is designed to mitigate performance limitations stemming from distillation, which distinguishes the diffused distributions of real and generated images under varying noise disturbance levels by injecting time information.

优点

  1. The proposed distillation strategy is simple and straightforward, which can eliminate the inherent bias in score distillation sampling (SDS) and enable the student models to focus more on high-frequency image details.
  2. The proposed time-aware discriminator can differentiate between real and synthetic data, contributing to the generation of high-quality images.
  3. The presentation of this work is written well and is easy to read.

缺点

  1. It is confusing which is the final output of the model when inference, z_0^{stu} or z ̂_0^{stu}? It is not clearly indicated in Figure 4. Please explicitly state in the text and figure.
  2. The authors should clarify if the teacher model is used at all during inference, or if it is only used during training. If I understand correctly, only the student model samples one step, and then the teacher model is used later to sample multiple steps to get the final clean latent, so the model performance relies heavily on the performance of the teacher model, and is not exactly efficient.
  3. What is the purpose of setting the weighting function (ω = 1/CS )? Please provide intuition for why this weighting function was chosen, and what effect it has on the training process or results.
  4. In order to eliminate the dependence of the proposed method on the teacher model of ResShift, the relevant ablation experiments should be conducted by replacing the different teacher models to validate the effectiveness of the proposed method.
  5. The experiments lack comparisons with the most relevant distillation methods, including DMD, DEQ[1], DFOSD[2], etc. Among them, DMD, a new diffusion model, utilizes similar score distillation techniques to the proposed HSD. DEQ and DFOSD are both efficient and relevant diffusion models, which require one-step diffusion distillation or even no distillation.
  6. In the experimental section, the authors compare many GAN and transformer-related methods. However, the proposed method is a diffusion model and should be compared with the most relevant diffusion models to validate its efficiency, especially accelerated diffusion models, including OSEDiff[3], DPM++[4], Unipc[5], etc.
  7. The authors claim that the method is designed to accomplish effective and efficient image super-resolution, but did not include a complexity comparison of the different methods (including parameters, sampling steps, running time, MACs, etc.), which is crucial for diffusion models. Please provide a Table to compare these computational complexity metrics with the key baselines.
  8. Are there any limit conditions for using the method? The author should discuss and analyze the limitations of the proposed method. It is recommended to add a discussion of a discussion of potential limitations or where the proposed method might not perform as well.

References

[1] Geng Z, Pokle A, Kolter J Z. One-step diffusion distillation via deep equilibrium models[C]. Advances in Neural Information Processing Systems, 2024.

[2] Li J, Cao J, Zou Z, et al. Distillation-free one-dtep diffusion for real-world image super-resolution[J]. arxiv preprint arxiv:2410.04224, 2024.

[3] Wu R, Sun L, Ma Z, et al. One-step effective diffusion network for real-world image super-resolution[J]. arxiv preprint arxiv:2406.08177, 2024.

[4] Lu C, Zhou Y, Bao F, et al. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models[J]. arxiv preprint arxiv:2211.01095, 2022.

[5] Zhao W, Bai L, Rao Y, et al. Unipc: A unified predictor-corrector framework for fast sampling of diffusion models[C]. Advances in Neural Information Processing Systems, 2024.

问题

See the Weakness part. The author should carefully describe the details of the method to enhance the readability and clarity of the paper. In addition, the comparison of the most relevant methods (including complexity comparison) should be added to clarify the innovation and effectiveness of the method, and the advancement of the method should be proved through relevant experiments.

I tend to improve the score if the author can solve my concerns.

评论

Table 7: Complexity comparison among different SD-based SR methods. All methods are tested on the ×4 (128→512) SR tasks, and the inference time is measured on an V100 GPU.

MethodStableSRPASDSeeSRSeeSR+UniPCSeeSR+ DPMsolverAddSROSEDiffTAD-SR
NFE20020501010111
Inference time (s)17.7613.518.42.142.130.640.480.64

Q8: Are there any limit conditions for using the method? The author should discuss and analyze the limitations of the proposed method. It is recommended to add a discussion of a discussion of potential limitations or where the proposed method might not perform as well.

A8: Thank you for your suggestion. Although our single-step method demonstrates strong performance, it shares a common limitation with current single-step distillation methods: increasing the number of inference steps alone does not yield better performance. Thus, developing a distillation method that matches the performance of state-of-the-art single-step approaches while enabling additional inference steps to enhance performance is a key area of our ongoing research.

References

[1]Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W. T., & Park, T. (2024). One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6613-6623).

[2]Hertz, A., Aberman, K., & Cohen-Or, D. (2023). Delta denoising score. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 2328-2337).

评论

Thank you for your response. I choose to keep my score as is mainly because the performance improvement appears to be somewhat marginal (or, in some cases, the improvement in certain metrics comes at the cost of others), which also validates my previous concerns.

评论

I appreciate the response from the authors but I will keep my score. The author didn't fully address my concerns. First, the explanation that the proposed model relies heavily on the teacher model does not convince me, and the authors did not explain the efficiency of the proposed method. Second, the author did not provide an intuitive reason for choosing the weighting function and how it affects the training process or results. More importantly, complexity comparison shouldn't just compare the inference time, but should include other key parameters for diffusion models, such as parameters, sampling steps, and MACs.

评论

Table 4: Quantitative results of different SR methods. The best and second best results are highlighted in bold and italic. ∗ indicates that the result was obtained by replicating the method in the paper.

DatasetsImageNet-testRealSRRealSRRealSet65RealSet65
MethodsLPIPS \downarrowCLIPIQA \uparrowMUSIQ \uparrowCLIPIQA \uparrowMUSIQ \uparrowCLIPIQA \uparrowMUSIQ \uparrow
LDM-150.2690.51246.4190.38449.3170.42747.488
ResShift-150.2310.59253.6600.59659.8730.65461.330
SinSR-10.2210.61153.3570.68961.5820.71562.169
SinSR*-10.2310.59952.4620.69160.8650.71262.575
DMD*-10.2460.61254.1240.70963.6100.72366.177
TAD-SR-10.2270.65257.5330.74165.7010.73467.500

Table 5: Generative performance on unconditional CIFAR-10. The best results are highlighted in bold.

MethodDDPMDDIMEDM(Teacher)DPM-solver2UniPCCD-L2CD-LPIPSDEQDMDOurs
NFE \downarrow1000503512811111
FID \downarrow3.174.671.885.285.107.903.556.913.772.31

Q6: In the experimental section, the authors compare many GAN and transformer-related methods. However, the proposed method is a diffusion model and should be compared with the most relevant diffusion models to validate its efficiency, especially accelerated diffusion models, including OSEDiff[3], DPM++[4], Unipc[5], etc.

A6: Thank you for your suggestion. Since OSEDiff is an SD-based SR method, we compared our approach to OSEDiff while distilling the SD-based SR model SeeSR. This ensures a fair comparison, as both methods were trained on the same dataset. As shown in Tables 1, 2, and 3, our method outperforms OSEDiff across most evaluation metrics.

In response to the reviewers' suggestions, we have also incorporated the designed sampler methods, Unipc[5] and DPM++[4], into Tables 1, 2, and 3. (Note that we did not apply these samplers to ResShift, as ResShift modifies the standard Markov chain, creating challenges for its adaptation to these samplers.) Despite this, the results clearly demonstrate that our method significantly outperforms methods employing these samplers.

Q7: The authors claim that the method is designed to accomplish effective and efficient image super-resolution, but did not include a complexity comparison of the different methods (including parameters, sampling steps, running time, MACs, etc.), which is crucial for diffusion models. Please provide a Table to compare these computational complexity metrics with the key baselines.

A7: Based on the reviewers' feedback, we have included a complexity comparison between TAD-SR and baseline methods, as presented in Tables 6 and 7. Table 6 focuses on comparisons with GAN-based methods and diffusion-based super-resolution methods trained from scratch. The results demonstrate that TAD-SR accelerates the teacher model, ResShift, to a single inference step, improving its speed by approximately tenfold. Table 7 highlights a comparison of inference time with SD-based super-resolution methods, revealing that our method's inference delay is only 7.6% of the teacher model, SeeSR.

Table 6: Complexity comparison among different SR methods. All methods are tested on the ×4 (64→256) SR tasks, and the inference time is measured on an A100 GPU.

MethodESRGANRealSR-JPEGBSRGANSwinIRRealESRGANDASRLDMResShiftSinSRTAD-SR
NFE111111151511
Inference time(s)0.0380.0380.0380.1070.0380.0220.4080.6820.0580.058
评论

Thank you for your comments and feedback. We address your concerns here.

Q1: It is confusing which is the final output of the model when inference, z_0^{stu} or z ̂_0^{stu}? It is not clearly indicated in Figure 4. Please explicitly state in the text and figure.

A1: Thank you for pointing out this issue. z0stuz_0^{stu} is the final output of the student model. z^0stu\hat{z}_0^{stu} represents the clean value predicted by the teacher model after re-adding noise to the output of the student model. This value is used to calculate the loss. We will revise the paper and the images in the manuscript to enhance clarity and make them easier to understand.

Q2: The authors should clarify if the teacher model is used at all during inference, or if it is only used during training. If I understand correctly, only the student model samples one step, and then the teacher model is used later to sample multiple steps to get the final clean latent, so the model performance relies heavily on the performance of the teacher model, and is not exactly efficient.

A2: Thank you for your suggestion. As the reviewer understands, the teacher model is only used during the training. Additionally, we not only leveraged the knowledge from the teacher model but also incorporated the ground truth (GT) into the distillation framework through adversarial learning to provide additional supervision for the model. Therefore, the performance of our method is not solely dependent on the teacher model’s performance.

Q3: What is the purpose of setting the weighting function (ω = 1/CS )? Please provide intuition for why this weighting function was chosen, and what effect it has on the training process or results.

A3: Apologies for the confusion. What we intended to convey is that our score distillation loss is averaged over both spatial and channel dimensions, which facilitates model optimization [1][2]. However, there was an error in the formula expression, and we will correct this in the next version of the manuscript.

Q4: In order to eliminate the dependence of the proposed method on the teacher model of ResShift, the relevant ablation experiments should be conducted by replacing the different teacher models to validate the effectiveness of the proposed method.

A4: Thank you for your suggestion. We have included the results of distilling the SD-based SR method SeeSR into a single step using TAD-SR. The quantitative and qualitative experimental results are presented in Tables 1, 2, and 3. As shown in the tables, our proposed distillation method demonstrates strong generalization capabilities, effectively distilling different teacher models into a single step and generating promising results.

Table 1: Quantitative comparison with state of the arts on RealSR dataset. Following the experimental setup of SeeSR, the LR images in the RealSR dataset were center-cropped to 128 ×\times 128. The best and second best results are highlighted in bold and italic.

MethodsPSNR \uparrowLPIPS \downarrowFID \downarrowNIQE \downarrowCLIPIQA \uparrowMUSIQ \uparrowMANIQA \uparrow
BSRGAN26.490.267141.285.660.51263.280.376
RealESRGAN25.780.273135.185.830.44960.360.373
LDL25.090.277142.716.000.43058.040.342
FeMaSR25.170.294141.055.790.54159.060.361
StableSR-20025.630.302133.405.760.52861.110.366
ResShift-1526.340.346149.546.870.54256.060.375
PASD-2026.670.344122.306.060.51962.920.404
SeeSR-5025.240.301125.425.390.67069.820.540
SeeSR(UniPC-10)25.860.281122.415.530.57767.120.476
SeeSR(DPMSolver-10)25.900.281122.465.540.58167.120.478
SinSR-126.160.308142.445.750.63060.960.399
AddSR-123.120.309132.015.540.55267.140.488
OSEDiff-125.150.292123.495.630.66868.990.474
TAD-SR-124.500.304118.385.130.67669.020.526
评论

Table 2: Quantitative comparison with state of the arts on RealLR200 dataset dataset. The best and second best results are highlighted in bold and italic. Note that since the RealLR200 dataset lacks high-resolution images, we only computed non-reference metrics.

MethodsNIQE \downarrowCLIPIQA \uparrowMUSIQ \uparrowMANIQA \uparrow
BSRGAN4.380.57064.870.369
RealESRGAN4.200.54262.930.366
LDL4.380.50960.950.327
FeMaSR4.340.65564.240.410
StableSR-2004.250.59262.890.367
ResShift-156.290.64760.250.418
PASD-204.180.62066.350.419
SeeSR-504.160.66268.630.491
SeeSR(UniPC-10)4.250.60166.900.433
SeeSR(DPMSolver-10)4.280.60366.920.435
SinSR-15.620.69763.850.445
AddSR-14.060.58566.860.418
OSEDiff-14.050.67469.610.444
TAD-SR-13.950.67469.480.482

Table 3: Quantitative comparison with state of the arts on DIV2k-val dataset. The best and second best results are highlighted in bold and italic.

MethodsPSNR \uparrowLPIPS \downarrowFID \downarrowNIQE \downarrowCLIPIQA \uparrowMUSIQ \uparrowMANIQA \uparrow
BSRGAN24.580.33544.224.750.52461.190.356
RealESRGAN24.290.31137.644.680.52761.060.382
LDL23.830.32642.284.860.51860.040.375
FeMaSR23.060.34653.704.740.59960.820.346
StableSR-20023.290.31224.544.750.67665.830.422
ResShift-1524.720.3441.996.470.59460.890.399
PASD-2024.510.39231.585.370.55159.990.399
SeeSR-5023.680.31925.974.810.69368.680.504
SeeSR(UniPC-10)24.070.33927.335.000.60764.970.432
SeeSR(DPMSolver-10)24.120.33827.325.030.61265.070.435
SinSR-124.410.32435.236.010.64862.800.424
AddSR-123.260.36229.684.760.57363.690.405
OSEDiff-123.720.29426.334.710.66167.960.443
TAD-SR-123.540.31125.964.640.66467.010.470

Q5: The experiments lack comparisons with the most relevant distillation methods, including DMD, DEQ[1], DFOSD[2], etc. Among them, DMD, a new diffusion model, utilizes similar score distillation techniques to the proposed HSD. DEQ and DFOSD are both efficient and relevant diffusion models, which require one-step diffusion distillation or even no distillation.

A5: Thank you for your suggestion. We applied DMD to super-resolution tasks and compared it with our proposed method. From Table 4, it can be seen that while DMD achieves promising results when transferred to super-resolution tasks, it remains inferior to our approach. Regarding DEQ[1], its high training cost makes applying it to super-resolution tasks extremely challenging. As noted in its original paper, DEQ experiments were only conducted on the CIFAR-10 dataset due to these limitations. For DFOSD[2], we found that its code is not open source, and the training relied on a self-collected dataset that is not publicly available, making it difficult to perform a fair comparison with our method.

To further validate the effectiveness of our approach, we applied TAD-SR to unconditional generation tasks and compared it with DMD and DEQ on the CIFAR-10 dataset. The experimental results are presented in Table 5. The results demonstrate that our method performs well in unconditional generation tasks, surpassing both DMD and DEQ.

评论

Thank you for your response. We will address your remaining concerns as follows.

Q1: The explanation that the proposed model relies heavily on the teacher model.

A1: First, we would like to clarify that during inference, only the student model performs single-step sampling to generate samples, while the teacher model supervises the student by generating samples through multi-step sampling during training.
Second, the knowledge distillation technique aims to transfer knowledge from the teacher model to student model through training, meaning the performance of the student model is inevitably influenced by the teacher. However, to prevent the student model's performance from being entirely constrained by the teacher, we have incorporated ground truth into the distillation framework through adversarial learning, providing additional supervision. Experimental results demonstrate that our method even outperforms the teacher model on certain non-reference metrics. Furthermore, in response to the reviewer’s comments, we replaced the teacher model in our experiments. As shown in Tables 11, 12, and 13 of the paper, our method continues to generate high-quality images through single-step inference, clearly demonstrating its effectiveness.

Q2: The weighting function of HSD.

A2: Regarding the loss weight, we followed the approach used in DMD[1] and DDS[2], normalizing the loss across both spatial and channel dimensions(i.e.,ω=1/CSi.e.,\omega = 1/CS). This normalization is commonly applied in prior model training, as it facilitates better model optimization. We have also provided results without weighting function ω\omega for comparison, and the effectiveness of the weighting function is evident from Table 1.

Q3: Complexity comparison.

A3: Ultimately, we would like to emphasize that both Table 2 in the initial manuscript and Table 6 in the revised manuscript provide a comparison of the sampling steps, inference time, and parameter count between our method and other methods. Additionally, in response to the reviewer’s comments, we have included a comparison of FLOPs, with the results shown in Tables 2 and 3. Table 2 focuses on comparisons with diffusion-based super-resolution methods trained from scratch. Table 3 highlights a comparison of computational complexity with SD-based super-resolution methods.

Table 1: Ablation studies of the weighting function of HSD on RealSR and RealSet65 benchmarks. The best results are highlighted in bold.

DatasetsRealSet65RealSet65RealSRRealSR
SettingsCLIPIQA \uparrowMUSIQ \uparrowCLIPIQA \uparrowMUSIQ \uparrow
w/o weighting function0.72366.2420.73164.425
Ours0.73467.5000.74165.701

Table 2: Complexity comparison among different SR methods. All methods are tested on the ×4 (64→256) SR tasks, and the inference time is measured on an A100 GPU.

MethodLDMResShift (teacher)SinSRDMDTAD-SR
NFE1515111
#Parameters (M)168.92173.91173.91173.91173.91
Inference time (s)0.4080.6820.0580.0580.058
FLOPs (G)1208.71506.75100.45100.45100.45

Table 3: Complexity comparison among different SD-based SR methods. All methods are tested on the ×4 (128→512) SR tasks, and the inference time is measured on an V100 GPU.

MethodStableSRPASDSeeSR (teacher)AddSROSEDiffTAD-SR
NFE2002050111
#Parameters (M)1002.951333.531703.051703.051378.391703.05
Inference time (s)17.7613.518.40.640.480.64
FLOPs (G)15729428675.2711488488.767995.58488.76

[1]Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W. T., & Park, T. (2024). One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6613-6623).

[2]Hertz, A., Aberman, K., & Cohen-Or, D. (2023). Delta denoising score. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 2328-2337).

评论

Thank you for your response. I have no further questions and am willing to increase my score.

评论

Thank you for carefully reviewing the discussion and deciding to increase your score. We are pleased to revise the manuscript based on your suggestions, which have made it more robust and easier to understand.

审稿意见
5

This paper introduces TAD-SR, a time-aware diffusion distillation method designed to enhance the efficiency and performance of diffusion-based image super-resolution (SR) models. By aligning the student and teacher models with the proposed score distillation strategy and incorporating a time-aware discriminator to distinguish real and synthetic data across varying noise levels, TAD-SR achieves strong performance across several metrics.

优点

  1. The topic is interesting and meaningful.
  2. Extensive experiments demonstrate that TAD-SR achieves results comparable to or exceeding multi-step diffusion models, espeically in some non-reference IQA metrics.

缺点

  1. The organization of the paper needs improvement, as it is challenging to clearly understand the core idea. For instance, Fig. 2, which aims to illustrate the paper's motivation, has a caption that provides limited information.

  2. The paper lacks essential metrics, such as PSNR and SSIM, to evaluate model fidelity. As shown in previous works, there is a trade-off between PSNR, SSIM, and CLIPIQA, MUSIQ. Reporting only LPIPS and non-reference IQA metrics is insufficient to demonstrate performance. Both the main results and ablation studies should include these metrics.

  3. Although I understand that StableDiffusionXL also employs adversarial loss, it appears less elegant to me due to the inherent limitations of GANs.

  4. In addition to the difficulty of assessing performance without PSNR and SSIM, the reported improvements seem marginal compared to existing methods.

问题

The motivation is not clear. If the proposed method wants to achieve one-step SR, why it is important for student model to learn how to deal with the intermediate steps?

Will increase the inference steps contribute to the improvement of the performance?

评论

Q5: The motivation is not clear. If the proposed method wants to achieve one-step SR, why it is important for student model to learn how to deal with the intermediate steps?

A5: Sorry, it may be that our description was not clear enough, which caused a misunderstanding for you. We will enhance the readability of the paper in the revised PDF. To clarify, our student model accepts a fixed time step TT to generate clean samples in a single step. The intermediate time steps we sample are used solely to calculate the loss. Specifically, we leverage the pre-trained diffusion model's ability to handle intermediate time steps to constrain the single-step output of the student model. Diffusion models typically predict low-frequency information in the early stages of denoising and high-frequency information in the later stages. Therefore, we add varying levels of noise to both the clean samples generated by the student model and the teacher model, then feed them into a pre-trained diffusion model for prediction. By calculating the distance between the two predicted values, we can constrain the samples generated by the student model to match the high-frequency or low-frequency information in the teacher model's generated samples.

Q6: Will increase the inference steps contribute to the improvement of the performance?

A6: This is really a good question! Normally, if only a single time step is sampled to train the student model, simply increasing the number of iterations during inference will not lead to any performance improvement. This is because the model has only learned the mapping from noisy data to clean data at that specific time step and lacks the ability to process noisy data at other intermediate time steps. This limitation is common to all single-step distillation methods. Thus, developing a distillation method that matches the performance of state-of-the-art single-step approaches while enabling additional inference steps to enhance performance is a key area of our ongoing research. We will include a discussion on this aspect in the revised PDF.

References

[1] Wang, J., Yue, Z., Zhou, S., Chan, K. C., & Loy, C. C. (2024). Exploiting diffusion prior for real-world image super-resolution. International Journal of Computer Vision, 1-21.

[2] Xie, R., Tai, Y., Zhao, C., Zhang, K., Zhang, Z., Zhou, J., ... & Yang, J. (2024). Addsr: Accelerating diffusion-based blind super-resolution with adversarial diffusion distillation. arXiv preprint arXiv:2404.01717.

[3] Wu, R., Yang, T., Sun, L., Zhang, Z., Li, S., & Zhang, L. (2024). Seesr: Towards semantics-aware real-world image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 25456-25467).

[4] Sauer, A., Lorenz, D., Blattmann, A., & Rombach, R. (2025). Adversarial diffusion distillation. In European Conference on Computer Vision (pp. 87-103). Springer, Cham.

[5] Xu, Y., Zhao, Y., Xiao, Z., & Hou, T. (2024). Ufogen: You forward once large scale text-to-image generation via diffusion gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8196-8206).

评论

Dear reviewer B832

We sincerely appreciate your response, but it seems that you have replied in the wrong place. We will continue to address your concerns here.

Our method demonstrates significant improvements over other methods in most metrics for real-world image super-resolution and blind face restoration tasks, particularly when compared to SinSR, a single-step SR technique. Additionally, we replaced the teacher model (ResShift) with an SD-based SR model (SeeSR) and conducted extensive experiments. The experimental results are presented in Tables 11, 12, and 13 of the manuscript. Our method achieved performance comparable to the teacher model and outperformed other comparison methods in most metrics, effectively validating the effectiveness of our approach. Furthermore, it is noteworthy that previous methods were also unable to consistently outperform comparative methods across all indicators and scenarios, which is a highly challenging task.

评论

Table 2: Quantitative results of different methods on the dataset of CelebA-Test. The best and second best results are highlighted in bold and italic. ∗ indicates that the result was obtained by replicating the method in the paper.

MethodsPSNR \uparrowSSIM \uparrowLPIPS \downarrowIDS \downarrowLMD \downarrowFID-F \downarrowFID-G \downarrowCLIPIQA \uparrowMUSIQ \uparrow
DFDNET10.8330.4490.73986.32320.78493.62176.1180.61951.173
PSFRGAN19.6620.5820.47574.02510.16863.67660.7480.63069.910
GFPGANv1.219.5580.6050.41666.8208.88666.30827.6980.67175.388
RestoreFormer19.6040.5510.48870.51811.13750.16551.9970.73671.039
VQFR19.9790.6220.41165.5388.91058.42325.2340.68573.155
CoderFormer23.5760.6610.32459.1365.03562.79426.1600.69875.900
DiffFace-10024.0330.7050.33863.0335.30152.53123.2120.52766.042
Resshift-1523.4130.6710.30959.6235.05650.16417.5640.61373.214
SinSR*-122.3170.6400.31960.3054.93555.29221.6810.63474.140
TAD-SR-122.6140.6290.34159.8975.05041.96816.7790.73575.027

Table 3: Ablation studies of the proposed methods on ImageNet-Test benchmarks. The best results are highlighted in bold.

Score distillationDiscriminatorPSNR \uparrowSSIM \uparrowLPIPS \downarrowCLIPIQA \uparrowMUSIQ \uparrow
SDS24.460.6580.3350.41241.133
SDS24.760.6700.3000.46946.024
SDStime-aware24.690.6710.2780.52249.932
HSD24.640.6610.2280.60853.508
HSD23.890.6400.2270.64957.370
HSDtime-aware23.910.6410.2270.65257.533

Q3: Although I understand that StableDiffusionXL also employs adversarial loss, it appears less elegant to me due to the inherent limitations of GANs.

A3: Recently, many diffusion-based methods [4][5] have begun integrating adversarial learning into the training process. Experimental results demonstrate that this approach can significantly enhance model performance, underscoring its potential value.

Q4: In addition to the difficulty of assessing performance without PSNR and SSIM, the reported improvements seem marginal compared to existing methods.

A4: In addition to PSNR and SSIM, our method demonstrates significant improvements over SinSR in other metrics. The table below lists the percentage improvements achieved by our method compared to SinSR.

Table4: Quantitative comparison with SinSR method in super-resolution tasks.

DatasetsImageNet-TestRealSRRealSRRealSet65RealSet65
MethodLPIPS \downarrowCLIPIQA \uparrowMUSIQ \uparrowCLIPIQA \uparrowMUSIQ \uparrowCLIPIQA \uparrowMUSIQ \uparrow
SinSR*0.2310.59952.4620.69160.8650.71262.575
TAD-SR0.227(+1.7%)0.652(+8.8%)57.533(+9.7%)0.741(+7.2%)65.701(+7.9%)0.734(+3%)67.5(+7.9%)
评论

Thank you for your comments and feedback. We address your concerns here.

Q1: The organization of the paper needs improvement, as it is challenging to clearly understand the core idea. For instance, Fig. 2, which aims to illustrate the paper's motivation, has a caption that provides limited information.

A1: Thank you for your suggestion. We will carefully describe the details of this method in the revised manuscript to improve the readability and clarity of the paper.

Q2: The paper lacks essential metrics, such as PSNR and SSIM, to evaluate model fidelity. As shown in previous works, there is a trade-off between PSNR, SSIM, and CLIPIQA, MUSIQ. Reporting only LPIPS and non-reference IQA metrics is insufficient to demonstrate performance. Both the main results and ablation studies should include these metrics.

A2: Thank you for your suggestion. We have included PSNR and SSIM metrics in both our main experiments and ablation studies, as shown in Tables 1, 2, and 3. However, our experimental results, along with findings from previous studies, indicate that PSNR and SSIM do not always align with human perception or other indicators such as LPIPS, CLIPIQA, and MUSIQ. Specifically, when image quality improves, and these perceptual indicators yield higher values, PSNR and SSIM often decrease. Conversely, an increase in PSNR and SSIM typically corresponds to smoother and blurrier images. For instance, while methods such as LDM, ResShift, and DASR achieve higher PSNR and SSIM scores compared to others, the images they generate tend to appear smoother or blurrier (as shown in Figures 6 and 12). We infer that this discrepancy likely arises because PSNR and SSIM measure image differences in pixel space, whereas human perception and other metrics evaluate images based on perceptual quality. Therefore, we regard PSNR and SSIM as reference metrics rather than primary evaluation metrics in real-world super-resolution tasks, consistent with the conclusions of prior work [1][2][3].

Table 1: Quantitative results of different methods on the dataset of ImageNet-Test. The best and second best results are highlighted in bold and italic. ∗ indicates that the result was obtained by replicating the method in the paper.

MethodsPSNR \uparrowSSIM \uparrowLPIPS \downarrowCLIPIQA \uparrowMUSIQ \uparrow
ESRGAN20.670.4480.4850.45143.615
RealSR-JPEG23.110.5910.3260.53746.981
BSRGAN24.420.6590.2590.58154.697
SwinIR23.990.6670.2380.56453.790
RealESRGAN24.040.6650.2540.52352.538
DASR24.750.6750.2500.53648.337
LDM-1524.890.6700.2690.51246.419
ResShift-1525.010.6770.2310.59253.660
SinSR-124.560.6570.2210.61153.357
SinSR*-124.590.6590.2310.59952.462
DMD*-124.050.6290.2460.61254.124
TAD-SR-123.910.6410.2270.65257.533
审稿意见
5

This paper proposes a time-aware diffusion distillation method, TAD-SR, to achieve one-step SR inference with competitive performance. It applies a score distillation strategy make efforts to eliminate the inherent bias SDS focus more on high-frequency image details when sampling at small time steps. A time-aware discriminator is also designed to differentiate between real and synthetic data.

优点

  1. This paper proposes a time-aware distillation method that accelerates diffusion-based SR models into a single inference step.
  2. The writing of this paper is good.

缺点

See the questions.

问题

  1. Since this is a distillation method, please compare more diffusion-based distillation SR methods, like OSEDiff [1], quantitatively and qualitatively. (Why are the comparison with diffusion-based distillation SR methods missing in some tables and figures?)

  2. Since you claim that TAD-SR can achieve better reconstruction of high-frequency information, please present the spectrum images of the LR input, GT, baseline methods’ reconstruction, and TAD-SR’s reconstruction. Examine the differences in the high-frequency patterns around the periphery of the spectrum images.

  3. Please compare the inference time of TAD-SR and baseline methods.

  4. In Fig. 10 and Fig. 12, TAD-SR’s results appear to contain many fragmented particles, which make the images look sharper at first glance; however, this is actually due to the addition of pseudo-textures or unnatural details. Could you explain the cause of this? For instance, could it be due to the adversarial loss?

  5. Following the concern raised in my 4th question, could you please provide more qualitative comparisons that contain fine details or small textures?

[1] Rongyuan Wu, et al. One-Step Effective Diffusion Network for Real-World Image Super-Resolution.

(I apologize for my previous review comments, which were not fully aligned with your article due to a heavy review workload. I am providing corrected feedback here, and if your response addresses these points well, I will consider adjusting the score.)

评论

Q3: Please compare the inference time of TAD-SR and baseline methods.

A3: Based on the reviewers' feedback, we have included a complexity comparison between TAD-SR and baseline methods, as presented in Tables 4 and 5. Table 4 focuses on comparisons with GAN-based methods and diffusion-based super-resolution methods trained from scratch. The results demonstrate that TAD-SR accelerates the teacher model, ResShift, to a single inference step, improving its speed by approximately tenfold. Table 5 highlights a comparison of inference time with SD-based super-resolution methods, revealing that our method's inference delay is only 7.6% of the teacher model, SeeSR.

Table 4: Complexity comparison among different SR methods. All methods are tested on the ×4 (64→256) SR tasks, and the inference time is measured on an A100 GPU.

MethodESRGANRealSR-JPEGBSRGANSwinIRRealESRGANDASRLDMResShiftSinSRTAD-SR
NFE111111151511
Inference time (s)0.0380.0380.0380.1070.0380.0220.4080.6820.0580.058

Table 5: Complexity comparison among different SD-based SR methods. All methods are tested on the ×4 (128→512) SR tasks, and the inference time is measured on an V100 GPU.

MethodStableSRPASDSeeSRSeeSR+UniPCSeeSR+ DPMsolverAddSROSEDiffTAD-SR
NFE20020501010111
Inference time (s)17.7613.518.42.142.130.640.480.64

Q4: In Fig. 10 and Fig. 12, TAD-SR’s results appear to contain many fragmented particles, which make the images look sharper at first glance; however, this is actually due to the addition of pseudo-textures or unnatural details. Could you explain the cause of this? For instance, could it be due to the adversarial loss?

A4: Upon careful examination of the images generated by our method and other super-resolution approaches, we observed that various methods may produce pseudo-textures in certain images to differing extents. We found that some of the unnatural textures generated by our method exhibit the same pattern as those produced by the teacher model, which we speculate may be due to the inherent properties of the diffusion model itself. Additionally, on real-world datasets, this phenomenon is likely attributed to the inconsistent degradation encountered by the training and testing models. While degradation during model training is artificially synthesized and may exhibit certain statistical features, real-world degradation is more complex and diverse, which could lead to the generation of pseudo-textures.

Q5: Following the concern raised in my 4th question, could you please provide more qualitative comparisons that contain fine details or small textures?

A5: Sure, we provide more qualitative comparisons that contain fine details in Figure 9 and Figure 15 of the revised PDF.

评论

Dear Reviewer uBAa:

The discussion period between the authors and the reviewer is nearing its end, and we kindly request that you review our clarifications and revisions. If our response addresses your concerns, we hope you can reconsider your score.

Thank you once again for your time and consideration.

Best Wishes!

Authors of Submission 1713

评论

Table 3: Quantitative comparison with state of the arts on DIV2k-val dataset. The best and second best results are highlighted in bold and italic.

MethodsPSNR \uparrowLPIPS \downarrowFID \downarrowNIQE \downarrowCLIPIQA \uparrowMUSIQ \uparrowMANIQA \uparrow
BSRGAN24.580.33544.224.750.52461.190.356
RealESRGAN24.290.31137.644.680.52761.060.382
LDL23.830.32642.284.860.51860.040.375
FeMaSR23.060.34653.74.740.59960.820.346
StableSR-20023.290.31224.544.750.67665.830.422
ResShift-1524.720.3441.996.470.59460.890.399
PASD-2024.510.39231.585.370.55159.990.399
SeeSR-5023.680.31925.974.810.69368.680.504
SeeSR(UniPC-10)24.070.33927.335.000.60764.970.432
SeeSR(DPMSolver-10)24.120.33827.325.030.61265.070.435
SinSR-124.410.32435.236.010.64862.800.424
AddSR-123.260.36229.684.760.57363.690.405
OSEDiff-123.720.29426.334.710.66167.960.443
TAD-SR-123.540.31125.964.640.66467.010.470

Q2: Since you claim that TAD-SR can achieve better reconstruction of high-frequency information, please present the spectrum images of the LR input, GT, baseline methods’ reconstruction, and TAD-SR’s reconstruction. Examine the differences in the high-frequency patterns around the periphery of the spectrum images.

A2: Thank you for your valuable suggestion. In Figure 10 of the appendix, we present the Fourier transform spectra of low-resolution (LR) images, ground truth (GT) images, and reconstructions from different super-resolution (SR) methods. From these spectra, it is evident that our method preserves more high-frequency information compared to other diffusion-based SR methods.

评论

Thank you for providing valuable feedback on our paper despite your busy schedule. We address your concerns here.

Q1: Since this is a distillation method, please compare more diffusion-based distillation SR methods, like OSEDiff [1], quantitatively and qualitatively. (Why are the comparison with diffusion-based distillation SR methods missing in some tables and figures?)

A1: Thank you for pointing out this issue. We have compared our method with OSEDiff, with quantitative results presented in Tables 1, 2, and 3. In the main text, we primarily use the super-resolution model ResShift trained from scratch as the teacher, enabling a fair comparison with SinSR, which also distills ResShift. In the appendix, we employ the SD-based SR method SeeSR as the teacher and mainly compare our approach with other SD-based SR methods and SD-based distillation SR methods. Due to substantial differences in the datasets used to train ResShift and SeeSR, comparisons involving SD-based SR methods are omitted from certain charts in the main text.

Table 1: Quantitative comparison with state of the arts on RealSR dataset. Following the experimental setup of SeeSR, the LR images in the RealSR dataset were center-cropped to 128 ×\times 128. The best and second best results are highlighted in bold and italic.

MethodsPSNR \uparrowLPIPS \downarrowFID \downarrowNIQE \downarrowCLIPIQA \uparrowMUSIQ \uparrowMANIQA \uparrow
BSRGAN26.490.267141.285.660.51263.280.376
RealESRGAN25.780.273135.185.830.44960.360.373
LDL25.090.277142.716.000.43058.040.342
FeMaSR25.170.294141.055.790.54159.060.361
StableSR-20025.630.302133.405.760.52861.110.366
ResShift-1526.340.346149.546.870.54256.060.375
PASD-2026.670.344122.306.060.51962.920.404
SeeSR-5025.240.301125.425.390.67069.820.540
SeeSR(UniPC-10)25.860.281122.415.530.57767.120.476
SeeSR(DPMSolver-10)25.900.281122.465.540.58167.120.478
SinSR-126.160.308142.445.750.63060.960.399
AddSR-123.120.309132.015.540.55267.140.488
OSEDiff-125.150.292123.495.630.66868.990.474
TAD-SR-124.500.304118.385.130.67669.020.526

Table 2: Quantitative comparison with state of the arts on RealLR200 dataset dataset. The best and second best results are highlighted in bold and italic. Note that since the RealLR200 dataset lacks high-resolution images, we only computed non-reference metrics.

MethodsNIQE \downarrowCLIPIQA \uparrowMUSIQ \uparrowMANIQA \uparrow
BSRGAN4.380.57064.870.369
RealESRGAN4.200.54262.930.366
LDL4.380.50960.950.327
FeMaSR4.340.65564.240.410
StableSR-2004.250.59262.890.367
ResShift-156.290.64760.250.418
PASD-204.180.62066.350.419
SeeSR-504.160.66268.630.491
SeeSR(UniPC-10)4.250.60166.900.433
SeeSR(DPMSolver-10)4.280.60366.920.435
SinSR-15.620.69763.850.445
AddSR-14.060.58566.860.418
OSEDiff-14.050.67469.610.444
TAD-SR-13.950.67469.480.482
审稿意见
6

This paper introduces a time-aware diffusion distillation method named TAD-SR, which enables the student model to focus on high-frequency image details at smaller time steps and eliminates inherent biases in score distillation sampling. The authors also design a time-aware discriminator that fully leverages the teacher model’s knowledge by injecting time information to differentiate between real and synthetic data. Experimental results demonstrate the effectiveness and efficiency of the proposed method.

优点

  • The paper is well-written.
  • Experimental results demonstrate that the proposed method achieves state-of-the-art performance with high efficiency.

缺点

  • The evaluation is not comprehensive. Some image fidelity metrics are lacking, such as PSNR and SSIM on ImageNet-Test, where the competing methods ResShift and SinSR all reported.

  • The improvement over the previous single-step distillation method SinSR is minor. Considering that LPIPS—a crucial metric for perceptual quality—is very important, the increase from 0.221 to 0.227 represents a big drop in quality and is not slight.

  • The ablation study examines only the presence or absence of the discriminator, neglecting other important aspects—for example, the number of scales used in the discriminator.

问题

Please refer to the weakness part.

评论

Thank you for your comments and feedback. We address your concerns here.

Q1: The evaluation is not comprehensive. Some image fidelity metrics are lacking, such as PSNR and SSIM on ImageNet-Test, where the competing methods ResShift and SinSR all reported.

A1: Thank you for your suggestion. We incorporate PSNR and SSIM metrics into the evaluation of the ImageNet dataset. However, we want to emphasize that these two metrics are secondary in real-world super-resolution tasks[1][2][3]. For instance, while methods such as LDM, ResShift, and DASR achieve higher PSNR and SSIM scores compared to others, the images they generate tend to appear smoother or blurrier (as shown in Figures 6 and 12). This discrepancy likely arises because PSNR and SSIM measure image differences in pixel space, whereas human and other metrics evaluate images based on perceptual quality. Therefore, PSNR and SSIM metrics should be considered as reference points only, which aligns with observations in previous studies[1][2][3].

Table 1: Quantitative results of different methods on the dataset of ImageNet-Test. The best and second best results are highlighted in bold and italic. ∗ indicates that the result was obtained by replicating the method in the paper.

MethodsPSNR \uparrowSSIM \uparrowLPIPS \downarrowCLIPIQA \uparrowMUSIQ \uparrow
ESRGAN20.670.4480.4850.45143.615
RealSR-JPEG23.110.5910.3260.53746.981
BSRGAN24.420.6590.2590.58154.697
SwinIR23.990.6670.2380.56453.790
RealESRGAN24.040.6650.2540.52352.538
DASR24.750.6750.2500.53648.337
LDM-1524.890.6700.2690.51246.419
ResShift-1525.010.6770.2310.59253.660
SinSR-124.560.6570.2210.61153.357
SinSR*-124.590.6590.2310.59952.462
DMD*-124.050.6290.2460.61254.124
TAD-SR-123.910.6410.2270.65257.533

Q2: The improvement over the previous single-step distillation method SinSR is minor. Considering that LPIPS—a crucial metric for perceptual quality—is very important, the increase from 0.221 to 0.227 represents a big drop in quality and is not slight.

A2: Thank you for your feedback. We replicate SinSR using the open-source code, and the experimental evaluation results are shown in the third-to-last row of Table 1. Compared to the replicated SinSR, our method improved the LPIPS metric, decreasing it from 0.231 to 0.227. Additionally, our method demonstrates significant improvements over SinSR in most metrics. The table below lists the percentage improvements achieved by our method compared to SinSR.

Table2: Quantitative comparison with SinSR method in super-resolution tasks.

DatasetsImageNet-TestRealSRRealSRRealSet65RealSet65
MethodLPIPS \downarrowCLIPIQA \uparrowMUSIQ \uparrowCLIPIQA \uparrowMUSIQ \uparrowCLIPIQA \uparrowMUSIQ \uparrow
SinSR*0.2310.59952.4620.69160.8650.71262.575
TAD-SR0.227(+1.7%)0.652(+8.8%)57.533(+9.7%)0.741(+7.2%)65.701(+7.9%)0.734(+3%)67.5(+7.9%)
评论

Q3: The ablation study examines only the presence or absence of the discriminator, neglecting other important aspects—for example, the number of scales used in the discriminator.

A3: Thank you for your valuable suggestion. We also conducted ablation experiments to evaluate the impact of using multi-scale features in the discriminator. We designed an experiment using only the features of the last layer of the diffusion model for discrimination, denoted as "w/o multi-scale". Now, our analysis of the discriminator includes comparisons with and without the discriminator, the incorporation of temporal information, and the use of multi-scale features. From Table 3, it can be seen that the discriminator utilizing multi-scale features and incorporating temporal information achieves the best performance.

Table 3: Ablation studies of our proposed discriminator on RealSR and RealSet65 benchmarks. The best results are highlighted in bold.

DatasetsRealSet65RealSet65RealSRRealSR
SettingsCLIPIQA \uparrowMUSIQ \uparrowCLIPIQA \uparrowMUSIQ \uparrow
Ours discriminator0.73467.5000.74165.701
w/o time-aware0.72966.9040.71163.550
w/o multi-scale0.72467.3300.72265.205

References

[1] Wang, J., Yue, Z., Zhou, S., Chan, K. C., & Loy, C. C. (2024). Exploiting diffusion prior for real-world image super-resolution. International Journal of Computer Vision, 1-21.

[2] Xie, R., Tai, Y., Zhao, C., Zhang, K., Zhang, Z., Zhou, J., ... & Yang, J. (2024). Addsr: Accelerating diffusion-based blind super-resolution with adversarial diffusion distillation. arXiv preprint arXiv:2404.01717.

[3] Wu, R., Yang, T., Sun, L., Zhang, Z., Li, S., & Zhang, L. (2024). Seesr: Towards semantics-aware real-world image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 25456-25467).

评论

Dear Reviewer gXos:

The discussion period between the authors and the reviewer is nearing its end, and we kindly request that you review our clarifications and revisions. If our response addresses your concerns, we hope you can reconsider your score.

Thank you once again for your time and consideration.

Best Wishes!

Authors of Submission 1713

审稿意见
5

This paper proposed a method to distill a super-resolution diffusion model into one step, by combining 3 losses: direct regression loss, GAN loss, and a modified score distillation loss. The main contribution is the score distillation part.

优点

  1. The paper targets at an important problem of distillation of SR diffusion models. While diffusion distillation is a popular research area, it is interesting to see some insight particularly designed for SR models

  2. The paper introduces a novel technique to reduce the bias of the score estimate of generated samples in SDS, which particularly fits in the insights from SR.

  3. Empirical results shows promising improvements.

缺点

  1. The biggest concern is insufficient baselines. The method compare against a large number of non-diffusion based methods or diffusion based iterative methods, but it lacks comparisons against the most closely related methods: other diffusion distillation algorithms. This method distill a pre-trained SR diffusion model into one step with some specific design for SR, but there are many distillation methods designed for general diffusion models, such as consistency model and the family of distribution matching distillation. The authors should run controlled experiment with the same teacher model with different algorithms to emphasize the relative advantage. For example, personally I found CM works well in distilling SR model into one step, and DMD and its variant can distilled the more complicated T2I model into one step. Their relative performance on SR diffusion is what we really care.

  2. It seems like the method requires teacher model to generate clean samples, which can be computationally expensive, even if you pre-compute the data off-line.

  3. The background of SDS and how to reduce the bias is unclear to readers without prior knowledge.

问题

N/A

评论

Q2: It seems like the method requires teacher model to generate clean samples, which can be computationally expensive, even if you pre-compute the data off-line.

A2: The generation of samples by teacher models does incur additional computational costs; however, these costs remain within an acceptable range, particularly when generating samples offline. We compare our method with SinSR in terms of training time. As shown in Table 3, when generating clean samples online, our training time is only two hours longer than that of SinSR, and the distillation process can be completed within a day. Moreover, generating samples offline further reduces both training time and computational resource consumption. Additionally, we compare the GPU memory usage of our method during training between offline generation clean samples and online generation clean samples. The results show that the online generation of clean samples increases GPU memory usage by less than 5%, which is within an acceptable range. Furthermore, because SinSR requires learning a bidirectional mapping between noise and images, its GPU memory usage is higher than that of our method.

Table 3: A comparison of the training cost on 8 NVIDIA V100.

MethodNum of Iterss/IterTraining TimeGPU memory (GB)
SinSR30k2.57~21 hours17.30
Ours (Online)30k2.79~23 hours11.72
Ours (Offline)30k1.05~9 hours11.17

Q3: The background of SDS and how to reduce the bias is unclear to readers without prior knowledge.

A3: Thank you for your valuable suggestion. In the revised manuscript, we will include more background information on SDS and provide a clearer explanation of how we address deviations in SDS.

评论

Dear Reviewer eiDx:

The discussion period between the authors and the reviewer is nearing its end, and we kindly request that you review our clarifications and revisions. If our response addresses your concerns, we hope you can reconsider your score.

Thank you once again for your time and consideration.

Best Wishes!

Authors of Submission 1713

评论

Thank you for your comments and feedback. We address your concerns here.

Q1: The biggest concern is insufficient baselines. The method compare against a large number of non-diffusion based methods or diffusion based iterative methods, but it lacks comparisons against the most closely related methods: other diffusion distillation algorithms. This method distill a pre-trained SR diffusion model into one step with some specific design for SR, but there are many distillation methods designed for general diffusion models, such as consistency model and the family of distribution matching distillation. The authors should run controlled experiment with the same teacher model with different algorithms to emphasize the relative advantage. For example, personally I found CM works well in distilling SR model into one step, and DMD and its variant can distilled the more complicated T2I model into one step. Their relative performance on SR diffusion is what we really care.

A1: Thank you for your valuable suggestion. We apply both consistency models and distribution matching distillation (DMD) to SR tasks for evaluation. Specifically, we employ consistency distillation under L2 loss and set the same boundary conditions as consistency models: cskip(t)=σdata2(ηtη0)2+σdata2,cout(t)=σdata(ηtη0)σdata2+ηt2,c_{skip}(t) = \frac{\sigma_{data}^2} {(\eta_t-\eta_0)^2 + \sigma_{data}^2}, c_{out}(t) = \frac{\sigma_{data}(\eta_t - \eta_0)}{\sqrt{\sigma_{data}^2+\eta_t^2}}, which clearly satisfies cskip(0)=1c_{skip}(0) = 1 and cout(0)=0c_{out}(0) = 0. For DMD, we alternately update the fake score network and generator, with the weights of the distribution matching distillation loss and regression loss set to 1.

The experimental results are presented in Table 1. As shown in the table, the high-resolution images generated by the model using consistency distillation are significantly inferior to those produced by other super-resolution methods across all metrics, which appears to contradict the reviewer's findings. We speculate that this discrepancy may be due to ResShift modifying the standard Markov chain of the diffusion model, making it difficult to apply consistency distillation directly. While applying DMD to super-resolution tasks has yielded promising results, it still falls short of our method. To further validate the effectiveness of our approach, we also transferred it to an unconditional generation task. The results of this evaluation on CIFAR-10 are presented in Table 2. As shown, our method achieves competitive performance, even in unconditional generation tasks, outperforming both consistency models and DMD.

Table 1: Quantitative results of different SR methods. The best and second best results are highlighted in bold and italic. ∗ indicates that the result was obtained by replicating the method in the paper.

DatasetsImageNet-testRealSRRealSRRealSet65RealSet65
MethodsLPIPS \downarrowCLIPIQA \uparrowMUSIQ \uparrowCLIPIQA \uparrowMUSIQ \uparrowCLIPIQA \uparrowMUSIQ \uparrow
LDM-150.2690.51246.4190.38449.3170.42747.488
ResShift-150.2310.59253.6600.59659.8730.65461.330
SinSR-10.2210.61153.3570.68961.5820.71562.169
SinSR*-10.2310.59952.4620.69160.8650.71262.575
DMD*-10.2460.61254.1240.70963.6100.72366.177
CD-L2*-10.5680.19227.0020.23030.5780.26235.101
TAD-SR-10.2270.65257.5330.74165.7010.73467.500

Table 2: Generative performance on unconditional CIFAR-10. The best and second best results are highlighted in bold and italic.

MethodDDPMDDIMEDM(Teacher)DPM-solver2UniPCCD-L2CD-LPIPSDEQDMDOurs
NFE \downarrow1000503512811111
FID \downarrow3.174.671.885.285.107.903.556.913.772.31
评论

We thank all reviewers for their questions and constructive feedback. Based on these suggestions, we have made significant revisions to the manuscript. Key changes in the revised submission include:

  1. We have applied DMD to super-resolution tasks and compared it with our method. The results are shown in Tables 2 and 3. (Reviewer eiDX, Reviewer Rnto)

  2. We have included a more detailed explanation of the background knowledge related to score distillation sampling (SDS) technology in Section 2. (Reviewer eiDX)

  3. We have incorporated PSNR and SSIM metrics for evaluation in the main experiments included in the revised manuscript. (Reviewer gXos, Reviewer B832)

  4. We have conducted ablation experiments on the multi-scale features utilized by the discriminator, with the results presented in Table 9. (Reviewer gXos)

  5. In addition to applying our method to distill the diffusion-based SR model ResShift trained from scratch, we also distilled the SD-based SR model SeeSR and compared it with other SD-based methods, such as OSEDiff. The results are shown in Tables 11, 12, and 13. (Reviewer uBAa, Reviewer Rnto)

  6. We visualized the frequency spectra of the reconstruction results obtained by different methods through the Fourier transform to highlight the advantage of our method in generating high-frequency details. The results are presented in Figure 10. (Reviewer uBAa)

  7. We have compared the inference time of TAD-SR distillation across different super-resolution models with their respective baseline methods, and the results are presented in Tables 6 and 14. (Reviewer uBAa, Reviewer Rnto)

  8. We have provided more qualitative comparisons that contain fine details or small textures in Figures 9 and 15. (Reviewer uBAa)

  9. We have carefully revised the motivation and methodology sections of the paper to enhance readability and clarity. Furthermore, we remain committed to ongoing revisions of our manuscript to enhance its readability and comprehensibility.(Reviewer B832, Reviewer Rnto)

  10. We have provided the training process of the TAD-SR algorithm in the appendix to enhance the clarity of our method. (Reviewer B832, Reviewer Rnto)

  11. We utilized samplers such as UniPC and DpmSolver to accelerate the teacher model and compare them with our method. The experimental results are presented in Tables 11, 12, and 13. (Reviewer Rnto)

  12. We have included a discussion in the paper on the limitations of our proposed method and potential directions for future research. (Reviewer Rnto)

We hope that these changes strengthen the state of our submission.

评论

We sincerely appreciate your valuable feedback and insightful suggestions, which have greatly helped us improve our manuscript. We have carefully addressed your concerns in our response and revised the manuscript accordingly. 

We understand that you have a busy schedule, but we would be grateful for any additional feedback or response you may have regarding our paper, as reviewer input is crucial for improving the quality and clarity of our work. Alternatively, if our revisions adequately address the issues raised, we kindly request a reconsideration of the score based on the clarifications and improvements made. 

Thank you once again for your time and consideration.

Best Wishes!

Authors of Submission 1713

AC 元评审

This paper receives mixed ratings of (5, 5, 5, 6, 6). The reviewers generally agree that the area this paper is exploring is interesting and meaningful, and the simplicity of the method, while having concerns about the comparison and improvement over existing works. The AC carefully read the paper, reviews, and rebuttal, and agree with the reviewers overall. In particular, in the response of the authors, the improvement over OSEDiff cannot be regarded as significant given a slower speed. As a result, the effectiveness of the methods could not be fully verified. While the AC agrees that this paper is an interesting exploration, the AC regretfully recommends a rejection.

审稿人讨论附加意见

Reviewers raise concerns mainly on the comparison and improvements, and the authors are managed to resolve most of the concerns. After reading the paper, review, and rebuttal, the AC feels that effectiveness of the proposed method cannot be convincingly verified, hence recommending a rejection.

最终决定

Reject