PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
4
5
4
5
3.8
置信度
创新性2.5
质量3.0
清晰度2.8
重要性3.3
NeurIPS 2025

One-Step Diffusion-Based Image Compression with Semantic Distillation

OpenReviewPDF
提交: 2025-05-08更新: 2025-10-29

摘要

关键词
image compressiongenerative compressiondiffusion modelsone-step diffusionextremely low bitrate

评审与讨论

审稿意见
4

The paper introduces OneDC, a one-step diffusion-based image compression method. The authors distill a multi-step diffusion model for single-step generation, yielding very fast compression. Additionally, the authors replace text-based conditioning with semantic conditioning from a (further distilled from a generative tokenizer) VQ module called the hyperprior. They train LoRAs on the diffusion model, using a hybrid training for both pixel-space and latent-space. Experiments show that OneDC achieves state-of-the-art performance.

优缺点分析

Strengths

  • The visual results seem very strong quality wise, and very compelling.

  • Practically speaking, the move towards one-step diffusion compression is very meaningful due to the decreased latency.

  • The ablation studies are thorough and supports the authors’ claims.

Weaknesses

  • The evaluated bitrate range is narrow (between around 0.01 and 0.05), and the method is not (L232) tested at extremely low bitrates and higher bitrates. For example, PerCo SD[1] test BPPs of 0.0019-0.12, DDCM[2] test BPPs of around 0.2-0.003 for the same sized images, DiffC test between 0.002->1. This raises questions about generalizability across compression regimes.

  • From the appendix, the authors claim (L93) that the rank 64 LoRA on the U-Net added 928M parameters (in addition to the 394M parameters for the other decoder parts). However, the used base model is SD1.5 (L219) includes only 860M parameters. The authors should explain how using LoRA here is faster and preserving rich priors (L195), since it seems like the added weight trained an entirely new denoiser, and the weights are no longer low rank. Additionally, how long did the training actually take? (The 800K stage 1 training steps & the following 1M stage 2 training steps). The size of this heavy model should also be compared to previous works, e.g. PerCo only fine tunes the model, while DDCM use it in a zero-shot manner.

  • The paper does not discuss the implications of training a new model per bitrate (L203), which is a significant limitation for real-world use, especially considering the size of the model and the training time.

  • Some of the writing is unclear and not very easy to follow.

    • The hyperprior is not a very commonly used module, and therefore the authors should give more background on it.

    • Figure 3 is confusing. gsg_s, which is part of the decoder, is marked as part of the compression module. I understand that it is trained, however the visualization is confusing. An additional block might help here.

    • The paper includes multiple comparisons to PerCo, which cannot be done as their internal model was not released. The authors only state it’s "Perco (SD)" in the supplementary text. This should be clearly stated in the main text.

  • The authors claim in several parts (e.g. L2, L28) how traditional codecs often produce blurry details at low bitrate or focus on perception rather compromising reconstruction fidelity. This is known and expected due to the rate-perception distortion tradeoff [3] and should be cited accordingly.

  • Several claims are overstated or misleading.

    • The “40% bitrate reduction” is benchmarked against a single metric at a time, which gives little information with regards to the known rate-distortion-perception tradeoff [3] (E.g., maybe OneDC is better at optimizing distortion but substaintialy hurts the perception while other methods hurt it less?). Additionally, these numbers are not compared to SOTA baselines (L246) (e.g. [2]). Table 1 and 2 include percentages changes but this is not clearly conveyed to the reader and can be confusing. As the scale of the changes seems small from the graphs, a table of concrete numbers would help here.

    • L8,L54 seem to imply that no other works have compressed using diffusion models without text, which is untrue (as you mention yourself, [4] do not use them, neither do [2], and I’m sure many other works).

  • Important metrics like PSNR are missing from the main text (Fig. 5) and only appear in the appendix, limiting fair comparison to non-latent-space methods such as MS-ILLM. This would expose the big limitation in using latent space models in compression due to the incurred loss in pixel space.

  • Regarding L229 - FID on kodak can be done using smaller patches (which is quite commonplace, see e.g. [2]). Your FID is already calculated using patches, as noted in the appendix. This should be noted in the main text.

  • Missing citation:

    • Foundational diffusion works like DDPM and Sohl-Dickstein’s original paper are not cited

    • Consistency distillation can be considered for citation here as well due to the distillation of a diffusion model for single-step generation.

[1] N Körber. “PerCo (SD): Open Perceptual Compression.” , Workshop on Machine Learning and Compression, NeurIPS 2024.

[2] Ohayon, Guy, Hila Manor, Tomer Michaeli, and Michael Elad. "Compressed Image Generation with Denoising Diffusion Codebook Models." arXiv preprint arXiv:2502.01189 (2025).

[3] Blau, Yochai, and Tomer Michaeli. "Rethinking lossy compression: The rate-distortion-perception tradeoff." In International Conference on Machine Learning, pp. 675-685. PMLR, 2019.

[4] Zhiyuan Li, Yanhui Zhou, Hao Wei, Chenyang Ge, and Jingwen Jiang. Towards extreme image compression with latent feature guidance and diffusion prior. IEEE Transactions on Circuits and Systems for Video Technology, 2024.

问题

  • Considering the model’s size, despite it being fast – how much memory does it take at inference time?
  • the results of the fine-tuned MS-ILLM and PerCo baffle me. The original MS-ILLM checkpoints include vlo1 and vlo2 which produce much worse results, while PerCo here shows very bad performance to their own reported numbers. Could you please explain this inconsistency with prior work?
  • Why focus on such a low bitrate range?
  • Did you indeed add 928M parameters using the LoRA on a 860M model?
  • Can the model maybe be finetuned easily and fast to support additional bitrates?

局限性

No. The paper does not discuss the implications of training a new model per bitrate (L203), which is a significant limitation for real-world use, especially considering the size of the model and the training time. Additionally, as mentioned, any latent space-based method suffers from limitations with regards to pixel space distortion.

最终评判理由

The authors responded to my concerns. If they fix/mention what was discussed, the paper should be accepted in my opinion. For example:

  • They trained on a wider bitrange.
  • They clarified details such as the parameters count and promise to clarify details in the final version (such as the use of percentages which seems to have confused even another reviewer, etc.).
  • They promise to expand the limitation section
  • They promise to include concrete numbers of evaluation.
  • They promise to add missing citations and comparison to SotA related works.

格式问题

none

作者回复

Many thanks for your valuable comments and the recognition on our contributions. We hope the following responses could address your concerns.

1. Wider bitrate range

We have trained additional OneDC models spanning a wider bitrate range (0.0034–0.1115) and compare them with PerCo (SD) on full-resolution Kodak images:

Bpp(OneDC)LPIPS(OneDC)↓PSNR(OneDC)↑Bpp(PerCo SD)LPIPS(PerCo SD)↓PSNR(PerCo SD)↑
0.00340.38017.1410.00310.52915.662
0.01010.22020.631
0.01650.18321.513
0.02450.15422.2200.03240.29919.344
0.03540.13322.979
0.05060.11423.620
0.07750.09624.396
0.11150.08325.2480.12610.14122.967

These results confirm that OneDC maintains strong performance across a wide bitrate range, including both extremely low and moderately high bitrate settings. We will include results on extended bitrate range in the final version.

2. Comparison with SOTA DDCM

To ensure a fair comparison with DDCM and enable FID calculation on Kodak dataset, we follow DDCM's evaluation protocol: center-cropping and resizing images to certain resolution. The test datasets include Kodak (512x512) and CLIC2020 (512x512,768x768). All DDCM data numbers are obtained from their paper.

  • The table below shows the absolute numbers on the Kodak (512x512 resolution, FID calculated on 64x64 patches).

    Bpp(OneDC)FID(OneDC)↓LPIPS(OneDC)↓PSNR(OneDC)↑Bpp (DDCM)FID(DDCM)↓LPIPS(DDCM)↓PSNR(DDCM)↑
    0.003454.2920.39416.756
    0.010936.8620.22320.259
    0.017733.0750.18321.219
    0.025730.9470.15421.9700.03032.0310.22222.066
    0.036929.1030.13222.7180.03829.1170.19022.551
    0.052427.4280.11323.3630.05025.6470.16123.013
    0.079525.4520.09524.2020.09524.2150.13823.606
    0.113923.4480.08225.0210.14923.1990.12424.069

    The absolute numbers on CLIC2020 are omitted here due to rebuttal character constraint.

  • For more comprehensive comparison, we also calculate the relative BD-Rate with PerCo as an anchor:

    • Kodak (512x512 resolution, FID calculated on 64x64 patches):

      BD-Rate (FID)↓BD-Rate (LPIPS)↓BD-Rate (PSNR)↓
      PerCo0.00%0.00%0.00%
      DDCM-60.18%-57.32%-73.76%
      OneDC-65.30%-81.92%-77.13%
    • CLIC2020 (512x512 resolution, FID calculated on 128x128 patches):

      BD-Rate (FID)↓BD-Rate (LPIPS)↓BD-Rate (PSNR)↓
      PerCo0.00%0.00%0.00%
      DDCM-65.97%-61.58%-77.37%
      OneDC-69.94%-84.51%-86.11%
    • CLIC2020 (768x768 resolution, FID calculated on 128x128 patches):

      BD-Rate (FID)↓BD-Rate (LPIPS)↓BD-Rate (PSNR)↓
      PerCo0.00%0.00%0.00%
      DDCM-52.20%-53.04%-86.23%
      OneDC-93.37%-85.98%-89.16%

We will add all absolute numbers and the relative BD-rate comparisons in the revision.

3. LoRA parameter count

We appreciate the detailed observation of the reviewer and would like to clarify a misunderstanding. The base U-Net from SD1.5 indeed contains ~860M parameters. Our LoRA tuning (rank 64) adds only ~68M trainable parameters, bringing the total to ~928M during training. However, since LoRA can be merged into the base model at inference, the final U-Net can be ~860M. We will revise the paper to clarify this point.

4. Training time

Stage I is trained for 6 days, and Stage II for 12 days using 4×A100 GPUs, since we use high-resolution patches for training.

5. Comparison on model size && memory usage

We compare OneDC with PerCo and DDCM on the Kodak dataset under DDCM’s test setting (center crop and resize to 512×512, FID on 64×64 patches). The memory usage of PerCo (SD) and DDCM is calculated with their open-sourced code.

MethodParamsBD-Rate(FID)↓BD-Rate(LPIPS)↓BD-Rate(PSNR)↓Dec Time stepsMemory
PerCo (SD)3.8B + 340M + 955M*0.00%0.00%0.00%2022220 MB
DDCM950M-60.18%-57.32%-73.76%10004186 MB
OneDC1.4B-65.30%-81.92%-77.13%18038 MB

* Open-sourced PerCo (SD) includes additional 3.8B BLIP2 caption model and 340M CLIP text embedding model.

We will include the model size comparison and memory usage into our final version.

6. On training a separate model per bitrate

We appreciate the reviewer’s comment. This paper focuses on establishing an effective one-step diffusion framework for compression, so we currently use fixed-rate models for clarity and controlled analysis, like many other generative codecs [1][2][4][5]. We believe that OneDC is compatible with existing techniques for variable bitrate (e.g., bottleneck scaling[7] and adjustable quantization[8]), which we leave as future work.

7. Hyperprior Usage

The hyperprior is introduced by Ballé et al. [6] and widely adopted in methods such as HiFiC [1], MS-ILLM [2], GLC [3], and DiffEIC [4]. In OneDC, we extend its role to also provide semantic guidance for generative decoding. We will add a brief background and proper citation in the final version to improve clarity.

8. On the modification of Figure 3

Thanks for your great suggestion. We will separate gsg_s into its own module and update Figure 3 accordingly in the final version.

9. About PerCo (SD)

We will clarify the details regarding the use of PerCo (SD) in the revised main paper.

10. On the rate–perception–distortion tradeoff

We fully agree reviewer's opinion. We will cite the rate–perception–distortion tradeoff when discussing the limitations of traditional codecs at low bitrates.

11. On missing PSNR in the main text

We will include PSNR results in the revised main paper to enable a fair comparison with pixel-domain methods like MS-ILLM does.

12. On missing citations

We will add the appropriate citations in the revision.

13. Regarding the results of fine-tuned MS-ILLM and PerCo

  • MS-ILLM:

    Our paper focuses on the 0.015–0.05 bpp range on the Kodak dataset, while official MS-ILLM vlo1/vlo2 checkpoints target <0.01 bpp. To ensure a proper comparison under similar bitrate range, we fine-tuned MS-ILLM using the official code to produce models operating between 0.015–0.045 bpp. These fine-tuned models outperform the original vlo1/vlo2 checkpoints due to their higher bitrate. For bitrates above 0.045 bpp, we use the official MS-ILLM checkpoints.

    We will include BD-Rate comparisons against all official MS-ILLM checkpoints (vlo1/vlo2 and quality levels 1–4) in the revision, using OneDC extended to a broader bitrate range.

  • PerCo (SD):

    Our results align with the PerCo (SD) technical report, which notes slightly degraded performance to the original PerCo (not publicly released) on the Kodak and MS-COCO 30K datasets. For results on CLIC2020 (full resolution), PerCo performs worse due to limited resolution adaptability.

    We will clarify the use of PerCo (SD) in the main paper and add the comparisons to the original PerCo using data extracted from their paper.

14. On focusing on the low bitrate range

We focus on the extremely low bitrate range because existing generative codecs often fail to preserve acceptable visual quality in this setting. It poses challenges in maintaining semantic content and synthesizing fine details, making it a key research frontier as explored in works like DiffEIC, PerCo, and GLC. Our aim is to advance compression capabilities specifically under these challenging conditions. To broaden applicability, we have also extended OneDC to cover higher bitrate range, and will include these results in the final version.

15. On adapting the model to additional bitrates

Yes, we fine-tune the extra bitrate points from the existing model, which converges faster than training from scratch.

16. On claims

Thanks for your reminders. In the revision, we will carefully modify these claims according to reviewer's comments. Specifically, we will:

  • Revise our expression about bitrate reduction and the text usage of existing methods.
  • Clarify the meaning of each number in tables and supplement concrete numbers.

Citations:

[1] High-fidelity generative image compression, NeurIPS 2020

[2] Improving statistical fidelity for neural image compression with implicit local likelihood models, ICML 2023

[3] Generative latent coding for ultra-low bitrate image compression, CVPR 2024

[4] Towards extreme image compression with latent feature guidance and diffusion prior, TCSVT 2024

[5] Towards image compression with perfect realism at ultra-low bitrates, ICLR 2024

[6] Variational image compression with a scale hyperprior, ICLR 2018

[7] Asymmetric gained deep image compression with continuous rate adaptation, CVPR 2021

[8] EVC: Towards Real-Time Neural Image Compression with Mask Decay, ICLR 2023

评论

I thank the authors for their extensive answer, considering my feedback and for including additional experiments. I have a few questions/remarks left:

  • Will you compute FID on Kodak (L229) using patches?
  • Even though the paper focuses on fixed-rate models, will you mention this drawback (and I guess you can relate it to your proposed future work)? Also, you should note the long training required so far per bitrate.
  • I understand you had no space for tables of concrete numbers of the metrics in the rebuttal response; can you provide at least some of them now?
评论

Thank you for your great follow-up. We sincerely appreciate your engagement and are glad to address the remaining questions:

  1. Yes, as shown in Comparison with SOTA DDCM in our previous response, we computed FID on Kodak using 64×64 patches, following the evaluation setup in DDCM. In the revision, we will cite DDCM, add the full comparisons (including FID results on Kodak using patches), and clearly state this evaluation protocol in the main paper.

  2. Thank you for the kind reminder. In the revision, we will explicitly mention the fixed-rate design and its drawbacks in the Limitation part. The detailed training time required per bitrate will also be noted in the revision. In addition, we will release the full training code to ensure transparency and reproducibility.

  3. Sure. Below we provide the raw results for OneDC aligning with the evaluation protocol of DDCM (512×512 and 768×768 resolutions on CLIC2020, FID computed using 128×128 patches). The results of DDCM and PerCo (SD) are obtained from the DDCM paper.

    • CLIC2020 (512x512 resolution, FID calculated on 128x128 patches):

      MethodBppFID↓LPIPS↓PSNR↑
      PerCo (SD)0.03313.8960.28718.111
      0.1277.8880.12822.453
      DDCM0.0309.4590.18622.630
      0.0388.3860.16023.171
      0.0507.7550.13723.748
      0.0957.3400.11624.472
      0.1497.1640.10325.008
      0.3096.8250.08825.782
      OneDC (Ours)0.003421.4190.36016.439
      0.011211.5970.18321.197
      0.017910.5870.14922.505
      0.02619.8890.12323.563
      0.03699.0410.10424.454
      0.05158.3520.08825.340
      0.07647.1660.07226.385
      0.10726.1350.06027.395
    • CLIC2020 (768x768 resolution, FID calculated on 128x128 patches):

      MethodBppFID↓LPIPS↓PSNR↑
      PerCo (SD)0.00330.4090.51715.339
      0.03212.8690.26919.018
      0.1265.4190.12223.387
      DDCM0.00723.8620.40419.672
      0.00819.5210.35420.532
      0.01015.5590.31421.207
      0.01411.3620.26222.116
      0.0178.7220.22722.722
      0.0226.7530.19223.366
      0.0425.0510.15624.136
      0.0664.5490.13324.739
      0.1374.1320.10825.650
      OneDC (Ours)0.003415.3250.32617.619
      0.00987.9530.17222.378
      0.01557.3900.14023.600
      0.02287.0110.11724.641
      0.03266.5380.09925.532
      0.04615.8740.08426.416
      0.06944.9150.06927.463
      0.09934.0460.05828.471

    In addition, we also provide concrete numbers on MS-COCO 30K dataset, following the evaluation protocol of PerCo (512x512 resolution, FID computed using 512x512 patches):

    MethodBppPSNR↑MS-SSIM↑LPIPS↓DISTS↓FID↓
    MS-ILLM0.009220.4110.6410.3970.25572.693
    0.019622.7080.7660.2570.16917.992
    0.029623.9480.8210.2000.1459.041
    0.048825.5170.8750.1380.1224.100
    0.083526.9720.9120.0950.0992.032
    0.151028.7340.9440.0630.0750.990
    0.285030.9300.9690.0390.0580.457
    DiffEIC0.021718.7780.6760.3180.1716.151
    0.040720.5640.7700.2290.1333.929
    0.065322.7100.8400.1590.0992.578
    0.097524.6060.8860.1160.0771.911
    1.421924.3580.9090.0960.0650.130
    PerCo (SD)0.003614.1260.3880.5450.2454.467
    0.032918.1240.6760.3110.1592.748
    0.126722.8020.8690.1340.0801.152
    DiffC0.008316.1700.4950.5240.27190.989
    0.013318.3060.6110.4270.22957.542
    0.019820.1590.7050.3340.18728.909
    0.030322.0290.7860.2410.1409.763
    0.051424.1700.8590.1550.0952.805
    0.054424.3900.8650.1480.0912.566
    0.058324.6460.8720.1400.0872.334
    OneDC (Ours)0.003416.0830.4600.3760.21113.607
    0.011220.6960.7210.2010.1283.496
    0.017921.9240.7790.1660.1112.817
    0.026022.9240.8200.1390.0972.379
    0.037123.8090.8520.1180.0862.044
    0.052124.6500.8780.1010.0761.719
    0.078925.6740.9050.0820.0641.365
    0.113726.6620.9240.0690.0551.043

    If you need any additional concrete numbers, please let us know, and we are happy to provide them.

评论

Thank you for the answers. I will raise my score, if the aforementioned matters are fixed the paper should be accepted in my opinion.

评论

Thank you for your thoughtful engagement throughout the review process. We're glad our clarifications addressed your concerns, and we sincerely appreciate your support and constructive feedback. We will carefully revise the paper to incorporate the discussed points.

审稿意见
5

This paper introduces a one-step diffusion-based generative image codec designed for ultra-low bitrate scenarios. The authors argue that multi-step sampling is not essential for generative image compression and propose a lightweight alternative that leverages a semantic distillation mechanism from a pretrained tokenizer into the hyperprior. The system combines latent compression with one-step conditional generation using a pretrained diffusion backbone, and further employs hybrid-domain optimization to improve perceptual realism and fidelity. Experimental results demonstrate that OneDC achieves state-of-the-art visual quality and over 40% bitrate reduction with 20× faster decoding compared to existing multi-step diffusion codecs.

优缺点分析

[Strength]

This paper is well-written and clearly structured, maintaining coherence throughout in articulating its main claims. The contributions are clearly stated, and the authors support their claims with a comprehensive set of experiments.

In particular, the discussion on the limitations of using text as a global context for generative models is compelling, and the proposed alternative of utilizing hyperprior for semantic guidance introduces a meaningful novelty. Even considering that some training techniques are adopted from prior works, such as GLC and DMD2, the use of code prediction loss for semantic context in the first training phase, and the balanced integration of latent space distillation and pixel-level fidelity optimization in the second phase are notably impressive.

As a result, the proposed approach not only significantly outperforms existing generative compression methods but also largely mitigates the complexity overhead typically associated with multi-step diffusion models. The reviewer believes that this method could serve as a significant milestone in the field.

A few points of concern that arose during the review process are detailed in the following weakness section.

[Weakness]

(1) One of the key questions is how much bitrate is allocated to the low-level representation y^\hat{\bm{y}}, and how its exclusion would affect the reconstruction. While the paper focuses heavily on the use of semantic context—particularly as demonstrated in the results of Fig.~2, which suggest that a significant portion of the reconstruction is determined by the semantic context derived from the hyperprior—the influence of the y^\hat{\bm{y}} (or y~t\tilde{\bm{y}}_t) representation is comparatively underemphasized. Providing the relative bitrate portion of the two bitstreams at different rates, along with an ablation study that shows the impact of removing the yy-branch both quantitatively and qualitatively, would further improve the completeness of the paper.

(2) In a similar context, it is unclear why the subscript tt is used in y~t\tilde{\bm{y}}_t. Since y~t\tilde{\bm{y}}_t is reconstructed from y^\hat{\bm{y}}, it should follow a distribution distinct from random noise. However, there appears to be no time-dependent operation involved in the process of reconstructing y~t\tilde{\bm{y}}_t. It would be helpful for the authors to revisit this notation and, if possible, visualize how this representation differs from random noise.

(3) The paper would benefit from a more detailed explanation of how the semantic context (C) is injected spatially into the model via cross-attention. Since this process is critical to overcoming the limitations of textual tokens, as claimed by the paper, it should be explicitly described in the main text or, if necessary, included in the supplementary material.

(4) (Minor) In Fig.~4, the semantic context (C) is not depicted as being injected into the one-step U-Net, which may lead to confusion. It would be helpful to revise this figure to reflect the actual flow.

(5) (Minor) In Eq.~(1), the expression “\text{logits} = P_{\text{aux}}(c)” is somewhat misleading, as logits generally refer to pre-softmax activations rather than actual probabilities. Simply using Paux(c)P_{\text{aux}}(c) would likely be sufficient.

(6) (Minor) In Table~1, it would be helpful to include results for a setting without second-stage training (“w/o second-stage training”) to better understand the contribution of the hybrid-domain fine-tuning step.

问题

One important request I would like to make is that, for the sake of the broader advancement of our research field, I strongly encourage the authors to release the training code in a reproducible form. Some state-of-the-art methods in neural compression—such as the DCVC family in video compression—have demonstrated strong performance, but their lack of publicly available training code has severely hindered reproducibility. Please keep in mind that reproducibility is a essential attribute of any high-quality research paper.

局限性

Yes.

最终评判理由

I maintain the original rating, taking into account the reviewer’s withdrawal and the subsequent discussion.

The authors have sincerely responded to the inquiry regarding the remaining semantic context injection method.

格式问题

None.

作者回复

Many thanks for your valuable comments and the recognition on our contributions. We hope the following responses could address your concerns.

1. On the role and bitrate portion of y^\hat{y}

The hyperprior latent z^\hat{z} provides high-level semantic guidance, while the main latent y^\hat{y} preserves low-level details for reconstruction. To analyze their respective contributions, we conducted an ablation study on the CLIC2020 dataset. We fixed the bitrate for the semantic hyperprior z^\hat{z} at 0.0035 bpp and progressively increased the bitrate allocated to the main latent y^\hat{y}.

Bpp z^\hat{z} (ratio)Bpp y^\hat{y} (ratio)Bpp TotalPSNR↑MS-SSIM↑LPIPS↓DISTS↓FID↓
0.0035 (100%)0.0 (0%)0.003519.310.6290.2900.16914.885
0.0035 (43%)0.0047 (57%)0.008223.130.7900.1630.0896.223
0.0035 (27%)0.0094 (73%)0.012924.200.8260.1390.0775.560
0.0035 (18%)0.0157 (82%)0.019225.250.8560.1190.0684.979

The results demonstrate the division of roles:

  • Semantic Foundation (z^\hat{z}): Even with zero bits for y^\hat{y}, the model reconstructs a coherent image (FID=14.89), confirming that z^\hat{z} provides the core semantic structure.
  • Fidelity Enhancement (y^\hat{y}): Allocating more bits to y^\hat{y} leads to substantial improvements across all fidelity and perceptual metrics (e.g., LPIPS drops from 0.290 to 0.119). This confirms that y^\hat{y} is crucial for encoding fine-grained details.

Thanks for your great suggestion; we will include this analysis in the revision.

2. Regarding the notation

Thanks for your great suggestion. To clarify, we will revise the notation y~t,y~0\tilde{y}_t, \tilde{y}_0 to better reflect the actual data flow:

  • subscript t→in : the latent decoded from the compressed bitstream, serving as the input to the one-step diffusion model.
  • subscript 0→out : the denoised latent produced by the one-step diffusion model.

The new notation removes the unnecessary implication of a timestep and more clearly represents the flow of inference in our one-step framework. We will revise the figure and text accordingly in the revision.

3. Regarding how semantic context is injected via cross-attention

The semantic context cRB×N×Dc\in R^{B \times N \times D}, extracted from the quantized hyperprior z^\hat{z} via a semantic decoder hsemh_{\text{sem}},is injected into the cross-attention layers of the one-step diffusion U-Net. Here N=H×WN = H \times W corresponds to the spatially flattened hyperprior feature map and DD is the embedding dimension.

In each cross-attention block, the latent feature serves as the query QQ, while the semantic context cc provides the key KK and value VV:

Attention(Q,K,V)=softmax(QKD)V\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left( \frac{\mathbf{Q} \mathbf{K}^\top}{\sqrt{D}} \right) \mathbf{V}

This design allows every spatial location in the latent to reference relevant semantic context adaptively, improving reconstruction quality.

Thanks for your great suggestion, we will revise the paper to clarify this process in the final version.

4. Open-sourcing

We fully agree with the reviewer's opinion on open-sourcing. We make the commitment that we will release the both training and testing code for our paper.

5. Minor issues

Thank you for these kind reminders.

  • We will revise Fig. 4 to explicitly show the injection of semantic context cc into the one-step U-Net for clarity.

  • We will revise Eq. (1) to use Paux(c)P_{aux}(c) directly, avoiding the potentially misleading use of “logits.”

  • We will include the “w/o second-stage training” setting in Table 1 to clarify the impact of hybrid-domain fine-tuning.

评论

Thank you for the clarification. I'll keep my rating.

I have one more quick question about injecting the semantic context via cross-attention. How did you handle the irregular sizes of (full-size) input images? Did you resize after the "h_sem" decoder?

评论

Thank you for your thoughtful follow-up. We're glad our previous clarifications were helpful, and we sincerely appreciate your continued engagement.

Regarding your additional question:

We do not resize the semantic context cc after hsemh_{sem} decoder. For example, given a 1024×768 input image xx:

  • The corresponding latent yy and y^\hat{y} have a spatial resolution of 64×48 (16× downsampling);

  • The hyperprior zz, z^\hat{z} and its decoded semantic context cc have a resolution of 16×12 (64× downsampling);

  • We flatten cc into a 1-D token sequence with a length of 192 = 16×12, which serves as the key and value in the cross-attention layers.

For inputs whose resolution is not divisible by 64, we apply padding to the nearest multiple and remove it after decoding to restore the original resolution. This approach follows common practice in prior neural codecs (e.g., HiFiC, MS-ILLM).

The model is trained with both 512 and 1024 resolutions to ensure generalization across high-resolution inputs.

This design avoids distortions from resizing or interpolation, while preserving semantic context in a compact and spatially aligned form. We will add these implementation details for clarity in the revision.

评论

Thank you for your prompt response.

Both 1024x768 and 768x1024 (or 1536x512, and so on) inputs will result in the same 192 flattened embeddings; however, the spatial positions of the elements differ between the two configurations. Are you suggesting that this spatial information mismatch can be overcome by the training technique of using both 512x512 and 1024x1024 data?

评论

Thank you for your thoughtful follow-up. We appreciate the opportunity to further clarify this point.

In our framework, the semantic context cc is decoded from the hyperprior z^\hat{z}, and then injected into the one-step diffusion model.

For the injection process, we leverage the cross-attention mechanism, and the semantic context cc is flattened into 1-D sequence, which serves as key and value. In our current implementation, each query attends to all keys and values, and extracts the most relevant information without the positional encoding.

However, our semantic context cc is derived from the hyperprior z^\hat{z}. Meanwhile, we also use z^\hat{z} to generate the distribution parameters of the low-level detail latent y^\hat{y}, so z^\hat{z} will capture the spatial information. This leads semantic context cc also implicitly be aware of the spatial information. Thus, even without explicit positional encoding, the model benefits from the implicit spatial correlation between z^\hat{z} and cc.

We appreciate your insightful question, which has motivated us to consider the potential benefits of explicitly incorporating spatial information during the cross-attention. As part of our future work, we plan to explore advanced image positional embedding techniques to further enhance the model’s ability, while supporting arbitrary input resolutions.

审稿意见
4

The paper proposes OneDC, a one-step diffusion-based image codec designed for high-quality compression at ultra-low bitrates. The method combines a latent compression module with a one-step diffusion generator, guided by a hyperprior enhanced via distillation. A two-stage training strategy is used to improve efficiency and reconstruction quality, and experiments show competitive perceptual performance with fast decoding.

优缺点分析

Pros:

  • The writing is generally fluent, making the paper easy to read and follow.
  • Figure 3 is clear and well-designed, which helps readers understand the proposed method more easily.

Cons:

  • The proposed approach seems overly complicated, as it involves a large number of modules and a two-stage training paradigm where each stage requires coordinating multiple loss functions. As a result, the conceptual clarity of the work is reduced, and readers may find it difficult to extract clear insights from such an engineering-heavy solution.
  • Writing could be further improved. I suggest the authors better articulate the key contributions. It is somewhat unclear whether the main novelty lies in proposing a new overall architecture or in the design of certain modules. I recommend explicitly stating this and providing more comparisons with existing work to help position the paper more clearly in the context of related literature.
  • In lines 54–55, the authors mention that text-based guidance often leads to significant computational overhead, and that their method is more efficient. However, the experiments do not provide any comparison in terms of model size. I suggest including a parameter comparison (as done in DiffC) to better support the claimed efficiency.
  • Some equations are incomplete. For example, in Equation (2), loss inputs are not specified.

问题

  • What’s the difference between scalar quantization and finite scalar quantization in Figure 3? Why not adopt a unified choice—either using finite scalar quantization throughout or sticking to general scalar quantization?
  • What is the connection between Figure 3 and Figure 4? Several symbols used in Figure 4 are not defined in the main text, which makes it difficult to follow.
  • What is the input to the model during inference, and where does it come from?
  • The values for DISTS, LPIPS, and FID are all 0.00 in Table 1 and 2. Is this normal?

局限性

One limitation of the proposed method is that it involves multiple training stages and multiple loss functions, which can make tuning the loss-related hyperparameters particularly challenging in practice. This increases implementation complexity and may hinder reproducibility or practical deployment.

最终评判理由

I updated my score based on the authors' reply.

格式问题

N/A

作者回复

Many thanks for your valuable comments. We hope that the following responses can address your concerns.

1. Complexity of Architecture & Training

Our design philosophy balances performance with simplicity. Actually, many recent generative codecs require sophisticated design, but we have streamlined our architecture and training to be more integrated.

  • Architectural Simplicity: Our OneDC integrates a one-step diffusion generator into a standard hyperprior codec framework. This design is self-contained and avoids the heavy external dependencies or complex add-ons seen in other methods. For instance:
    • PerCo [1] relies on large, separate vision-language (BLIP2) and text-embedding (CLIP) models for conditioning.
    • DiffEIC [2] incorporates a ControlNet and a custom Latent Feature-Guided Compression Module, increasing structural complexity.
    • DiffC [3] uses reverse-channel coding over many steps, requiring specialized CUDA kernels and complex scheduling.

In contrast, OneDC replaces external conditioning with a simple yet semantic-aware hyperprior.

  • Training Strategy: We adopt a two-stage training strategy, a well-established and effective paradigm used by leading methods like HiFiC[4] / MS-ILLM[5].
    • Stage I focuses on learning the compression part.
    • Stage II refines the decoding for perceptual realism.

This principled approach avoids the more intricate multi-stage pipelines of other models like GLC[6], which involves a three-stage process of training a VQGAN, transform coding, and then end-to-end fine-tuning.

We believe our approach represents a proper trade-off between innovation and practicality. We will release our full training code to ensure transparency.

2. Novelty & comparison

We appreciate the opportunity to clarify our contributions. The primary novelty lies in the design of a new codec architecture that successfully integrates a one-step diffusion generator with a hyperprior-based latent compressor. This unified design achieves SOTA performance while being significantly more efficient and self-contained than previous diffusion-based codecs.

Our key technical contributions that enable this framework are:

  • Hyperprior as Semantic Guidance: Unlike some prior works that rely on heavy external models (e.g., BLIP2, CLIP) for text-based conditioning, we are the first to leverage the hyperprior as a direct source of semantic guidance for the diffusion model. This approach is not only more efficient but also provides spatially-aware context that is better aligned with the image content, eliminating the need for textual prompts.

  • Semantic distillation: To tap the potential of the hyperprior for its new role, we introduce a semantic distillation mechanism. During training, we distill knowledge from a powerful pretrained generative tokenizer into the hyperprior branch. This novel step enriches the hyperprior’s semantic representation, making it an effective conditioning signal for the generator.

  • Integrated Architecture and Training: Our framework is designed for simplicity and reproducibility.

    • Architecture: While methods like DiffEIC or PerCo bolt on complex modules like ControlNet or large language models, our design is a self-contained system. It replaces complex conditioning with a learnable, integrated hyperprior.
    • Training: We adopt a principled two-stage training strategy (similar to HiFiC/MS-ILLM) that first establishes a robust latent representation and then refines for perceptual realism. This hybrid-domain approach (pixel and latent) ensures stable optimization and balances fidelity with realism, avoiding the more complex multi-stage pipelines of other generative codecs.

In summary, while we build upon established concepts like hyperpriors and diffusion models, our novelty stems from their synergistic integration into a new, streamlined framework specifically designed for efficient, high-fidelity, one-step generative compression.

3. Model size & parameter

We provide parameter counts, runtime comparisons (on 1024×1024 images), and BD-Rate results on the MS-COCO 30K dataset:

ModelParamsEnc. Time (s)Dec. Time (s)BD-Rate (LPIPS)↓BD-Rate (DISTS)↓BD-Rate (FID)↓
MS-ILLM181M0.140.1776.9%225.3%247.9%
DiffEIC1.4B0.3212.4247.8%100.1%163.3%
PerCo (SD)3.8B+340M+955M*0.588.80410.8%265.9%75.2%
DiffC950M3.9~15.66.9~10.8114.9%96.2%144.2%
OneDC1.4B0.150.340.00%0.00%0.00%

* Open-sourced PerCo (SD) includes additional 3.8B BLIP2 caption model and 340M CLIP text embedding model.

0.0 in BD-Rate means we use the OneDC as anchor for comparison. See our response for question 8 for more details.

Compared to MS-ILLM, diffusion-based methods typically use larger models but achieve superior perceptual quality (e.g., lower BD-Rate with FID) due to stronger generative capacity. Unlike other diffusion-based codecs, OneDC avoids external caption models and multi-step sampling, enabling over 20× faster decoding (0.34 s vs. 6.9~12.4 s) while also achieving better rate-distortion performance. We will include these parameter and runtime comparisons in the revision to support the claimed efficiency.

4. Equation input

Thanks for your suggestions. We will clarify the input of each equation in the revision.

5. Choice of Scalar quantization (SQ) / finite scalar quantization (FSQ)

Our model uses FSQ for the hyperprior branch and standard SQ for the main latent branch. This design is intentional and crucial for balancing semantic representation and reconstruction fidelity.

  • FSQ for Hyperprior (Semantic Representation): The hyperprior is designed to capture high-level semantic information. FSQ, an improved variant of vector quantization, excels at learning a discrete, categorical representation from a large codebook. This makes it ideal for modeling the semantic-rich features in the hyperprior, providing effective guidance for the generative model.

  • SQ for Main Latent (Low-Level Detail Preservation): The main latent must preserve fine-grained visual details for high-fidelity reconstruction. SQ provides a larger, more continuous-like quantization space that is better suited for this task, minimizing information loss for low-level textures and structures.

To validate it, we compare our approach against two alternatives: using SQ for both branches (Dual SQ) and FSQ for both branches (Dual FSQ). All models, including the compared OneDC, are trained from scratch under the same training duration in Stage I for fair comparison. The results below show performance degradation (BD-Rate increase) relative to our OneDC model.

VariantBD-Rate (DISTS)↓BD-Rate (FID)↓
Dual SQ48.3%45.9%
Dual FSQ79.1%39.8%

The results confirm our hypothesis:

  • Using SQ for the hyperprior (Dual SQ) weakens its semantic modeling capability, leading to a significant drop in perceptual quality (higher BD-Rate with FID).
  • Using FSQ for the main latent (Dual FSQ) harms its ability to preserve fine details, resulting in a substantial loss of fidelity (higher BD-Rate with DISTS).

It verifies our hybrid quantization strategy, which effectively assigns the proper quantization method to each branch based on its specific role.

6. Connection between Fig. 3~4 & symbol definitions

The two figures are complementary and serve different purposes:

  • Fig. 3 shows the inference pipeline about the interaction between latent compression module and the one-step diffusion generator. It highlights two key points:

    • The integration of a latent compression module and a one-step diffusion model.

    • The use of hyperprior-based semantic guidance instead of text-based conditioning for the diffusion U-Net.

  • Fig. 4 presents the training pipeline, detailing the two-stage process and indicating which components are updated or used only during training:

    • Stage I trains the compression module and the one-step diffusion model for reconstruction, with a distillation mechanism to enhance the hyperprior’s semantic representation.

    • Stage II fine-tunes the one-step diffusion model using latent-domain diffusion distillation to improve generation realism, combined with a pixel-domain loss to preserve fidelity.

We will improve cross-referencing between the figures and explicitly define all symbols in the revision.

7. Input during inference

InputOutput
Encodinguncompressed RGB imagecompressed bit-stream
Decodingcompressed bit-streamreconstructed RGB image

8. The 0.0 value in Table 1 / 2

In Table 1 / 2, we report Bjontegaard Delta-Rate (BD-Rate)[7] to reflect bitrate change under a specific quality metric when compared the anchor method. A negative BD-Rate value means bitrate savings over the anchor method, while a positive value means bitrate increase. The BD-Rate value over anchor itself is 0.0%.

In our paper, we use the final OneDC model as the anchor. Accordingly:

  • In Table 1, BD-Rate reflects performance degradation when individual components of OneDC are ablated.
  • In Table 2, BD-Rate reflects the degradation of other codecs compared to OneDC.

In the revision, we will clarify it and annotate the table headers as “BD-Rate (Quality Metric)” for improved clarity.

[1] Towards image compression with perfect realism at ultra-low bitrates, ICLR 2024.

[2] Towards extreme image compression with latent feature guidance and diffusion prior, TIP 2024.

[3] Lossy compression with pretrained diffusion models, ICLR 2025.

[4] High-fidelity generative image compression, NeurIPS 2020.

[5] Improving statistical fidelity for neural image compression with implicit local likelihood models, ICML 2023.

[6] Generative latent coding for ultra-low bitrate image compression, CVPR 2024.

[7] Calculation of average PSNR differences between RD-curves, VCEG-M33.

评论

Thank you for the response. I updated my score accordingly.

评论

Thank you for your thoughtful follow-up. If you have any further questions, please let us know, and we are happy to provide any needed clarification.

审稿意见
5

This paper proposes a novel method for ultra low rate image compression. It merges hyper prior latents as semantic conditioning, one-step diffusion decoding, and a semantic distillation mechanism. Performance comparisons at the ultra low rate regime are provided.

优缺点分析

Strengths

  • This paper provides a mature diffusion-based compressor at the ultra low rate regime. To me, it fulfills some of the promises of rate-distortion-perception. It also seems to integrate some of the latest techniques in deep learning in a successful fashion for image compression, which is a non-trivial accomplishment.
  • The combination of the chosen components allow for great tradeoffs in speed and performance.
  • Experiments are fairly extensive, with ablation studies showing the benefit of each component.

Weaknesses:

  • a discussion on training time would be helpful
  • Performance could potentially be improved by doing full fine-tuning, rather than LoRA. I suppose it would be more resource intensive, but I am not sure if this was compared to.

问题

See weaknesses

局限性

yes

最终评判理由

I think this is a strong paper and valuable contribution to the conference. I maintain my original positive score.

格式问题

N/A

作者回复

Many thanks for your valuable suggestions and the recognition on our contributions. We hope that the following responses can address your comments.

1. Discussion of training time

During training, image patches of size {512×512, 1024×1024} are randomly cropped with probabilities {0.6, 0.4} to enhance generalization to high-resolution inputs. We use a two-stage training strategy and 4×A100 80G GPUs to train our models.

  • Stage I trains the latent compression module and adapts it to the pre-trained one-step diffusion model for the reconstruction. It takes approximately 6 days as we use high-resolution patches for training.

  • Stage II further fine-tunes the one-step diffusion model to enhance the realism via latent-domain distillation and preserves fidelity through pixel-domain supervision. This stage requires around 12 days, as diffusion distillation introduces additional computational cost in addition to high-resolution training.

Thanks for your great suggestion. We will add these details into our revision, and will also release our training code for reproducibility. In addition, we will explore accelerated training strategies, such as the improved distillation method, to reduce training cost in the future.

2. Full fine-tuning vs. LoRA

We conducted additional experiments using full fine-tuning (FT) starting from the λ = 7.4 checkpoint originally used in our LoRA-based model. We denote this variant as LoRA + FT in below table. Owing to the significantly slower convergence of full fine-tuning compared to LoRA, we were limited to roughly 15% of the Stage II training iterations during the rebuttal period.

The preliminary results on the CLIC2020 dataset are below:

MethodBppLPIPS ↓DISTS ↓FID ↓
LoRA0.0130.1390.0775.56
LoRA + FT0.0130.1380.0765.48

These early findings suggest that full fine-tuning can yield some improvements over LoRA under limited training conditions. In future work, we plan to explore more efficient training methods and further improve generation performance.

评论

Thank you for addressing my comments. I maintain my positive score.

评论

Thank you for your thoughtful follow-up. We're glad our clarifications addressed your concerns, and we sincerely appreciate your support and constructive feedback.

最终决定

This paper introduces OneDC, a novel framework for one-step diffusion-based image compression that leverages a hyperprior for semantic guidance. The approach is elegant and highly effective, demonstrating state-of-the-art perceptual quality and a dramatic 20x decoding speedup. The authors engaged commendably during the rebuttal period, providing substantial new experiments that successfully addressed the major initial concerns, including the limited bitrate range and the lack of direct SOTA comparisons.

However, despite these strengths, the initial reviews raised several critical questions that could have undermined the work, centering on the evaluation protocol, model complexity, and comparisons against the latest SOTA methods. The authors' comprehensive rebuttal proved decisive, providing extensive new quantitative results, clarifying all ambiguities around the model's parameters and training, and ultimately resolving every key concern raised by the review panel.

This successful rebuttal was reflected in the final reviewer stances; it not only satisfied but successfully swayed reviewers who were initially borderline, leading them to raise their scores and forge a strong, unanimous consensus. Therefore, because the work presents a significant contribution and the authors have thoroughly resolved all critical points raised during the review process, my recommendation is to Accept.