/10

Poster3 位审稿人

最低3最高4标准差0.5

ICML 2025

Distillation of Discrete Diffusion through Dimensional Correlations

Satoshi Hayakawa,Yuhta Takida,Masaaki Imaizumi,Hiromi Wakaki,Yuki Mitsufuji

提交: 2025-01-17更新: 2025-07-24

TL;DR

We conduct theoretical analysis of current discrete diffusion models and propose a method to effectively capture element-wise dependency that is ignored in conventional models.

摘要

关键词

diffusion modeldiscrete diffusiondistillationconsistency modeldimensional correlationconvergence analysis

评审与讨论

审稿意见

评分: 42025-03-10

This paper focuses on an important research question of distilling discrete diffusion models, which poses unique challenges due to the necessity of modeling the joint distribution of multiple discrete states grows with a combinational complexity. This paper proposes Di4C, a principled model agnostic approach for distilling discrete diffusion models. Specifically, a mixture model with enough expressivity is employed as the student model and is learned with a consistency trajectory distillation fashion loss, along with several auxiliary losses and a designed control variate. Experimental results demonstrate the effectiveness of the proposed method on both image generation and text generation.

给作者的问题

How does the proposed method compare to advanced samplers for discrete diffusion models such as [1]?
Does the time step schedule (i.e., the choice of $s, u, t$ ) play an important role as in the consistency models [2,3]?

[1] Zheng, Kaiwen, et al. "Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling." (ICLR 2025).

[2] Song, Yang, and Prafulla Dhariwal. "Improved techniques for training consistency models." (ICLR 2024)

[3] Geng, Zhengyang, et al. "Consistency models made easy." (ICLR 2025).

论据与证据

The claims are well supported with mathematical derivation and empirical evidence.

方法与评估标准

The proposed method is reasonable with comprehensive evaluation across several standard benchmarks.

理论论述

Though I didn't go through every detail in the proof, the conclusion makes sense to me.

实验设计与分析

The experimental designs for image generation seem nice to me.

For the experiment part of text generation, I have several questions:

Why do you opt to apply your method on top of another specific distilled model? Does the proposed method work well on top of a vanilla teacher model compared to the other distillation method?
It is known that the generative perplexity has crucial flaws [1,2], so it would be great to consider additional metrics for the text generation task.

[1] Shi, Jiaxin, et al. "Simplified and generalized masked diffusion for discrete data." (NeurIPS 2024)

[2] Zheng, Kaiwen, et al. "Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling." (ICLR 2025).

补充材料

I reviewed Appendix A, E, and F in detail.

与现有文献的关系

Modeling complex correlations / joint distribution of multiple high-dimensional random variables is an important research question and this paper presents several nice ideas to tackle this.

遗漏的重要参考文献

Several papers discussing the evaluation criteria of discrete diffusion models for text generation (e.g., [1]) and developing fast samplers (e.g., [2]) should be discussed.

[1] Shi, Jiaxin, et al. "Simplified and generalized masked diffusion for discrete data." (NeurIPS 2024)

[2] Zheng, Kaiwen, et al. "Masked diffusion models are secretly time-agnostic masked models and exploit inaccurate categorical sampling." (ICLR 2025).

其他优缺点

Strength

This paper studies an important and underexplored research question of distilling discrete diffusion models.
This paper is well executed on both technical and empirical aspects.
The writing is easy to follow.

Weakness

The primary weakness of the current manuscript is the presentation of the main text, likely due to space constraints. As a result, readers must consult the supplementary material to fully grasp the paper's core technical contributions. Additionally, for clarity, a formal description of the sampling scheme should be included in the main text.
Please refer to the "Experimental Designs Or Analyses" part for my question about the metric and the teacher model used for text generation.

其他意见或建议

In Eqn. (4), should the expectation be taken over $x_T$ ?
On the top of Figure 1 (right), should the second term be $p^1(x; \beta)p^2(y; \beta)$ ?

作者回复

2025-03-28

Thank you for your positive evaluation of the paper. Let us answer your questions. We will also correct typos you have pointed out.

[q-1] Why we chose SDTT model as teacher

Why do you opt to apply your method on top of another specific distilled model? Does the proposed method work well on top of a vanilla teacher model compared to the other distillation method?

We chose the SDTT model as the teacher for the following reasons. Firstly, we wanted to see if our method works even on top of a well-distilled model. While the teacher models (sdtt-6 / sdtt-7) went through many rounds of distillation, their modeling remains dimensionally independent. So, we wanted to confirm that there is still room for improvement by introducing dimensional correlations. Secondly, we were not aware of any other distillation methods for discrete diffusion (even though SDTT is just for masked diffusions) with which we compare our methods. By examining their performance gain (i.e., sdtt-6 -> sdtt-7), we could see how our methods work compared to another distillation method at the same time (please also see [q-2] for the actual comparison).

[q-2] Generative perplexity and diversity

It is known that the generative perplexity has crucial flaws [1,2], so it would be great to consider additional metrics for the text generation task.

Generative perplexity can indeed be easily hacked as it does not consider the diversity of the generated samples. Following our teacher model's work, we measured the Self-BLEU metric (similar to the sentence entropy used in [2] from your reference list, but lower is better) in addition to generative perplexity and plotted their trade-off curve in Figure 4(b). Our finding is that our distillation method (sdtt-6 -> sdtt-6+di4c(^2)) does not worsen the Self-BLEU compared to a round of SDTT (sdtt-6 -> sdtt-7).

[q-3] Relation to an advanced fast sampler

How does the proposed method compare to advanced samplers for discrete diffusion models such as [1]?

Thank you for pointing out this relevant work. We understand that the fast sampling method in the reference [1] (from your reference list) suggests that, in masked diffusion models, we can reduce the size of categorical sampling by first choosing the index (or indices in the parallel decoding variant) to unmask. In their parallel decoding variant, they do not model the dimensional correlations, so our mixture modeling can be combined to further improve sampling quality.

[q-4] Time step scheduling

Does the time step schedule (i.e., the choice of $s,u,t$ ) play an important role as in the consistency models [2,3]?

We suppose so, as the quality of our distillation target can vary depending on the time step scheduling. While we did not optimize the teacher sampling steps in our work, there is research optimizing the time step schedules in discrete diffusion [Par25], which can be combined with our method. With this approach, we can also optimize the step schedule after training the model (i.e., the step schedule in few-step sampling).

References

[Des25] Deschenaux and Gulcehre. Beyond Autoregression: Fast LLMs via Self-Distillation Through Time. ICLR 2025.

[Par25] Park et al. Jump your steps: Optimizing sampling schedule of discrete diffusion models. ICLR 2025.

审稿意见

评分: 32025-03-14

This paper studies the distillation problem for discrete diffusion models. The authors identify a key challenge in capturing dimensional dependencies and provide theoretical analyses to support their findings. To address this, they propose a mixture student model with tailored loss functions to facilitate distillation. The proposed method is demonstrated to be effective in both vision and language domains.

给作者的问题

How should the sampling distribution of $\lambda$ be chosen? Have the authors ever considered other common distributions than uniform? How sensitive is model performance to this choice?

论据与证据

Yes, the claims are well supported by clear and convincing evidence.

方法与评估标准

Yes, the proposed method and evaluation criteria are mostly appropriate for the problem. However, I find one requirement somewhat restrictive in practical applications. Specifically, the method relies on reference distributions for $r_\delta$ and $r_t$ , which must either be sampled from real training data or approximated using multiple teacher steps. The first approach requires access to real data, which is not always feasible—especially in cases where only a pretrained model is available and due to copyright restrictions. The second approach, while avoiding this issue, may introduce significant computational overhead (e.g., when approximating $r_\delta$ with $\delta \ll 1$ ), making the method less practical for large-scale applications.

理论论述

I reviewed the main theorems and their proofs and found no issues.

实验设计与分析

I reviewed the experimental design and discussion and found no major issues. However, I did found parts of the experiments could be further improved:

Table 1 only provides the evaluation metrics of 10,20,40 inference steps. It would be more informative to include results of other steps in a graph similar to Figure 3. For example, I would like to see how many times of acceleration the proposed distillation method can achieve to maintain a similar performance (in FID/IS) to the teacher model using 20 inference steps. Additionally, could the authors explain how the results were obtained? Based on Campbell et al. [1] Figure 4, CIFAR-10 FID of $\tau\text{LDR-0}$ ("teacher model" here) does not drop near 8 until over 256 NFEs. Why is the result of the teacher model using 40 steps in Table 1 already close to 8?
Additionally, in [1], with additional corrector steps, the results of $\tau\text{LDR}$ can be significantly improved--3.74 in FID and 9.49 in IS for $\tau\text{LDR-10}$ . However, it seems that the authors did not compare their results with it. Does it indicate that the proposed distillation method is not compatible with the predictor-corrector sampler. Otherwise, I would suggest including the results combining the student model with the predictor-corrector sampling strategy.
Could the authors also include the sampling results with one inference step? Considering that distillation methods for continuous diffusion models can already achieve performance on par with their teacher models using just one step sampling, I feel there is still a huge gap between the proposed method and its counterpart for continuous diffusion. While the proposed distillation method does open a new door for accelerating discrete diffusion sampling, I don't find the proposed method especially impressive and convincing based on these experimental results.

[1] Campbell, Andrew, et al. "A continuous time framework for discrete denoising models." Advances in Neural Information Processing Systems 35 (2022): 28266-28279.

补充材料

I found the codebase in the supplementary material but did not analyze it in detail.

与现有文献的关系

The paper makes important contributions to discrete diffusion distillation through both theoretical analysis and practical algorithm design.

遗漏的重要参考文献

N/A

其他优缺点

The paper is well-written with solid theoretical grounding and mathematical details. The proposed algorithm is flexible, practical, and applicable across different model architectures and tasks.

其他意见或建议

None.

作者回复

2025-03-28

Thank you for your feedback and positive comments. Let us respond to your questions.

[d-1] On the reference distributions

the method relies on reference distributions for $r_\delta$ and $r_t$ , which must either be sampled from real training data or approximated using multiple teacher steps.

It is a valid point that our method has limitations on the source of reference samples. However, in the relatively large-scale experiments (masked image/language modeling), we used only a small portion of the original dataset. For example, we used 200K out of 9M samples in OpenWebText (similarly for ImageNet). Additionally, in our current implementation, each sample $x_0$ generates only a single $x_t$ (with a single $t$ ) throughout the training run. This suggests that more efficient batch designs could reduce the essential number of training samples. That said, we acknowledge that accessing or constructing reference samples is not always straightforward. Future research should explore the use of partial samples or samples from alternative distributions to address this limitation.

[d-2] Regarding $\tau$ LDR models

Why is the result of the teacher model using 40 steps in Table 1 already close to 8?

This discrepancy arises from the different sampling methods used. As detailed in Table 3 in Section F.2.2, the $\tau$ -leaping sampler does not work well with 40 steps. Instead, we used the analytical sampler to evaluate the teacher model. We suppose this difference stems from the non-time-homogeneous nature of the forward diffusion. Specifically, $\tau$ -leaping approximates the transition rate using a constant matrix over a time interval of length $\tau$ , which fails to accurately capture the actual transition rates when $\tau$ is large and the rates vary significantly within the interval. In contrast, the analytical sampler avoids this issue, although it still does not account for dimensional correlations.

Does it indicate that the proposed distillation method is not compatible with the predictor-corrector sampler.

We can use predictor-corrector (PC). However, PC requires additional network evaluations (NFE; each as expensive as one step of denoising), and does not perform well in the small NFE regime [Cam22, Fig 4]. Below are the FIDs (based on 10K images; IS omitted due to the character limit) for different PC settings under total NFE of 20. 'n+2*m' means we used m corrector steps before each of the final 2 out of n denoising steps (imitating [Cam22]).

NFE	6+2*7	10+2*5	14+2*3	20+2*0
teacher	57.57	32.74	21.32	14.42
student	44.57	22.63	14.25	11.81

[d-3] Inference of other number of steps

It would be more informative to include results of other steps in a graph similar to Figure 3. For example, I would like to see how many times of acceleration the proposed distillation method can achieve to maintain a similar performance (in FID/IS) to the teacher model using 20 inference steps.

Here, we have computed additional FID/IS for our CIFAR-10 experiment using 10K samples (fewer than the 50K samples used in the paper, so the numbers may be slightly worse) for 2-20 steps:

FID:

model	2	4	6	8	10	12	14	16	18	20
teacher	392.24	173.29	78.24	49.44	34.70	26.33	21.47	18.05	15.80	14.42
student	411.70	147.67	59.62	33.85	22.57	17.46	14.37	12.86	12.28	11.81

IS:

model	2	4	6	8	10	12	14	16	18	20
teacher	1.17	2.99	5.90	7.01	7.45	7.82	7.96	8.16	8.35	8.46
student	1.25	3.48	6.71	7.68	8.17	8.33	8.37	8.50	8.38	8.39

In terms of FID, our method achieves approximately 1.4 times acceleration in the 10–20 step range. However, it does not perform well in very few steps (e.g., 2–4 steps).

I feel there is still a huge gap between the proposed method and its counterpart for continuous diffusion.

To further reduce the number of steps, it is necessary to model high-dimensional discrete distributions more efficiently, as we lack a deterministic formulation like probability-flow ODEs (available in the continuous case). While mixture modeling provides some improvement, further optimization is required, including adjustments to the architecture, initial distribution, and dimensionality of $\lambda$ . As you noted, our work represents a first step in building the theoretical foundation for this goal.

[d-4] Sampling distribution of $\lambda$

How should the sampling distribution of $\lambda$ be chosen? Have the authors ever considered other common distributions than uniform?

We prioritized ablations of different components over exploring alternative distributions for $\lambda$ . As noted at the end of [d-3], this remains an area for future investigation.

References

[Sah24] Sahoo et al. Simple and effective masked diffusion language models. NeurIPS 2024.

[Cam22] Campbell et al. A continuous time framework for discrete denoising models. NeurIPS 2022.

审稿意见

评分: 32025-03-17

The paper proposed a improved model for distilling discrete diffusion models(DDMs). The key idea is that traditional DDMs break apart the latent distributions into product of marginal distributions, while the proposed model represent the latent distributions into products of bi-dimension distribution pairs, which is able to approximate the ground-truth distribution in fewer de-noising steps with accuracy.

给作者的问题

It is confusing why the authors insist on distilling DDM, which is already a fast approximate model, rather than slow high-quality models? Is it possible to use the model for distilling original or modified diffusion models of higher quality, or training from a scratch?

论据与证据

The authors provided FID measurement as well as image instances to prove that the distilled model compliant with the teacher model. They claim that distilled model with 4-steps reaches similar performance of the teacher model in 8-steps, while the overhead for each step is minimal (according to supp. F.1). Although this claim is rational, it should be noted that this can be frequently observed for certain distillation models because they take benefits from the hint of the teacher models.

方法与评估标准

The evaluation criteria make sense but is a very rough one. The FID/IS metric is a very limited metric for image generation quality. It mainly focus on the average distribution of inception features, without considering the diversity and detailed fidelity of the images.

理论论述

The authors provided theoretical proof that the total variance decrease with increasing training steps.

实验设计与分析

Firstly, as discussed above, the authors mainly compare their model with the teacher model using FID/IS scores, which is inherited limited . Also, they made this comparison on CIFAR-10 and ImageNet datasets, where the FID/IS scores converge for each model in very few steps, which cannot justify the benefit of the model because DDM is not slow in this case. The authors did introduce other metrics (e.g., PPL and MAUVE score) in conditional generation but each time only single metric is used for one dataset, and no statistical variance analysis was given. Third and most important, the method is not compared against other distilling methods.

补充材料

I have checked the experimental results in the supplementary material.

与现有文献的关系

N/A

遗漏的重要参考文献

N/A

其他优缺点

Strength: The authors actually proposed an improvement for DDM. Weakness: If this improved structure can only be applied to distill pre-trained DDM, the benefit is very limited.

其他意见或建议

N/A

作者回复

2025-03-28

Thank you for your review. Let us reply to your comments.

[1-1] Added metrics beyond FID/IS

the authors mainly compare their model with the teacher model using FID/IS scores, which is inherited limited

We have additionally computed precision and recall metrics for the ImageNet experiment (see the table below), following established practices in the current literature (e.g., [Tian24]). Due to character limits, we present the results for each model at the classifier-free guidance (CFG) coefficient that achieves the best FID (see Table 6). We can see di4c models achieve better scores than the teacher model with the same number of steps.

Model (#steps)	Precision	Recall
teacher (4)	0.7737	0.5057
di4c (4)	0.7910	0.5363
di4c-d (4)	0.7866	0.5391
teacher (8)	0.7939	0.5499

[1-2] DDMs are not "fast and approximate" in general

It is confusing why the authors insist on distilling DDM, which is already a fast approximate model, rather than slow high-quality models? Is it possible to use the model for distilling original or modified diffusion models of higher quality, or training from a scratch?

The characterization of DDMs as approximations of 'original diffusion models' (as implied in your latter comment) is not typically correct. Indeed, DDMs are not "approximate" or "fast" in general.

Not Approximate: The teacher DDMs were generally trained from scratch (except for SDTT models we used in language modeling) without relying on any other diffusion models. While the parallel decoding (analytical sampling) in teacher DDM is indeed an approximate inference (with guarantees given in Theorem 1), a similar approximation (unimodal Gaussian approximation) is commonly employed in continuous diffusion models, as noted in [Li24]. Therefore, DDMs are not particularly "approximate" compared to continuous diffusion models.
Not Fast: DDMs typically require tens to hundreds of sampling steps, depending on the complexity of model/data. For the experiment on Discretized Gaussian Diffusion, the teacher model requires 30-40 steps to reach FID<10, which we believe is no longer "very few steps". For comparison, in the continuous case, while EDM on CIFAR-10 requires 35 steps [Kar22], notable distillation methods [Son23] start from EDM and distill it into a few-step model. MaskGIT(-pytorch) is one of few examples that achieves a small number of steps (8 steps to reach FID~7.0), enabled by some heuristics including confidence-based sampling and CFG. We tested our method in this scenario because we believe it is worth investigating if our distillation methods can also work in combination with such heuristics.

Thus, we believe our method can be applied to "original or modified diffusion models". Learning dimensional correlations in DDM from scratch by incorporating the consistency training of [Son23] into our method is an interesting future work.

[1-3] We mostly compute multiple metrics for each dataset

each time only single metric is used for one dataset, and no statistical variance analysis was given.

It is not accurate to state that "each time only single metric is used for one dataset". In fact, we computed both FID and IS for CIFAR-10 and ImageNet experiments. For the OpenWebText experiment, we computed Gen. PPL for unconditional generation, and three different metrics (Gen. PPL, Self-BLEU, and MAUVE) for conditionally generated texts with prompts from the WebText dataset.

Regarding statistical variance analysis, we provided variance values for the latency analysis in Table 2. While it is ideal to retrain and resample everything 5-10 times, the large number of training checkpoints and experimental data points makes it common practice in the literature to report numbers from a single run (as seen in most references cited in this reply).

[1-4] We compare our method with SDTT distillation

most important, the method is not compared against other distilling methods.

While there are few distillation methods for discrete diffusion available (except for some concurrent works), we did compare our method with one round of distillation by SDTT [Des25] in the language modeling experiment (Figure 6). Specifically,"sdtt-7" is obtained after one round of SDTT distillation upon "sdtt-6", while "sdtt-6 + di4c" (or di4c^2) is obtained by Di4C training using the same teacher (sdtt-6). We explicitly compare them in Section 5.3.

References

[Tian24] Tian et al. Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction. NeurIPS 2024.

[Li24] Li et al. Soft mixture denoising: Beyond the expressive bottleneck of diffusion models. ICLR 2024.

[Kar22] Karras et al. Elucidating the design space of diffusion-based generative models. NeurIPS 2022.

[Son23] Song et al. Consistency models. ICML 2023.

[Des25] Deschenaux and Gulcehre. Beyond Autoregression: Fast LLMs via Self-Distillation Through Time. ICLR 2025.

最终决定Accept (poster)

2025-05-01

This paper presents a contribution to the field of discrete diffusion models by tackling the challenge of capturing dimensional dependencies while maintaining computational feasibility. The authors propose a "mixture" model approach for discrete diffusion distillation alongside tailored loss functions, supported by mathematical analysis and empirical validation across both image and text generation tasks. While reviewers raised some valid concerns about experimental design, evaluation metrics, and the practicality of requiring reference distributions, the authors provided rebuttals with additional results and clarifications. Given the theoretical rigor, cross-domain applicability, and the authors' responses to all concerns, I recommend accepting this paper as a step toward more efficient discrete diffusion modeling.

Distillation of Discrete Diffusion through Dimensional Correlations

摘要

评审与讨论

给作者的问题

论据与证据

方法与评估标准

理论论述

实验设计与分析

补充材料

与现有文献的关系

遗漏的重要参考文献

其他优缺点

Strength

Weakness

其他意见或建议

[q-1] Why we chose SDTT model as teacher

[q-2] Generative perplexity and diversity

[q-3] Relation to an advanced fast sampler

[q-4] Time step scheduling

References

给作者的问题

论据与证据

方法与评估标准

理论论述

实验设计与分析

补充材料

与现有文献的关系

遗漏的重要参考文献

其他优缺点

其他意见或建议

[d-1] On the reference distributions

[d-2] Regarding τ\tauτLDR models

[d-3] Inference of other number of steps

[d-4] Sampling distribution of λ\lambdaλ

References

给作者的问题

论据与证据

方法与评估标准

理论论述

实验设计与分析

补充材料

与现有文献的关系

遗漏的重要参考文献

其他优缺点

其他意见或建议

[1-1] Added metrics beyond FID/IS

[1-2] DDMs are not "fast and approximate" in general

[1-3] We mostly compute multiple metrics for each dataset

[1-4] We compare our method with SDTT distillation

References

[d-2] Regarding $\tau$ LDR models

[d-4] Sampling distribution of $\lambda$