PaperHub
5.5
/10
Poster4 位审稿人
最低4最高7标准差1.1
6
7
4
5
3.8
置信度
正确性2.8
贡献度2.3
表达2.8
NeurIPS 2024

Fast Sampling via Discrete Non-Markov Diffusion Models with Predetermined Transition Time

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06

摘要

关键词
discrete Diffusionsampling accelerationtext generation

评审与讨论

审稿意见
6

The paper proposes an accelerated sampling method from standard discrete diffusion models like multinomial diffusion and absorbing-state diffusion, using a non-Markovian forward process where the stochasticity is modelled by sampling a single transition time for each token, after which the process is fully determined. The model is trained similarly as a standard discrete diffusion model, but the formalism allows for a different type of sampler: The transition times are sampled in advance in the beginning, and the denoiser neural network simply parameterizes the transitions that happen at those times in the reverse process. This allows to skip over redundant steps where no transition would occur in the standard discrete diffusion models, speeding up sampling significantly for large step counts. The model provides improved results and faster sampling speeds over a prior accelerated sampling method for sequence-to-sequence generation tasks, and similarly improved results for unconditional language modelling with the text8 and enwik8 data sets when comparing with a prior multinomial diffusion model.

优点

  • Clearly improves over a standard multinomial and absorbing-state schedule in sampling speed and results
  • Improves over a previous published work in accelerating discrete diffusion models
  • The framework could be a useful conceptual tool for people working with standard discrete diffusion models in the style of D3PM and want to accelerate sampling. Methods like ARDM from Hoogeboom et. al. and the tau-leaping of Campbell et al. are similar in function, but either require reframing the modelling problem differently or only allow for absorbing-state transitions in the case of ARDM.
  • In contrast, this paper presents a method that allows changing a model in the D3PM framework to a faster one that skips over redundant steps, and shows how it is connected to the original model as a model with transition times as latent variables.
  • The method is simple and easy-to-use

缺点

  • The method is, in practice, somewhat similar to prior work (Campbell et. al. and Hoogeboom et. al.), and in that sense it would be more beneficial for the community if the paper went deeper into analysing potential speed improvements in discrete diffusion models or theoretical connections between their method and other work.
  • It would be useful if we had results from less steps in Table 3: Seems like the biggest time improvements are obtained with large step counts, but the BLEU scores are not drastically different from low step count BLEU scores. If both methods work quite well with low step counts already, then the results don't seem like the best showcase of the improvements due to increased step counts. Overall, it would be useful to have a clearer picture of the situations in which the method provides practical benefits.

[1] "A Continuous Time Framework for Discrete Denoising Models", Campbell et. al. [2] "Autoregressive Diffusion Models", Hoogeboom et. al.

问题

  • Since the accelerated multinomial model now does only one step for each token, I suppose it is not quite equivalent to the standard multinomial diffusion, where multiple transitions per token can happen during the generative process? Could the authors clarify the similarities and differences?

局限性

Addressed.

作者回复

Thank you for your support and insightful comment.


Q1. The method is, in practice, somewhat similar to prior work (Campbell et. al. and Hoogeboom et. al.), and in that sense it would be more beneficial for the community if the paper went deeper into analysing potential speed improvements in discrete diffusion models or theoretical connections between their method and other work.

A1. While there are some conceptual parallels with prior work, our method offers unique and significant contributions as follows:

  • Unified framework: Unlike ARDM [1], which is limited to absorbing-state transitions, our approach provides a unified framework applicable to both multinomial and absorbing diffusions.

  • Non-Markovian continuous-time framework: Different from Campbell et al.'s (2022) Markovian framework [2], we study the non-Markovian setting, offering new insights for discrete diffusion models.

  • Theoretical foundations: Our rigorous theoretical analysis (Theorems 3.1, 3.5, and D.1) establishes connections between our non-Markovian process and standard discrete diffusion models.

  • Bridging discrete and continuous processes: We investigate the transition from finite to infinite step sampling, providing new insights into bridging the gap between discrete and continuous-time processes.

  • Practical implementation and scalable efficiency: Our method allows for the seamless adaptation of existing multinomial and absorbing discrete diffusion models and demonstrates significant speed improvements, achieving a 30x speedup at 1000 steps while maintaining generation quality.

Regarding your suggestion for deeper analysis, we agree this would be valuable. Due to page limitations, we had to place some of the detailed discussions and analyses in the appendix (like lines 845-853). We will incorporate these extended discussions into the extended page of the main paper in the camera-ready version if accepted, providing a more comprehensive comparison with ARDM and Campbell et al.'s work [1,2].

[1] Hoogeboom et al., "Autoregressive Diffusion Models." ICLR 2022.

[2] Campbell et al, "A Continuous Time Framework for Discrete Denoising Models". NeurIPS 2022.


Q2. Lack of results from fewer steps in Table 3, making it unclear in which situations the method provides practical benefits.

A2. We appreciate your feedback. Tables 2, 3, 6, and 7 highlight the practical benefits of our method across various step counts:

  • At 25 steps: Our method achieves approximately 2x speedup.

  • At 50 steps: We observe about a 3x speedup.

  • At 1000 steps: The speed increases significantly to around 30x.

These results show that our method offers benefits even at moderate step counts like 25, with the advantage becoming more significant as the number of steps increases. This scalability makes our approach particularly valuable for tasks requiring higher quality generation and more computational steps.

Below, we present additional results for IWSLT14 with fewer sampling steps in an absorbing generation process, complementing the experiment in Table 3 of our submission. As shown in the table, when the sampling step count is significantly reduced below 25, the quality of the generated examples decreases noticeably (By Table 3, when step counts of 25 or higher, performance remains consistently above 31.5\mathbf{31.5} BLEU for both RDM-Absorb and DNDM-Absorb).

StepsRDM-Absorb BLEURDM-Absorb Time (s)DNDM-Absorb BLEUDNDM-Absorb Time (s)
529.9227.3630.3024.1
1031.0248.0831.4937.9
1531.2669.2232.1350.0

Performance in low-step settings is a known challenge for discrete diffusion models, which falls outside the scope of our current work. Our primary focus is on accelerating the algorithm while maintaining high performance, which is a crucial factor for real-world applications requiring high-quality sample generation. Our comprehensive experiments demonstrate that DNDM provides a competitive speed-quality trade-off across a broad range of step sizes, with its performance particularly excelling as the number of steps increases.


Q3. Since the accelerated multinomial model now takes only one step for each token, is it not quite equivalent to the standard multinomial diffusion, where multiple transitions per token can happen during the generative process? Could the authors clarify the similarities and differences?

A3. Thank you for your question. Our accelerated multinomial model, based on the DNDM framework, indeed performs only one transition per token. This approach is mathematically equivalent to the standard multinomial diffusion in terms of the final generated distribution, and it leads to the same training objective, as detailed in Appendix B. Here are the key similarities and differences:

Similarities:

  • Final Distribution: Both methods produce samples from the same learned distribution.

  • Training Process: The training process remains unchanged, allowing the use of the same neural network.

Differences:

  • Sampling Process: Our method pre-samples transition times, allowing more efficient sampling by skipping unnecessary steps.

  • Number of Transitions: While standard multinomial diffusion allows multiple transitions, our method consolidates these into a single, more efficient transition.

  • Computational Efficiency: Our approach significantly reduces the number of function evaluations, especially for large step counts, which leads to faster sampling.

We will include a detailed discussion of these points in our paper to clarify this important aspect.

评论

I thank the authors for the comprehensive answers to the questions. I appreciate the promise to add more discussion on the connections to prior work, and the additional experiments on low step counts. Given that these were the main concerns, I will raise the score, although it would be better if the extended discussions had been in the original paper and if there were some additional speed-up tricks that make the method clearly stand out from previous work on speeding up discrete diffusion. The new theoretical approach is interesting as well.

评论

Thank you for your positive feedback and increased score. We appreciate your valuable insights on our additional experiments and theoretical approach.

审稿意见
7

This paper proposes a new formulation for discrete diffusion models whereby the corruption process is defined as a non-markovian process. At each point, a decision is made as to either stay in the current state or switch to a noise sample however crucially, this noise sample is constant throughout the process thereby only a single transition happens for each dimension. During generation, the model then just needs to step from transition time to transition time and can ignore superfluous simulation steps where no transition happens. The model is trained using an ELBO which the authors derive for their transition time conditional process. The authors test their model on machine translation examining quality versus sample speed and also on unconditional text generation with the text8 dataset.

优点

The authors present an original and interesting idea, it is a benefit to the community to point out the fact that standard discrete corruption processes have unnecessary transitions in them and that the same q(xtx0)q(x_t | x_0) distribution can be obtained by simply switching between a data token and a noise token. The idea to condition on the transition time itself within the framework also enables some significant simplifications for general styles of corruption whereas in the past this advantage may have only been noticed for absorbing state processes.

The paper is well written, it is fairly easy to get a good understanding of the proposed method on a first read of the paper and this is helped by the fact that the authors move some nuances and complexity to the appendix. Effort has been made to make the main text readable and intuitive which is greatly appreciated.

I believe the paper will have some impact in the community because it is a quick win when users implement a discrete diffusion model. Instead of the standard sampling procedure of calling the network T times, users can switch to the proposed (quite simple in reality) algorithm and step directly from transition to transition without needing to change much implementation and gain a lot of speed up over the naive algorithm.

缺点

I think there should be some discussion relating to quite a simple baseline that can be implemented for the absorbing state case. One can sample an absorbing state diffusion model in the same way as a standard discrete diffusion model, by stepping from each timestep to the next, however, instead of carrying out a forward pass of the neural network for every timestep, you first check if any tokens perform a transition and only if at least one token transitions, you then do the forward pass. This is possible in the absorbing state case because the unmasking rate is independent of the neural network output. Therefore, this simple algorithm would have the same stated advantages as your method where at most the neural network is called N times but if T is less than N, it will be called T times. I see that for other diffusion styles, your algorithm is fundamentally different due to conditioning on the transition time, however, in the absorbing state case, it seems very similar to this simple method.

How can you explain the fact that you have better performance in terms of sample quality than the baselines you compare to? In your method, you propose no methodological advancement that should improve sample quality as your method is to speed up sampling to make sure the neural network is only called an appropriate number of times. Especially for a large number of timesteps, in Tables 2 and 3, you get across the board improvement, but your algorithm should be very similar to a standard discrete diffusion model in this regime but simply with fewer network evals.

For your text8 experiment, you say you use a 9e7/5e6/5e5 train/val/test split however I believe this is incorrect because this only adds up to 9.55e7 tokens but the text8 dataset contains 1e8 tokens. This may be inherited from a typo in https://arxiv.org/pdf/2107.03006 pg.25 but in a more recent paper https://arxiv.org/pdf/2308.07037 pg 45, the test set is 5e6 tokens. Please confirm which size of test set you used.

The names of the method seem to be slightly confused, for example on Figure 1, a DNDM-T is referenced but this is never actually defined in the full text.

问题

In the end, do you think it will be possible to move beyond the idea of a time variable altogether? Since the final sampling algorithm steps from transition to transition, it seems that the more important variable is which tokens become denoised, in what order, and how many in one go. This is decoupled from time through your algorithm so perhaps there is even more simplification to be made.

Do you think it is possible to learn the transition times on a per dimension basis? This could be useful to train models that can generate tokens in an intelligent ordering rather than currently, the order of generation is completely decided before any tokens have been generated when the transition times are sampled and the ordering is completely uniform over all orderings of the dimensions.

局限性

The authors adequately discuss the limitations in the appendix.

作者回复

Thank you for your support and valuable feedback. We address your major question as follows.


Q1. Think there should be some discussion relating to quite a simple baseline that can be implemented for the absorbing state case. \dots, however, in the absorbing state case, it seems very similar to this simple method.

A1. Thank you for your insightful observation and suggestion. In the absorbing state case, as you accurately pointed out, one could implement a sampling method where:

  • The process steps from each timestep to the next, similar to a standard discrete diffusion model.

  • Before performing a forward pass of the neural network, it first checks if any tokens perform a transition.

  • Only if at least one token transition a forward pass of the neural network is performed.

This method would indeed be a special case of our approach, potentially reducing the number of neural network calls to at most N (number of tokens), or T (number of timesteps) if T < N. When TT \rightarrow \infty, the algorithm becomes ARDM [1] (as detailed in Section G.1). This simplification works for the absorbing state case due to the unmasking rate's independence from neural network output. For other diffusion styles, our framework differs fundamentally due to transition time conditioning. In a revised version, we would include this discussion to provide a more comprehensive comparison and to help readers better understand the advantages of our approach across different diffusion scenarios.

[1] Hoogeboom, et al. "Autoregressive Diffusion Models." ICLR 2022.


Q2. How can you explain the fact that you have better performance in terms of sample quality than the baselines you compare to? \dots but your algorithm should be very similar to a standard discrete diffusion model in this regime but simply with fewer network evals.

A2. Thank you for this insightful observation. You're correct that our primary goal was to accelerate sampling rather than directly improve quality. The superior performance in terms of sample quality was indeed an unexpected but welcome outcome. We hypothesize that this improvement may be attributed to the non-Markovian nature of our process. Similar to how DDIM [1] improved upon DDPM, our non-Markovian approach might lead to more coherent generation by allowing the model to leverage information from key timesteps throughout the entire sequence rather than just the immediately preceding step. Additionally, the reduced number of network evaluations using our method might actually be beneficial. By focusing on key transition points, we may be avoiding unnecessary noise introduced by intermediate steps, leading to cleaner, more focused generations. We will add this discussion to the paper in our revision. In future work, we plan to conduct a more in-depth analysis to elucidate the exact mechanisms behind this quality improvement as well as its relation with different types of transition times.

[1] Song, J., Meng, C., & Ermon, S. (2020). ICLR 2021.


Q3. Typos: 1) Possible incorrect data split for the text8 experiment. 2) Confusion in method naming (e.g., DNDM-T referenced but not defined).

A3. Thank you for pointing out those typos. We'll revise the text8 experiment data split and correct DNDM-T to DNDM-k in our revision. We appreciate your attention to detail in helping us improve the clarity of our paper.


Q4. In the end, do you think it will be possible to move beyond the idea of a time variable altogether?

A4. Thank you for this insightful suggestion. While our current approach still relies on a time variable, your idea of focusing solely on token denoising order and grouping could indeed lead to further simplification and efficiency. This aligns well with our goal of optimizing the sampling process. Below we present additional experiments. We explored the impact of transition times based on the position of the tokens: from left to right and from right to left. In the left-to-right approach, tokens positioned on the left are transitioned to 𝑥0 earlier, and vice versa for the right-to-left approach. As the table shows, the left-to-right approach consistently outperforms the right-to-left approach across all datasets and step counts, supporting the significance of the choice of the transition time.

StepsDirectionIWSLT14WMT14WMT16
25Left-to-right31.0824.4131.67
25Right-to-left30.5423.3331.33
50Left-to-right32.8726.4633.37
50Right-to-left32.4725.1832.78
1000Left-to-right34.4527.9334.43
1000Right-to-left34.0427.0234.15

We'll discuss this promising direction for future research in our revision, as it could potentially take our work a step further.


Q5. Do you think it is possible to learn the transition times on a per-dimension basis?

A5. Thank you for this insightful suggestion. Learning transition times on a per-dimensional basis is indeed an intriguing idea that could lead to more intelligent and efficient token generation. This approach could offer greater flexibility compared to our current method, where transition times are sampled uniformly before generation begins. The key challenges in pursuing this promising avenue would be:

  • Designing an appropriate neural network structure for the predictor to learn the transition times on a per-dimension basis.

  • Formulating an effective training loss that incorporates the learned transition times.

Implementing this could allow for adaptive ordering of token generation, potentially improving the quality and efficiency of the generated text. However, it would also increase the complexity of the model and training process. In our revision, we will discuss this idea as a direction for future research.

评论

I appreciate the clear and detailed rebuttal from the authors. The additional experiment exploring token orderings for the corruption is also very interesting and it will be a nice addition to the paper. I will increase my score to 7.

评论

Thank you for your support and the increased score. We appreciate your positive feedback on our rebuttal.

审稿意见
4

This paper presents the non-Markov process for the discrete diffusion to reduce the sampling time. The authors present the transition time to de-randomize the sampling process and study the non-Markov processes from finite to infinite step sampling. The conditional text generation and unconditional text generation results demonstrate the effectiveness of the proposed method.

优点

  1. The writing of the article is very clear and easy to understand.

  2. Compared with the image field, discrete diffusion is more noticeable in the text field. So I think the conditional text generation and unconditional text generation experiments are sufficient.

缺点

W1: The authors claim that Eq.(1) and Eq.(6) are different because wtw_t in Eq.(1) is independently drawn from the noise distribution qnoiseq_{noise} and ww in Eq.(6) is time-invariant. But, qnoiseq_{noise} is a Dirac distribution for the absorbing diffusion and wt=ww_t=w where t=1,Tt=1, \dots T. So Eq.(1) and Eq.(6) are equal for the absorbing diffusion, which means the proposed non-Markov process is the same as the Markov process.

W2: Besides W1, we can further deduce that for the absorbing process, the proposed DNDM sampling algorithm is equivalent to the original sampling algorithm of the Markov process.

For the origin Markov process, given xtx_t,we first sample x0pθ(x0xt)x_0 \sim p_{\theta}(x_0|x_t) and sample xt1q(xt1xt,x0)x_{t-1} \sim q(x_{t-1}|x_t, x_0) as shown in Eq.(4). I use [M][M] to represent the absorbing state.

$

q(x_{t-1}|x_t, x_0) = \frac{q(x_t|x_{t-1}, x_0)q(x_{t-1}|x_0)}{q(x_t|x_0)} = \frac{q(x_t|x_{t-1})q(x_{t-1}|x_0)}{q(x_t|x_0)}

$

Firstly, if xtx_t is not the absorbing state, we have that xt1=xtx_{t-1}=x_t because q(xt=axt1a)=0q(x_t =a| x_{t-1} \neq a) =0 where a[M]a \neq [M].

Secondly, if xtx_t is the absorbing state, based on that q(xt1=[M]x0)=1αt1q(x_{t-1}=[M]|x_0)=1- \alpha_{t-1}, q(xt1=x0]x0)=αt1q(x_{t-1}=x_0]|x_0)=\alpha_{t-1}, q(xt=[M]xt1=[M])=1q(x_t=[M]|x_{t-1}=[M])=1 and q(xt=[M]xt1[M])=1βt=1αtαt1q(x_t=[M]|x_{t-1} \neq [M]) = 1 - \beta_t = 1 - \frac{\alpha_t}{\alpha_{t-1}}, we can get:

$

q(x_{t-1}=[M]|x_t=[M], x_0) = \frac{1- \alpha_{t-1}}{1 - \alpha_t}, q(x_{t-1}=x_0|x_t=[M], x_0) = \frac{\alpha_{t-1} - \alpha_{t}}{1 - \alpha_t}

$

Based on the above analysis, the sampling process of the absorbing Markov diffusion can be simplified as: we sample starting from an all [M][M] sequence, and during the sampling, we sample x0pθ(x0xt)x_0 \sim p_{\theta}(x_0|x_t). If xt=[M]x_t=[M], xt1x_{t-1} will stay the [M][M] state with probability 1αt11αt\frac{1- \alpha_{t-1}}{1 - \alpha_t} and transfer to x0x_0 with probability αt1αt1αt\frac{\alpha_{t-1} - \alpha_{t}}{1 - \alpha_t}. If xt[M]x_t \neq [M], it will stay unchanged.

In order to further illustrate the relationship with DNDM sampling, I denote the first time xtx_t transitions from [M][M]to a non-[M][M] state as τ\tau (consistent with the transition time in this paper). We can deduce that p(τ=k)=αk1αk1αkt=Tt=k+11αt11αt=αk1αkp(\tau=k) =\frac{\alpha_{k-1} - \alpha_{k}}{1 - \alpha_k} \prod_{t=T}^{t=k+1} \frac{1- \alpha_{t-1}}{1 - \alpha_t}=\alpha_{k-1} - \alpha_{k}. p(τ=k)p(\tau=k) and Theorem 3.5. further verified that the proposed Non-Markov method is the same as the original Markov method for absorbing diffusion.

W3: The proposed non-Markov method of multinomial diffusion is the same as the method in DDIM[1], Appendix A but with a different formulation. We can use the p(τ=k)p(\tau=k) in W2 to further verify it.

In conclusion, this paper presents a non-Markov process for the multinomial diffusion and absorbing diffusion to accelerate the sampling of origin method with Markov processes, this is the main contribution of this paper. However, for the multinomial diffusion, the relationship with the DDIM[1] (Appendix A) is not clearly stated. For the absorbing diffusion, the non-Markov processes are the same as the Markov processes.

If I misunderstand your approach, please feel free to figure it out, and I will adjust my score.

[1] DENOISING DIFFUSION IMPLICIT MODELS, ICLR2021

问题

  1. In line 281~282, the authors denote RDM and RDM-k as the sampling method with and without top-k selection, is this a typo?

  2. In the experiment, do you use a pre-trained model or train it yourself?

  3. The authors claim that they use a model consisting of an encoder and a decoder. This is confusing, do you remove the causal mask in the transformer decoder? For the unconditional text generation, what is the input of the encoder?

局限性

yes

作者回复

Thank you for your constructive feedback. We have addressed your questions and provided clarifications below.


Q1. The authors claim that Eq.(1) and Eq.(6) are different, but they are equal for the absorbing diffusion, which means the proposed non-Markov process is the same as the Markov process.

A1. We appreciate the reviewer's observation. While Eq.(1) and Eq.(6) are indeed equivalent specifically for absorbing diffusion, we respectfully disagree that this equivalence undermines our method. For more general cases beyond absorbing diffusion, these equations are not equivalent, which is why we term our approach the Discrete Non-Markov Diffusion Models (DNDM) framework. The key innovation of our paper is sampling transition times upfront to develop fast sampling algorithms (Lines 5 and 7 in Algorithm 1). Sampling transition times upfront allows the algorithm to skip function evaluations for steps that are not transition times. This reduces the number of neural network calls, leading to faster sampling. DNDM provides a unified framework to introduce the notion of transition time across various diffusion types, including but not limited to absorbing diffusion. In the revised version of our paper, we will add a remark to clarify this special case and highlight the broader applicability of our approach beyond this special case.


Q2. For the absorbing process, the proposed DNDM sampling algorithm is equivalent to the original sampling algorithm of the Markov process.

A2. We believe this is a misunderstanding. While Eq.(1) and Eq.(6) are equivalent for absorbing diffusion, our DNDM algorithm is fundamentally different from the original sampling algorithm. As explained in Section 3, the key innovation of our approach is determining the transition times at the beginning of the algorithm. This significantly reduces the number of function evaluations from T (time steps) to the number of transition times, which is typically much smaller than T. In contrast, the original absorbing diffusion method [1] requires a function evaluation at every step, resulting in T function evaluations. This difference leads to substantial computational savings in our approach.

[1] Austin et al. "Structured denoising diffusion models in discrete state-spaces." NeurIPS 2021.


Q3. The proposed non-Markov method of multinomial diffusion is the same as the method in DDIM[1], Appendix A but with a different formulation.

A3. While the DDIM paper [1] proposed a model for discrete diffusion in its appendix, our approach differs significantly. DDIM's discrete process is still randomized, as whether xt=x0x_t = x_0 or xt1x_{t-1} is controlled by some latent random variables (those random variables are actually analogous to the transition times τ\tau in DNDM). Our method, in contrast, offers full de-randomization using the transition time argument τ\tau, with only one transition time occurring during our sampling process. Crucially, the introduction of transition time in our derandomized process allows DNDM to achieve faster sampling speed under the same number of sampling steps, a feature not reported in DDIM. Furthermore, our work is specifically designed for discrete spaces, providing a comprehensive framework and detailed theoretical analysis connecting finite and infinite step sampling. These key differences underscore that our method is a novel and significant contribution to discrete diffusion models, distinct from DDIM's approach. We will add a remark to clarify the relationship between our approach and DDIM's, emphasizing our method's unique features and empirical advantages in the discrete diffusion setting.

[1] Song et al. "Denoising diffusion implicit models," ICLR2021.


Q4. In lines 281-282, the authors denote RDM and RDM-k as the sampling method with and without top-k selection. Is this a typo?

A4. Thank you for pointing out this typo. We will fix it in the revision.


Q5. In the experiment, do you use a pre-trained model or train it yourself?

A5. As detailed in Appendix F (Experiment details), our approach varied based on the experiment type. For conditional discrete sampling experiments, we utilized pre-trained models (saved checkpoints) provided by the original authors when available to ensure fair comparison [1]. For continuous sampling experiments, no pre-trained checkpoints were available, so we trained the corresponding models ourselves.

[1] Zheng et al. "A reparameterized discrete diffusion model for text generation." arxiv 2023.


Q6. The authors claim that they use a model consisting of an encoder and a decoder. This is confusing. Do you remove the causal mask in the transformer decoder? For unconditional text generation, what is the input of the encoder?

A6. We use different model architectures for conditional and unconditional tasks, but all self-attention blocks within the models are bi-directional and do not use causal masks. This design choice allows each token to attend to both past and future tokens during both training and inference, differentiating discrete diffusion from standard autoregressive models. The use of bi-directional attention in the decoder means the model isn't constrained to generate tokens sequentially, allowing for more flexible and potentially faster generation. For conditional text generation tasks like machine translation, we employ an encoder-decoder architecture. The encoder processes the source text, while the decoder generates the target text. For unconditional text generation tasks like text8 and enwik8, we use a decoder-only architecture similar to GPT models, without an encoder since there's no input sequence to encode - thus, there is no encoder input for these tasks. The 12-layer Transformer mentioned for these experiments refers to this decoder-only model. We will add a more explicit explanation in the experiment settings to distinguish these architectures for different tasks.

评论

Thank you for your response. I believe Q1, Q2, and Q3 have been resolved, but I think the issues in W1, W2, and W3 still exist.

R1. I appreciate the author's emphasis that the proposed methods significantly reduce the number of function evaluations from T to much less than T. However, for Markov diffusion, we can also use far fewer than T steps for sampling as follows:

$

p_{\theta}(x_s|x_t) = \int q(x_s|x_t, x_0) p_{\theta}(x_0|x_t)

$

where s<ts<t. The above expression has been widely used in Markov diffusion without further proof [1, 2]. Therefore, I think that the authors' emphasis on DNDM achieving sampling in fewer than T steps is a minor contribution.

R2. My main concern is that for the absorbing process, regardless of the number of sampling steps (whether equal to or less than T), the proposed DNDM sampling algorithm is equivalent to the original sampling algorithm of the Markov process, as indicated in W1 and elaborated upon in W2. I believe my main concern has not been directly addressed.

R3. Regarding the relationship between DNDM and DDIM, my concern remains unresolved. I believe that DNDM is a special case of DDIM (Appendix A). Specifically, when the hyperparameter σt=1αt11αt\sigma_t = \frac{1-\alpha_{t-1}}{1 - \alpha_t} in DDIM (Appendix A), DDIM and DNDM become equivalent. When σt=1αt11αt\sigma_t = \frac{1-\alpha_{t-1}}{1 - \alpha_t}, the probability that xt1=xtx_{t-1}=x_t is 1αt11αt\frac{1-\alpha_{t-1}}{1 - \alpha_t}, while the probability that xt1=x0x_{t-1}=x_0 is αtαt11αt\frac{\alpha_t-\alpha_{t-1}}{1 - \alpha_t}. In the rebuttal, the authors claim that DDIM and the proposed DNDM are "analogous". But actually, we can prove that they are equivalent based on W2.

[1] Bao et al. "Analytic-DPM: an Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models." ICLR2022

[2] He et al. "DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models." ACL2023.

评论

Thank you for your feedback and for acknowledging the resolution of Q1, Q2, and Q3. We address your remaining concerns regarding W1, W2, and W3 as follows:


W1. We appreciate the author's emphasis that the proposed methods significantly reduce the number of function evaluations from T to much less than T. However, for Markov diffusion, we can also use far fewer than T steps for sampling as follows: pθ(xsxt)=q(xsxt,x0)pθ(x0xt)p_\theta(x_s|x_t) = \int q(x_s|x_t,x_0)p_\theta(x_0|x_t) where s<ts < t. The above expression has been widely used in Markov diffusion without further proof [1, 2]. Therefore, I think that the authors' emphasis on DNDM achieving sampling in fewer than T steps is a minor contribution.

A1. The method you mention is a standard technique to calculate the reverse transition probability between any two-time steps, ss and tt. While it's true that diffusion models can use pθ(xsxt)=q(xsxt,x0)pθ(x0xt)p_\theta(x_s|x_t) = \int q(x_s|x_t,x_0)p_\theta(x_0|x_t) to accelerate the sampling process, choosing good s and t rigorously while preserving the sample quality is highly nontrivial. For example, if you do uniformly downsampling to get a set of time steps 0,2,4,,T0, 2, 4, \dots, T, you can indeed skip many time steps, but you cannot guarantee the resulting samples are of high quality due to discretization error. Our algorithm for DNDM provides a provable approach to select the sampling steps for each token while maintaining high sample quality.

Instead of uniformly skipping time steps across all tokens, we only skip those deemed unimportant, i.e., not in the transition time set. We denote the transition time for the n-th token in the sequence xnx_n to be τn\tau_{n} where τn\tau_n is the transition time for the token xnx_n. And further denote the transition time set T:=τ_n_n=1N\mathcal{T}:= \\{\tau\_{n}\\}\_{n=1}^{N}. This set captures the key time step when each token transitions from noise to the target distribution. Given the transition times τnT\tau_{n} \in \mathcal{T}, our DNDM can be written as: xt1,n=1(τn=t)x0,n+1(τnt)xt,nx_{t-1,n} = \mathbb{1}(\tau_n=t)x_{0,n} + \mathbb{1}(\tau_{n}\not= t)x_{t,n}. Our algorithm can be written as

def sample(x_t, t, transition_times):
# Only update tokens at their specific transition times
    if t in transition_times:
        x_0_pred = predict_x0(x_t, t)
        x_{t-1} = update_tokens(x_t, x_0_pred, t, transition_times)
    else:
        x_{t-1} = x_t
    return x_{t-1}

In conclusion, although sampling with fewer than T steps is crucial for accelerating reverse sampling, simply reducing the number of time steps does not necessarily preserve sample quality. Our DNDM offers a rigorous and adaptive method for reducing sampling steps by precomputing transition times and overcoming the limitations of uniform downsampling or other heuristic approaches.


评论

W3. Regarding the relationship between DNDM and DDIM, my concern remains unresolved. I believe that DNDM is a special case of DDIM (Appendix A). Specifically, when the hyperparameter σt=1αt11αt\sigma_t = \frac{1-\alpha_{t-1}}{1-\alpha_{t}} in DDIM (Appendix A), DDIM and DNDM become equivalent. When σt=1αt11αt\sigma_t = \frac{1-\alpha_{t-1}}{1-\alpha_{t}}, the probability that xt1=xtx_{t-1} = x_t is 1αt11αt\frac{1-\alpha_{t-1}}{1-\alpha_{t}}, while the probability that xt1=x0x_{t-1} = x_0 is αtαt11αt\frac{\alpha_{t} - \alpha_{t-1}}{1-\alpha_t}. In the rebuttal, the authors claim that DDIM and the proposed DNDM are "analogous." But actually, we can prove that they are equivalent based on W2.

[1] Bao et al. "Analytic-DPM: an Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models." ICLR2022

[2] He et al. "DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models." ACL2023.

A3. While there are similarities between DNDM and DDIM (Appendix A), they are fundamentally different models, and DNDM is not a special case of DDIM. DNDM introduces a framework specifically designed for discrete spaces, while DDIM was originally developed for continuous diffusion models. Let me clarify the key differences for multinomial diffusion:

  • DDIM: By eq (19) in Appendix A of DDIM paper, q(xt1xt,x0)=Cat(σtxt+(αt1σtαt)x0+((1αt1)(1αt)σt)1K).q(x_{t-1}|x_t, x_0) = \text{Cat}(\sigma_t x_t + (\alpha_{t-1} - \sigma_t \alpha_t)x_0 + ((1 - \alpha_{t-1}) - (1 - \alpha_t)\sigma_t)1_K). Even with σt=1αt11αt\sigma_t = \frac{1-\alpha_{t-1}}{1-\alpha_t}, the process remains stochastic: q(xt1xt,x0)=Cat(σtxt+(1σt)x0)q(x_{t-1}|x_t, x_0) = \text{Cat}(\sigma_t x_t + (1- \sigma_t)x_0 ). This means at every step, there's a probability of choosing x0x_0, regardless of whether it has transitioned to x0x_0 or not. Unlike Absorbing discrete diffusion, no [Mask] exists in multinomial diffusion. Therefore, DDIM cannot distinguish whether xtx_t already equals x0x_0. In particular, although the sampling process becomes less stochastic in the DDIM setting, it will still be predicted x0x_0 with high probability 1σt =αt1αt1αt1-\sigma_t = \frac{\alpha_{t-1}- \alpha_t}{1-\alpha_t}.

  • DNDM: Achieves full de-randomization using transition time τ:xt1=1(τ=t)x0+1(τt)xt\tau: x_{t-1} = \mathbb{1}(\tau = t)x_0 + \mathbb{1}(\tau \not= t)x_{t} (Equation 8 in our paper). Here, τ\tau follows P(τ=t)=αt1αt<αt1αt1αtP(\tau = t) = \alpha_{t-1} - \alpha_t < \frac{\alpha_{t-1}- \alpha_t}{1-\alpha_t}. Such crucial difference allows DNDM to achieve full de-randomization once τ\tau is sampled, leading to a deterministic evolution that DDIM cannot replicate.

  • Sanity Check via Concrete Example: For sampling x1x_1 based on x2x_2, consider the probability of calling x^_0pθ(x^_0xt)\hat{x}\_0 \sim p_{\theta}(\hat{x}\_0|x_t). DDIM: P(call x^_0\hat{x}\_0) = α1α21α2\frac{\alpha_{1} - \alpha_2}{1-\alpha_2}. DNDM: P(call x^_0\hat{x}\_0) = α1α2\alpha_1 - \alpha_2 . Crucially, α1α2<(α1α2)/(1α2)\alpha_1 - \alpha_2 < (\alpha_1 - \alpha_2)/(1-\alpha_2), because α2<1\alpha_2 < 1.

The above illustration shows that DNDM is not a special case of DDIM. We say that DNDM is analogous to DDIM because both of them are Non-Markov models. We will add a discussion to clarify this point in the revision to avoid any confusion.


评论

Thank you for your response!

I understand that DNDM is not a special case of DDIM (Appendix A), thanks! I believe that a comparison between DNDM and DDIM experiments will be very interesting.

For the absorbing process, the author provides a detailed explanation of the differences between the proposed DNDM and the original sampling algorithm of the Markov process. For each sampling step (e.g., pθ(xsxt)p_{\theta}(x_s|x_t), s<t), the DNDM and D3PM methods are consistent. The advantage of the proposed DNDM lies in how it selects which time steps should be omitted. It's important to emphasize that this advantage of DNDM does not stem from the non-Markov process as claimed by this paper. For the absorbing process, the Markov and non-Markov processes are entirely identical.

In both this paper and references [1, 2], the absorbing process has consistently outperformed the multinomial process and has garnered increasing attention. Therefore, I hope that the claim in this paper that the non-Markov process can accelerate sampling of absorbing diffusion does not mislead the community, and I remain inclined to reject it.

This paper presents promising experimental results. If the authors can accurately explain the reasons behind the effectiveness of DNDM, which would require substantial revisions, it will undoubtedly significantly enhance the quality of the paper.

[1] Austin et al. "Structured denoising diffusion models in discrete state-spaces." NeurIPS 2021.

[2] Lou et al. "Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution." ICML2024.

评论

Thank you for your further feedback and for acknowledging the technical and empirical contributions of our paper. We're glad that we have resolved all of your misunderstandings and questions.

Note that the modifier Non-Markov in our algorithm name refers to both the time-invariant noise ww and the predetermined transition time τ\tau (See Eq. (7) in the paper). To emphasize the importance of the predetermined transition time in DNDM, we will add a sentence in the abstract: "The proposed DNDM model naturally induces the predetermined transition time set, which enables a training-free sampling algorithm that can accelerate a large family of discrete diffusion models." We will also highlight this in the paper.

While we believe we have clearly explained the reasons behind the effectiveness of DNDM in Sections 3.1 and 3.2 with comprehensive pseudo-algorithms in the paper (see Algorithms 1-4), we will also incorporate the clarifications from our rebuttal discussion into the final version.

Given that we have resolved all of your concerns and questions in your review, and we believe the promised changes would not be a major revision, we would greatly appreciate it if you could consider raising your score in light of these points. In particular, a score of 3 indicates 'technical flaws, weak evaluation, inadequate reproducibility, and/or incompletely addressed ethical considerations.' We believe our paper does not fit this description and deserves a higher rating.

评论

Thank you for incorporating the clarifications from our rebuttal discussion into the final version. I believe this is crucial for clearing up any misunderstandings for the readers.

However, I find using the modifier "Non-Markov" completely unacceptable. This term appears frequently in the title, abstract, and throughout the main text. As I have repeatedly emphasized in our previous discussions, the original Markov diffusion and the proposed non-Markov diffusion are identical for the more effective and widely recognized absorbing process. The non-Markov assumption is redundant and misleading.

After several days of careful consideration and thoroughly reading the comments from other reviewers, I still lean towards rejecting this paper.

评论

W2. My main concern is that for the absorbing process, regardless of the number of sampling steps (whether equal to or less than T), the proposed DNDM sampling algorithm is equivalent to the original sampling algorithm of the Markov process, as indicated in W1 and elaborated upon in W2. I believe my main concern has not been directly addressed.

A2. First of all, let's recall the forward processes of D3PM and DNDM as follows: xt = btxt1 + (1bt)wt, t = 1  T, D3PM, Eq 1x_{t} = b_{t}x_{t-1} + (1-b_t)w_t, \forall t = 1 \dots T, \qquad \text{D3PM, Eq 1} xt = btxt1 + (1bt)w,  t = 1 T.DNDM, Eq 6x_{t} = b_{t}x_{t-1} + (1-b_t)w, \forall t = 1 \dots T . \qquad \text{DNDM, Eq 6} The only difference between Equation 1 and Equation 6, is wtw_t vs. ww. Since for absorbing diffusion, wt=w=[Mask]w_{t} = w = [Mask], D3PM and DNDM are indeed equivalent. However, for multinomial diffusion or other diffusion processes, wtww_t \neq w, D3PM and DNDM are different.

In addition, even for absorbing diffusion, our proposed reverse sampling algorithm for DNDM is still different from that for D3PM. To elucidate the key differences between the sampling algorithm in DNDM and that in D3PM for absorbing diffusion, let's directly compare the algorithms:

  • For the D3PM-Absorb algorithm: We begin with an all [M] sequence. At each time step tt, we sample x0pθ(x0xt)x_0 \sim p_{\theta}(x_0|x_t). If xt=[Mask]x_t=[Mask], xt1x_{t-1} transitions to [Mask][Mask] with probability (1αt1)/(1αt)(1-\alpha_{t-1})/(1-\alpha_t) and to x0x_0 with probability (αt1αt)/(1αt)(\alpha_{t-1} - \alpha_t)/(1-\alpha_t). If xt[Mask]x_{t}\not= [Mask], it remains unchanged.

  • For the DNDM-Absorb algorithm: We also start with an all [Mask][Mask] sequence, but crucially, we first determine the transition time set. During sampling, if xt=[Mask]x_t=[Mask], the transition probabilities for xt1x_{t-1} are identical to D3PM. However, we only sample x_0pθ(x_0xt)x\_0 \sim p_{\theta}(x\_0|x_t) when at least one token needs to change, as determined by our pre-computed transition set. This selective sampling is the key to our algorithm's efficiency.

Therefore, you can see that DNDM will skip many steps during the sampling process to avoid function evaluation and save computational cost. A natural question is how many time steps can be skipped. Let's do the calculation as follows.

For a specific time t and token position n, the token will change at time t1t-1 only if:

  • It hasn't already changed (probability: Πs=Tt1αs1αs+1 = 1αt\Pi_{s=T}^{t}\frac{1-\alpha_s}{1-\alpha_{s+1}} = 1-\alpha_t)

  • It will transfer to x0x_0 (probability: αt1  αt1αt\frac{\alpha_{t-1} - \alpha_t}{1-\alpha_t})

Thus, the probability of nn-th token changing at time t-1 is (1αt) αt1αt1αt = αt1  αt(1-\alpha_t)\cdot \frac{\alpha_{t-1}-\alpha_t}{1-\alpha_t} = \alpha_{t-1} - \alpha_t. Consequently, the probability that no tokens change at time t for the entire sequence is (1  (αt1αt))N\big(1 - (\alpha_{t-1}-\alpha_{t})\big)^{N} where NN is the sequence length. These are precisely the time steps that our DNDM algorithm will skip to save computational time, unlike D3PM, which does function evaluation every time step.

To sum up, even though the forward process of DNDM is the same as that of D3PM for absorbing diffusion, our DNDM approach introduces a clever and provable algorithm design in the sampling process by pre-computing the transition time set and selectively applying function evaluations. This distinguishes DNDM from D3PM algorithm, offering a more computationally efficient approach to inference in discrete diffusion.


评论

Thank you for your candid feedback and for acknowledging the improvements in our paper.

Regarding your remaining concern 'Non-Markov', we would like to provide further clarification on why this terminology is accurate and necessary for our work:

  • Our model handles a broader set of discrete diffusion models beyond the absorbing process. In general cases, including multinomial diffusion, our forward process is non-Markovian.

  • The Non-Markov modifier refers to both the time-invariant noise 𝑤 and the predetermined transition time 𝜏 (Eq. 7). Only when w = [Mask], which is deterministic, would the distribution for the absorbing state become Markovian. However, the non-Markovian nature is still fundamental to our DNDM model's full generality, and it provides readers with an accurate understanding of the process's properties and how transition time gets introduced.

  • The use of 'Non-Markov' in our terminology aligns with similar practices in the field, accurately highlighting key characteristics of our model that deviate from strict Markovian properties, even though DNDM can degenerate to Markovian process under specific settings. In the DDIM framework, for instance, when σt=(1αt1)/(1αt)1αt/αt1\sigma_t = \sqrt{(1-\alpha_{t-1})/(1-\alpha_t)}\sqrt{1-\alpha_t/\alpha_{t-1}}, the diffusion process becomes Markovian, and the forward/generative process becomes a DDPM. Similarly, our use of 'Non-Markov' emphasizes the general case while acknowledging special conditions where Markovian properties may emerge.

The added clarifications in our paper should help readers understand the specific characteristics of our approach and its relationship/differences to Markovian processes.

With that being said, we are open to changing the term 'Non-Markov' if it would persuade you to raise your rating to the acceptance level. If you have any suggestion for the replacement of 'Non-Markov', we would be happy to take it.

审稿意见
5

The paper introduces a discrete non-Markov diffusion model (DNDM) aimed at accelerating the sampling process in discrete diffusion models. The proposed method reduces the number of neural network function evaluations to speed up the sampling process while maintaining sample quality. The paper explores the transition from finite to infinite step sampling, providing new insights into bridging the gap between discrete and continuous-time processes. Experiments on natural language generation and machine translation tasks illustrate the competitive performance of the method in terms of speed and quality compared to existing methods.

优点

  • The introduction of a discrete non-Markov diffusion model (DNDM) provides a new method for accelerating the sampling process in discrete diffusion models in a training-free manner. It reduces the number of neural network function evaluations, enhancing the efficiency of the sampling process to speedup by 3x for 50 steps.
  • The authors conducted extensive experiments on natural language generation and machine translation tasks, demonstrating the effectiveness of the proposed method, for both multinomial and absorbing diffusions.

缺点

  • The method involves a complex process that might be challenging to easily follow and implement. More details or visualizations on how the transition time distribution is determined and whether it can be adapted for different types of discrete diffusion models will be helpful for a better understanding of the motivation and methodology.
  • The comparison with other acceleration methods is not very convincing, especially at a practical smaller number of sampling steps. Instead of RDM baseline, how does the proposed method compare with other existing acceleration techniques for discrete diffusion models in terms of both efficiency and quality?
  • While the method is tested on natural language generation and machine translation tasks, its applicability to other modalities such as image or video generation is not unknown, which might limited the scope of the proposed method.

问题

Suggest to address concerns in the weakness section.

局限性

yes

作者回复

Thank you for your support. Below, we address the questions.


Q1. The method involves a complex process that might be challenging to easily follow and implement. More details or visualizations on how the transition time distribution is determined and whether it can be adapted for different types of discrete diffusion models will be helpful for a better understanding of motivation and methodology.

A1. We thank the reviewer for their feedback. However, we respectfully disagree that our method is overly complex or challenging to implement. We have taken several steps to ensure clarity and ease of implementation:

  • Training-free approach: Our sampling method is designed to be training-free, making it straightforward to integrate with existing discrete diffusion models.

  • Detailed algorithms: We have already included comprehensive pseudo-algorithms in the paper (see Algorithms 1-4) to guide implementation.

  • Discussion about transition time distribution: We have provided a detailed explanation and ablation study in Appendix C, which thoroughly covers how the transition time distribution is determined and its impact on performance. Figure 3 in the paper provides clear visualizations of different distribution types (e.g., linear, cosine, Beta) to aid understanding.

  • Adaptability: Our method can be adapted to different types of discrete diffusion models, including the most popular multinomial and absorbing diffusions, the results of which are demonstrated in Sections 4.1 and 4.2.

  • Ease of implementation: Our method can be implemented in just a few lines of code. For example, the core sampling logic can be expressed in the following block, where the key intuition is the condition line: "if t in transition_times:". It determines whether to update tokens at a given time step, significantly reducing computation while maintaining quality.

def sample(x_t, t, transition_times):
    if t in transition_times:
        x_0_pred = predict_x0(x_t, t)
        x_{t-1} = update_tokens(x_t, x_0_pred, t, transition_times)
    else:
        x_{t-1} = x_t
    return x_{t-1}

We believe these points demonstrate that our method, while mathematically sophisticated, is conceptually simple and easy to implement. However, if the reviewer still finds additional clarification necessary, we are open to adding more visualizations or explanations in the revised version.


Q2. The comparison with other acceleration methods is not very convincing, especially at a practical smaller number of sampling steps.

A2. Our method, DNDM, consistently shows acceleration across various step sizes, particularly for moderate to high step counts. For example, in Tables 2 and 3, we demonstrate significant speedups for 50 and 1000 steps across different datasets (IWSLT14, WMT14, WMT16). While the speed is 25 steps, we still see improvements in most cases. Below is an additional experiment for IWSLT14 with fewer sample steps in an absorbing generation process (additional results for Table 3 in our submission). As we can see from the table, when the sampling step is significantly less than 25, the quality of the generated examples does not match those with larger sampling steps (By Table 3, when step counts of 25 or higher, performance remains consistently above 31.5\mathbf{31.5} BLEU for both RDM-Absorb and DNDM-Absorb).

StepsRDM-Absorb BLEURDM-Absorb Time (s)DNDM-Absorb BLEUDNDM-Absorb Time (s)
529.9227.3630.3024.1
1031.0248.0831.4937.9
1531.2669.2232.1350.0

It's important to note that performance in low-step settings poses a significant challenge for discrete diffusion models, which is beyond the scope of our current work. Our primary focus is on accelerating the algorithm while maintaining good performance—a crucial factor for numerous real-world applications that demand high-quality sample generation. Our comprehensive experiments demonstrate that DNDM offers competitive speed-quality trade-offs across a broad spectrum of step sizes, with its performance notably excelling as the number of steps increases.


Q3. Instead of RDM baseline, how does the proposed method compare with other existing acceleration techniques for discrete diffusion models in terms of both efficiency and quality?

A3. Our work introduces the first training-free acceleration technique specifically designed for finite-step discrete diffusion models such as D3PM and multinomial diffusion, filling a gap in the field.

We chose RDM as our primary baseline due to its state-of-the-art results in discrete diffusion. Given the novelty of our method in the context of finite-step discrete diffusion models, direct comparisons with other acceleration techniques are limited. Our focus was on developing and demonstrating the effectiveness of this training-free approach, which opens up new possibilities for accelerating discrete diffusion models.


Q4. The method's applicability to other modalities, such as image or video generation, is unknown, which might limit its scope.

A4. While our current work focuses on text generation, DNDM's core principles have the potential for broader applicability. The method is designed for discrete data, which naturally suits text but can also apply to other modalities through appropriate discretization or quantization. Numerous follow-up studies can benefit from our fast sampling algorithm for tasks such as Electronic Health Record (EHR) data generation and protein sequence generation. These areas all involve discrete data structures that could benefit from our acceleration technique.

评论

Thank you for your valuable feedback. We sincerely hope that we have adequately addressed your questions and concerns. Specifically,

  • We have elaborated on DNDM's simplicity and ease of implementation. As a training-free approach, DNDM integrates seamlessly with existing discrete diffusion models. We've provided comprehensive pseudo-algorithms in the paper and detailed explanations of the transition time distribution in Appendix C and Figure 3.

  • Regarding DNDM's performance across various step sizes, we've demonstrated its effectiveness for both moderate and high step counts. Our additional experimental results for IWSLT14 with fewer sample steps demonstrate DNDM's capabilities even at lower step counts.

  • We've added the comparison with other acceleration techniques by highlighting DNDM's unique feature as the first training-free acceleration method specifically designed for finite-step discrete diffusion models. We've also discussed its potential applicability to other discrete data modalities beyond text generation.

We sincerely hope our response adequately addresses your questions and provides clarity on our method. Thank you for your time and careful consideration of our work.

评论

Thank you to the authors for the detailed response on the simplicity and ease of implementation of the proposed method, as well as the additional experiments in the small number of sampling step regimes. It would still be interesting to see comparisons with other fast sampling methods beyond the RDM baseline, such as DDIM or others. Additionally, while I appreciate the novelty of the approach, there are already many training-free acceleration methods in the literature, even though they may not have been tested in the specific problem setting of this paper. Therefore, I suggest toning down the claim of being the "first training-free acceleration method." After reading other reviews and careful consideration, I prefer to keep my original rating.

评论

Thank you for your valuable feedback and for taking the time to consider our response. We sincerely appreciate your acknowledgement of our improvements on the simplicity and ease of implementation of the proposed method, as well as the additional experiments with smaller sampling steps. We would like to address your remaining concerns as follows:

  • We appreciate your point about the phrasing of our novelty claims during previous discussion. We agree to tone down any perceived claim of being the "first training-free acceleration method". To clarify, we developed a training-free acceleration method specifically designed for finite-step discrete diffusion models. We'd like to note that we didn't include such a phrase as "first training-free acceleration method" in our submission.
  • Additional Experiment on DDIM Appendix A: Our focus was on accelerating finite-step discrete diffusion models, which differ fundamentally from continuous diffusion models. Techniques primarily designed for continuous diffusion are not directly applicable to discrete diffusion, like multinomial diffusion and absorbing diffusion. While DDIM proposes a version for multinomial diffusion in Appendix A, it doesn't consider the transition time or provide any code or experiments. Inspired by your feedback, we've implemented DDIM Appendix A ourselves and conducted additional experiments. We've included results both with and without top-k sampling (denoted by '-k' in the table headers). Below are the results for IWSLT14 with a wide range of sample steps in multinomial diffusion generation (These could serve as additional results for Table 2 in our submission):
StepsDDIM-multi BLEUDDIM-multi Time (s)DNDM-multi BLEUDNDM-multi Time (s)
528.8830.228.0428.1
1030.4655.830.5744.4
1530.8780.130.7750.6
2531.30130.430.9552.9
5031.63257.231.4583.9
100031.795064.831.82191.3
StepsDDIM-k-multi BLEUDDIM-k-multi Time (s)DNDM-k-multi BLEUDNDM-k-multi Time (s)
528.9331.130.3028.3
1030.6956.731.7944.3
1530.8581.532.1450.1
2531.38132.732.3052.6
5031.64260.132.8093.2
100031.895121.333.15191.5

These results suggest that compared with DDIM (Appendix A), DNDM achieves comparable or better performance with reduced computation time across various step sizes, with the advantage becoming more significant as the number of steps increases. This scalability makes our approach particularly valuable for tasks requiring higher quality generation and more computational steps.

We appreciate your careful consideration and are open to suggestions on how to better position our work within the broader context of diffusion model acceleration techniques. Thank you again for your valuable feedback.

评论

Dear Reviewers,

Thanks for writing your reviewers of the paper. Now the authors' rebuttals are in. Please go through them and see if they have addressed your questions. Please start discussions with the authors if you have further comments.

Regards, AC

最终决定

The research presents a new approach called DNDM, designed to make sampling in discrete diffusion models faster. This is achieved by decreasing the number of times the neural network needs to be evaluated, thereby speeding up the process without sacrificing the quality of the generated samples. The paper is well written with thorough experiments showing the effectiveness of the proposed approach. The proposed method is an important contribution to the community.