PaperHub
6.3
/10
Poster3 位审稿人
最低6最高7标准差0.5
7
6
6
3.3
置信度
COLM 2024

A Reparameterized Discrete Diffusion Model for Text Generation

OpenReviewPDF
提交: 2024-03-22更新: 2024-08-26
TL;DR

This work develops a reparameterized discrete diffusion model that implies effective training and sampling algorithms and delivers significant performance improvements for text generation..

摘要

关键词
discrete diffusiontext generationnon-autoregressive generation

评审与讨论

审稿意见
7

The authors suggest a reparameterized discrete diffusion model to improve the performance of diffusion models in natural language processing. The proposed reparameterization creates a more effective training as well as a more effective sampling of the model. The authors evaluated the proposed method on machine translation, question generation and paraphrasing. Using the proposed method, the authors could significantly improve the performance of existing diffusion models in terms of translation quality. However, the comparison to autoregressive models is lacking recent advances. The model is compared only to baseline transformer model from 2017 and not to state-of-the-art models (e.g. submitted to WMT). The authors nicely investigate the influence of the different training and decoding methods independently. Also the influence on the speed is shown. However, it is unclear why the proposed model with 20 iterations is faster the baseline one with 10? Furthermore, the authors compared to a GPT-2 base model, but not to the compared transformer baseline from the qualitative results.

接收理由

The authors present an interesting idea to improve the quality of diffusion models for text generation The idea is evaluate on several tasks and shows good performance and significant improvements in quality

拒绝理由

The evaluation of the proposed should be improved with respect to the autoregressive baselines.

作者回复

Thank you for your valuable feedback! We appreciate the recognition of our work's potential in improving the quality of diffusion models for text generation.

Q1: However, it is unclear why the proposed model with 20 iterations is faster the baseline one with 10? Furthermore, the authors compared to a GPT-2 base model, but not to the compared transformer baseline from the qualitative results.

A1: The baseline with 10 steps is slower than our model with 20 iterations due to additional intermediate processing steps required by the baseline model (DiffuSeq). Additionally, we have compared our model to transformer baselines in our main translation results, as detailed in Table 1 of our paper.

Q2: The evaluation of the proposed should be improved with respect to the autoregressive baselines.

A2: We demonstrate that our discrete diffusion models (RDMs) not only achieve competitive performance compared to autoregressive baselines but also offer greater flexibility in text generation. Our experiments demonstrate faster generation with fewer decoding steps (Figure 2b), and support non-autoregressive generation (Tables 4, 11, 12, 13, and 14).

It is also worth noting that most non-autoregressive text generation models (including most diffusion models) usually lag behind conventional autoregressive baselines in terms of performance. However, our RDMs still exhibit promising results and effectively bridge this performance gap.

We hope these points effectively address the concerns raised and highlight the strengths and contributions of our work.

评论

Thanks for the clarification, the first point is clear now. Sorry for being unclear about the GPT-2 and Transformer comparison: It is unclear why you did only do a comparison to transformer base in Table 1 based on the quality, but not compare the speed of transformer to your work. This is only done for GPT-2. Even though we do not expect non-autoregressive approach to outperform autoregressive ones, it is important to do a fair comparison to estimate the difference.

评论

Thank you for your clarification! Our preliminary experiments indicated that encoder-decoder transformers with a similar parameter count to GPTs achieve comparable or sometimes slower throughput, especially when generating sequences shorter than 128 tokens. We will include the results and analysis of encoder-decoder baselines in the revised manuscript and provide a more comprehensive evaluation of our models. We are grateful for your engagement in the rebuttal process.

审稿意见
6

This paper proposes to do discrete diffusion by explicitly modelling which tokens are noise and which tokens are correct. At x_0 all the tokens are correct, and at x_T all the tokens are noise. In between, they explicitly model the transitions to and from correct tokens and noise tokens (the v variables). They derive a lot of math about these transition distributions and the associated token generation distributions, and show that it has some nice properties for training. At generation time, they add an easy-first heuristic to decide which tokens should (stochastically) transition from noise to correct.

Or, at least, that is how I understand it, which could be over-simplistic. The paper is written in the opposite order, and I didn't read it backwards.

The empirical evaluations show consistent improvements when adding their methods to previous discrete diffusion models. An ablation study shows that both the easy-first decoding strategy and the new loss function are responsible for improvements.

接收理由

The idea is interesting and there are lots of nice derivations (which I didn't check). Using an easy-first heuristic to do non-autoregressive generation is a good idea, and they propose a diffusion model which allows them to do that (which is not trivial). They show strong improvements when they add this idea to other discrete diffusion models, and show that both the modified decoding strategy and the modified loss result in improvements, individually and together.

拒绝理由

It was very hard to follow the presentation of the model. Diffusion is often complicated, but I think this was unnecessarily difficult to read. Take for example section 4.3. This section is supposed to be about decoding, but it spends a lot of time talking about training objectives. Then it assumes that you know the ground truth. Then it says that they compute the ground truth from the sampled v variables. This is the second-to-last paragraph before the experiments, on page 6, and it was only at this point that I understood what they are trying to model. Then I could go back and understand the paragraph before that (what I call the easy-first heuristic), at which point the paper started to make sense.

As another example of the poor presentation, on page 5 the paper says "So far, we mainly consider the case of diffusing over a single token", after 2.5 pages discussing the model without mentioning this fact. At the end of section 2 they even say "for each token", while this section only makes sense "for a given token".

Going back to the introduction, I can see that there is a summary of the main idea. But it is worded in such a vague ambiguous way that I couldn't understand it when I read the introduction.

I think that the paper should really be rewritten before it is published. It is possible to fix these issues for the camera-ready, but it would be a lot to expect.

给作者的问题

Do you think that, in time for the camera-ready version, you can add a precise but intuitive explanation of each contribution before going into the details?

作者回复

Thank you for your valuable feedback! We appreciate your positive remarks about the novelty and effectiveness of our approach. We also recognize the concerns raised regarding the presentation and clarity of our paper.

  • We will streamline Section 4 and add a more intuitive summary before detailing the formulated diffusion process. The discussion of training objectives will be moved to Section 4.2 to maintain a logical flow and avoid confusion.
  • Your note about the late mention of “diffusion over a single token” is well taken! We will fix the conflicts that lead to confusion and revise the wording to emphasize this concept earlier in the paper, providing a clear context before delving into the detailed description.
  • We will address the issue of scattered wording in the introduction of the main idea and rewrite this section to provide a more precise and intuitive explanation of our contributions.

We are committed to addressing these issues comprehensively in the camera-ready version. Thanks again for your feedback!

评论

The rebuttal demonstrates an understanding of my concerns and willingness to address them. But I will leave it to the AC to decide whether this level of promised modifications in the camera-ready version is acceptable.

评论

Thank you for your engagement in the rebuttal process! Should you have further queries, we are more than willing to provide additional clarifications. We are grateful for your valuable time to review our work.

审稿意见
6

This paper presents a reparameterization of discrete diffusion that can be applied on top of existing diffusion architectures. In this work, the authors use it train diffusion language models. The primary contribution lies in rearranging the term q(x_{t-1} | x_t, x_0) that allows defining it as a sum of two terms: one involving staying at the current token with some probability if it is close to the target, and the other simply adding noise at the current step. Based on this observation, the authors redefine the training process, and derive a loss function which they show is similar to a cross-entropy based language modeling loss. The resulting sampling algorithm requires much fewer iterations than baselines.

The authors train with this approach, discrete diffusion models with two kinds of noise formulation (namely multinomial and absorbing). On translation, paraphrasing, and question generation. Analysis shows that this model is much faster than other diffusion-based baselines.

接收理由

  1. All theoretic contributions are non-trivial and interesting: the reparameterization, the new training loss, cross-entropy formulation. They have clearly explained, the method section is easy to follow.
  2. The improved decoding speed owing to needing fewer iterations is a huge benefit of this approach as diffusion models are known to be super-slow for language generation.
  3. The improvements in performance over other considered diffusion models are significant.

拒绝理由

  1. Several diffusion-based baselines are missing [1,2]. A quick skim through these papers shows that these baselines offer better performance on translation, QG, and QQT than the proposed method. These approaches, however, are not discrete and the reparameterization in its current form cannot be applied to them.
  2. There is no performance gain over autoregressive models. In addition, a general weakness of discrete diffusion models is they do not allow adding control via guided diffusion.

[1] TESS: https://arxiv.org/pdf/2305.08379

[2] AR-Diffusion: https://arxiv.org/pdf/2305.09515

给作者的问题

  1. The tasks the authors consider are of reasonably short length (256), will the speed improvements hold for longer sequences?

伦理问题详情

NA

作者回复

Thank you for your thoughtful and constructive feedback on our paper! Below, we provide our responses to the points mentioned in the review:

Q1: Several diffusion-based baselines are missing [1,2]

A1: We appreciate the suggestion to include a broader set of baselines for comparison. However, the baselines referenced

  • [1] start fine-tuning from pretrained language models (e.g., RoBERTa) and use much more inference steps (e.g., 1000 steps), while our approach starts from scratch and decodes within 25 steps.
  • [2] generate a substantially higher number of samples (up to 50 to 500) for selection to evaluate the task performance, whereas our method uses at most 10 samples. These differences in setup make it challenging to perform a direct, fair comparison to these baselines.

We appreciate this feedback and will address these points in the revision.

Q2: There is no performance gain over autoregressive models.

A2: It is worth noting that most non-autoregressive text generation models (including most diffusion models) usually lag behind conventional autoregressive baselines a lot in terms of performance. However, our RDMs still exhibit promising results and effectively bridge this performance gap to auto-regressive models. In addition, the RDM family offers greater flexibility in text generation, such as faster generation with fewer decoding steps (Figure 2b) and non-autoregressive generation (e.g., Tables 4, 11, 12, 13, and 14).

Q3: In addition, a general weakness of discrete diffusion models is they do not allow adding control via guided diffusion.

A3: While it is true that discrete diffusion models currently do not allow adding gradient-based control, these control signals can be directly materialized into the context (similar to the approach used in DiffuSeq [3]). This is analogous to prompting in using instruct-tuned language models, with the added advantage of being insertable everywhere in the sequence thanks to the non-autoregressive nature of diffusion models.

[1] Mahabadi, Rabeeh Karimi, et al. "Tess: Text-to-text self-conditioned simplex diffusion." arXiv preprint arXiv:2305.08379 (2023).

[2] Wu, Tong, et al. "Ar-diffusion: Auto-regressive diffusion model for text generation." Advances in Neural Information Processing Systems 36 (2024).

[3] Gong, Shansan, et al. "Diffuseq: Sequence to sequence text generation with diffusion models." arXiv preprint arXiv:2210.08933 (2022).

最终决定

In this paper, the authors derive an alternative yet equivalent formulation for sampling from discrete diffusion processes, leading to a family of reparameterized discrete diffusion models. Extensive experiments on machine translation and text generation demonstrate significant improvements over existing diffusion models. The major concerns from the reviewers are: (1) a lack of comparison with several diffusion model baselines; (2) performance is not better than autoregression-based language models; and (3) the need for improved presentation. The authors have committed to making the necessary revisions to address all these concerns. I recommend acceptance, provided that all promised revisions are incorporated to address the concerns outlined above.