PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
4
4
5
5
3.3
置信度
创新性3.0
质量3.0
清晰度2.8
重要性3.0
NeurIPS 2025

Beyond Masked and Unmasked: Discrete Diffusion Models via Partial Masking

OpenReviewPDF
提交: 2025-04-06更新: 2025-10-29

摘要

关键词
Discrete Diffusion ModelsMasked Diffusion ModelsGenerative Models

评审与讨论

审稿意见
4

This paper introduces MDM-Prime, a partial masking scheme designed to improve Masked Diffusion Models (MDMs). The core innovation is to allow tokens to exist in intermediate states between fully masked and unmasked, which the authors conceptualise as decomposing tokens into finer "sub-token" representations. This approach facilitates a more fine-grained denoising process by enabling the model to make predictions based on partially observed information. Empirically, Prime demonstrates superior performance across tasks, significantly outperforming previous MDM and autoregressive models on the OpenWebText benchmark and achieving competitive results on image generation datasets.

优缺点分析

Strengths

Overall, the paper is well-written and presents very simple yet effective ideas. It is very promising that the proposed method outperforms auto-regressive models in the openwebtext dataset and achieves comparable performance to diffusion on image modelling.

Weaknesses

My main concern lies in the absence of theoretical analysis explaining why partial-mask modelling yields better performance compared to MDM. While the use of sub-tokens enables smoother transitions, the parameterization of p(y_0|y_t) remains independent across different tokens. Does the proposed approach provide a tighter lower bound on the likelihood?

问题

  • It would be helpful to include a sampling algorithm in the paper for clarity. Specifically, does the posterior q(xsi,lxti,l,x0i,l)q(x_s^{i,l} \mid x_t^{i,l}, x_0^{i,l}) follow the same formulation as Eq3? Additionally, how is sampling from p(xsxt)p(x_s \mid x_t) performed in practice? Given that p(x0ixt)p(x_0^i \mid x_t) is no longer independent across sub-tokens, we do not have a closed-form expression for p(xsixti)p(x_s^i \mid x_t^i). In this case, is the sampling procedure carried out by first sampling x0ip(x0ixt)x_0^i \sim p(x_0^i \mid x_t), followed by sampling xsi,lq(xsi,lxti,l,x0i,l)x_s^{i,l} \sim q(x_s^{i,l} \mid x_t^{i,l}, x_0^{i,l})?
  • It would be helpful to include the training curve in the paper. Does the training cost increase as ll increases? Additionally, how does the training cost of MDM-Prime compare to that of MDM?
  • The paper does not report the model size. Could you clarify whether the model size is similar to that of the baselines?
  • While MDM includes more idle steps, these can be practically skipped as the model's output is not time-dependent, meaning logits do not need recomputation. In this regard, given the same sampling steps, MDM-Prime has a higher computational cost compared to MDM due to having fewer idle steps. This raises a critical question: given the same computational budget, like flops, at the test time, does MDM-Prime still maintain its performance advantage over MDM?
  • In Figures A3 and A4, there are cases where l<1l < 1 for CIFAR-10. Could you clarify what this means? My understanding is that l=1l = 1 corresponds to the original MLDM parameterisation without sub-tokens.
  • Everything has two sides. While MDM-Prime currently performs very well in terms of likelihood, what do you think are its downsides?

局限性

Overall, the paper is well-written and the results are impressive, particularly the achievement of outperforming autoregressive models in language modelling. However, some experimental and implementation details are missing, and the paper lacks theoretical analysis to support the proposed method. I would be inclined to raise my score if the authors can address these concerns.

最终评判理由

The paper introduces a new mask diffusion model that significantly improves performance. However, it is still unclear why MDM-prime theoretically has a better likelihood. Therefore, I think 4 is an appropriate score.

格式问题

Nope

作者回复

We sincerely appreciate the reviewer’s valuable feedback and questions. The responses to the reviewer’s questions are as follows.


Questions

Q1. My main concern lies in the absence of theoretical analysis explaining why partial-mask modelling yields better performance compared to MDM. While the use of sub-tokens enables smoother transitions, the parameterization of p(y0yt)p(y_0|y_t) remains independent across different tokens. Does the proposed approach provide a tighter lower bound on the likelihood?

Response. We thank the reviewer for the comment, and are pleased to share some preliminary thoughts about the potential relationship between the likelihood bounds (Eqs. (4) and (5)).

Both objectives aim to approximate the same target, logp_θ(x_0)-\log p\_{\theta}(\boldsymbol{x}\_0), using a discrete diffusion process with the same scheduling function α_t\alpha\_t. The key difference lies in the expectations:

E_q(x_tx_0)[_i=1LlogpMDM_θ(x_0ix_t)],,,, and ,,,,E_q(y_ty_0)[_i=1LlogpPrime_θ(y_0iy_t)].\mathbb{E}\_{q(\boldsymbol{x}\_t \| \boldsymbol{x}\_0)} \left[\sum\_{i=1}^L \log p^{\text{MDM}}\_{\theta}(x\_0^i \| \boldsymbol{x}\_t) \right] \\, \\, \\, \\, \text{ and }\\, \\, \\, \\, \mathbb{E}\_{q(\boldsymbol{y}\_t \| \boldsymbol{y}\_0)}\left[\sum\_{i=1}^L \log p^{\text{Prime}}\_{\theta}(\boldsymbol{y}\_0^i \| \boldsymbol{y}\_t)\right].

A possible direction for theoretical exploration is to analyze the per-step difference between these expectations:

E_q(y_ty_0)[_i=1LlogpPrime_θ(y_0iy_t)]E_q(x_tx_0)[_i=1LlogpMDM_θ(x_0ix_t)]d.\mathbb{E}\_{q(\boldsymbol{y}\_t \| \boldsymbol{y}\_0)} \left[ \sum\_{i=1}^L \log p^{\text{Prime}}\_{\theta}(\boldsymbol{y}\_0^i \| \boldsymbol{y}\_t)\right] - \mathbb{E}\_{q(\boldsymbol{x}\_t|\boldsymbol{x}\_0)} \left[\sum\_{i=1}^L \log p^{\text{MDM}}\_{\theta}(x\_0^i \| \boldsymbol{x}\_t) \right] \triangleq d.

If this difference (dd) is consistently positive, it could suggest that MDM-Prime provides a tighter bound. However, since pMDM_θp^{\text{MDM}}\_{\theta} and pPrime_θp^{\text{Prime}}\_{\theta} are parameterized by distinct neural networks operating on different input spaces, establishing this relationship analytically is nontrivial and requires specific assumptions about the model families.

To investigate this empirically, we estimated the expectation difference using trained models at a number of discretized timesteps. The results show positive values as follows:

tt0.250.50.751.0
dd199.38208.8059.2511.47

We leave a rigorous theoretical analysis that fully explains this behavior for future work.


Q2. (a) It would be helpful to include a sampling algorithm in the paper for clarity. (b) Does the posterior q(x_si,lx_ti,l,x_0i,l)q(x\_s^{i,l} \| x\_t^{i,l}, x\_0^{i,l}) follow the same formulation as Eq. (3)? (c) Additionally, how is sampling from p(x_sx_t)p(x\_s \| x\_t) performed in practice? Given that p(x_0ix_t)p(x\_0^i \| x\_t) is no longer independent across sub-tokens, we do not have a closed-form expression for p(x_six_ti)p(x\_s^i | x\_t^i). In this case, is the sampling procedure carried out by first sampling x_0ip(x_0ix_t)x\_0^i \sim p(x\_0^i \| x\_t), followed by sampling x_si,lp(x_si,lx_ti,l,x_0i,l)x\_s^{i,l} \sim p(x\_s^{i,l} \| x\_t^{i,l}, x\_0^{i,l})?

Response. (a) We thank the reviewer for the suggestion and will enhance our description regarding the sampling process in Section 3.1 in an updated manuscript.

(b) The posterior q(y_si,jy_ti,j,y_0i,j)q(y\_s^{i,j} | y\_t^{i,j}, y\_0^{i,j}) follows the same formulation as Eq. (3) and is defined as follows:

\delta\_{y\_t^{i,j}}(y\_s^{i,j}) & \text{if $y_t^{i,j}\in \mathcal{Y}$} \\\\ \frac{1-\alpha\_s}{1-\alpha\_t} \delta\_`m` (y\_s^{i,j}) + \frac{\alpha\_s-\alpha\_t}{1-\alpha\_t} \delta_{y\_0^{i,j}} (y\_s^{i,j}) & \text{if $y\_t^{i,j}=`m`$} \end{cases}$$ **(c)** Yes, the sampling procedure is carried out by first sampling $\boldsymbol{y}\_0^i \sim p\_{\theta}(\boldsymbol{y}\_0^i \| \boldsymbol{y}\_t)$, followed by sampling each sub-token $y\_s^{i,j} \sim q(y\_s^{i,j} | y\_t^{i,j}, y\_0^{i,j})$. --- **Q3**. **(a)** It would be helpful to include the training curve in the paper. **(b)** Does the training cost increase as $\ell$ increases? Additionally, how does the training cost of MDM-Prime compare to that of MDM? **Response**. **(a)** Due to NeurIPS requirements, we are unable to add figures during the rebuttal phase, but we would be pleased to include the curves in the final manuscript version. We provide the following table that compares the evaluation perplexity (PPL) of MDLM-Prime and MDLM at various training steps: | Training Steps | 50K | 100K | 250K | 500K | 1,000K | |-|-|-|-|-|-| | MDLM | 29.21 | 27.18 | 24.81 | 23.84 | 22.98 | | MDLM-Prime ($\ell=4$) | 25.92 | 19.45 | 18.32 | 16.87 | 15.62 | | MDLM-Prime ($\ell=6$) | 22.72 | 18.54 | 16.77 | 15.95 | 15.36 | **(b)** A runtime analysis of MDLM-Prime with varying $\ell$ is presented in Appendix A.5.2 and Table A3. Larger $\ell$ generally incur higher training cost due to more complex filtering in carry-over parameterization. The table below presents the runtime of MDLM ($\ell = 1$) and MDLM-Prime with varying $\ell$, along with the corresponding perplexity (PPL). Based on these results, we recommend $\ell=4$ as an ideal balance between efficiency and performance: | | Runtime (sec.) | |-|-| | $\ell=1$ (PPL: 22.98) | 1.14e-1 | | $\ell=2$ (PPL: 17.90) | 1.29e-1 | | $\ell=3$ (PPL: 16.36) | 1.34e-1 | | $\ell=4$ (PPL: 15.62) | 1.43e-1 | | $\ell=6$ (PPL: 15.36) | 1.75e-1 | --- **Q4**. The paper does not report the model size. Could you clarify whether the model size is similar to that of the baselines? **Response**. For image experiment, MDM-Prime is with 114M parameters, which matches the MDM baseline. For text experiment, MDLM-Prime is with 130M parameters, which is the same as the MDLM baseline. We’ll include this information in the final version. --- **Q5**. While MDM includes more idle steps, these can be practically skipped as the model's output is not time-dependent, meaning logits do not need recomputation. In this regard, given the same sampling steps, MDM-Prime has a higher computational cost compared to MDM due to having fewer idle steps. This raises a critical question: given the same computational budget, like flops, at the test time, does MDM-Prime still maintain its performance advantage over MDM? **Response**. For the image experiment, we follow [1] to adopt the ablated diffusion model (ADM) architecture, which is conditioned on the time embeddings and the sampling updates are not cacheable (i.e., updates cannot be skipped). Therefore, MDM-Prime still maintains its performance advantage over MDM according to Tables 3 and 4 in the manuscript. For the text experiment, we perform additional experiments to measure the generative perplexity (Gen PPL) under approximately aligned sampling time. The corresponding number of non-cacheable function evaluations (N-NFE) is also presented in the table. The results indicate that MDLM-Prime exhibits lower Gen PPL when N-NFE is higher. - MDLM | Runtime (sec.) | 2.89 | 6.02 | 8.87 | |-|-|-|-| | Discretized Timesteps | 262 | 644 | 1,685 | | N-NFE | 256 | 512 | 768 | | Gen PPL ($\downarrow$) | 112.0 | 99.0 | 87.0 | - MDLM-Prime ($\ell=4$) | Runtime (sec.) | 2.83 | 5.99 | 8.72 | |-|-|-|-| | Discretized Timesteps | 230 | 500 | 772 | | N-NFE | 230 | 500 | 768 | | Gen PPL ($\downarrow$) | 123.0 | 90.5 | 77.0 | We would also like to highlight that MDLM requires large numbers of discretized timesteps to achieve high N-NFE due to the existence of idle steps (see Eq. (A4) in Appendix). Since determining the unmasked elements (i.e., resampling) in the idle steps causes computational overheads (even without forward passing the model), the sampling cost of MDLM becomes increasingly higher with higher N-NFE: - MDLM | N-NFE | 768 | 1,003 | 1,023 | | - | - | - | - | | Discretized Timesteps | 1,685 | 25,000 | 500,000 | | Runtime (sec.) | 8.87 | 23.30 | 159.81 | - MDLM-Prime ($\ell=4$) | N-NFE | 768 | 1,003 | 1,023 | | - | - | - | - | | Discretized Timesteps | 772 | 1,022 | 1,044 | | Runtime (sec.) | 8.72 | 12.31 | 12.58 | --- **Q6**. In Figures A3 and A4, there are cases where $\ell<1$ for CIFAR-10. Could you clarify what this means? My understanding is that $\ell=1$ corresponds to the original MLDM parameterization without sub-tokens. **Response**. Yes, $\ell = 1$ corresponds to the standard MDM configuration. For cases where $\ell < 1$ appears, this refers to *super-tokenization*, where multiple pixel values are grouped into a single token before applying MDM. For example, $\ell = \frac{2}{3}$ indicates that three 8-bit RGB channels are merged into two 12-bit super-tokens, resulting in a larger token class size (e.g., from 256 to 4,096). --- **Q7**. Everything has two sides. While MDM-Prime currently performs very well in terms of likelihood, what do you think are its downsides? **Response**. We thank the reviewer for the question. The current MDM-Prime has the following limitations: (i) Non-monotonic Performance Scalability with $\ell$: As demonstrated in our experiments (Appendix A.2.4), increasing $\ell$ generally leads to better performance. However, the trend is not strictly monotonic, as larger $\ell$ does not always yield improvement. Although we can properly select this value based on the elbow point method presented in Appendix A.2.4, we believe that the development of advanced training techniques to enable improved performance scalability with larger $\ell$ constitutes an interesting future avenue. (ii) Fixed Design of the Sub-tokenization Function $f$: When extending MDM-Prime to applications such as text imputation, the choice of the sub-tokenization function $f$ and its inverse $f^{-1}$ becomes critical (see Q3 from Reviewer uEyH). The grouping of semantically related tokens may improve generalization in these tasks, and we consider this a promising direction for further investigation. --- **References**\ [1] Gat et al. Discrete Flow Matching, NeurIPS 2024.
评论

We appreciate the reviewer’s valuable feedback and recognition of our work. We will incorporate the rebuttal discussion into an updated version of the paper. Thank you once again for your thorough review.

评论

Thanks for the detailed response. My main questions have been addressed, although the theoretical judgment remains unclear; I think it is very interesting to leave it as future work. Moreover, it would be great to include the discussion of limitations in the revision. Thanks for the great work again.

审稿意见
4

This paper introduces a partial masking technique which let each token in the masked diffusion model goes beyond just masked or unmasked state. In particular, each token is decomposed to subtokens, where the subtokens can be masked individually, making the original token partially masked. There is a bijective mechanism that allows the translation between token and subtokens. Empirical results show improvement over traditional MDM.

优缺点分析

Strength:

  1. The idea of partial masking via sub token is novel and elegant. It makes a lot of sense to have partial masking for discrete diffusion models, as the continuous diffusion model has many intermediate state, but it is non trivial to design a mechanism to do so. The design of sub tokens and bijection mapping is a clever way to tackle the problem.

  2. The empirical result is strong. It surpasses MDM by a large margin on perplexity and FID.

  3. The paper is well written and the idea is presented nicely. The figures are also very helpful.

Weaknesses:

  1. Some discussion on the computational complexity is needed. Does it add any overhead to training and inference?

问题

N/A

局限性

yes

最终评判理由

The author addressed my concern and I will maintain my score for weak accept.

格式问题

no

作者回复

We sincerely appreciate the reviewer’s valuable feedback and questions. The responses to the reviewer’s questions are as follows.


Comments

C1. Some discussion on computational complexity is needed. Does it add any overhead to training and inference?

Response. A runtime analysis of MDLM-Prime with varying \ell is presented in Appendix A.5.2 and Table A3. Larger \ell generally incur higher training cost due to more complex filtering in carry-over parameterization. The table below presents the runtime of MDLM (=1\ell = 1) and MDLM-Prime with varying \ell, along with the corresponding perplexity (PPL). Based on these results, we recommend =4\ell=4 as an ideal balance between efficiency and performance:

Runtime (sec.)
=1\ell=1 (PPL: 22.98)1.14e-1
=2\ell=2 (PPL: 17.90)1.29e-1
=3\ell=3 (PPL: 16.36)1.34e-1
=4\ell=4 (PPL: 15.62)1.43e-1
=6\ell=6 (PPL: 15.36)1.75e-1


For inference, we perform additional experiments to measure the generative perplexity (Gen PPL) under approximately aligned sampling time. The corresponding number of non-cacheable function evaluations (N-NFE) is also presented in the table. The results indicate that MDLM-Prime exhibits lower Gen PPL when N-NFE is higher.

  • MDLM | Runtime (sec.) | 2.89 | 6.02 | 8.87 | | - | - | - | - | | Discretized Timesteps | 262 | 644 | 1,685 | | N-NFE | 256 | 512 | 768 | | Gen PPL (\downarrow) | 112.0 | 99.0 | 87.0 |

  • MDLM-Prime (=4\ell=4) | Runtime (sec.) | 2.83 | 5.99 | 8.72 | | - | - | - | - | | Discretized Timesteps | 230 | 500 | 772 | | N-NFE | 230 | 500 | 768 | | Gen PPL (\downarrow) | 123.0 | 90.5 | 77.0 |

评论

Thank you again for your thoughtful review.

审稿意见
5

This paper introduces MDM-Prime, an enhancement to Masked Diffusion Models (MDMs) designed to address computational inefficiency arising from "idle steps" where the model's input remains unchanged during the sampling process. The core idea is to replace the binary masked/unmasked states with a more granular system that represents each discrete token as a sequence of sub-tokens using a base-b encoding. Masking is then performed at the sub-token level. The method achieves a lower perplexity on the OpenWebText dataset than previous MDM and autoregressive models.

优缺点分析

Strengths:

  • The paper is well written. The illustrations are nicely plotted.
  • The motivation for the work is clear and compelling. The concept of "idle steps" provides a strong intuition for the inefficiency of standard MDMs. The proposed solution of partial masking through sub-token decomposition is an elegant and simple way to create a more fine-grained and efficient denoising process.
  • MDM-Prime requires only minimal architectural changes to a standard MDM.
  • The method achieves impressive results, notably outperforming autoregressive models in text perplexity (15.36 PPL vs. 17.54 PPL) on the OpenWebText dataset.

Weaknesses:

  • The connection between improved model utilization and superior likelihood estimation is not thoroughly explored. While the reduction in "idle steps" is the primary motivation, the paper does not provide a deep analysis of why this leads to such a dramatic improvement in modeling capabilities, to the point of outperforming autoregressive models.
  • The paper does not report the impact of MDM-Prime on training and inference time. Handling sub-tokens and a more complex parameterization should introduce overhead. A direct comparison of throughput against the baseline MDM is necessary for a complete picture of the method's practical efficiency.

问题

None

局限性

  • For the text generation task, performance is evaluated solely based on perplexity. Augmenting the evaluation with other automated metrics or a human study would substantially strengthen the claims of superiority.

最终评判理由

The authors fully addressed my concerns in the rebuttal, therefore I would like to increase the rating from 4 to 5.

格式问题

None

作者回复

We sincerely appreciate the reviewer’s valuable feedback and questions. The responses to the reviewer’s questions are as follows.


Comments

C1. The connection between improved model utilization and superior likelihood estimation is not thoroughly explored. While the reduction in "idle steps" is the primary motivation, the paper does not provide a deep analysis of why this leads to such a dramatic improvement in modeling capabilities, to the point of outperforming autoregressive models.

Response. We thank the reviewer for the comment. In Section 1, we provide the core intuition behind our method, and in Appendix A.2.4, we analyze how the target length \ell (which controls the number of idle steps) affects the model’s likelihood estimation as measured by perplexity. While our current focus centers on providing empirical evidence and intuitive motivation, we acknowledge that a deeper analysis would constitute a valuable contribution and consider this an important direction for future work.


C2. The paper does not report the impact of MDM-Prime on training and inference time. Handling sub-tokens and a more complex parameterization should introduce overhead. A direct comparison of throughput against the baseline MDM is necessary for a complete picture of the method's practical efficiency.

Response. A runtime comparison between the MDLM baseline and MDLM-Prime with varying \ell is presented in Appendix A.5.2 and Table A3 (the throughput can be derived by taking the inverse of the reported runtime values). Larger \ell generally incur higher training cost due to more complex filtering in carry-over parameterization. The table below presents the runtime of MDLM (=1\ell = 1) and MDLM-Prime with varying \ell, along with the corresponding perplexity (PPL). Based on these results, we recommend =4\ell=4 as an ideal balance between efficiency and performance:

Runtime (sec.)
=1\ell=1 (PPL: 22.98)1.14e-1
=2\ell=2 (PPL: 17.90)1.29e-1
=3\ell=3 (PPL: 16.36)1.34e-1
=4\ell=4 (PPL: 15.62)1.43e-1
=6\ell=6 (PPL: 15.36)1.75e-1


For inference, we perform additional experiments to measure the generative perplexity (Gen PPL) under approximately aligned sampling time. The corresponding number of non-cacheable function evaluations (N-NFE) is also presented in the table. The results indicate that MDLM-Prime exhibits lower Gen PPL when N-NFE is higher.

  • MDLM | Runtime (sec.) | 2.89 | 6.02 | 8.87 | | - | - | - | - | | Discretized Timesteps | 262 | 644 | 1,685 | | N-NFE | 256 | 512 | 768 | | Gen PPL (\downarrow) | 112.0 | 99.0 | 87.0 |

  • MDLM-Prime (=4\ell=4) | Runtime (sec.) | 2.83 | 5.99 | 8.72 | | - | - | - | - | | Discretized Timesteps | 230 | 500 | 772 | | N-NFE | 230 | 500 | 768 | | Gen PPL (\downarrow) | 123.0 | 90.5 | 77.0 |


C3. For the text generation task, performance is evaluated solely based on perplexity. Augmenting the evaluation with other automated metrics or a human study would substantially strengthen the claims of superiority.

Response. We appreciate this valuable suggestion. We selected perplexity as the primary metric in the main manuscript due to its widespread adoption and its suitability for the evaluation of likelihood-based models such as MDM-Prime. In addition to standard perplexity, we also report generative perplexity (Gen PPL) in Appendix A.5.3, which evaluates the quality of model-generated samples through pretrained large language models. We agree with the reviewer and consider the reviewer's suggestion regarding additional automated metrics or human evaluation an interesting and promising future research direction.

评论

Thank you for the response. My concerns have been addressed.

评论

We appreciate the reviewer’s response and valuable feedback. Thank you again for your thoughtful review.

审稿意见
5

The authors propose a partial masking scheme (Prime) to address redundant and perhaps suboptimal computation in the masked diffusion model unmasking generative process, wherein some tokens are sampled early and then copied over and over while others are unmasked to be consistent with those already chosen. To challenge this setup, instead of the original masking/unmasking scheme, they introduce latent variables that are ``partially masked" tokens which allows the model to gradually sample a consistent set of tokens rather than specialize/commit some dimensions too early and copying them over the rest of the diffusion process. They propose a variational training scheme for handling these latent partial masks ("intermediate states") and report perplexities and FIDs on text and image data.

While the two goals (1. minimizing steps of doing nothing 2. introducing a richer latent denoising space) may seem orthogonal at first, I think the idea is roughly that time spent doing could be time spent throwing extra computation into a more expressive model, and a more fine-grained latent space is one way to do this.

优缺点分析

Strengths

  • Generalize masking and unmasking to partial masking , where a partial mask is a base b ssequence of length \ell of "sub-tokens".
  • adapting MDMs to MDM-prime requires some embedding changes but not much architecture or diffusion formalism changes
  • Good PPL/FIDs per number of function evaluations

Weaknesses

See questions.

问题

  • [conceptual clarification] do I understand correctly that by explicitly defining the mapping between original tokens and subtokens, partially-masked subtoken sequences essentially represent a set of possible tokens that the current noisey state for this token can be mapped to? And this set of "tokens to denoise to" shrinks as we unmask more subtokens? Therefore, this is sort of a diffusion over the space of sets of tokens?

  • [detailed clarification of NFEs vs sampling steps vs number of modeling calls] If i understand correctly, it seems like one needs a joint model over subtokens given state yty_t. Is it right that this joint model factors over tokens (y0iy_0^i independent of y0jy_0^j given yty_t) but is correlated across subtokens y0i,iy_0^{i,i} dependent with y0i,jy_0^{i,j} given yty_t? As a result, in the likelihood bound, there is a marginalization over subtokens consistent with a given observed partial masking, is that right? Is that what equations (6) and (7) get at? If so, any considerations computationally ? Does this marginalization govern that \ell should be kept somewhat low?

Specifically, If multiple model calls are made in the marginalization sum, then what exactly do the NFEs in the tables mean? Does 500 NFEs mean 500 denoising steps, where each denoising step might call the model many times to marginalize? In either case, the relationship between NFEs, number of denoising steps, and number of model calls should be clarified. If multiple model calls are made per what the table refers to as "1 NFE", then this should be made more prominent in the text.. Clarification will be appreciated.

  • [exploratory question] On an intuitive level, it seems that which original tokens to "pair" via shared subtoken subsequences is potentially an important detail that could affect the learning problem. e.g. if "cat" is 0001 and "bird" is 0000 and the partial mask is 000m then the model knows from [000m] that the word is "cat" or "bird" and statistical strength can be shared in sentences like "Lucy has a pet [000m]". But less such strength could be shared if it were "bird" or "chair". Any particular thoughts on automated ways to assign the mapping? How was it done in your work?

  • [exploratory question] Any thoughts about data-driven ways to set \ell ?

局限性

Yes.

最终评判理由

I maintain my positive score for this careful and insightful work.

格式问题

None.

作者回复

We sincerely appreciate the reviewer’s valuable feedback and questions. The responses to the reviewer’s questions are as follows.


Questions

Q1. [conceptual clarification] (a) do I understand correctly that by explicitly defining the mapping between original tokens and subtokens, partially-masked subtoken sequences essentially represent a set of possible tokens that the current noisey state for this token can be mapped to? And this set of "tokens to denoise to" shrinks as we unmask more subtokens? (b) Therefore, this is sort of a diffusion over the space of sets of tokens?

Response. (a) Yes, this is a valid and correct interpretation. When some sub-tokens are masked, the noisy state corresponds to a subset of possible original tokens consistent with the unmasked parts. As more sub-tokens are revealed, this subset shrinks, narrowing down the candidate tokens. (b) Yes, this process can be intuitively viewed as diffusion over sets of tokens. Each intermediate state (i.e., partial sub-token sequence) defines a set of possible tokens, and the diffusion moves from larger sets (more uncertainty) to singleton sets (fully denoised tokens).


Q2. [detailed clarification of NFEs vs sampling steps vs number of modeling calls] (a) If i understand correctly, it seems like one needs a joint model over subtokens given state yty_t. Is it right that this joint model factors over tokens (y0iy_0^i independent of y0jy_0^j given yty_t) but is correlated across subtokens dependent with given yty_t?

(b) As a result, in the likelihood bound, there is a marginalization over subtokens y0i,iy_0^{i,i} consistent with a y0i,jy_0^{i,j} given yty_t observed partial masking, is that right? Is that what equations (6) and (7) get at? If so, any considerations computationally? Does this marginalization govern that \ell should be kept somewhat low?

(c) Specifically, If multiple model calls are made in the marginalization sum, then what exactly do the NFEs in the tables mean? Does 500 NFEs mean 500 denoising steps, where each denoising step might call the model many times to marginalize? In either case, the relationship between NFEs, number of denoising steps, and number of model calls should be clarified. If multiple model calls are made per what the table refers to as "1 NFE", then this should be made more prominent in the text. Clarification will be appreciated.

Response. (a) Yes, this is correct. Eq. (7) defines a joint distribution pθ(y0iyt)p_\theta (\boldsymbol{y}^i_0| \boldsymbol{y}_t) over sub-tokens at position ii given yt\boldsymbol{y}_t.

(b) No, the likelihood bound does not require explicit marginalization over sub-token combinations. Instead, the distribution pθ(y0iyt)p_\theta (\boldsymbol{y}_0^i | \boldsymbol{y}_t) in Eq. (7) is defined as a softmax over the set V(yti)\mathcal{V}(\boldsymbol{y}_t^i), and this formulation is guaranteed to satisfy the marginalization condition in Eq. (6), as shown in Proposition A.4.

To construct the softmax over V(yti)\mathcal{V}(\boldsymbol{y}_t^i), we apply a filtering mechanism that masks out the logits entries corresponding to inconsistent y0i\boldsymbol{y}_0^i based on the observed yti\boldsymbol{y}_t^i. This procedure, as detailed in Appendix A.4 and illustrated in Fig. A2 (a), performs logical AND operations across \ell binary filters to the logits, which does not require additional model evaluations.

Regarding the computational cost, larger \ell generally incur higher training cost due to more complex filtering in carry-over parameterization. The table below presents the runtime of MDLM (=1\ell = 1) and MDLM-Prime with varying \ell, along with the corresponding perplexity (PPL). Based on these results, we recommend =4\ell=4 as an ideal balance between efficiency and performance:

Runtime (sec.)
=1\ell=1 (PPL: 22.98)1.14e-1
=2\ell=2 (PPL: 17.90)1.29e-1
=3\ell=3 (PPL: 16.36)1.34e-1
=4\ell=4 (PPL: 15.62)1.43e-1
=6\ell=6 (PPL: 15.36)1.75e-1

(c) Each NFE corresponds to a single denoising step, which includes one forward pass of the model. The marginalization condition is handled via filtering and does not require additional model calls as described above.


Q3. [exploratory question] On an intuitive level, it seems that which original tokens to "pair" via shared subtoken subsequences is potentially an important detail that could affect the learning problem. e.g. if "cat" is 0001 and "bird" is 0000 and the partial mask is 000m then the model knows from [000m] that the word is "cat" or "bird" and statistical strength can be shared in sentences like "Lucy has a pet [000m]". But less such strength could be shared if it were "bird" or "chair". Any particular thoughts on automated ways to assign the mapping? How was it done in your work?

Response. We appreciate the reviewer’s insight regarding the impact of sub-token assignments. In our current implementation, we directly apply the base-b encoding to the original token indices (i.e., [0, ..., 255] for images and [0, ..., 50,256] for text). An interesting extension for future work is to reorder token indices based on semantic similarity, which could be automatically determined by a word embedding model, prior to base-b encoding. This would allow related tokens (like "cat" and "bird") to share more sub-token prefixes, enhancing the model’s ability to support fine-grained imputation in textual settings, analogous to the sub-token completion task demonstrated for images in Fig. 8.


Q4. [exploratory question] Any thoughts about data-driven ways to set \ell?

Response. An effective approach to setting \ell is to monitor the Idle Step Ratio (ISR), which can be calculated using Eq. (A5). As shown in our experiments in Appendix A.2.4, the ‘elbow point’ in the ISR curve often aligns with optimal performance. Therefore, \ell can be selected by gradually increasing its value and choosing the smallest point at which ISR plateaus. This selection method can be performed efficiently prior to training a model.

最终决定

This paper introduces partial masking for masked discrete diffusion, by decomposing each token into a small set of sub-tokens to define intermediate partially masked states. All reviewers agreed that the approach is simple, the paper is clear and the results are strong. There are some weaknesses, a particular one being that there is not a clear theoretical reason for why partial masking should improve results. Other comments/concerns raised by reviewers were generally resolved satisfactorily, with some additions to paper promised (sampling algorithm, training curves, parameter counts, better comparisons). These will all improve the paper. I am recommending an accept, and strong encourage the authors to add all the promised items in the camera-ready.