Discrete Modeling via Boundary Conditional Diffusion Processes
摘要
评审与讨论
The authors propose a discrete diffusion model that operates in the embedding space of vectors that represent discrete tokens. They propose a method that takes into account the fact that whole regions of the space will decode to the same discrete token because during decoding, the closest embedding vector is selected. The method is that at training time, instead of interpolating between and , they interpolate from to where is the point at which the line between and intersects the 'discrete boundary' of which is the region of embedding space that all decodes to . The authors test their method on language tasks as well as CIFAR-10 image modeling.
优点
The idea of trying to exploit the fact that the embedding space has a very specific structure consisting of volumes that all decode to the same discrete token is interesting and novel.
The experimental results look quite promising with good performance compared to other discrete diffusion approaches on machine translation and text summarization.
The authors are working on a significant problem as finding diffusion methods that can perform on par with autoregressive models for discrete modeling would be very impactful.
缺点
The paper's main weakness is lack of clarity and precision in the description of the method. I really struggle to understand what the authors are doing exactly and how it is justified.
- It is not clear to me that your training objective and sampling regime are justified. The flow matching paradigm is to construct a probability path and then find a vector field that generates this probability path through the transform defined by ODE integrating along the vector field. The final step is to say that a vector field that generates the unconditional probability path is equal to and this can be learned tractably with an objective of . However, I don't see how your method fits in this framework. This is for two reasons: 1) at no point do you define the conditional vector field that generates your rescaled probability path . How are you able to generate along these probability paths without defining ? 2) You sample using for the original but with the prediction from the neural network. This is valid in the original flow matching case because is linear and so the objective can be re-arranged into prediction . However we don't necessarily know if that generates is linear so we can't be sure it is valid to re-arrange to prediction.
- I think after the method is explained in section 3, the authors should have text that revisits the probability contour intuition they provide during the introduction. I think this is a nice mental model to understand the method but it is very unclear how exactly the method presented in section 3 gives you the desired probability contours that are promised in the introduction. For example, if a contour is very long in embedding space, then definitely the far away points on the contour with have low probability under so therefore probability is not constant along the discrete boundary and therefore the discrete boundary does not line up with the probability contour. These kinds of doubts would be alleviated with a clearer presentation linking back to the original probability contour view and some more precise statements.
- The discussion relating to the confidence factor is very unclear. Firstly, I'm unsure why it is even called a confidence factor, we can be fully confident that our data is discrete so why is there uncertainty about this? It is also concerning that the method can oscillate and mode collapse when r=1 which is your full method. This sounds like a major flaw of the approach if it cannot work at all when your new method is used in isolation. This warrants more discussion regarding this failure case. Furthermore L195 should be explained more as it is very unclear what "fixing the initial path" means and it sounds very important if it is required for your method to work.
Regarding the experiments:
- It seems unfair to allow your method to be re-ranked in Table 1 if you don't also re-rank the transformer baseline
- In Table 3, it is hard to get a good sense of how your method compares to other discrete diffusion approaches. Since they all use different embedding strategies it seems that it is only a good comparison to compare methods within the same embedding strategy. In this case, you are really only comparing to bit diffusion but your reproduction seems to perform much worse than the originally stated results in the bit diffusion paper. It is therefore difficult to place your method with confidence amongst the other discrete diffusion methods.
My score is given on the basis of the very unclear presentation meaning I cannot understand the method the author's are proposing. If this can be cleared up in the rebuttal period, I am happy to raise my score.
问题
Here I have listed some more minor questions for the authors that can help improve the exposition. My main concerns are given in the previous discussion.
- How does your formulation handle the fact that for some and samples, there will never be a boundary crossing because may be on the outside of the space or is sampled within 's discrete boudary.
- On L187 (24) is already deterministic because its an ODE so I don't understand why you then propose a deterministic alternative.
- In Table 2 it is unclear what ablation you are actually doing, what does it mean to only rescale the forward process versus rescaling the forward and backward?
局限性
I don't believe the authors adequately describe the limitations of their method when r is near 1, they say that it can be unstable but provide no further details or explanation regarding this failure case.
We sincerely appreciate your meticulous review and the valuable comments. Please kindly find our response below.
Q1: The rescaled vector field and probability path
Due to the character limit, this is only an outline. Please refer to comments for more details.
-
Our neural network predicts not .
-
is tractable but we did not derive it because it is not used.
-
We generate the probability path with the deterministic reverse diffusion process (Equation 25).
-
The training objective is valid. Derivations in Comments.
Q2: Method and the probability contour
Due to the character limit, this is only an outline. Please refer to comments for more details.
-
We have demonstrated the rescaled probability contours of in Figure 2, where sample points are exactly calculated by our proposed method (Equation 22).
-
is a series of functions, where the consistency to the boundary contour is inversely proportional to the subscript .
Q3: Confidence factor
Due to the character limit, this is only an outline. Please refer to comments for more details.
-
The confidence factor is named after some thoughts of sampling, such as the confidence interval. Our data is discrete and the uncertainty comes from the learned diffusion model.
-
The mode collapse when comes from two part: One is the uncertainty of the learned diffusion model as described above. The other involves a traditional problem of diffusion processes.
-
Fixing the initial path means the Trajectory Alteration (Line 8 in Algorithm2) is optional in practice. This is a negligible detail of our model.
Q4: Re-ranking
This is a fair comparision because beam searching the generated results of the transformer is re-ranking with transformer.
Re-ranking is a traditional strategy widely used in Non-autoregressive translation[7,8]. We use the generated results of our method re-ranked by the transformer to demonstrate their upper bound. Since the generation of diffusion LMs are highly random, their own probability scores often cannot reflect the upperbound of their performance. Re-ranking the generated results with an auto-regressive model can better demonstrate the potential capability of the diffusion model. Besides, re-ranking the transformer baseline just gets what it generates. Because beam search is exactly re-ranking the generated contents with the transformer's probabilities themselves.
Q5: Table 3
-
We believe it is a reasonable comparision because we strictly follows the setting of BitDiffusion[9] as in the Table 1 of their paper. We want to clarify that the embedding strategy is only valid for continuous diffusion models, where there is not embedding stage for discrete diffusion models. We compare the continuous diffusion models and the discrete diffusion models under the same level of continuous information, Continuous Pixels, Discrete Ordinal Pixels, and Categorical Pixels, where the latter the less continuous information models can obtain.
-
We have clarified in Lines 285-287 that since they only provide the model code without a training script with detailed hyperparameters or a checkpoint, we have to reproduce with exactly the same configuration based on their code and paper. We have done our best effort to reproduce the results and we promise to release our code and reproduction. Besides, we achieve the best reproduction result in the open source community. We are confident in the reliability of our experimental results and hope to address the reviewer's concerns.
-
Even ignoring the above issues, our method still shows effectiveness. Considering the gap introduced by the continuous information that the original BitDiffusion (Fid 3.48) can not beat DDPM (Fid 3.17) under the Gaussian sampling, our method (Fid 3.86) achieves improvement over DDIM (FiD 4.04), which are both deterministic diffusion processes. This means that our approach can achieve better results with less continuous information.
Q6: Samples in the boundary
-
It is possible that a random noise is sampled within 's discrete boundary. However, the probability of this happening in a high-dimensional representation space is very low. According to our statistics during the training phase, this probability is less than 0.1%.
-
Our solution is demonstrated in the pseudo code (Line 639, Appendix F), where we mask out this situation when . This means that random noise sampled within 's discrete boundary will not be rescaled and its trajectory is the same as original DDPM.
Q7: Line 187
Similar to 3.2 in Q1, as demonstrated in Lines 186-189, we provide an alternative for the ODE decoding, which is a deterministic reverse process. The word 'deterministic' is used to reflect the difference between our approach and traditional Gaussian process in DDPM, because both of them model the reverse process.
Q8: Ablation in Table 2
-
First three lines in Table 2 reveal where the performance improvement of our method mainly comes from. Only rescale the forward process means we just train the diffusion model with the rescaled trajectory and reverse the trajectory as in the DDPM. Rescaling the forward and backward means we use the rescaled trajectory for both training and inference. The results in Table 2 shows that training with the rescaled trajectories makes a major contribution to the final performance of our method.
-
Last line in Table 2 reveals that our method is robust for different trajectory types that optimal transport trajectory used in Flow Matching can sometimes achieve better results.
Thank you for the detailed rebuttal. I am happy with the responses to my questions regarding the probability contour, confidence factor, samples on the boundary, the deterministic sampler and Table 2.
However, I am still confused about your overall method. I appreciate the derivation of however this still leaves me with questions about the loss you use. You justify the prediction by saying that we are trying to match with which yes they would be equal if you set . However this misunderstands what the loss is actually doing. When we do flow matching, we use the L2 loss because we would like to have which is the solution to min . If you then solve min you will obtain . But we don't necessarily have which would be required for your loss to be justified. This is ok in standard flow matching since is linear but it is not clear in your case.
Regarding Table 1, am I to understand you used beam-search for the transformer baseline and this is why you say this is equivalent to re-ranking your method? Why didn't you just use normal sampling for the transformer baseline and normal sampling for your method, I feel like that would be a much clearer narrative and not have to imply your method requires re-ranking to work well.
For Table 3 it is good that you implemented Bit Diffusion and got the best open source results, however, it doesn't really help make your narrative clear since the results are worse than those reported in the original bit diffusion paper. Why didn't you just use standard flow matching on the same discrete embedding space as your method but just without your boundary method. That would have been the closest and simplest baseline and would have made the precise benefits of your method clear.
Q1: The rescaled vector field and probability path
We do not define the vector field in our framework, because our neural network predicts and we generate the probability path with the deterministic reverse diffusion process.
- Why do we predict ?
is the key of our method during both training and inference stage, we are required to have a low error estimation on (Lines 180-182). Predicting or will amplify the errors in the neural network's output.
- How to define the vector field if we want?
Preliminary: (Equation 11-13 in Flow Matching [1])
In our framework: (Equation 22)
Therefore:
Given: (Eqaution 19), we have
Let denote
Hence: , which is actually tractable.
- How to generate the probability paths?
- 3.1 One direct solution is Equation 24, which is derived from the Theorem3 in Flow Matching[1]
Let be a probability path, given equation 22, there is .
Replacing in with , we have .
Therefore, we generate the probability path with .
However, this may be inefficient in practical use because we have to solve the equation to get the with respect to the change of in real time. This is tractable for special cases such as original Flow Matching (Equation 42 in Appendix E), but it will be much more complex for Diffusion trajectories (Equation 43 in Appendix E). Since our approach is a general framework for all continuous diffusion processes, including both Diffusion Model and Flow Matching, we actually use an alternative approach similar to the DDIM[2].
It's worth noting that we just demonstrate the vector field function in Equation 24 to show its complexity. We do not provide the derivation of because we do not want to distract readers from our actual generation method by spending too much time introducing and deriving this function that is totally not used. However, as pointed in this question by reviewer, we will add the above derivation into the appendix to ensure the integrity our framework.
- 3.2 Our alternative solution is Equation 25 and Algorithm 2, which works like the DDIM[2].
In short, our solution can be understood as deterministic reverse diffusion process or ODE with discrete time steps. We use a state transfer probability (Equation 25) to extend the traditional reverse process , where . As in diffusion processes[3], if we set the timestep interval as 1, the probability path is obtained with .
In our framework, we get .
Since the state transfer probability is a Dirac delta function, there is no randomness in the reverse process, where the pair can be iteratively generated deterministically (Algorithm 2).
- How to derive the training objective ?
- 4.1 From the perspective of the deterministic reverse process
We utilize the deterministic reverse diffusion process to generate data (Equation 25). Similar to DDPM[3], the learning objective for the unrescaled deterministic reverse diffusion process is derived in Appendix C.3 Equation 38, which can still be simplified to . For our rescaled reverse process , where both and are inputs, the objective is an equation set:
\left $ \begin{aligned} \mathbf{u}_ {\tau_ \Delta}\mathbf{x}_ 0 + \mathbf{v}_ {\tau_ \Delta}\hat{\boldsymbol{\epsilon}}&= \mathbf{u}_ {\tau_ \Delta} \mathbf{x}_ \theta + \mathbf{v}_ {\tau_ \Delta}\hat{\boldsymbol{\epsilon}}\\\\ \mathcal{T}(t-\Delta t,G(\mathbf{x}_ 0,\hat{\boldsymbol{\epsilon}})) &= \mathcal{T}(t-\Delta t,G(\mathbf{x}_ \theta,\hat{\boldsymbol{\epsilon}})) \end{aligned}\right $ \Rightarrow \left $ \begin{aligned} \mathbf{x}_ 0 &=\mathbf{x}_ \theta \\\\ G(\mathbf{x}_ 0,\hat{\boldsymbol{\epsilon}}) &= G(\mathbf{x}_ \theta,\hat{\boldsymbol{\epsilon}}) \end{aligned}\right $ ,where is a constant. Besides, is a constant for the above equation.
The unique solution to this equation set is .
- 4.2 From the perspective of re-arranging
Given , we can define . To get the , we can direct calculate the partial derivate. For any dimension of , there is:
We can drop , because it can not provide valid value of without the term of .
Solving the equation is difficult, but it is easy to prove that is a solution.
Since is deterministic, it is an injection function. This means that the same input will not produce different outputs. Therefore, can always get the minimum value of and we simplify the objective to for convenience and training stability.
Q2: Method and the probability contour
- We have demonstrated the rescaled probability contours of in Figure 2, where sample points are exactly calculated by our proposed method (Equation 22).
As illustrated in Figure 2, the probability contours calculated by our method can be easily adapted to different discrete boundaries and noising trajectories, which are inline with our expectations in Figure 1 of the Introduction Section.
Thank you for this suggestion and we will add a quick revision at the end of the Method Section and link our formula to the Figure 2.
- is a series of functions, where the consistency to the boundary contour is inversely proportional to the subscript .
As illustrated in Figure 2 and Equations 21 and 22, is exactly the boundary contour while is a Gaussian distribution. When a boundary contour is very long in embedding space, far away point will gradually have a lower probability density under as increases. This means, during the inference stage, the density probability will gradually be consistent with the contour when approaching the boundary, which is inline with Lines 46-51 and Figure 1B.
Besides, we will not face the extreme case where the boundary contour is an infinitely long straight line in practical application. Because the embedding space is a finite space and the values of the embedding points will be normalized into a finite range. Therefore, when we set the diffusion space a bit larger than the embedding space, there are always series of probability functions from Gaussian distributions to the boundaries.
[1] Flow Matching for Generative Modeling
[2] Denoising Diffusion Implicit Models
[3] Denoising Diffusion Probabilistic Models
[4] Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
[5] Rectified Flow: A Marginal Preserving Approach to Optimal Transport
[6] Improving and Generalizing Flow-Based Generative Models with Minibatch Optimal Transport
[7] Non-Autoregressive Neural Machine Translation
[8] Understanding Knowledge Distillation in Non-Autoregressive Machine Translation
[9] Analog Bits: Generating Discrete Data Using Diffusion Models with Self-Conditioning
Q3: Confidence factor
- The confidence factor is named after some thoughts of sampling, such as the confidence interval. Our data is discrete and the uncertainty comes from the learned diffusion model.
Consider a simplified situation of only two tokens 'A' and 'B', which is the binary classification. When the confidence factor , points on the boundary of 'A' strictly follow . Since the learned diffusion model is expected to guide a random noise to this boundary, any subtal perturbation to the output of our learned diffuison model may lead to a reverse of the 's attribution. If we decrease the factor so that points on the boundary of the token 'A' have higher probability density, e.g., , we will be more confident that our generated points of the learned diffusion model belong to the token 'A'. (In this situation, is the discrete area of A, is the discrete area of B, and is the unattributed area.)
- The mode collapse when comes from two part: One is the uncertainty of the learned diffusion model as described above. The other involves a traditional problem of diffusion processes.
- 2.1 From the perspective of uncertainty
Facing tasks with strong conditions, we are confident with the model's prediction since this is a easy task. Therefore, works perfect (Lines 227-229 and Table 1) as the strong conditions greatly reduce uncertainty of the learned model during inference.
When the condition is weak or there is no condition, uncertainty increases rapidly. This means that the sampled noisy training data constitutes a more difficult task with . When the task difficulty exceeds the model's capability, the collapse occurs.
- 2.2 From the perspective of diffusion processes
This is also a problem similar to the well studied one in Flow Matching with Dynamic Optimal Transport [4,5,6], i.e., the path from the noise to target is too long and learning this path is hard. For example, suppose there are two unconditional targets 'A' and 'B', the sampled noisy data of 'A' when may have a shorter path to 'B' than 'A'. Likewise, can be closer to A. The diffusion models are required to map to 'A' and to 'B', but it is easier to learn the mapping of to 'B'. This is a general problem for all diffusion models, not just ours. We believe the stability and performance of our method can be further improved with the Dynamic Optimal Transport, but we want to have a fair comparision with our baselines to demonstrate our effectiveness. Therefore, we just use a smaller to stabilize the training process and use the sub-optimal results to compare with the baselines.
Besides, the failure case is just the unexpected loss value during training, e.g., NaN, where these models are unable to generate contents.
- Fixing the initial path means the Trajectory Alteration (Line 8 in Algorithm2) is optional in practice. This is a negligible detail of our model.
- 3.1 From the perspective of uncertainty
Line 8 in Algorithm 2 is to update the trajectory. If the learned diffusion model predicts a different , we have to update the original noise because currently we are in a different trajectory. If the learned model keeps changing the predicted target, which means it is uncertain about where to go, updating the trajectory correspond to an unreliable target may not be beneficial for the denoising process. If the learned model rarely changes its prediction, it means we are on the right path and no updates are needed. Therefore, it would be a viable option to discard this step (Line 8 in Algorithm 2) to make decoding faster.
- 3.2 From the perspective of experiments
As illustrated in Table 3, if we train with the rescaled forward process and decode with unrescaled trajectory, there will be only a slight performance degradation. The performance improvement of our model mainly comes from the training process, and some small changes in the decoding process may not have much impact. In practical application, we just discard the trajectory update (Line 8 in Algorithm 2) for simplicity in our experiments and find almost no performance degradation.
We sincerely appreciate your response, and we are delighted to have addressed some of your concerns. We hope to better align our contributions with your understanding.
Q1: loss function
is a simplified objective where we drop the coefficent (Equation 38 in Appendix C.3).
Let's rewrite the expectation in the sum form, we have:
where is the function of and is a dynamic coefficient ranging from to . Our current objective minimizes the upperbound of , which is also effective.
As illustrated in Equation 38 of Appendix C.3, we simplify the training objective and drop the coefficient during training, which is similar to the equation 14 in DDPM. This coefficient works as to dynamically assign weights to different samples based on the noise and x0. Therefore, if we want to make , we need to calculate the dynamic coefficient during training, which may be time-consuming.
Q2: re-ranking
The transformer generates with a beam size of 5. We do not use sampling because
- We strictly follow our baselines that all of them are using the beam search generation.
- Diffusion LMs are not stable enough to keep achieving good results with simple sampling like the transformer, as illustrated in Appendix I and our baselines. They have the potential to generate high quality contents, but random initialization will have a great impact on the generated results. Transformers do not face this problem of randomness.
Take IWSLT14 as an example.
| beam search | sampling | |
|---|---|---|
| transformer | 34.31 | 34.05 |
| Ours Rerank | 35.02 | 33.30 |
When we convert the beam search to sampling, there is a large drop for the reranked diffusion model. As illustrated in Table 1, we do not intend to claim that our method surpasses transformers, but rather to show that our method has the potential to produce results comparable to transformers in addition to outperforming existing diffusion models.
Q3: Table 3
We want to clarify that the most closest and simplest baseline is the DDIM, not the Flow Matching. Our method is a deterministic diffusion process or ODE with dicrete time step. We use the ODE framework to model the deterministic forward process (because DDPM or DDIM framework can not do this), so that our method is theoretically compatible with both the Diffusion process and Flow Matching. However, currently our method requires to predict in application and thus reverses the diffusion process step by step, which is actually a deterministic diffusion model, not a standard Flow Matching model. Therefore, the most closest and simplest baseline is the DDIM, not the Flow Matching. (The DDIM sampling of BitDiffusion is 11.37, slightly worse than Gaussian sampling)
If we want to apply the flow matching to the same discrete embedding space, there are several problems:
- If we predict , the only difference to BitDiffusion is the optimal transport trajectory. As in the ablation study (Table 2) for language, this sometimes improves performance, but the effect is not large.
Due to the high training cost, we currently only have results of 200K steps, where changing the trajectory to optimal transport does not make a significant difference.
| FiD | |
|---|---|
| BitDiffusion repro | 22.12 |
| BitDiffusion Flow | 21.05 |
| Ours | 8.17 |
| Ours Flow | 8.08 |
- If we predict , our method is currently not adaptable to this objective. As demonstrated in Q1-3.1, we have to solve the equation to get the with respect to the change of in real time, which is currently inefficient to apply.
We want to supplement that since solving the equation to get the with respect to the change of in real time is difficult, this is why we use the equation 25 as an alternative. The equation 25 discretizes to finite steps and keeps track of previous to make it tractable. And therefore the discrete time step make our model the deterministic diffusion process rather than the ODE based flow matching with infinite continuous timesteps.
In addition, if the reviewer still wants to know the performance of Flow Matching on binary coding, (although our method is not closely related to Flow Matching and is currently not efficient to the Neural ODE framework in application), we are training the model with the TorchCFM framework and will provide the results before the end of the discussion period.
Thank you for the continued and detailed engagement with my questions.
I appreciate this new analysis of the conditional vector field. As you have shown (the quantity you use during sampling) is an upper bound on (the quantity that you actually need to be using to be performing principled generative modelling). This is quite concerning since even if you train the network perfectly, you are not targeting the required quantity. I think this analysis should be included in the paper and it should be clearly stated that this departs from standard generative modelling objectives. You should also add discussion about why you still expect this to work, it is entirely not clear now what your framework is doing.
I understand better now your experimental contributions and see that BitDiffusion is indeed very much like your method but without the boundary considerations. In light of your good experimental results, it seems that your method does have promise even if I don't believe right now it is well justified.
I think the paper is quite borderline because of the unclear derivation of the methodology but good experimental results. I will raise my score to 5.
We sincerely appreciate your response, and we are delighted to have addressed most of your concerns. We attach great importance to your suggestions and conduct some analytical experiments for better understanding.
Q: Why our objective works?
- In practice, errors come not only from the theory but also from the capability of neural networks. And the error caused by optimizing the upper bound is much smaller than the error from the neural network. Therefore, the theoretical error caused by optimizing the upper bound is negligible.
The symbols in the formula may sometimes be confusing, and substituting them into numbers will lead to a more intuitive understanding. We add an additional analysis experiment on IWSLT14 to reveal the thinking behind our formula.
| Objective | BLEU (BLEU-1/2/3/4) | |||
|---|---|---|---|---|
| 8.4358 | 1.5613 | 51.81% | 33.42 (68.0/42.0/27.7/18.6) | |
| 8.4061 | 1.5518 | 52.34% | 33.49 (68.0/41.9/27.7/18.6) |
We demonstrate the Expectation of 's and 's errors on the test set. It is easy to observe that, with the dynamic coeffficient , the value of 's error (8.4358) is much larger (than 1.5613). This supports Lines 180-182 that predicting will amplify the error on , whereas predicting will reduce the error of neural networks on . Therefore, predicting is beneficial to reducing the impact of the prediction error of the neural network compared with .
In addition, we convert the objective to where the neural network still predicts . We find that the error expectations of and decrease 0.0297 (0.35%) and 0.0095 (0.61%), respectively. This means that the error caused by optimizing the upper bound is basically negligible compared to the error of the neural network. Furthermore, predicting the upper bound has almost no impact on the final performance (0.07 of the BLEU score).
- Discrete modeling may be more robust to neural network errors than the continuous modeling.
We compare the DDPM and BitDiffuison with different image embedding types over the eval set of cifar10.
| Models | |
|---|---|
| DDPM | 1.18% |
| BitDiffusion | 25.74% |
For continuous embedding, each pixel is represented with a float number in [-1,1], where the predictions will be discretized to integers in [0,255]. It is really hard to recover the original pixels with only one step (1.18%). For binary coding, where the pixel representation is an 8-dimensional vector, it is more easy to recover due to a larger absorption state (25.74%). This means the discrete embedding space can be more robust to errors of neural networks. (It's worth noting that although discrete embedding is more robust, it weakens the continuity between adjacent pixels and currently cannot exceed the performance of continuous models.)
We would like to express our sincere gratitude to you again. Your suggestions are of great help in improving the quality of our paper.
We believe the following minor revisions will improve the clarity of our paper.
-
In Training Objective part of Section 3.3, we will change the equation 24 to a quick introduction of the and move the equation to this part. Then we will add the derivation from to as discussed above, including the upper bound. Therefore, our method covers the entire framework of the Flow Matching in theory.
-
In Reverse Process part of Section 3.3, we will add an explaination that solving the equation to get the with respect to the change of in real time is difficult. Therefore, the theory and practice are distinguished in this part, where we explain how to implement our method.
-
In Section 4, we will add analyses about the expectation of errors as demonstrated above.
-
Other derivations and discussions that do not affect reading will be added to the appendix.
I appreciate the empirical analysis of the error induced when switching from the based objective to the based objective. I also think the paper's clarity will be improved greatly including this derivation of the loss in the main with its relation to flow matching, it will definitely help the reader's understanding with your method. I will raise my score to 6 due to my main concern of the unclear derivation of the method being mostly resolved.
Thank you again for your patience and kindness, your suggestions and guidance have been very helpful.
In this study, the authors propose a framework to address the discrepancy between discrete data and continuous modeling in diffusion processes. The approach involves a two-step forward process that first estimates the boundary as a prior distribution, then rescales the forward trajectory to build a boundary conditional diffusion model. The reverse process is proportionally adjusted to ensure the learned contours yield more accurate discrete data. Experimental results show the method's strong performance in language modeling and discrete image generation tasks, outperforming previous models in several areas and establishing a new state-of-the-art for categorical image generation on the Cifar-10 dataset.
优点
- The authors have conducted comprehensive experiments, providing thorough and complete details in their exploration.
- The novelty of the study lies in its discussion of the concept of "discrete area", presenting a fresh perspective in the field.
缺点
- The explanation as to why this new framework can enhance performance is not clearly articulated in the study.
- From my understanding, Algorithm 1 appears to be equivalent to training original diffusion models with as opposed to .
问题
- Is there a possibility that the boundaries of the discrete areas of two distinct data points might intersect? If so, why would pushing the noise to the boundary be beneficial? Could certain perturbations potentially lead to inaccuracies in the final result?
- According to Equation (14), shouldn't the input of be the boundary point and the "crossing boundary time"? If so, why does line 8 of Algorithm 2 suggest that a sequence of ranging from to can serve as the input of ?
- also please refer to weaknesses.
If these questions are addressed, I would consider revising the score.
局限性
The authors do acknowledge the limitations of their work, which is commendable. Furthermore, I recommend conducting theoretical analysis, such as exploring the convergence properties of their new framework.
Q3: Boundaries of discrete areas
- The discrete area of two distinct data points will not intersect.
As illustrated in Lines 34-37 and Equation 6, points in the discrete area will not be attributed to any other area by definition.
Take language modeling as an example, the discrete area of a token 'A' is the set of all points in the embedding space with the largest dot similarity to token 'A' than other tokens in the vocabulary. This means almost any point in the embedding space will be attributed to only one token. When the dot similarities of a point to different tokens equal, it means the point lies at the boundary of these discrete areas.
Besides, the discrete area is convex. This means all points within the boundary will only belong to this area, i.e., we can safely discribe the area with boundaries as in Section 3.1. The quick proof of convex is:
Suppose the embedding of token 'A' is , given two points and in the embedding space.
If and are inside the discrete area of token 'A' and for all embeddings of other tokens,
there is: and .
For any other points ,
there is .
Therefore, the discrete area is convex.
- Boundaries of two areas can overlap with the confidence factor , but this is impossible when .
When the confidence factor , as illustrated in Equations 7 and 8, the boundary is where the probabilities of points belonging to the neighbour areas equal. Consider a simplified situation of only two tokens 'A' and 'B', which is the binary classification, the boundary of token 'A' overlaps the boundary of token 'B', where .
When the confidence factor decreases, the range of each discrete area is gradually reduced. For example, we may take as the boundary of 'A' and as the boundary of 'B'. In this case, is an unattributed area and there is no overlap between the boundaries of 'A' and 'B'.
- Pushing the noise to the boundary is beneficial.
-
Empiricaly, previous work like Difformer and Dinoiser and our Table 1 have revealed that increasing the minimal noise scale can benefit the performance (Lines 217-219). In both their experiments and our Table 1, comparing to the original Diffusion LMs like DiffuSeq and SeqDiffuSeq, increasing the minimal noise scale (difformer and dinoiser) can improve the performance. And it is obvious that pushing the noise to the boundary is precisely what increases the minimal noise scale.
-
Theoretically, probability density inside the discrete area can be ignored.
Let’s make a simple analogy. If we want to launch a satellite around the moon, we can treat the moon as a point mass. This is similar to the traditional continuous diffuison process. When it comes to landing on the moon, we have to take into account the shape of the moon, while the gravity field inside the moon does not matter. This is what our discrete problem looks like.
Based on the observation of the discrete area that any point inside this area will be attributed to the corresponding token (Lines 34-37), we can easily get the derivation: we don't need to model the probability density inside the discrete area because our diffusion model only needs to guide a noise to get into this area, where the accurate end position doesn't matter. Therefore, pushing the noise to the boundary does not cause theoretical damage to the probability modeling. And the dffusion model can be less disrupted by invalid data, i.e., points inside the area, during the training stage.
- Certain perturbations can lead to inaccuracies in the final result without our confidence factor.
We use the confidence factor (Lines 137-147) to control the influence of perturbations. Given the above example of only two tokens 'A' and 'B'. When the confidence factor , points on the boundary of 'A' follow . Therefore, any perturbation to the prediction of the learned diffusion model may lead to a reverse of the attribution of . To alleviate this problem, we can decrease the factor so that points on the boundary have higher confidence, e.g., , which is more robust to perturbations. In practice, works well for tasks with strong conditions and we should decrease for unconditional tasks.
We sincerely appreciate your meticulous review and the valuable comments. Please kindly find our response below.
Q1: Why can our framework enhance performance
- From the perspective of Motivation
As illustrated in Introduction and Figure 1, we find that the continuous diffusion processes oversimplify the objective of discrete modeling. The discrete data are treated as single points, while they are actually areas in the diffusion space. This means diffusion models learned on the oversimplified objective generate the sub-optimal probability density contour, which is inconsistent with the discrete priors (Lines 38-41 and Figure 1A). Therefore, our method takes the discrete prior into consideration and design a more appropriate diffusion trajectory for discrete modeling problems (Lines 42-65 and Figure 1B).
- From the perspective of Methodology
We theoretically derive the diffusion process conditioned on the discrete priors and obtain the sampling distribution of the corresponding forward process (Equations 21 and 22). It is easy to calculate that the is exactly the discrete boundary and is a Gaussian distribution. During the inference stage, i.e., reversing the time step from to , the density probability will gradually be consistent with the contour when approaching the boundary, which is inline with Lines 46-51 and Figure 1B. Hence, our method will precisely guide the random noise into the discrete area. Figure 2 demonstrates sample points exactly calculated by our proposed method (Equation 22) and the probability contours calculated by our method can be easily adapted to different discrete boundaries and noising trajectories, which are inline with our expectations in Figure 1 of the Introduction Section.
- From the perspective of Experiment
Our framework is constructed on Difformer and BitDiffusion with the same configurations for both language and image generation tasks. As in Table 1 and 3, our method demonstrates significant improvements over them. Besides, ablations in table 4 reveal that, with the increase of the confidence factor , i.e., increasing the influence of the discrete prior, the discrete modeling performance improves in all aspects.
Q2: Algorithm 1
You can consider . However, is not a constant but a random variable as illustrated in equation 12.
Sampling the from and then transform it to with is equivalent to directly sampling from . We choose the former procedure because it is easy to perform in parallel. Algorithm 1 is just a commonly used element substitution technique in sampling.
Q3: Boundaries of discrete areas
Due to the character limit of the rebuttal, this is only an outline of our answer. Please refer to the comment for more details.
- The discrete area of two distinct data points will not intersect.
As illustrated in Lines 34-37 and Equation 6, points in the discrete area will not be attributed to any other area by definition.
- Boundaries of two areas can overlap with the confidence factor , but this is impossible when .
When the confidence factor , as illustrated in Equations 7 and 8, the boundary is where the probabilities of points belonging to the neighbour areas equal. When the confidence factor decreases, the range of each discrete area is gradually reduced and there is no overlap between boundaries.
- Pushing the noise to the boundary is beneficial.
-
Empiricaly, previous work like Difformer and Dinoiser and our Table 1 have revealed that increasing the minimal noise scale can benefit the performance (Lines 217-219). And it is obvious that pushing the noise to the boundary is precisely what increases the minimal noise scale.
-
Theoretically, probability density inside the discrete area can be ignored. Based on the observation of the discrete area that any point inside this area will be attributed to the corresponding discrete data (Lines 34-37), we can easily get the derivation: we don't need to model the probability density inside the discrete area because our diffusion model only needs to guide a noise to get into this area, where the accurate end position doesn't matter. Therefore, pushing the noise to the boundary does not cause theoretical damage to the probability modeling. And the dffusion model can be less disrupted by invalid data, i.e., points inside the area, during the training stage.
- Certain perturbations can lead to inaccuracies in the final result without our confidence factor.
We use the confidence factor (Lines 137-147) to control the influence of perturbations. When we set a smaller , the learned reverse process will be robust to perturbations.
Q4: Input of function
The input of is valid for any arbitrary pair of point and time . Functions and in Equation 14 are directly extended from Equations 4 and 5, which construct the invertible relationship between any point-time pair (-) and their initial noise .
In Equation 14 of Section 3.1, we take the boundary point and the crossing boundary time pair as input because we just want to estimate this boundary. However, this does not mean the input of can only be the pair of boundary point and crossing boundary time, while any other point-time pairs are still valid. This is why we use to update the trajectory in Line 8 of Algorithm 2.
Thanks for your detailed reply.
Now, I have no concerns on the motivation of this work. The motivation is reasonable. But I still feel confused towards the design of the sampling (Algorithm 2), can you provide more illustration on this? It is beneficial to give more clear presentation of this.
We sincerely appreciate your response, and we are delighted to have addressed some of your concerns.
Q: The design of sampling (Algorithm 2)
Lets start from the sampling process of DDIM under our symbol denotion.
- for do
(Pseudo Target)(Trajectory Alteration)(Previous Sample)
This is a synchronous trajectory update where the algorithm directly updates the start point of the trajctory based on the current prediction .
Our method requires one more step of timestep rescaling , which takes the current trajctory start point as the input. However, the trajectory alteration (step 4 above) also requires the current timestep as the input . Therefore, our procedure can not be synchronous and we must choose or to be asynchronous.
As discussed with reviewer 1VFK (Q3.3 fixing the initial path), we find that even discarding the step of Trajectory Alteration leads to almost no performance degradation.
(1. From the perspective of uncertainty, if the learned diffusion model predicts a different , we have to update the original noise because currently we are in a different trajectory. If the learned model keeps changing the predicted target, which means it is uncertain about where to go, updating the trajectory correspond to an unreliable target may not be beneficial for the denoising process. If the learned model rarely changes its prediction, it means we are on the right path and no updates are needed. Therefore, it would be a viable option to discard this step of Trajectory Alteration to make decoding faster. 2. From the perspective of experiments, as illustrated in Table 3, if we train with the rescaled forward process and decode with unrescaled trajectory, there will be only a slight performance degradation. The performance improvement of our model mainly comes from the training process, and some small changes in the decoding process may not have much impact. In practical application, we just discard the trajectory update for simplicity in our experiments and find almost no performance degradation.)
Therefore, we choose to be asynchronous and the steps in the for loop (line 2) becomes:
3.
(Pseudo Target) 4.(Trajctory Rescaling) 5.(Previous Sample) 6.(Asynchronous Trajectory Alteration)
where we add the Trajctory Rescaling step and move the Asynchronous Trajectory Alteration to the end. Other steps in Algorithm 2 is to keep track of current and .
It's worth noting that we can not loop the as because is dynamic and different for each pair of and . In the same iteration round, different corresponds to different .
Therefore, Algorithm 2 can complete the iterative prediction of with subtal errors. And in the analytical experiments discussed with the reviewer 1VFK, the errors in the sampling stage have almost no impact on the final results.
(There is a typo in Algorithm 2 that in Line 6 should be )
We hope to supplement the above description in a more vivid way.
The sampling process of DDIM can be demonstrated as:
In this process, the reverse trajectory is . When the prediction of changes, the trajectory is updated with as the fixed point , which can be demonstrated as:
Therefore, DDIM algorithm can be simplified as:
- for do
(Pseudo Target)(Previous Sample)
In our algorithm 2, the timestep rescaling takes both and as the input. This means errors in the predicted will be maginified if we update before rescaling the timestep. Therefore, there will be two different solutions:
- One is our implemention, where we directly discard the Trajectory Alteration step, and the fixed point of our reverse trajectory is :
It is worth noting that, in this situation, the Trajectory Alteration step is just like a place holder and does not change the value of , because the Previous Sample (Step 5) and Asynchronous Trajectory Alteration (Step 6) are exaclty the same equation.
- The Asynchronous Trajectory Alteration works on the other solution where we still take as the fixed point. In this situation, the Previous Sample step is . This means we calculate the with but generate with the updated . The corresponding reverse trajectory is:
As mentioned in Lines 195-196 of our paper and discussion with reviewer 1VFK, there is almost no performance gap between the two different fixed point solutions of our algorithm. Therefore, we choose the former one which is simple and stable. Besides, the Asynchronous Trajectory Alteration step (Line 8 in Algorithm 2) is optional in practice.
Thanks for your detailed reply.
I will raise the score to 5. More illustration on the algorithm should be added in revision.
Thank you for your patience and kindness, your suggestions and guidance have been very helpful.
We will make the following minor revisions to make it clearer.
- Replace the current iteration into an easier notation.
- Add the explaination about the Asynchronous Conflict between and and trajectory alteration strategy as discussed above.
- Add comparison with DDIM in the appendix for better understanding.
- Other derivations and discussions that do not affect reading will be added to the appendix.
We believe these revisions can improve the clarity of our algorithm and we would like to express our sincere gratitude to you again.
Besides, the asynchronous conflict problem is the same as the problem of discussed with reviewer 1VFK.
In Reverse Process part of Section 3.3, we will add an explaination that solving the equation to get the with respect to the change of in real time is difficult. Therefore, the theory and practice are distinguished in this part, where we explain how to implement our method.
Therefore, we will combine these two explanations.
A discrepancy exists when using continuous diffusion models to model discrete data, which has not been sufficiently addressed in past methods. This paper redesigns the forward and backward processes of the diffusion model to eliminate this issue. Experiments on translation tasks and CIFAR-10 effectively validate the proposed method's effectiveness.
优点
-
The writing is clear and concise.
-
This paper aims to address an important and interesting question.
-
Experiments demonstrate that the proposed method surpasses all baselines, effectively validating its effectiveness.
缺点
Please see the Questions.
问题
-
In Sec. 4, the authors present an untrainable given the embedding layer. If we employ a trainable (e.g., Diffusion-LM [1]), will the discrepancy between discrete data and continuous modeling still exist? If it does, will your proposed method still bring improvements in this scenario?
-
This paper presents the results of translation tasks. I am curious to know whether the proposed method would be effective for models like Plaid [2], which can compute text likelihood. The discrepancy between discrete data and continuous modeling mainly occurs during the step of decoding the embedding into text, while the likelihood calculation does not require this decoding step.
-
In Table 3, the authors report FIDs of 51.27 and 30.97 for D3PM. However, the best generative FID on CIFAR-10 in the original D3PM paper is 7.34. Why didn't the authors report this result as a baseline? This requires further explanation.
[1] Diffusion-LM Improves Controllable Text Generation
[2] Likelihood-Based Diffusion Language Models
局限性
yes
We sincerely appreciate your meticulous review and the valuable comments. Please kindly find our response below.
Q1: The embedding layer
-
We want to clarify that the embedding layer we use is actually trainable, as illustrated in Lines 200 and 208. The is only used to denote the diffusion parameters in our paper.
-
The discrepancy between discrete data and continuous modeling exists for both fixed and trainable embedding layers.
-
Besides, we are fully aware that it has been demonstrated in Diffusion-LM that a trainable embedding layer achieves better performance than the fixed one.
Q2: Application scenarios of our framework
-
There seems to be an embedding layer in Plaid as well (Section 3.1 in Plaid), so our framework is theoretically applicable. Likelihood calculation is also a similarity function that has the same effect as the dot product in our work, which is compatible with the framework we defined.
-
Besides, as illustrated in Lines 28-30, our framework is specifically designed for continuous diffuison models, where the discrete diffusion process with the discrete state space will not face the problem in Figure 1A and Lines 38-41. Therefore, our framework will extend the generality of the widely used continuous diffusion processes, but does no help for discrete diffusion processes.
Q3: FiD of D3PM
We want to clarify that we have reported the original D3PM in Table 3 with the FiD of 7.34, first line (D3PM GAUSS) in the part of Discrete Ordinal Pixels.
As demonstrated in Table 3 and Section 5, we use different type of image embeddings. Different embedding type possesses different level of continuous information. The D3PMs of 51.27 and 30.97 FiDs are the UNIFORM and ABSORBING versions with Categorical Pixels.
Thank you for your response. I've decided to maintain my score.
Thank you for your response and kindness, your suggestions have been very helpful.
The paper discusses a novel approach to discrete modeling using boundary conditions and diffusion processes. The primary contributions include:
- Development of a framework for discrete image generation that incorporates binary coding and pixel embedding.
- Introduction of an intermediate state to illustrate the correlation between discreteness and modeling difficulty.
- Extensive experimental evaluation on datasets like CIFAR-10, demonstrating competitive results compared to state-of-the-art models.
- Proposal of a new method for stochastic processes that improve model performance, especially in cases of categorical pixels.
优点
- Originality: The paper introduces a unique method for addressing discrete modeling through a combination of binary coding and pixel embedding, which is innovative in the context of diffusion models.
- Quality: The experimental setup is robust, with detailed evaluation metrics and comparisons to existing methods. The results show significant improvements, indicating the effectiveness of the proposed approach.
- Clarity: The paper is well-structured, with clear explanations of the methodology and the underlying mathematical formulations. Figures and tables are used effectively to illustrate the results.
- Significance: The approach has broad applicability in various fields requiring discrete data generation, such as image and language modeling. The improvements in FID scores and other metrics highlight the potential impact of this work on the research community.
缺点
- The approach is primarily tested on specific datasets like CIFAR-10. Expanding the evaluation to a wider range of datasets on high-resolution images could provide a more comprehensive assessment of the model's generalizability.
- Language model experiments are mainly on translation tasks. Could you provide experiments on language modeling tasks, which is more common for benchmarking diffusion LMs.
问题
See weakness.
局限性
Limitation is mentioned in one sentence in the conclusion section, but not comprehensively discussed.
We sincerely appreciate your meticulous review and the valuable comments. Please kindly find our response below.
Q: Expanding datasets for images and languages
-
For both language and image tasks, our framework is constructed on our baselines, Difformer and BitDiffusion, with the same configurations. We strictly follow their benchmarks and datasets to ensure a fair comparison.
-
Our experiments across languages and images may have reflected the generalizability of our framework.
-
Since training the image model takes a week and training the language model takes at least two days (lines 719-720), we do not have enough computing resources to scale to a larger and more complex benchmark. We sincerely accept your suggestions and will try to expand our method to larger benchmarks when we have enough computing resources.
-
We try to conduct the experiment on ImageNet 64, but our computing resources are currently insufficient to complete it. We show the temporary results of the first 100K steps as follows. | Binary Coding | FiD | | ---- | ---- | | BitDiffusion | 47.82 | | Ours | 31.68 |
We sincerely thank the reviewers for their efforts and suggestions.
We would like to briefly summarize our discussions with the reviewers.
- All reviewers agree that our method is novel and meaningful, which incorporates discrete areas as priors when modeling discrete problems with continuous diffusion processes.
- All reviewers agree that our experiments are effective with comprehensive baselines and promising performances.
- Reviewers are mainly confused about the equations and algorithms in our method, and we provide detailed derivations and explanations in the discussion and have addressed their concerns.
- Reviewer 1VFK raises questions about why our theoretical design is effective, and we supplement additional analytical experiments and theoretical explanations to address these concerns.
According to the suggestions of reviewers uQEt and 1VFK, we will make the following minor revisions to improve the clarity of our paper:
-
In Training Objective part of Section 3.3, we will change the equation 24 to a quick introduction of the and move the equation to this part. Then we will add the derivation from to as discussed, including the upper bound. Therefore, our method covers the entire framework of the Flow Matching in theory.
-
In Reverse Process part of Section 3.3, we will add an explaination that solving the equation to get the with respect to the change of in real time is difficult. Therefore, the theory and practice are distinguished in this part, where we explain how to implement our method.
-
In Algorithm 2 part of Section 3.3, we will add an explaination about the asynchronous conflict between and (which will be connected to the above explaination of ) and trajectory alteration strategy as discussed. Besides, we will replace the notations of Algorithm 2 to easier ones.
-
In Section 4, we will add analyses about the expectation of errors as demonstrated in the response to reviewer 1VFK.
-
Other derivations and discussions that do not affect reading will be added to the appendix.
We again express our gratitude to the ACs and reviewers for their efforts.
The authors present a model that adapts continuous diffusion processes to discrete modeling. The proposed method is focused on injecting guidance for discrete boundaries. The approach is validated on vision and language tasks.
The reviewers agreed on the novelty of the presented approach compared to reference methods. Two of them raised some theoretical concerns, but the authors addressed them in detail in the rebuttal stage. There are also some concerns regarding the experimental part of the work, especially for images where Cifar10 is used. However, the results for language applications and the novelty of the work are satisfactory enough, and the paper should be accepted.