PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
4
5
5
4
3.0
置信度
创新性2.5
质量2.8
清晰度2.5
重要性2.8
NeurIPS 2025

Transition Matching: Scalable and Flexible Generative Modeling

OpenReviewPDF
提交: 2025-05-09更新: 2025-10-29
TL;DR

Transition matching (TM) is a discrete-time continuous-state generative modeling that advances both flow/diffusion and autoregressive models. TM variants achieve state-of-the-art text-to-image generation.

摘要

关键词
generative modelsflow matchingdiffusionlarge scalemultimodal

评审与讨论

审稿意见
4

This paper introduces a general framework called Transition Matching (TM), which unifies continuous-state autoregressive models. The framework is built upon three core design choices: the supervision process, kernel parameterization, and the modeling paradigm. By making specific selections within these dimensions, the authors propose three powerful new variants: DTM, ARTM, and FHTM. The empirical results show that the proposed method FHTM can match the quality of full-sequence flow model in image generation.

优缺点分析

[Distinctions to previous works]

The paper proposes a general framework of continuous state autoregressive models by defining the three major design spaces.

However, the overall writing was not easy to follow with unconventional notations and missing justifications and explanations. First, design space analysis has been widely explored in previous works [1] (diffusion), [2-3] (diffusion and flow), [4] (diffusion, flow, and CTMC). For me Sec. 2.1 seems as a preliminary, not a core component of the paper, and the author's claims of unifying flow, diffusion under TM is too bold.

In Sec. 2.2, the paper specifically focuses on continuous state autoregressive models, which I could not spot the differences of DTM from MAR [5]. While the authors propose modeling the latent difference, rather than the conventional ϵ\epsilon-prediction in diffusion models or velocity prediction in flow models, I am not convinced why this is needed and such a design choice should give better results. Similarly, the use of independent linear process is not justified. What does it exactly mean by a better regularity of the conditional reverse kernel?

[DTM results]

The authors show that DTM achieves the best performances to full sequence denoising such as FM. However, I find the results contradictory since the DTM also outperforms AR-based methods that adopt causal attention. Additionally, how does DTM compete against FM in inference time or training stability?

Minor typo: well-explored (L2), mixed use of discrete, continuous time (L125-126).

[1] Denoising Diffusion Implicit Models

[2] Flow Matching for Generative Modeling

[3] Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

[4] Generator Matching: Generative modeling with arbitrary Markov processes

[5] Autoregressive Image Generation without Vector Quantization

问题

Please address the weaknesses above.

局限性

Yes.

最终评判理由

The paper presents a new parameterization that benefits from larger mode coverage. Additionally, the comprehensive results show promising results for AR-diffusion models and provide apple-to-apple comparison results which can benefit the future researches. I recommend this paper to be accepted in NeurIPS 2025.

格式问题

None.

作者回复

The overall writing was not easy to follow with unconventional notations and missing justifications and explanations.

We thank the reviewer for the feedback and are sorry to hear that parts of the paper were difficult to follow. We made an honest effort to find a notation to describe the design space of TM. If the reviewer can provide more specifics on what is unclear we will try and clarify further.

Design space analysis has been widely explored in previous works [1] (diffusion), [2-3] (diffusion and flow), [4] (diffusion, flow, and CTMC). For me Sec. 2.1 seems as a preliminary, not a core component of the paper.

The reviewer correctly notes that previous works have explored learning transition kernels of Markov chains with a simulation free loss similar to equations 4 and 5, particularly in the context of diffusion and flow-based models. However, we identify several key differences that make Section 2.1 more than just a preliminary: (1) [1-4] focus on factorized transition kernels (e.g., Gaussian/deterministic kernels), with the exception of MAR [5] which is patch-wise factorized, and we further elaborate on DTM vs. MAR to an answer below. In contrast, we formulate a general, simulation-free learning methodology of arbitrary transition kernels. (2) we provide a formulation that allows arbitrary supervision processes and a systematic way to consider different kernel parameterizations; and (3) we feel the level of generality we chose allowed us to provide a full exposition to TM, including all design choices, within (hopefully clear) 2 pages. We expect this to allow future researchers to rather quickly get familiar with the field in a rather general manner, capturing the main design choices of existing methods.

The author's claims of unifying flow, diffusion under TM is too bold.

TM includes diffusion/flow matching and AR, however we do not claim it to be the first framework to include both diffusion and flow matching (or AR to that matter). If the reviewer can point to where such a claim is made in the paper we are happy to clarify it in the text.

In Sec. 2.2, the paper specifically focuses on continuous state autoregressive models, which I could not spot the differences of DTM from MAR [5].

DTM and MAR indeed both use the same Modeling, i.e., DiT backbone with a small flow/diffusion head but they are distinct in their Supervising Process and Parametrization which lead to different training and generation processes and notably lead to very different performance where MAR is considerably sub-par to DTM .

In detail: Supervising Process:

  • DTM: Xt=(1tT)X0+tTXTX_t = \left( 1-\frac{t}{T} \right)X_0 + \frac{t}{T}X_T, where X0N(0,I)X_0\sim\mathcal{N}(0, I).

  • MAR: Xt=(1Bt)M+BtXTX_t = \left(1-B_t \right) \circ M + B_t \circ X_T, where MM is a masked image, Bti,i=1,,dB_t^{i}, i=1,…,d are i.i.d Bernouli(tT)\left(\frac{t}{T}\right) random variables, and \circ is the Hadamard product.

Parametization:

  • DTM: Y=XTX0Y=X_T-X_0.
  • MAR: Y=XTY=X_T.

Hence, (i) the DTM model’s input (XtX_t) is a noisy image while the MAR model’s input (XtX_t) is an image where some patches are clean and some are masked. (ii) DTM model’s output (YY) is a direction XTX0X_T-X_0 intersecting XtX_t while MAR model’s output (YY) is an estimation of the clean image. (iii) On generation, DTM is non-AR and predicts a direction and takes a small step in that direction, changing all patches at each step. In contrast, MAR is arbitrary-order-AR that predicts a clean image and masks back some of the patches, thus changing only a “set of tokens” at each step. A set of tokens of size one leads to an autoregressive model in the order of patch generation.

While the authors propose modeling the latent difference, rather than the conventional \eps\eps-prediction in diffusion models or velocity prediction in flow models, I am not convinced why this is needed and such a design choice should give better results.

First we would like to emphasize that in practice, as shown in the paper, DTM is considerably favorable to the flow matching (FM) method.

Second, as the reviewer is probably aware, all diffusion-type methods reproduce the training set at a global minimum loss, therefore the hope of theoretically distinguishing generalization abilities among them is probably futile. However, we can offer an intuition: Given a current state Xt=xtX_t=x_t, FM learns to approximate the expected transition Yt=E[XTX0Xt=xt]Y_t = \mathbb{E}[X_T-X_0|X_t=x_t] while DTM learns to sample the underlying distributions of transitions YtpYt(Xt=xt)Y_t \sim p_{Y|t}(\cdot|X_t=x_t). We hypothesize that in the large scale settings (as in our case) where model capacity is not a constraint, the more elaborate supervision of DTM is beneficial for the model training.

Similarly, the use of independent linear process is not justified. What does it exactly mean by a better regularity of the conditional reverse kernel?

We respectfully disagree that the independent process is unjustified.

First we refer the reviewer to Figure 10 in the Appendix that shows a significant advantage for the independent linear process, therefore using this process is justified empirically.

Second, regarding regularity, given a current state Xt=xtX_t=x_t, Figure 6 in the paper illustrates that the support of the transition kernel of the independent linear process (equation 15) is significantly larger than the support of the linear process (equation 10). In practice, using the linear process with TM steps>1>1 for training an autoregressive kernel results in trivial overfitting (since a linear interpolant is fully determined by any two points on the trajectory) and leads to poor results. Thus, the use of the independent linear process which has a wider support results in a better regularized training objective and leads to state-of-the-art results.

The authors show that DTM achieves the best performances to full sequence denoising such as FM. However, I find the results contradictory since the DTM also outperforms AR-based methods that adopt causal attention.

We are not sure we understand the reviewer’s comment, why “DTM outperforms AR-based” methods is contradictory? As we found in our experiments, which were performed in a controlled and fixed settings, and is also shown in [5] that the reviewer mentions, we found AR (causal) methods to be sub-par in general to non-AR (bidirectional) methods in generated image quality.

Additionally, how does DTM compete against FM in inference time or training stability?

To show inference time of DTM versus FM we have conducted several experiments and added several relevant Tables in Reviewer’s ag7i response.

Overall, Table 1 in Reviewer’s ag7i response compares the wall-clock time of FM and DTM, and in particular DTM (16 backbone NFE and 4 head NFE) can achieve superior performance to FM (128 backbone NFE) with a 7x speedup.

In more detail, Tables 2(a–b) in Reviewer’s ag7i response present the dependence of DTM’s CLIPScore and PickScore on the number of function evaluations (NFE) in the flow head and backbone. Table 4 reports the corresponding inference times. Tables 3(a–b) in Reviewer’s ag7i response present flow matching’s (FM) CLIPScore and PickScore as a function of backbone NFE.

Regarding training stability, the reported DTM and FM model are trained with the exact same hyper-parameters and training steps, and we didn’t observe any stability issues.

I could not find limitations nor societal impact of the work in the paper.

The limitations and societal impact statement appears in the conclusions section in lines (286-287) and lines (289-290). We will modify/elaborate the limitations to include “The improved performance of ARTM/FHTM comes at the price of a higher sampling cost, i.e., NFE counts are proportional to the number of transition steps, see e.g., Table 1 in the paper.”

评论

As discussion period coming to an end - we would like to draw the reviewer's attention to the comprehensive rebuttal and new results we produced to address their comments and would greatly appreciate a response. Thanks!

评论

Appreciate the authors for resolving most of the concerns.

Here are further clarification of my previous comments.

  • From my understanding, DTM and FM are the same in a sense they both use bi-direction attention and condition on the previous sample xtx_{t} to sample xt+1x_{t+1}. Then is the performance boost (FM to DTM) coming from the architectural advantages or parameterization? What is component leading to the major performance improvements?
  • Would FM perform better if we used independent linear for the supervising process?
  • Timestep TT and 11 are used interchangeably. This confused the notations whether x1x_1 indicating one-step or fully generated sample (L125).
  • With some typos in the algorithms (Fig. 20-22): rand_like -> randn_like.
  • Some discrepancies between the figures and algorithms. The velocity head in Alg. 7 FHTM seems to be both conditioned on t,st, s (subscript), but the notation in Fig. 5 implies it is conditioned only on ss.
  • From Fig. 5, it was not clear to me how the initial tokens were sampled (x0x_0) from boi. Does it condition on previous Gaussian patches to sample Gaussian samples? Sampling Gaussian from a velocity head seemed unintuitive to me.

These are the some of the points I found hard to understand the method. I would appreciate if the authors can provide some clarifications.

评论

We are pleased to hear that most of the reviewer's concerns have been addressed. Below, we provide further clarifications.

From my understanding, DTM and FM are the same in a sense they both use bi-direction attention and condition on the previous sample xtx_t to sample xt+1x_{t+1} . Then is the performance boost (FM to DTM) coming from the architectural advantages or parameterization? What is component leading to the major performance improvements?

We would like to emphasize that DTM and FM both use the same linear supervising process Xt=(1tT)X0+tTXTX_t=(1-\frac{t}{T})X_0 + \frac{t}{T}X_T and the same parametrization Y=XTX0Y=X_T-X_0, however they are distinct in their modeling. Given a current state Xt=xtX_t=x_t:

  1. FM: at each step FM takes as input the current state Xt=xtX_t=x_t and output the velocity which is exactly equal to the expectation of Y=XTX0Y=X_T-X_0 conditioned on Xt=(1tT)X0+tTXT=xtX_t=(1-\frac{t}{T})X_0 + \frac{t}{T}X_T=x_t (denoted before as E[YXt=xt]\mathbb{E}[Y|X_t=x_t]).

  2. DTM: at each step DTM takes as input a current state Xt=xtX_t=x_t and an output a sample of Y=XTX0Y=X_T-X_0 conditioned on Xt=(1tT)X0+tTXT=xtX_t=(1-\frac{t}{T})X_0 + \frac{t}{T}X_T=x_t.

To learn a sampler of YY conditioned on Xt=xtX_t=x_t in a scalable manner DTM must use a flow head. The flow head is an additional small generative model parametrized by a 40M parameters MLP and trained (end-to-end with the backbone) with flow matching loss where the target is the distribution of YY conditioned on Xt=xtX_t=x_t as in Algorithm 3. in the paper.

To summarize, we believe that the addition of the flow head which amounts to architectural and loss change compared to FM is the reason for improved performance.

Would FM perform better if we used independent linear for the supervising process?

The independent linear process cannot be used as supervision for FM. The intuition can be drawn from Figure 6 in the paper. Given a current state Xt=xtX_t=x_t (since X0,t+1X_{0,t+1} is independent of XtX_t), the line connecting X0,t+1X_{0,t+1} and XTX_T does not necessarily cross XtX_t. Thus in the limit of continuous time, the transition kernel of the independent linear process will involve a jump probability and cannot be modeled solely by a velocity field (as done in FM).

Timestep TT and 11 are used interchangeably. This confused the notations whether x1x_1 indicating one-step or fully generated sample (L125).

We apologize for the confusion in line 125, indeed it should note s=1s=1 instead of t=1t=1 indicating a sample from B1B_1, which in the case of DTM would be B1=YB_1=Y. Will fix this in the camera ready version.

With some typos in the algorithms (Fig. 20-22): rand_like -> randn_like.

Thank you for noting this, we will fix it in the camera ready version.

Some discrepancies between the figures and algorithms. The velocity head in Alg. 7 FHTM seems to be both conditioned on t,st,s (subscript), but the notation in Fig. 5 implies it is conditioned only on ss.

Thanks for raising this typo, indeed Figure 5 is the correct one (as is in equation 18). We will remove the tt subscript for the head in Alg. 7 in the camera ready version. We also note that tt is implicitly given through the hidden state hth_t and in practice further experiments show that explicitly input tt into the flow head is redundant.

From Fig. 5, it was not clear to me how the initial tokens were sampled (x0x_0) from boi. Does it condition on previous Gaussian patches to sample Gaussian samples? Sampling Gaussian from a velocity head seemed unintuitive to me.

The initial token X0X_0 for DTM and ARTM is sampled from a Gaussian noise, i.e., X0N(0,I)X_0\sim \mathcal{N}(0,I). While for FHTM X0X_0 is constant and always taken to be a “boi” token.

We hope this answers your questions and we would be happy to clarify any further concerns.

评论

Thank you to authors for addressing the comments. But it is confusing what the authors mean by "While for FHTM X0X_0 is constant and always taken to be a “boi” token." How can X0X_0 be a constant? I believed this should also be sampled from random Gaussian. Could the authors clarify how exactly the sampling is done for FHTM?

评论

Thank you to authors for addressing the comments. But it is confusing what the authors mean by "While for FHTM X0X_0 is constant and always taken to be a “boi” token." How can X0X_0 be a constant? I believed this should also be sampled from random Gaussian. Could the authors clarify how exactly the sampling is done for FHTM?

We are sorry for the confusion, let us explain. When training FHTM (Figure 5) the head and loss is taken only for t1t \geq 1. That is, h01,,h0nh_0^1,\ldots,h_0^n do not have gradients. This is also consistent with equation 17.

Sampling from FHTM is done by concatenating [“boi”,X0X_0] and then sampling consecutively X11,X12,,X1n,X21,,X2n,X31,X_1^1, X_1^2,\ldots,X_1^n,X_2^1,\ldots,X_2^n,X_3^1,\ldots one after the other with causal attention (as standard in AR).

In practice, concatenating X0X_0 serves as dummy tokens since, as described above, no loss is taken on h01,,h0nh^1_0,\ldots,h_0^n. Therefore, it is sufficient (and more efficient) to take X0X_0 as “boi” (i.e., “constant”), see pseudocode in Figure 22. We will also clarify this in the paper.

We would be happy to clarify further if the above is still not clear and/or try to address any remaining concerns of the reviewer.

评论

Thank you for the clarification. That makes much more sense. I believe the general framework presented will have great impact in generative modeling community. I'm now convinced that this paper should be presented at the conference. I will revise my score accordingly. Thank you to the authors for providing clarifications.

评论

We thank the reviewer for their engagement and positive feedback and happy to learn that they are willing to update their score accordingly.

审稿意见
5

The paper proposes a novel auto-regressive generative model based on "Transition Matching". Here, the idea is to form a Markov model that takes some easily sampled data to the target distribution gradually based on a fixed process called "supervising process". The idea is then to use various transition modeling strategies and train them to generate data at the test time. This results in a generic and rich set of generative models, for which diffusion models and flow matching could be considered as its special cases. Because of the generality of the model, one would expect a greater flexibility (e.g. could be made auto-regressive as a result of the modeling choices) and potentially improved generative capabilities compared to previous models whose design choices are saturated for the most part. Three variants of the proposed model was trained on massive datasets "350M licensed Shutterstock" and tested based on PartiPrompts and MS-COCO benchmarks, based on various metrics such as CLIPScore, PickScore, etc.

优缺点分析

The paper overall looks promising and clear. I have few concerns:

  • The problem of text-to-image generation is multi-facet problem and may previous work focused on improving specific aspects, e.g. compositionality, quality, bias, style transfer, etc. It is not clear which of these aspects is expected to be improved by the proposed method. With regard to compositionaliity, one would expect benchmarks such as T2I-CompBench and other related ones to be tested against. Furthermore, many of the proposed metrics lack alignment with human perception, making results less reliable.
  • The fact that DTM (Difference Transition Matching) gave better results compared to more complex ARTM and FHTM sounds puzzling and was not discussed in details in the paper. My question to the authors is why DTM outperforms such models.
  • As the authors mentioned in line 286 "The improved performance of DTM/ARTM/FHTM comes at the price of a higher sampling cost". But we know certain test-time scaling methods, such as noise optimization, that improve the image generation quality by large margins when given higher test-time compute budget (e.g. see ReNO). The authors should elaborate how their method compare to such schemes.

问题

Please see above.

局限性

N/A

最终评判理由

All my concerns are addressed.

格式问题

N/A

作者回复

With regard to compositionality, one would expect benchmarks such as T2I-CompBench and other related ones to be tested against.

We have followed the reviewer’s suggestion on the benchmark they recommended and evaluated our models and baseline on the T2I-CompBench. Results are provided below in Table 7 and the global ranking of each method (defined as the sum of its ranking across all benchmarks: GenEval, T2I-Compbench, PartiPrompts, and MS-COCO) is shown in Table 8 Note that the new benchmark does not change the fact that DTM is best among non-AR and FHTM is best among AR methods.

For tables 7-8 below: Best score is denoted by Bold with ★ and second best is denoted by Bold only. NFE* is the number of function evaluations in the backbone. † Indicates inference with activation caching.

Table 7. T2I-CompBench

KernelColorShapeTexture2D-Spatial3D-SpatialNumeracyNon-SpatialComplex
MAR-discrete0.66660.45350.53160.14740.26930.45380.30900.3096
MAR0.7378★0.5174★0.65880.16380.30020.49620.30820.3392★
MAR-Fluid0.69970.47680.61490.14540.29380.46810.30370.3289
FM0.68550.45110.56150.13720.27060.45260.30260.3138
DTM0.73160.48650.6597★0.1839★0.3113★0.5043★0.30750.3382
AR-discrete0.60680.47570.59580.10950.25350.44230.3098★0.3097
AR0.50620.36690.50610.10410.24410.42100.29890.2983
ARTM-20.65200.44300.58700.14750.27480.48000.30740.3267
ARTM-30.65550.47380.58420.14590.28320.48550.30620.3227
FHTM-20.63180.43180.57300.14030.28180.48300.30580.3229
FHTM-30.66040.46400.58390.13940.27550.48100.30660.3223
FHTM-3LLM0.61660.46180.59450.16880.30810.50100.30790.3310

Table 8. Global ranking: For metric in GenEval, T2I-Compbench, PartiPrompts, and MS-COCO we rank all models from 1 to 12 (where 1 is best model) and we sum the ranks from all metrics. Lower is better.

KernelRank↓
MAR-discrete200
MAR127
MAR-Fluid220
FM179
DTM58★
AR-discrete245
AR321
ARTM-2184
FHTM-2185
ARTM-3130
FHTM-3130
FHTM-3 LLM99

The problem of text-to-image generation is multi-facet problem and many previous work focused on improving specific aspects, e.g. compositionality, quality, bias, style transfer, etc. It is not clear which of these aspects is expected to be improved by the proposed method ; … Furthermore, many of the proposed metrics lack alignment with human perception, making results less reliable.

We have collected standard and widely used metrics found in similar papers including for example:

GenEval eval in: SD3 [1], Fluid [2], Transfusion [3], EMU3 [4], Janus [5], Flow-GRPO [6].

CLIPScore eval in: SD3 [1], Transfusion [3], EMU3 [4], DALLE-3 [7], DPO [8], SDXL [9], Imagen [10]

PIckScore eval in: Flow-GRPO [6], DPO [8].

We now added T2I compBench. If the reviewer feels we have missed other well accepted, open-source and open-license auto-evals we would be happy to incorporate them as-well.

The fact that DTM (Difference Transition Matching) gave better results compared to more complex ARTM and FHTM sounds puzzling and was not discussed in detail in the paper. My question to the authors is why DTM outperforms such models.

First, we would not characterize DTM as “simpler” or “less performant” than ARTM/FHTM. If anything, its bidirectional attention makes it more expressive and in our empirical apples-to-apples comparison it was favorable.

As the authors mentioned in line 286 "The improved performance of DTM/ARTM/FHTM comes at the price of a higher sampling cost". But we know certain test-time scaling methods, such as noise optimization, that improve the image generation quality by large margins when given higher test-time compute budget (e.g. see ReNO). The authors should elaborate how their method compares to such schemes.

Regarding time-performance of DTM: DTM is in fact considerably faster than FM. We have conducted several experiments and added several relevant Tables in Reviewer’s ag7i response. Mainly, Table 1 in Reviewer’s ag7i response compares the wall-clock time of FM and DTM, and in particular DTM (16 backbone NFE and 4 head NFE) can achieve superior performance to FM (128 backbone NFE) with a 7x speedup. In more detail, Tables 2(a–b) in Reviewer’s ag7i response present the dependence of DTM’s CLIPScore and PickScore on the number of function evaluations (NFE) in the flow head and backbone. Table 4 in Reviewer’s ag7i response reports the corresponding inference times. Tables 3(a–b) present flow matching’s (FM) CLIPScore and PickScore as a function of backbone NFE.

Regarding test-time scaling methods: Nevertheless, we did compare to one rather popular test-time scaling method Restart [11], see Tables 5.(a-b) and 6 in Reviewer’s QC47 response. Note that although Restart (with up to 2400 NFE) shows some improvement over FM it is still considerably sub-par to DTM. ReNO [12] utilizes an additional reward loss and therefore cannot be considered a strictly apples-to-apples comparison. Incorporating reward based optimization to TM is, however, a very interesting future research avenue!

Regarding ARTM/FHTM test-time scaling: We are not aware of any useful test-time scaling methods for AR image generation. The ARTM/FHTM variants do not offer any improvement in sampling efficiency, however compared to the AR baseline they achieve far superior performance. And we do not claim these variants outperform the FM baseline.

[1] Esser, Patrick, et al. "Scaling rectified flow transformers for high-resolution image synthesis." Forty-first international conference on machine learning. 2024.

[2] Fan, Lijie, et al. "Fluid: Scaling autoregressive text-to-image generative models with continuous tokens." arXiv preprint arXiv:2410.13863 (2024).

[3] Zhou, Chunting, et al. "Transfusion: Predict the next token and diffuse images with one multi-modal model." arXiv preprint arXiv:2408.11039 (2024).

[4] Wang, Xinlong, et al. "Emu3: Next-token prediction is all you need." arXiv preprint arXiv:2409.18869 (2024).

[5] Chen, Xiaokang, et al. "Janus-pro: Unified multimodal understanding and generation with data and model scaling." arXiv preprint arXiv:2501.17811 (2025).

[6] Liu, Jie, et al. "Flow-grpo: Training flow matching models via online rl." arXiv preprint arXiv:2505.05470 (2025).

[7] Betker, James, et al. "Improving image generation with better captions." Computer Science. openai. com/papers/dall-e-3. pdf 2.3 (2023):8.

[8] Wallace, Bram, et al. "Diffusion model alignment using direct preference optimization." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[9] Podell, Dustin, et al. "Sdxl: Improving latent diffusion models for high-resolution image synthesis." arXiv preprint arXiv:2307.01952 (2023).

[10] Saharia, Chitwan, et al. "Photorealistic text-to-image diffusion models with deep language understanding." Advances in neural information processing systems 35 (2022): 36479-36494.

[11] Xu, Yilun, et al. "Restart sampling for improving generative processes." Advances in Neural Information Processing Systems 36 (2023): 76806-76838.

[12] Eyring, Luca, et al. "Reno: Enhancing one-step text-to-image models through reward-based noise optimization." Advances in Neural Information Processing Systems 37 (2024): 125487-125519.

评论

Thanks for addressing my major concerns. I would be happy to raise my score.

审稿意见
5

This paper presents Transition Matching (TM), a new class of generative models that merges the strengths of diffusion and flow-based approaches. TM broadens the generative design space by introducing non-deterministic probabilistic transitions and flexible, non-consecutive supervision. The authors develop three variants: Difference (DTM), Autoregressive (ARTM), and Full History (FHTM) Transition Matching. DTM enhances image quality and text consistency while speeding up sampling. As partially and fully causal models, ARTM and FHTM achieve generation quality on par with or exceeding non-causal methods and integrate easily with existing text generation technologies. In particular, FHTM is the first fully causal model in the continuous domain to outperform flow-based methods, excelling in text-to-image synthesis. The paper validates the advantages of the TM framework through extensive comparisons and outlines future work on improving efficiency and integrating FHTM into larger multimodal systems.

优缺点分析

Strengths:

  1. The paper presents a novel generative model called Transition Matching (TM) that combines the advantages of diffusion and flow models. The authors introduce non-deterministic probability transfer kernels and arbitrary non-continuous supervised processes, significantly expanding the design space. They propose three different TM variants - Difference Transition Matching (DTM), Autoregressive Transition Matching (ARTM), and Full History Transition Matching (FHTM) - each achieving performance optimization in specific aspects.

  2. The proposed TM framework offers a powerful tool for generating high-quality samples while maintaining computational efficiency. It addresses limitations of existing approaches and provides a new direction for future research in the field of generative modeling. The authors highlight the significance of their work in terms of its potential impact on various applications such as image generation, text-to-image synthesis, and multi-modal systems.

Weaknesses:

  1. As authors stated, the improved performance of DTM/ARTM/FHTM comes at the price of a higher sampling cost.

  2. The analysis of the model's performance is simplistic.

问题

  1. Has this method been validated on large-scale generative models?

局限性

yes.

最终评判理由

Thank you for your response. The authors have addressed some of my concerns, and I appreciate the thorough exploratory experiments conducted in the paper. I have decided to raise my score to 5.

格式问题

Paper Checklist is not fully matched the requirements.

作者回复

As authors stated, the improved performance of DTM/ARTM/FHTM comes at the price of a higher sampling cost.

DTM is in fact considerably faster than FM. We have conducted several experiments and added several relevant Tables in Reviewer’s ag7i response. Mainly, Table 1 in Reviewer’s ag7i response compares the wall-clock time of FM and DTM, and in particular DTM (16 backbone NFE and 4 head NFE) can achieve superior performance to FM (128 backbone NFE) with a 7x speedup. In more detail, Tables 2(a–b) in Reviewer’s ag7i response present the dependence of DTM’s CLIPScore and PickScore on the number of function evaluations (NFE) in the flow head and backbone. Table 4 Reviewer’s ag7i response reports the corresponding inference times. Tables 3(a–b) in Reviewer’s ag7i response present flow matching’s (FM) CLIPScore and PickScore as a function of backbone NFE.

The analysis of the model's performance is simplistic.

We kindly ask the reviewer to be more specific. We have made an honest effort to collect all popular benchmarks and auto-metric that are open-source and open-license: The GenEval benchmark and additional 6 auto-metrics: CLIPScore, PickScore, ImageReward, UnifiedReward, Aesthetic, and DeQA Score, where the metrics are evaluated on two benchmarks: MS-COCO and PartiPrompts. The GenEval benchmark is widely accepted metric for evaluating text-to-image and has been used by many acclaimed previous works: SD3 [1], Fluid [2], Transfusion [3], EMU3 [4], Janus [5], Flow-GRPO [6], and so does the metrics we report. For example, CLIPScore has been used by SD3 [1], Transfusion [3], EMU3 [4], DALLE-3 [7], DPO [8], SDXL [9], Imagen [10], and PickScore by Flow-GRPO [6], DPO [8].

We also added during the rebuttal the benchmark suggested by Reviewer jNpW: T2I-CompBench. See Table 7 in Reviewer’s jNpW response.

If the reviewer feels we have missed other well known, open-source and open-license auto-evals we would be happy to incorporate them as-well.

Lastly, we also added two fully discrete baselines: discrete-AR and discrete-MAR using the tokenizer of Chameleon [11]. The results are summarized below in Tables 5.(a-b) and 6, and are consistent with the submitted paper’s claims. To summarize all the evaluations we add Table 8 in Reviewer’s jNpW response showing the global ranking of each method (defined as the sum of its ranking across all benchmarks: GenEval, T2I-Compbench, PartiPrompts, and MS-COCO ).

For tables 5-6 below: Best score is denoted by Bold with ★ and second best is denoted by Bold only. NFE* is the number of function evaluations in the backbone. † Indicates inference with activation caching.

Table 5.a: Main Table. PartiPrompts

AttentionKernelArchNFE*CLIPScore↑PickScore↑ImageReward↑UnifiedReward↑Aesthetic↑DeQAScore↑
BaselineFullMAR-discreteDiT25626.8020.700.144.315.152.48
BaselineFullMARDiT25627.00★20.700.334.264.952.36
BaselineFullMAR-FluidDiT25626.0020.500.073.824.742.36
BaselineFullFMDiT25626.0021.000.234.785.292.55
BaselineFullFMDiT240026.0021.100.244.815.292.55
BaselineFullFM-RestartDiT240026.1021.100.344.835.312.53
TMFullDTMDiT3226.8021.20★0.53★5.12★5.42★2.65★
BaselineCausalAR-discrete†DiT25626.7020.40-0.013.744.812.38
BaselineCausalAR†DiT25624.9020.10-0.433.414.502.27
TMCausalARTM–2†DiT2×25626.8020.800.294.495.032.37
TMCausalFHTM–2†DiT2×25626.8020.800.304.595.132.44
TMCausalARTM–3†DiT3×25627.0020.900.384.775.212.53
TMCausalFHTM–3†DiT3×25627.0020.900.314.775.152.44
TMCausalFHTM–3†LLM3×25627.0021.000.435.025.302.54

Table 5.b: Main Table. MS-COCO

ModelAttentionKernelArchNFE*CLIPScore↑PickScore↑ImageReward↑UnifiedReward↑Aesthetic↑DeQAScore↑
BaselineFullMAR-discreteDiT25626.5720.630.014.145.272.41
BaselineFullMARDiT25626.1220.660.174.625.062.34
BaselineFullMAR-FulidDiT25625.4620.45-0.113.944.862.38
BaselineFullFMDiT25625.7821.110.095.005.452.47
BaselineFullFMDiT240025.7821.110.095.005.452.47
BaselineFullFM-RestartDiT240025.7821.110.155.115.482.44
TMFullDTMDiT3226.1621.19★0.225.385.55★2.58★
BaselineCausalAR-discrete†DiT25626.69★20.31-0.063.834.932.34
BaselineCausalAR†DiT25624.8320.11-0.483.604.762.34
TMCausalARTM-2†DiT2×25625.9020.750.074.705.192.41
TMCausalFHTM-2†DiT2×25625.9120.790.074.785.272.44
TMCausalARTM-3†DiT3×25626.0720.920.114.995.352.46
TMCausalFHTM-3†DiT3×25626.1420.980.155.235.382.41
TMCausalFHTM-3†LLM3×25626.1421.080.24★5.51★5.532.51

Table 6.: Main Table. GenEval

AttentionKernelArchNFE*Overall↑Single-object↑Two-objects↑Counting↑Colors↑Position↑ColorAttribute↑
BaselineFullMAR-discreteDiT2560.440.860.430.370.660.130.29
BaselineFullMARDiT2560.520.98★0.560.430.730.110.38
BaselineFullMAR-FluidDiT2560.440.900.330.370.760.120.28
BaselineFullFMDiT2560.470.910.520.270.710.120.34
BaselineFullFMDiT24000.470.910.510.250.720.140.36
BaselineFullFM-RestartDiT24000.490.890.59★0.290.730.130.38
TMFullDTMDiT320.54★0.930.580.350.79★0.20★0.46★
BaselineCausalAR-discrete†DiT2560.410.960.400.330.600.070.19
BaselineCausalAR†DiT2560.340.860.260.150.630.060.15
TMCausalARTM-2†DiT2×2560.490.950.510.390.79★0.110.27
TMCausalFHTM-2†DiT2×2560.480.960.480.250.780.090.37
TMCausalARTM-3†DiT3×2560.510.950.540.410.79★0.160.28
TMCausalFHTM-3†DiT3×2560.520.98★0.540.44★0.740.160.34
TMCausalFHTM-3†LLM3×2560.490.940.550.370.690.170.29

Has this method been validated on large-scale generative models?

We would like to bring to the reviewer’s attention that the results reported in the paper include 10 models of 1.7B parameters, trained from scratch on a large dataset of 350M image-caption pairs for 500k training iterations. We believe this is among the largest, most comprehensive and fair large-scale evaluation openly available. For comparison, SD3 [1], which we consider the largest available fair comparison of diffusion and flow models, made most of their evaluation efforts on models with 2B parameters or less, and trained only a single model with 8B parameters.

[1] Esser, Patrick, et al. "Scaling rectified flow transformers for high-resolution image synthesis." Forty-first international conference on machine learning. 2024.

[2] Fan, Lijie, et al. "Fluid: Scaling autoregressive text-to-image generative models with continuous tokens." arXiv preprint arXiv:2410.13863 (2024).

[3] Zhou, Chunting, et al. "Transfusion: Predict the next token and diffuse images with one multi-modal model." arXiv preprint arXiv:2408.11039 (2024).

[4] Wang, Xinlong, et al. "Emu3: Next-token prediction is all you need." arXiv preprint arXiv:2409.18869 (2024).

[5] Chen, Xiaokang, et al. "Janus-pro: Unified multimodal understanding and generation with data and model scaling." arXiv preprint arXiv:2501.17811 (2025).

[6] Liu, Jie, et al. "Flow-grpo: Training flow matching models via online rl." arXiv preprint arXiv:2505.05470 (2025).

[7] Betker, James, et al. "Improving image generation with better captions." Computer Science. openai. com/papers/dall-e-3. pdf 2.3 (2023):8.

[8] Wallace, Bram, et al. "Diffusion model alignment using direct preference optimization." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[9] Podell, Dustin, et al. "Sdxl: Improving latent diffusion models for high-resolution image synthesis." arXiv preprint arXiv:2307.01952 (2023).

[10] Saharia, Chitwan, et al. "Photorealistic text-to-image diffusion models with deep language understanding." Advances in neural information processing systems 35 (2022): 36479-36494.

[11] Team, Chameleon. "Chameleon: Mixed-modal early-fusion foundation models." arXiv preprint arXiv:2405.09818 (2024).

评论

As discussion period coming to an end - we would like to draw the reviewer's attention to the comprehensive rebuttal and new results we produced to address their comments and would greatly appreciate a response. Thanks!

审稿意见
4

This paper introduces Transition Matching (TM), a general framework for generating discrete-time continuous-state sequences. TM unifies both flow-based methods and continuous autoregressive generation. The framework explores three novel variants: Difference Transition Matching, Autoregressive Transition Matching, and Full History Transition Matching. Experiments demonstrate that FHTM outperforms flow-based methods in the text-to-image task for continuous domains.

优缺点分析

Strengths

  1. ARTM and FHTM generate continuous causal AR with quality comparable to non-causal approaches.

  2. The proposed methods achieve state-of-the-art image quality and text adherence in text-to-image generation. They significantly improve text adherence compared to Flow matching and set a new state-of-the-art on this task.

Weaknesses

Experiments indicate that the Non-Formal Error (NFE) for ARTM/FHTM is high. Further analysis or experiments specifically focusing on the trade-off between quality and NFE for these causal models would be beneficial.

问题

Please refer to the weaknesses.

局限性

I’m unable to find the discussion about the limitation except line 157.

格式问题

No

作者回复

Experiments indicate that the Non-Formal Error (NFE) for ARTM/FHTM is high. Further analysis or experiments specifically focusing on the trade-off between quality and NFE for these causal models would be beneficial.

We assume the reviewer means Number of Function Evaluations (NFE). DTM is in fact considerably faster than FM. We have conducted several experiments and added several relevant Tables below.

Mainly, Table 1 compares the wall-clock time of FM and DTM, and in particular DTM (16 backbone NFE and 4 head NFE) can achieve superior performance to FM (128 backbone NFE) with a 7x speedup. In more detail, Tables 2(a–b) present the dependence of DTM’s CLIPScore and PickScore on the number of function evaluations (NFE) in the flow head and backbone. Table 4 reports the corresponding inference times. Tables 3(a–b) present flow matching’s (FM) CLIPScore and PickScore as a function of backbone NFE.

Table 1.: DTM vs FM sampling time.

KernelTime (sec)CLIPScorePickScore
FM10.826.021.0
DTM1.626.821.1

Table 2.a: DTM CLIPScore, Head NFE (Rows) vs. TM steps (Columns).

HeadNFE/TMsteps1248163264128
115.817.020.422.823.223.223.022.8
216.118.624.226.226.426.426.226.2
417.921.125.426.726.826.726.526.5
818.821.225.526.626.826.626.526.5
1618.921.325.526.726.826.626.526.4
3219.021.225.526.726.726.726.626.5
6419.021.325.426.726.826.626.426.5
12818.921.325.426.726.926.726.526.4

Table 2.b: DTM PickScore, Head NFE (Rows) vs. TM steps (Columns).

HeadNFE/TMsteps1248163264128
117.617.818.619.419.619.719.619.6
217.718.319.720.620.921.021.021.0
418.118.820.020.821.121.121.121.1
818.318.820.020.821.121.121.121.2
1618.318.820.020.921.121.121.121.1
3218.318.820.020.921.121.121.121.1
6418.318.820.020.921.121.121.121.1
12818.318.820.020.821.121.121.121.1

Table 3.a: FM CLIPScore vs. Euler steps (Columns).

EulerSteps1248163264128
015.816.619.723.825.625.925.926.0

Table 3.b: FM PickScore vs. Euler steps (Columns).

EulerSteps1248163264128
017.918.018.720.020.821.021.021.0

Table 4.:DTM inference time (in seconds) for different combinations of Head NFE and TM steps on a single H100 GPU. Note that 0 head steps refers to FM. Head NFE (Rows) vs. TM steps (Columns).

Head NFE / TM Steps1248163264128
00.10.20.30.71.32.75.410.8
10.10.20.40.71.42.85.611.2
20.10.20.40.71.52.95.811.6
40.10.20.40.81.63.16.312.5
80.10.20.40.91.83.67.214.3
160.10.30.61.12.24.59.017.9
320.20.40.81.63.16.312.525.1
640.30.61.22.54.99.919.739.4
1280.51.12.14.38.517.034.068.1

I’m unable to find the discussion about the limitation except line 157.

More limitations appear in the conclusions section in lines (286-287). We will modify/elaborate the limitations to include “The improved performance of ARTM/FHTM comes at the price of a higher sampling cost, i.e., NFE counts are proportional to the number of transition steps, see e.g., Table 1 in the paper.”

评论

As discussion period coming to an end - we would like to draw the reviewer's attention to the comprehensive rebuttal and new results we produced to address their comments and would greatly appreciate a response. Thanks!

最终决定

The paper introduces Transition Matching (TM), a novel discrete-time generative modeling framework that unifies diffusion/flow models and autoregressive generation through three variants: DTM, ARTM, and FHTM. Key contributions include DTM achieving 7× speedup over flow matching while maintaining superior performance, and FHTM representing the first fully causal continuous-domain model to match non-causal methods. The comprehensive evaluation on 1.7B parameter models across multiple benchmarks with controlled experimental conditions provides strong empirical validation.

The rebuttal period was highly productive, with authors successfully addressing all major reviewer concerns through additional experiments and clarifications, leading three reviewers (QC47, jNpW, kPcj) to raise their ratings. While presentation clarity could be improved, the core technical contributions are substantial and theoretically sound.

Overall, the work advances state-of-the-art in generative modeling, opens new research directions for multimodal systems, and demonstrates significant practical improvements in both efficiency and generation quality. Hence, the AC recommends acceptance.