On Fitting Flow Models with Large Sinkhorn Couplings
a new great way to train flow matching
摘要
评审与讨论
This paper re-examines flow-based generative modeling by flow matching and investigates the benefit of using large-scale entropic optimal transport (EOT) couplings, i.e., Sinkhorn couplings, in order to match source and target samples. The authors think that the current practice uses small mini-batch sizes (e.g., 256) and fixed entropic regularization parameters which impose a ceiling on performance. The paper suggests:
-
Using much bigger batch sizes (up to 2 million).
-
Adding a batch-size and dimension-invariant, renormalized entropy scale, and facilitating more efficient tuning of .
-
Scaling the Sinkhorn algorithm efficiently across multi-GPU platforms using OTT-JAX.
-
Demonstrating that couplings with low-entropy, large-batch values produce better training signals for flow models, minimizing curvature, reconstruction error, NLL, BPD, and FID scores across synthetic and real datasets (e.g., CIFAR-10, ImageNet-32).
-
The method is meta-dataloader: training and generation coupling are separated, thus making the procedure modular and computationally realistic.
优缺点分析
Strengths
-
The paper has a clean motivation and an insightful problem formulation. The authors indicate the common drawback of previous work: small mini-batches and constant for EOT.
-
The paper gives an original application of scale-free entropy Definition of a renormalized scale of entropy, provides an interpretable, dimension-invariant way to calibrate Sinkhorn couplings, dimension-invariant Sinkhorn coupling
-
The paper provides massive-scale experiments i.e., 2M×2M coupling matrices are employed—a scale that has not yet appeared in the literature.
-
The proposed methods give substantial practical gains i.e., experiments demonstrate consistent improvements over independent flow matching (IFM) on synthetic tasks and image generation, particularly on curvature and FID scores (Figures 1–4).
-
The paper also provides an efficient implementation. The authors provide technical contributions in rescaling through std(C), replacing dot-products for Euclidean norms for stability, and sharding Sinkhorn computations.
Weaknesses
-
The role of coupling size and entropy in training stability or convergence speed is not adequately explored. Authors state they did not ablate learning rates for scale (p. 8, l. 280).
-
While the paper claims the Sinkhorn step is relatively low-cost, 2M×2M couplings take 8+ GPUs and substantial memory (p. 5, l. 192–207), which will hold back adoption.
-
Experiments lack missing standard deviation/error bars, which would help to estimate variability across runs (Checklist point 7, p. 13).
-
The paper has not investigated the usage of other variants of OT which are shown to be better in a mini-batch setting [1] [2].
[1] Unbalanced minibatch Optimal Transport; applications to Domain Adaptation, Fatras et al.
[2] Improving Mini-batch Optimal Transport via Partial Transportation, Nguyen et al.
问题
- The paper [2] proposes the two-stage approaches which utilizes RAM memory to scale up the OT problem in term of memory storage. Can this approach help to scale up further the OT problem in the paper?
[2] Improving Mini-batch Optimal Transport via Partial Transportation, Nguyen et al.
局限性
Yes
最终评判理由
The authors addressed my questions.
格式问题
No
**The paper has a clean motivation and an insightful problem formulation **[…] The paper provides massive-scale experiments i.e., 2M×2M coupling matrices are employed—a scale that has not yet appeared in the literature […]The proposed methods give substantial practical gains
Many thanks for your very encouraging comments!
The role of coupling size and entropy in training stability or convergence speed is not adequately explored. Authors state they did not ablate learning rates for scale (p. 8, l. 280).
We are happy to report that learning rates have been ablated for Imagenet-32 and Imagenet-64. It seems that the original choices reported in the literature for IFM (lr=1e-4) are suitable for OTFM at all and scales.
While sharing plots would be more convenient, we cannot. Hence, here are two table describing FID along iterations for ImageNet-32. We do observe, for ImageNet-32, a decrease then increase of the FID metric along iterations, even for IFM and other settings --- this was mentioned in Table 1 of the supplementary. We do not observe this in our runs on ImageNet-64.
, FID@DOPRI5
| Step | LR = 1e-4 (default) | 3e-4 (bigger) | 7e-5 (smaller) |
|---|---|---|---|
| 30001 | 67.29 | 112.34 | 61.89 |
| 90003 | 7.02 | 390.12 | 7.61 |
| 150005 | 5.43 | 34.32 | 5.89 |
| 210007 | 5.00 | 30.47 | 5.35 |
| 270009 | 4.94 | 11.95 | 5.24 |
| 330011 | 5.16 | 9.77 | 5.47 |
| 438015 | 5.88 | 9.25 | 6.19 |
, FID@4
| Step | LR = 1e-4 (default) | 3e-4 (bigger) | 7e-5 (smaller) |
|---|---|---|---|
| 30001 | 66.92 | 144.98 | 50.47 |
| 90003 | 30.52 | 394.15 | 31.68 |
| 150005 | 29.52 | 58.67 | 30.45 |
| 210007 | 29.65 | 51.21 | 30.39 |
| 270009 | 30.01 | 34.60 | 30.53 |
| 330011 | 30.57 | 35.07 | 31.06 |
| 438015 | 31.43 | 35.26 | 31.69 |
While the paper claims the Sinkhorn step is relatively low-cost, 2M×2M couplings take 8+ GPUs and substantial memory (p. 5, l. 192–207), which will hold back adoption.
This is a valid point, although 8-GPU nodes are by now the norm. We observe that most config files shared nowadays to train even small FM models on ImageNet-32 assume access to a node.
In our response to Reviewer zUcn we have considered several ways to improve the picture in sections
1. Warmstart(to reduce overall the number of iterations),2. Computing matchings in PCA Space(to alleviate per-iteration cost and lower memory requirements for couplings)3. Precomputing Noise/Data Pairsto split preprocessing from training.
Experiments lack missing standard deviation/error bars, which would help to estimate variability across runs (Checklist point 7, p. 13).
As mentioned in our response to W4 of Reviewer 7qcL, we cannot realistically provide so many results along with error bars, as this would make our compute budget explode.
We notice that other researchers have recently complained specifically about this requirement (Ben Recht, Standard error of what now?, blog post) because it is not realistic.
The paper has not investigated the usage of other variants of OT which are shown to be better in a mini-batch setting [1] [2].
Thanks for these references! We have added these to the draft. However, we observe that:
- These approaches explore partial or unbalanced OT, which are (for now) unrelated to our setting.
- The code provided for these approaches cannot scale as they both rely on POT and GeomLoss solvers, which are intrinsically either CPU or single-GPU.
- We also note that [2] claims, regarding Sinkhorn, in its introduction, “that storing an n × n matrix is unavoidable in this [Sinkhorn] approach, thus the memory capacity needs to match the matrix size”. Interestingly, we believe that this is incorrect, and this is one of the pillars of our contribution, as explained in L.200, see also our point
2. Computing matchings in PCA Spaceto Reviewer zUcn for more details on online recomputations of cost matrices, a technique that was leveraged first byGeomLoss.
As explained in our answer to point S1, Reviewer 52g8, we do not see, at the moment, an alternative solver to the Sinkhorn algorithm that can simultaneously
- be implemented to utilise efficiently / maximally all 8 GPU nodes, to scale to millions of points;
- return couplings that can be arbitrarily close to the solution of exact OT;
Of course, we hope that other researchers will be encouraged, after seeing our work, to target the 1 million point scale with other solvers / variants, and demonstrate their relevance to flow matching.
We are happy to hear that our response was useful, and thank you again for your time reviewing our work!
I would like to thank the authors for the response. I will keep my positive score.
The paper experimentally tests the potential advantages in initializing flow matching algorithms with Sinkhorn couplings instead of the independent coupling. That is, for a given regularization parameter , they sample source and target points and , solve the corresponding entropic optical transport problem by means of Sinkhorn's iterations. The optimal coupling is then used a starting point for Flow matching instead of the independent coupling. The motivation for doing this lies in the fact that the usage of the independent coupling may result in irregular vector fields near the initial and final times. Moreover, one of the goals of the article, in contrast with previous results in the literature, is to consider much larger values of . The tests carried out on datasets such as CIFAR-10 and ImageNet-32 as well as on synthetic data lead the authors to conclude that working with large Sinkhorn couplings has beneficial effects for mid-sized problem, enabling for faster training and inference. The authors also introduce a notion of "renormalized entropy" to tune the value of the regularization parameter and recommend to keep the renormalized entropy around
优缺点分析
-
The strength of the paper lies for sure in the thorough design of numerical implantation and in the extend of the numerical simulation across various datasets. Though I am not an expert in the implementation details of Sinkhorn algorithm and generative algorithms in general, I find the empirical conclusions of the article quite convincing. The benefit of working with large Sinkhorn couplings seems to be confirmed by several different metrics such as FID and curvature.
-
The main weaknesses is undoubtedly the lack of any form of theoretical justification to support the experimental conclusions, though achieving this may be beyond the scope of this work
问题
-
It appears to me that the usage of Sinkhorn coupling as initialization fits well with using stochastic rather than deterministic interpolants. In particular, I would use brownian bridges instead of straight lines. Have the authors thought in this direction? How do they believe their conclusion would change considering stochastic interpolants?
-
I understand the general recommendation is to have a renormalized entropy around 0.2. What are the recommendations for ? In general, I find the plots tell little about the influence on . What was the choice of for the plots in Figure and ?
局限性
yes.
最终评判理由
I have increased the score to 5. This is because the authors addressed in a compelling way all my questions. In particular, I appreciated their clarification about the opportunity of replacing straight lines with stochastic interpolants, as well as the numerical experiments accompanying their explanation. Even though it is only a very first step, I appreciate the authors' effort to comment on the theoretical justification for their empirical findings.
格式问题
no concern
Many thanks for your detailed comments.
The method is meta-dataloader: training and generation coupling are separated, thus making the procedure modular and computationally realistic.
Many thanks for highlighting this point! We will do our best to underline this further in our revision.
The main weaknesses is undoubtedly the lack of any form of theoretical justification to support the experimental conclusions, though achieving this may be beyond the scope of this work
While our main focus remains on scaling-up Sinkhorn and comprehensive experiments to study its effect, we are also able, following your comment and that of Reviewer 7qcL to provide some theoretical intuition into the necessity of using large OT batch sizes. This result is provided as Proposition 1 in response to W3 of Reviewer 7qcL.
We highlight that this argument will only be shown in the appendix, as it is only meant to provide insights into the necessity of large batch size and is not a core contribution of our work.
It appears to me that the usage of Sinkhorn coupling as initialization fits well with using stochastic rather than deterministic interpolants. In particular, I would use brownian bridges instead of straight lines. Have the authors thought in this direction? How do they believe their conclusion would change considering stochastic interpolants?
We agree that this is a promising direction. At the moment we do not exploit stochasticity of the coupling, but rather simply sample from it (Step 4) to form pairs, assuming straight paths, in accordance with the flow matching framework. Incorporating this stochasticity would be more akin to switching from ODE flow models to SDE Schrödinger bridge type models. We might explore this, or others might use our implementation to do it themselves before.
In that vein, we have also tested the ability of OTFM to help training for modality translation (i.e. flowing from a modality to another, without having access to pairings, rather than starting from nois). This setup was typically considered by Schrödinger bridge models.
We tried this with the cat→wild image translation for AFHQ-64. We follow the evaluation protocol as mentioned in [De Bortoli et al. 24]. We use their network architecture, increasing the number of channels in the U-Net to 192; train the network for 400k iterations with 512 effective batch size.
As noted by Bortoli et al., visual inspection is best way to assess the quality of results. While the pictures we obtain look good, we are not allowed to share these images this year, and might want instead to add them in the appendix.
As a result we can only report metrics (LPIPS, MSD and FID). As highlighted by Bortoli et al., FID is not really meaningful. as they are computed on a very small set of points (~4.5K training set of the wild domain and ~450 validation set of cat → wild transferred images). They show, however, a very clear failure of IFM vs. OTFM.
These preliminary results should be compared to the values reported in Figure 19 of [De Bortoli et al. 24].
| 𝜀 | LPIPS | MSD | FID | |
|---|---|---|---|---|
| 512 | 0.003 | 0.381 | 0.049 | 23.325 |
| 512 | 0.01 | 0.385 | 0.048 | 24.687 |
| 512 | 0.03 | 0.383 | 0.049 | 23.496 |
| 512 | 0.1 | 0.391 | 0.052 | 23.983 |
| 1024 | 0.003 | 0.393 | 0.044 | 25.278 |
| 1024 | 0.01 | 0.397 | 0.045 | 26.114 |
| 1024 | 0.03 | 0.389 | 0.047 | 25.817 |
| 1024 | 0.1 | 0.384 | 0.05 | 25.76 |
| 8192 | 0.003 | 0.407 | 0.069 | 45.939 |
| 8192 | 0.01 | 0.411 | 0.068 | 44.239 |
| 8192 | 0.03 | 0.405 | 0.058 | 33.942 |
| 8192 | 0.1 | 0.381 | 0.05 | 25.356 |
| IFM | NaN | 0.372 | 0.09 | 111.216 |
[ V. De Bortoli, I. Korshunova, A. Mnih, A. Doucet, Schrödinger Bridge Flow for Unpaired Data Translation, Neurips 2024]
I understand the general recommendation is to have a renormalized entropy around 0.2. What are the recommendations for ?
Our recommendation at the moment is to choose a as large as possible, within what is available in the preprocessing compute budget, since in our generation experiments show so far that larger always improves on all metrics, notably faster generation.
Additionally, our implementation is now fully efficient as we have transitioned to a metaloader (See our point 3. Precomputing Noise/Data Pairs shared with Reviewer 52g8) in order to split entirely NN training from pairing effort, making this computation a one-off effort.
In general, I find the plots tell little about the influence on . What was the choice of for the plots in Figure 1 and 2 ?
We apologise for not being more clear in the legends of Figs. 1 & 2 (please look at 10, 12, 16, 18 in supplementary). In all our figures, is color coded (yellow = large). We will add the mention n=2048, n=16384, ... in all legends. As can be seen, impacts performance in all settings.
I thank the authors for addressing all my comments in detail. The answer are pretty satisfying and I will consequently increase my score.
We would like to thank you for taking the time to read our rebuttal. We are very happy to hear that our responses answered some of your concerns, and we are very grateful for your score increase. Above all, we thank you for helping us improve our draft!
This paper studies at what regimes of mini-batch size , regularization , and for which data dimensions could batch-OT flow matching work well. The paper interpolates Batch-OT and independent FM with the regularization in entropic OT (EOT). The paper also modifies the Sinkhorn algorithm by dropping the square norms and focusing on the dot-product between points. Finally, the paper experiments on various tasks with substantially different regimes of and to show the thoroughness of the proposed study.
优缺点分析
Strengths:
- The paper is clearly-organized and set the central focus on the modification and influencing factors of the Sinkhorn algorithm.
- The paper experiments on various synthetic and real benchmarks with detailed implementation to show the thoroughness of the study.
Weaknesses:
- The contribution of the paper is limited. The main innovative point is dropping norms and focusing on the dot-product between points in the Sinkhorn algorithm for better stability, which is a narrow view to some extent. The overall contribution is mostly empirical. Some experimental details like the types of GPU could be moved to the appendix to give space to further discussions.
- The paper states that larger batch size is needed to cover the bias that cannot be traded off with more iterations on the flow matching loss. On the other hand, Figure 1 and Figure 2 show that larger batch size suffers from longer training time.
- It would be beneficial to include some theoretical complexity analysis like [1] that includes the sample complexity to further explain the “necessity of large batch size”.
- Multiple independent runs of the experiments could be done to obtain the confidence interval of the results.
[1] Pham, K., Le, K., Ho, N., Pham, T., & Bui, H. (2020, November). On unbalanced optimal transport: An analysis of sinkhorn algorithm. In International Conference on Machine Learning (pp. 7673-7682). PMLR.
问题
-
To balance between training time and mini-batch size , is there any criterion on how to choose the proper when the total training time is limited?
-
Is it possible to give a theoretical complexity analysis like [1] to show the effectiveness of the modification of Sinkhorn algorithm?
-
In the final row of Figure 1, why are there 6 different curves when only have 5 choices? The colors and the marks of the curves do not match the legend (there are two blue lines, and the colors are mismatched).
[1] Pham, K., Le, K., Ho, N., Pham, T., & Bui, H. (2020, November). On unbalanced optimal transport: An analysis of sinkhorn algorithm. In International Conference on Machine Learning (pp. 7673-7682). PMLR.
局限性
The paper addressed the limitations. It would be beneficial to add the theoretical complexity analysis that includes the sample complexity . Criterions for selecting proper given a limited time budget remains to be established.
最终评判理由
I have increased the score to 4. I am delighted to see the additional theoretical analysis (the lower bound proposition) on further explaining the “necessity of large batch size” , which could provide potential insights on the mathematical understanding of the proposed implementation tricks, even though a comprehensive theoretical justification is yet to be established in future works. The additional experimental detail discussions in the rebuttal of the Reviewer zUcn and the Reviewer 52g8 empirically showed the effectiveness of the implementation tricks. My other questions are also adequately addressed by the authors.
格式问题
No.
Many thanks for taking the time to review our paper, and providing several comments.
The paper is clearly-organized
Thanks!
W1. The contribution of the paper is limited. The main innovative point is dropping norms and focusing on the dot-product between points in the Sinkhorn algorithm for better stability, which is a narrow view to some extent. The overall contribution is mostly empirical.
We agree that our contributions are driven by empirical concerns, more precisely by the goal to scaling up to unprecedented regimes, while tracking convergence rigorously (small , L.213), and staying as arbitrarily close to sharp couplings (small ).
In that sense, our paper is informed by the “bitter lesson” (Rich Sutton) applied to OT computations, and is trimmed to the minimum to work in this case.
While our findings may seem simple in retrospect, from Lemma 2, to , the adoption of the dot product cost in L. 190, and rescaling by std , we are confident that these modifications will be impactful and become the norm when running Sinkhorn.
In addition to this, we have incorporated additional tricks (see 1. Warmstart , 2. Computing matchings in PCA Space in our response to Reviewer zUcn) and demonstrated that they speed up computations significantly in our setting.
Some experimental details like the types of GPU could be moved to the appendix
Hardware specifications only take a few characters in our draft. We prefer to be sure the reader is aware of these aspects, as they matter practically speaking, and can help get a clearer picture. Are there other details you would like to see moved?
W2.The paper states that larger batch size is needed to cover the bias that cannot be traded off with more iterations on the flow matching loss. On the other hand, Figure 1 and Figure 2 show that larger batch size suffers from longer training time.
Indeed, using larger and decreasing 𝜀 incurs a larger compute effort. We ask, however, to consider the following aspects:
- Coupling more carefully noise and images is not, strictly speaking, a training effort, it is a data-preprocessing effort, independent of model training (L.142) as highlighted in L. 141 and expanded in item
3. Precomputing Noise/Data Pairsshared with Reviewer 52g8. - The cost to compute such couplings is parameterised by and is independent of model parameters, which can run anywhere from billions to hundred of millions, and will always dominate overall training cost.
- We believe, following discussions on Warmstarting and PCA, that one may continue improving on the efficiency of computing couplings.
- A larger effort spent on better couplings is always associated in our image experiments with a higher quality when inference time compute is constrained. This is usually desirable.
W3. It would be beneficial to include some theoretical complexity analysis like [1] that includes the sample complexity to further explain the “necessity of large batch size”.
Thanks for this great suggestion. We will gladly include [1] to the 5 references we have provided in L.145-150. We will expand this section and include any other suggestions you may have.
Following your comment, we can add (in response to your comment that this would be beneficial) to our appendix a simple mathematical argument, similar to [Chewi et al. 2024, Thm. 2.14], that justifies the use of large batch sizes. We emphasize that this argument is only meant to provide insights into the necessity of large batch size and is not a core contribution of our work. Using this insight to establish a tight characterization of the curvature of flow models as a function of the OT batch size would require significantly more work, which can be an interesting research direction.
Assumption 1: The data distribution is such that if and are independently drawn from then, \mathbb{P}[\Vert X - X\' \Vert \geq t] \leq Ct^r, for all and some constants .
This assumption holds for example if is supported on a manifold of dimension with bounded density w.r.t. the volume measure on the manifold. Specifically, can be seen as the effective dimension of the support of , and for structured data distributions we expect .
Proposition 1: Given two probability measures where satisfies Assumption 1, suppose , i.e. they represent i.i.d. samples from and respectively. Let denote any coupling supported on and , including (entropic) optimal transport. Then, for some constant depending on and .
In Proposition 1, we look at the variance (sum of coordinate variances) of data conditioned on the noise it is paired to. Note that for IFM, this is simply the variance of the data distribution, as the noise/data coupling is independent. Ideally, with OT couplings we would always couple noise to a unique data sample, thus this variance would be zero and the flow trajectory would be straight. The above lower bound shows the necessity of large batch-size due to the slow rate of , which motivates us to scale Sinkhorn to as large as possible. While we do not include it for brevity, one can obtain matching upper bounds through standard law of large numbers arguments [Chewi et al. 2024, §2.5.1].
W4. Multiple independent runs of the experiments could be done to obtain the confidence interval of the results.
Thanks for this suggestion. Unfortunately, at this time, most papers in flow matching or large scale generative modeling do not include error bars when using image generation, e.g. Table 5 in [Tong et al.], Table 1 & 2, Figure 3 in [Pooladian].This is because each run takes anywhere between 4 days to 2 weeks.
To make things worse, in our case we report dozens of variants. Producing error bars would multiply by 5 our already very substantial compute budget.
Additionally, the randomness in coupling computations is independent from that resulting from neural network parameter initialization, increasing the challenge further (to get error bars for FM metrics, one would need to rerun multiple seeds for couplings multiple seeds for NN training)
This being said, we observe qualitatively fairly stable results, as long as we reuse similar config files. For instance, we ran the learning rate ablations (in answer to Reviewer 3ETv) **** using different NN seeds, and obtained similar results. We sometimes observed variability in FID computations across experiments, due to the batch of 50k images that is sampled. We always ensured, however, that the multiple runs done within a single experiment use the same batch, but were not always able to guarantee this across experiments.
Q1. To balance between training time and mini-batch size , is there any criterion on how to choose the proper when the total training time is limited?
Compute involves three independent terms:
- pre-processing, to select pairs of noise / data → this is where OTFM happens.
- training time: pick FM model, optimizer, learning rate, epochs, etc.. fed with pairs from above.
- inference time: ODE integrator, # of steps.
Any rule to select must depend on how these compute budgets are prioritised, as with LLMs training.
Nowadays, pre-processing steps are allocated a high budget, training budget is even higher, and inference budget is assumed to be limited. A message of our paper is that should be as large as possible (although probably not larger than dataset size, a lesson learned with the relatively toy-ish CIFAR-10 experiment).
In practice, it is easy to estimate the required time to run a single Sinkhorn computation: Sinkhorn time is very stable across batches for a given choice (see e.g. Fig. 8, top, in supplementary). The practitioner can then fairly easily calibrate their training setup with a dry run, and adjusting according to the preprocessing compute they wish to spend, and store these results as per item 3. Precomputing Noise/Data Pairs shared with Reviewer 52g8.
Q2. Is it possible to give a theoretical complexity analysis like [1] to show the effectiveness of the modification of Sinkhorn algorithm?
The modifications we provide are not theoretically motivated. For instance,
- the
stdrule is only designed to help set 𝜀 adaptively, robustly and consistently across many setups. - removing norms does not change the convergence analysis of Sinkhorn, as described in Remark 4.12 [Peyré / Cuturi]. It can be interpreted as a way to bypass completely norm information in dual scalings, whereas convergence theory of Sinkhorn focuses on lower-bounding the improvement from one iteration to the other.
Q.3 In the final row of Figure 1, why are there 6 different curves when only have 5 choices? The colors and the marks of the curves do not match the legend (there are two blue lines, and the colors are mismatched).
We apologise for this mistake and will redraw this plot. The Lowest curve can be dropped and corresponds to the setting that we decided to remove to improve legibility. See also our response to W1 of Reviewer zUcn
It would be beneficial to add the theoretical complexity analysis that includes the sample complexity . Criterions for selecting proper given a limited time budget remains to be established.
Thanks, as mentioned above, we will discuss [1] and add the discussion above.
I thank the authors for the detailed rebuttal. I am delighted to see some mathematical argument (the lower bound proposition) on further explaining the “necessity of large batch size” . I also read the additional experimental details in the rebuttal of the Reviewer zUcn and the Reviewer 52g8, which showed the effectiveness of the implementation tricks such as computing matchings in PCA space and using larger OT batchsizes. Based on the responses and the discussion I will increase my score.
I hope the responses and revisions made in the rebuttal make their way into the paper.
We would like to thank you for taking the time to read our response. Of course, we commit to including all of the elements discussed in this rebuttal process in the final draft. We are grateful for your time and consideration!
This paper provides an ablation study of various techniques and parameter settings for coupling the samples used for training of flow-matching models. In flow-matching, samples from a reference distribution are linearly interpolated with samples from a target distribution and the resulting interpolations are used to learn a velocity field for dynamic transport between the target and the reference. In most implementations of flow-matching, the samples from the target and reference are coupled independently to create the endpoints of the interpolations, but recent work has suggested that it may be beneficial to couple the points using optimal transport (OT), or approximations thereof, in order to stabilize training and decrease the computational burden of inference. In practice, points from the target and reference can be coupled using "minibatch optimal transport," wherein subsets of points are coupled to each other using, e.g, the Sinkhorn algorithm. This paper experiments with using couplings where is taken much larger than the oft-used , obtained using the Sinkhorn algorithm with various regularization levels . The paper also suggests some practical guidelines for choosing , computing coupling entropies, and preparing data for computation of large couplings. The numerical results generally suggest that using large couplings with low regularization is beneficial in comparison to using the independent coupling.
优缺点分析
Strengths
-
The paper computes the data couplings over multiple GPUs and provides details on how the data are distributed, sharded, and regathered. I am not well-acquainted with the SOTA in GPU computing for Sinkhorn/OT, but, according to the authors, GPU computations of this scale for obtaining Sinkhorn couplings have never been reported in the literature.
-
For the image generation benchmarks, the authors show flow-matching results not only for various but for different ODE solvers used in generation. Choice of ODE solver is an important practical consideration when implementing flow-matching, and having some insight on how that choice interacts with other hyperparameters is useful. In particular, using Dopri5 instead of forward Euler with low numbers of steps seemed to remove the impact of the couplings to a large extent. This is a salient finding -- if one does not want to tune coupling hyperparameters in flow-matching, one can just use a better ODE solver for generation.
Weaknesses
-
The paper is poorly written and the wordings used are frequently ambiguous, unclear, or perhaps even mathematically incorrect. As a result, I found the paper difficult to parse in spite of the facts that the contributions are not too complicated and I am well-versed in flow-matching and optimal transport.
-
The methodological contributions of the paper are marginal at best. For example,
-
One stated contribution of the paper (on page 2) is "Leveraging the fact that all [OT-based coupling] approaches can be interpolated using EOT: Hungarian [exact OT] corresponds to the case where , while IFM [using the independent coupling] is recovered with " -- I think this continuum is fairly widely understood.
-
The automatic rescaling of amounts to multiplying a scale-free by the standard deviation of the entries of the cost matrix and is not very different from the scaling used in [Cuturi 2013], wherein was multiplied by the median of the distances. Moreover, the rationale for using the standard deviation is unclear from the paper.
-
Similarly, the scale-free renormalized coupling entropy amounts of the usual entropy divided by . It is definitely important to use scale-free quantities when comparing results across numbers of points , but using scale-free quantities is standard practice in most areas of applied mathematics where dimension/number of points is varied.
-
-
Many of the plots containing the numerical results are presented in a (I hope unintentionally) deceiving manner. As alluded to in the summary, the -axes on the problematic plots do not start at zero and actually contain a very narrow range of values. If one does not look at the -axis limits, it appears that there are strong trends in the performance metrics as functions of and , but these trends would not be significant were the -axes started at zero. Figures which contain plots with massively truncated -axes include Figure 1 (and Figure 11 in the supplemental, especially for and BPD), Figure 3 (almost all plots), and Figures 16, 18, and 20 in the supplemental (almost all plots).
-
The numerical results for most of the benchmarks presented do not paint a clear picture of whether using large and small is helpful, especially in high-dimensional sampling tasks. The exception is perhaps the Korotin et al. benchmark.
问题
-
As mentioned under "Weaknesses," the -axes of the plots used to display performance metrics as a function of and all need to start at 0 so as not to suggest that significant trends are present when they are really not. Please fix these axes in your next revision of the manuscript.
Even though it is perhaps disappointing that the benefit of increasing and decreasing for many of the experiments was unclear, if you were to run more experiments and report trends (or lack thereof) in a clear way, I wonder if a pattern would emerge that would hint at a distinction between types of problems that benefit strongly from couplings with large and low , and those that don't? As far as I can tell, the only example where such benefit was clearly and strongly evident was the Korotin et al. benchmark. What makes the Korotin et al. benchmark different from the other examples you tried? If you were to run more experiments along these lines, would you be able to hypothesize what it is about the Korotin et al. benchmark and other examples where large low-entropy couplings are beneficial, that differentiates them from examples where these couplings are not beneficial? From a practical standpoint, even being able to say "this type of problem doesn't benefit from large couplings, so don't bother deviating from IFM" or "this problem does benefit from large couplings, so put in the effort to get one before you train" could go a long way to inform effective Flow-Matching practice.
-
What is the rationale for scaling the entropy level by the standard deviation of the cost-matrix , rather than the median or mean, as in [Cuturi 2013]? The stated reasoning -- that "the dispersion of the costs around its mean has more relevance than the mean itself" -- doesn't clarify for me why using the standard deviation is a good practical choice. Do you have numerical results comparing this choice to, e.g., scaling by the mean, that demonstrate clear benefit? Is there a deeper mathematical reason for why you made this choice?
-
There are many instances in the paper where the mathematical language could be more precise. Removing mathematical ambiguities would go a long way toward making the paper more clear and readable. For instance:
-
In equation (1) on page 2, the Wasserstein distance denoted is specifically the Wasserstein-2 distance (or a scalar multiple thereof). Consider changing notation to .
-
On lines 115-117, it is stated that "as , the solution converges to the optimal transport matrix solving (1)" -- Equation (1) is the continuum optimal transport problem, and therefore its solution is a coupling, not a matrix. Its solution could, however, be viewed as a matrix if the measures and are taken to be discrete/empirical measures. Please clarify/correct this statement.
-
On line 126 the random variable in the flow-matching loss is described simply as "a random variable in ". Shouldn't specifically be uniform over ?
-
On line 169, it is stated that the renormalized entropy provides a measure of how close is to an optimal assignment matrix. Wouldn't it be more accurate to say that provides a measure of how close is to any deterministic assignment matrix? The coupling does not have to be optimal to have zero entropy. That is, does not actually tell us whether we are close to the optimal assignment, just whether we are close to any 1-1 assignment.
-
局限性
Yes
最终评判理由
While the initial manuscript had some serious clarity issues and included methodological contributions that were marginal at best, based on discussion with the authors I feel that the clarity of the final manuscript will improve. Moreover, the authors have devised more computational strategies which, when combined with the existing rules-of-thumb in the first submission, provide a more comprehensive toolbox for computing large couplings using the Sinkhorn algorithm. These factors, combined with the authors' results which indicate that using large couplings to obtain flow-matching training pairs really does seem to improve performance, lead me to believe that the contributions of this paper have the potential to inform future flow-matching practice.
格式问题
The citation styles used are inconsistent and out-of-step with NeurIPS guidelines. Specifically, multiple citations are included as hyperlinks of an author's name with no year (e.g., "Monge", "Benamou and Brenier", "Sinkhorn"). Please ensure that your citation style is consistent and agrees with NeurIPS guidelines (i.e., is either author-year or numeric). Many of these hyperlinks could probably just be removed (e.g., there's no need to cite Sinkhorn every time you refer to the "Sinkhorn algorithm").
Many thanks for your detailed review and your insights.
We also found a few unfortunate typos (residual cardinalities in Alg.1, wrong transpose on in L.185, also missing a "+", etc...), and apologize for this. We do, however, stand by the maths in our paper.
various OT solvers[…] including the Hungarian algorithm
Let us clarify that we never use the Hungarian algorithm, we only use Sinkhorn.
S1. according to the authors, GPU computations of this scale[…]have never been reported in the literature.
Thanks for raising this point. Indeed, our computations are unprecedented for two reasons:
Scale. The hardness of an OT problem hinges on many parameters, typically: (dimension); (# of points); constraint tolerance (e.g. in Alg. 1); intended closeness to the optimal assignment/cost (e.g. 𝜀).
To our knowledge, no paper has pushed computations of Sinkhorn to this extent, e.g. for ImageNet-32, while tracking precisely constraints ( in 1-norm) and sharpness (small 𝜀). The largest setup we know of is the computation of a few coupling () in [Kassraie & al. 2024, Fig.6].
While low-rank approximate OT solvers have been used recently to compute fairly large couplings (see [Halmos, 2025], [Klein & al. 2025]), these reformulations are not convex, and offer no guarantees to solve the OT problem. Using them in large scale OT-FM could be an interesting follow-up work.
Repeats. We compute large couplings many times: e.g. for ImageNet-32, with points, we do so ~850 times, to cover 430k training steps.
P. Kassraie & al., Progressive Entropic Optimal Transport Solvers, Neurips 2024
Halmos & al., Hierarchical Refinement: Optimal Transport to Infinity and Beyond, ICML 2025
This is a salient finding — if one does not want to tune coupling hyperparameters in flow-matching, one can just use a better ODE solver for generation.
This finding is of little use in the context of FM/diffusion.
- The number of ODE steps is an inference-time parameter: in practice, for an end-user, better quality = more steps = longer wait.
- Computing larger/better couplings is a preprocessing effort.
Because inference and preprocessing are done by different actors on different hardware (e.g. consumer devices/lower grade GPU vs. GPU servers with interconnect), one cannot easily trade off one for the other. When models are shared with the public, inference cost will always dominate training cost. This is why we need fast flowing models.
One stated contribution […] this continuum is fairly widely understood.
We agree, this continuum is well understood in OT papers, but we stress
- in L.69 that it has not been leveraged enough in the OT-FM literature.
- Pooladian / Tong have not ablated 𝜀 in their studies (L.70) and emphasized instead a distinction between Hungarian / EOT / IFM.
- In L.73, we claim that these methods can be unified by varying 𝜀 in Sinkhorn, as long as convergence 𝝉/max_iters is rigorously tracked (L.213)
- In L.79 we facilitate this comparison across tasks and scales using .
many of the apparent trends reported on plots […] are actually marginal-to-nonexistent when one examines the y-axis limits.
numerical results are presented in a (I hope unintentionally) deceiving manner […] the y-axes on the problematic plots do not start at zero[…] trends would not be significant were the y-axes started at zero.
all need to start at 0[…] Please fix
We acknowledge your concern that FID or BPD sometimes vary in a narrow y-range in our plots, but this is usual in the diffusion/FM literature. Adding a 0 would be highly unusual and clutter plots. Instead we follow the practice of very highly cited papers in the literature (and more generally ML) to let matplotlib choose the y-axis bounds:
- Lipman & al, Flow Matching for Generative Modelling FID: Fig. 5 & 7
- Song & al., Score-based Generative Modeling through Stochastic Differential Equations FID: Fig. 10
- Liu & al., Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow FID: Fig. 8; Straightness (our 𝜅): Fig. 9
- Nichol & al., Improved Denoising Diffusion Probabilistic Models NLL/BPD: Fig. 15 & 16; FID: Fig. 17
While adding a 0 in our y-axes would be problematic, we commit, however, to
- provide more tables (as done in this rebuttal),
- highlight in the captions the narrow BPD or ranges
- remove Fig. 11 & 13, as they just present the data in Fig. 1 & 2 without the 🔻 IFM / green▲ ground-truth lower bound.
The numerical results […] do not paint a clear picture […]exception is perhaps the Korotin benchmark.
For the Piecewise quadratic benchmark (Fig. 1)
- very large (1M, 2M) improves on small (16k) and IFM on all metrics.
- The curvature 𝜅 for large is
25xto4xtimes smaller than IFM - The reconstruction loss of large OTFM is always
10~20%better than IFM. - BPDs are comparable, but this is usual: BPD values tend to cluster when comparing similar methods (see e.g. Table 7 in [Papamakarios & al. Masked Autoregressive Flow for Density Estimation])
We agree that CIFAR-10 results lack significance, because CIFAR-10 is too small (the 50k database size is smaller than our larger batch sizes ). We will move CIFAR-10 to the appendix ([Pooladian et al.] skipped it altogether).
For ImageNet-32 and ImageNet-64, all FID metrics for small NFE are significantly better than IFM, and 𝜅 improves as well (Figs. 16 & 18, Tables 1 & 2). This is visually striking in Fig. 17 and Fig. 19.
Even though it is perhaps disappointing […] inform effective FM practice.
Given your lowest possible rating of 1 to the significance & originality of our work, we felt encouraged to see that it still managed to trigger so many open questions.
Our findings are, however, clear: using large / small 𝜀 Sinkhorn couplings always improved (or left unchanged) FM metrics in all problems we considered.
The magnitude of these gains depends on data, as you mention, but also on model capacity & training method. Predicting the magnitude of these gains will necessitate a "scaling laws" approach that we leave for follow-up work.
What is the rationale for scaling the entropy level by the std of the cost-matrix[…] numerical results comparing this choice[…]
As an example, consider a cost matrix with values distributed in . Consider the shifted cost .
- Mathematically, the EOT solution for is the same as that for . Indeed, because is a bistochastic matrix,
- Hence, EOT solutions are invariant to the addition of a constant in the cost when using the same 𝜀 (a particular case of L.187 with a global shift)
- Hence, a data dependent rule to set 𝜀 (or its scale) should retain this shift-invariant property.
- The
mean,medianormaxrules introduced in [Cuturi ’13] are not shift-invariant. They were introduced to avoid underflow in the kernel . Most solvers used nowadays avoid this problem by relying on log-sum-exp computations (L.109). - Yet, the 𝜀 selected with
mean,medianormaxrules would be100xlarger for compared to , despite the fact that these problems are highly similar.
To scale 𝜀 in a shift-invariant way, we propose to quantify dispersion with std. While std is not the only way to achieve this, it is cheap and robust.
We do observe qualitatively that std scaling results in lower variance of # Sinkhorn iterations / renormalised entropy across experiments, compared to mean. We will clarify in the appendix.
changing notation to W_2
We will remove the Wasserstein notation entirely, and only introduce OT couplings.
On lines 115-117 […] Eq. (1) is the continuum OT problem, and therefore its solution is a coupling, not a matrix.[…]
Initially, we instantiated Eq. (1) with and . We removed this when shortening the paper. We will revert and clarify.
the random variable […] Shouldn't specifically be uniform[…]?
Time is not necessarily sampled uniformly, see e.g. paragraph above Eq. 4.27 in [Flow Matching Guide and Code]. We will clarify.
On line 169, it is stated […] does not actually tell us whether we are close to the optimal assignment, just whether we are close to any 1-1 assignment.
Thanks for the opportunity to clarify.
While we define for any coupling (L. 167), we write in L.168 “ provides a simple measure of the proximity of to an optimal assignment matrix”. We evaluate exclusively on Sinkhorn EOT solutions in the regularization path.
On that path:
- interpolates between (the optimal permutation, assuming it is unique w.h.p.) and .
- strictly increases as 𝜀 grows (e.g. Eq. 4.5 in [Peyré and Cuturi’19]), and is s.t. .
- Therefore, if .
Hence, qualitatively speaking, is a cheap proxy to quantify closeness of to without having to compute itself.
citation styles used are inconsistent and out-of-step with NeurIPS guidelines.[…]
We follow the Neurips guidelines (Section 4.1) and use the natbib \citeauthor macro to highlight famous names/results, to facilitate reading. We can remove this.
I thank the authors for their detailed response.
-
Thank you for clarifying that you do not use the Hungarian algorithm; what I was alluding to is that in the piecewise affine and Korotin benchmarks you have access to a ground-truth OT coupling and use it as a basis of comparison to the Sinkhorn OT couplings. I will update my summary.
-
With regard to the axis limits of the plots, while I agree that it is not necessary to start the y-axis exactly at zero on every single plot, it is necessary to include a significant range of y-values on the axis so that the reader can easily judge the magnitude of the trend in the data. The extremely narrow y-limits resulting from the use of matplotlib defaults are misleading because the resulting plots suggest that dramatic trends exist when they in fact do not. For instance, the limits on the y-axis on the leftmost “BPD” plot in Figure 1 are (2.765, 2.775) — meaning that the upper limit differs from the lower limit by less than 1%. By contrast, the plots that the authors cite to justify the use of the matplotlib defaults all contain y-axis limits which differ by at least 100% (with the exception of the straightness plot in the rectified flow paper, which I would also argue does not demonstrate best practice). Thus, suggesting that that the plots I flagged as problematic are in line with plots in highly cited ML papers isn’t credible. Presenting the data in question in tabular form would be a good alternate option for the experiments where the metrics varied in a very narrow range, but if you do not want to widen the y-axes on the problematic plots I think it would be best to remove the plots entirely. This is a matter of clarity and credibility: we want it to be easy for readers to accurately assess the results of your experiments. Including plots with very narrow y-ranges will confuse readers, cause readers to think that you are intentionally trying to deceive them, or both.
-
Regarding the improvement in FID on ImageNet, I see that using a large OT coupling results in marked improvement over the independent coupling when integration is performed with Euler. However, the differentiation between various settings of and seems small to me, especially for ImageNet-64. From a practical standpoint, will small decreases in FID resulting from increased be noticeable in the generated images? Figures 17 & 19 don’t show strong improvements with increased batch size to my eye. Maybe there are smaller values of that one could use and still obtain similar images? I.e., could you identify a “threshold” value of that resulted in acceptable/saturated image quality? I see that you are running some follow-up experiments with smaller in response to reviewer zUcn, which should help elucidate this threshold. As you mentioned, saturation of image quality with increasing could be an interesting question to address theoretically in follow-up work.
-
Thank you for the clarification about scaling by the standard deviation; I see the rationale now. If you have room, I would suggest adding some of these clarifying comments in the main body of the paper, since the choice of is one of the main stated contributions of the paper.
Thank you for clarifying that you do not use the Hungarian algorithm [...] I was alluding to is that in the piecewise affine and Korotin benchmarks you have access to a ground-truth OT coupling and use it as a basis of comparison to the Sinkhorn OT couplings.
Thanks! You are in fact alluding to an extremely important point, that is often missed in our opinion, when using a ground truth OT map to benchmark OT / flow solvers.
The most important difference when training with a ground truth OT baseline is not that we provide the ground truth coupling to the ground truth approach, but rather in the way training batches are constructed :
- For ground truth OT, in the same training batch, every ground truth origin point is associated to its corresponding target , see L. 258.
- For our method and IFM, we feed independently generated batches of source and target, i.e. and L.254: all of the subtlety lies in using independently resampled source points, . In other words, even "disambiguating" an unpaired batch by running an exact OT coupling on such batches does not provide paired source and corresponding "true" target, since the exact OT transport of a source point is never in the target set. This approach makes it, naturally, much harder for estimators to handle high-dimensions, but this is the only realistic setting to benchmark OT approaches.
With regard to the axis limits of the plots, while I agree [...], it is necessary to include a significant range of y-values on the axis so that the reader can easily judge the magnitude of the trend in the data. [...] This is a matter of clarity and credibility [...]
We understand and acknowledge your concern. In our opinion, the clearest "offender" to your point is clearly our BPD plots. We commit to moving BPD results to the appendix (this will free space for material following you and other reviewers recommendations) and only provide them as a table, removing that part of the plot.
We feel that the other metrics that matter most (, 𝜅, FID) are fairly represented in their y-axis. If anything, we even sometimes preferred understating the much better performance of our methods vs. IFM 🔻w.r.t curvature 𝜅 in the Piecewise quadratic benchmark, by using arrows/numerical values for IFM 🔻 (top of Figure 1 / 10). We felt that a comparison to the ground truth lower-bound was more useful there.
[...] I see that using a large OT coupling results in marked improvement over the independent coupling when integration is performed with Euler. However, the differentiation between various settings of and seems small to me, especially for ImageNet-64.
Maybe there is a bit of information overload, as we targeted too many results across and 𝜀. In that regard Table 1 and Table 2 may be easier to interpret, as they use a single 𝜀, and drive more simply the "large is better" message.
From a practical standpoint, will small decreases in FID resulting from increased be noticeable in the generated images? Figures 17 & 19 don’t show strong improvements with increased batch size to my eye.
While this is subjective, we would argue that there is a gradual improvement in image quality as one grows larger (they look sharper) for the leftmost columns. Ultimately, this is reflected in the smaller FIDs that we see for Euler integration.
Maybe there are smaller values of that one could use and still obtain similar images? I.e., could you identify a “threshold” value of that resulted in acceptable/saturated image quality?
Measuring this "acceptable" quality is a difficult problem... To discuss such improvements, we see no other alternative than FID at this point, for lack of a better measuring stick. We used other accepted metrics in the cat→wild modality translation experiment presented to Reviewer Lgwo.
I see that you are running some follow-up experiments with smaller in response to reviewer zUcn
Our point to reviewer zUcn was mostly about re-adding the poor to very poor results that we saw at , and committed to adding them (see also our latest answer above). Other than that, we think that the trend, if one digests the wide variety of settings, point very clearly to improving results with larger (e.g. Table 1 / 2)
Thank you for the clarification about scaling by the standard deviation; [...] suggest adding some of these clarifying comments in the main body of the paper, since the choice of is one of the main stated contributions of the paper.
Definitely! It is absolutely our idea to add our answers to all of the comments you and other reviewers have provided so far to improve the readability of this draft.
In that sense, we are grateful for your comments that have helped improve our work, and thank you again for your reviewing time.
Following the comment of Reviewer zUcn on small OT batch size , we had committed to include small experimental results in the draft. This is also a topic that appeared in your latest response to our rebuttal, hence we take this opportunity to share our latest results.
We are happy to report preliminary experiments with small on ImageNet-32, which confirm that while using small OT batch sizes can offer some FID gains over IFM for small NFE, it may also perform worse relative to IFM for larger NFE/Adaptive solvers. On the other hand is always significantly worse than large .
The following table shows our experiment with small OT batch size (i.e. images per device) on ImageNet-32, and compares it with IFM and a larger OT batch size. As seen from the table,
- large OT batch size is substantially better than small for all metrics.
- For large NFE, we even see that IFM edges above regardless of 𝜀.
| OT Batch Size | 𝜀 | FID@NFE=4 | FID@NFE=8 | FID@NFE=16 | FID@Dopri5 | Checkpoint (steps) |
|---|---|---|---|---|---|---|
| IFM | NA | 65.8 | 23.9 | 12.1 | 5.38 | 180K |
| 256 | 0.3 | 42.1 | 20.2 | 13.8 | 8.48 | 180K |
| 256 | 0.1 | 39.7 | 19.4 | 13.4 | 8.16 | 180K |
| 256 | 0.03 | 39.7 | 19.7 | 13.4 | 8.15 | 180K |
| 524288 | 0.03 | 30.2 | 14.9 | 9.54 | 5.18 | 180K |
These comparison are made at the 180K step checkpoint (out of total schedule) because our runs with batch size haven’t finished yet. We note, however, that measures like FID@NFE=4/8 have already plateaued and will not improve at later checkpoints (see also our response to Reviewer 3ETv showing that ImageNet32 metrics tend to plateau from 150k iterations, and might even slightly increase again after ~250k iterations)
We are not able to show generated images (as in Fig. 17) per the rebuttal policy this year, but these images come as expected (notably blurrier at for small NFE compared to large ).
Many thanks again to you (and Reviewer zUcn) for suggesting this experiment, we agree that it completes our message by also looking at what happens at the "left end of the range", not just at very large .
Many thanks to the authors for their detailed responses and for following up with new experimental results. I have enjoyed and learned from the discussion. Many of my concerns have been addressed and I think the manuscript will improve based on the items discussed above, so I will increase my score.
We are very grateful for your time and engagement during this conversation, and we are very thankful for your score increase. We are in particular very happy to hear that we have addressed your concerns. We commit to including our answers above to the draft. Many thanks again for helping us improve our draft!
The authors study the problem of using mini-batch optimal transport to improve the generation capabilities of flow matching models. The authors make use of entropy regularized OT to interpolate between optimal couplings (mini-batch OT) and independent couplings (as used in practice). They leverage the setup of [1] to keep entropy regularization parameter in the range instead of , which allows for interpretable results. They then study the effects of across different batch sizes and datasets. Their results find that using mini-batch OT can improve the generation results, especially when the batch size is large.
[1] Cuturi, Marco. "Sinkhorn distances: Lightspeed computation of optimal transport." Advances in neural information processing systems 26 (2013).
优缺点分析
Strengths
- The paper is very well written, with a nice introduction to the topics and literature review
- The experiments set up are comprehensive, including synthetic datasets where the ground truth is available, as well as more realistic datasets like imagenet
Weaknesses
- The authors considered a minimum batch size of 2048, it would be important to know the effect of mini-batch OT in the range below that, as it is common in practice to use smaller batch sizes
- The results seem to be done in an unconditional setup, the conditional set up is however, of great importance to the community. Would the authors be able to comment/provide evidence of how these results change when considering, for instance, class labels? It seems like the task of providing the coupling can be a little trickier there, which may limit the applicability to higher scale problems
问题
- Could the authors confirm if Imagenet was done in an unconditional setup?
- The authors mentioned that the results for the main paper were not complete, could you please provide the new results?
- See weakness section
局限性
yes
最终评判理由
The use of mini-batch OT as a coupling to train flow models is a technique that was not completely well understood. This paper sheds light on to when should one use mini-batch OT. Initially they were lacking some results, but after the discussion they have agreed to include them. I believe that after adding those it paints a complete picture of the roll of mini-batch OT.
格式问题
Many thanks for your encouragements and detailed review.
They leverage the setup of [1] to keep entropy regularization parameter in the range [0,1] instead of
Let us clarify this:
- The renormalised entropy is a metric for EOT solutions taking values in (L.176). This metric assesses whether the solution lies close to the optimal assignment (when ) or the independent coupling (when ), see also our answer to Rev. 52g8.
- We still need to set 𝜀: we set it as a multiple of the
stdof the cost (L. 166). Choosing that multiple in we got a sufficiently wide coverage of in in experiments.
The paper is very well written, with a nice introduction to the topics and literature review
Thanks!
W1. The authors considered a minimum batch size of 2048, it would be important to know the effect of mini-batch OT in the range below that […]
This is a great point.
We used early on in the piecewise affine benchmark and observed fairly poor results for and 𝜅 (we were not tracking BPD at that time), see table below for , which should be compared with other 's in the left/right columns of Fig. 1 (preferably retrieved as Fig. 10 in supplementary)
| 𝜀 = 0.003 | 𝜀 = 0.01 | 𝜀 = 0.03 | 𝜀 = 0.1 | 𝜀 = 0.3 | |
|---|---|---|---|---|---|
| (𝜃) | 2.69 | 2.67 | 2.72 | 2.80 | 3.01 |
| 𝜅(𝜃) | 0.25 | 0.25 | 0.25 | 0.27 | 0.43 |
| 𝜀 = 0.003 | 𝜀 = 0.01 | 𝜀 = 0.03 | 𝜀 = 0.1 | 𝜀 = 0.3 | |
|---|---|---|---|---|---|
| (𝜃) | 24.83 | 24.99 | 25.17 | 25.06 | 25.48 |
| 𝜅(𝜃) | 0.51 | 0.50 | 0.50 | 0.52 | 0.60 |
Performance drops compared to . More surprisingly, the reconstruction metric for () is much worse than IFM () or OTFM baselines ().
- for small 𝜀, we hypothetize that this is due to the statistical bias of exact OT for large / small .
- For large 𝜀, one would expect results to be, in principle, more similar to IFM (they both result in independent sampling). This difference is likely due to the way we implemented Alg. 1, more specifically Step 3.
- When using IFM, we set (L.136), i.e. the arbitrary identity assignment;
- For OTFM we use , which looks like as 𝜀 grows.
- Both procedures are roughly equivalent for large but differ for very small because OTFM would be less sample efficient, as it picks samples with replacement and likely duplicates some data points while dropping others (stratified vs. non-stratified sampling).
Generally, using very small was challenging in our earlier implementation, as it resulted in coupling sizes that were potentially smaller than the default batch size needed to train FM models.
We will rerun some small experiments for ImageNet32/64 and add this discussion to the appendix.
W2. Could the authors confirm if Imagenet was done in an unconditional setup?
Yes, all results in the paper & supplementary were unconditional.
W2. […] conditional set up is however, of great importance to the community. Would the authors be able to comment/provide evidence of how these results change when considering, for instance, class labels?
[Chemseddine 2023] and others have applied OT-FM to the class-conditional setup, augmenting the feature space with a one-hot class label. We extended our ImageNet32 experiments to conditioned generation over 1000 classes. The following FID results are for 𝜀=0.1/I-FM after 180k steps, and renormalized entropy is consistently ~0.1.
Imagenet 32: conditional generation FID
| Dopri5 | Euler4 | |
|---|---|---|
| I-FM | 3.68 | 32.37 |
| 4.17 | 29.37 | |
| 3.74 | 26.06 | |
| 3.52 | 24.79 | |
| 3.47 | 23.68 |
When testing conditional generation, we plan to move to latent-space representations. We prioritized for now unconditional / pixel space generation because the community wants to see generation methods operate reasonably well in pixel space (performance is harder to judge visually when rendered using high-quality latent spaces).
[ Chemseddine J, et al.,Conditional Wasserstein distances with applications in Bayesian OT flow matching ]
The authors mentioned that the results for the main paper were not complete, could you please provide the new results?
Indeed, the main paper results were incomplete at time of submission, as some of our jobs (e.g. Imagenet32/64) took a few days longer than expected. We apologise for this.
The complete results were submitted in the supplementary file, 1 week after the main paper deadline. They are in the .zip file at the top of this page.
Speeding up Sinkhorn
Since submission, we have progressed on our agenda to scale up Sinkhorn for OTFM to unprecendented scales. We report on the following changes, that we plan to add to our draft.
1. Warmstart
We leverage warmstarting for Sinkhorn. Rather than always reinitialize as in L.1 in Alg.1, we extrapolate the solution obtained when coupling the previous batches of (noise, data) to initialize for the next batches.
More precisely, for a pair of optimal -dimensional vectors outputted at Line 7 of Alg.1, corresponding to noise and data vectors, we follow [Eq.9, Thornton & Cuturi 2022] to initialize for a new batch as , computed adaptively as
For ImageNet-32, we observe substantial ~1.7 x speedups for large as reported below. We report runtime in seconds, we have checked that using iterations results in the same trends.
Imagenet 32, Average Sinkhorn time (in seconds) to solve Alg. 1
| 𝜀 | 16384 | 65536 | 262144 | 524288 |
|---|---|---|---|---|
| 0.003 | 5.97 | 223.93 | 2300.03 | 9207.89 |
| 0.01 | 2.02 | 73.64 | 710.4 | 2893.43 |
| 0.03 | 0.65 | 22.01 | 218.63 | 836.82 |
| 0.1 | 0.18 | 5.8 | 61.25 | 229.85 |
| 0.3 | 0.09 | 2.37 | 18.32 | 66.73 |
Same as above, using warmstarting
| 𝜀 | 16384 | 65536 | 262144 | 524288 |
|---|---|---|---|---|
| 0.003 | 3.71 | 133.54 | 1271.23 | 4916.29 |
| 0.01 | 1.4 | 48.68 | 466.47 | 1791.55 |
| 0.03 | 0.49 | 16.09 | 153.78 | 600.5 |
| 0.1 | 0.14 | 3.16 | 31.75 | 126.1 |
| 0.3 | 0.06 | 1.72 | 17.6 | 67.5 |
Please note that warmstarting is not an approximation, it is simply a way to leverage previous solutions to solve Alg. 1 faster.
2. Computing matchings in PCA Space
To fit computations for large in memory, our implementation reinstantiates the cost (L.190) block by block at each Sinkhorn iteration (L. 201), as proposed originally in the GeomLoss package, later in OTT-JAX.This is needed because at , storing is infeasible, even across GPUs (L.204): A 1M x 1M cost matrix of float32 would require 4 Tb of memory. As a consequence, the compute cost per Sinkhorn iteration is .
When both and are huge, we propose to alleviate this by projecting both noise and data onto the top PCA subspace of data to consider the cost , where is . This results in per iteration cost of . Note that FM training (Step. 5-8 in Alg.2) still happens in the original space, since PCA is only used to compute pairings.
When applying PCA to ImageNet-64,
- with and , we see up to
10xspeedup overfulldimension, with no impact on nor downstream FM metrics. - we can increase using (rightmost cols), which was not feasible with full dimensionality, and achieve even better FID for low NFE.
| PCA dimension | 500 | 1000 | 3000 | 12288 full | 500 | 1000 | |
|---|---|---|---|---|---|---|---|
| () | (131,072-0.1) | (131,072-0.1) | (131,072-0.1) | (131,072-0.1) | (1,048,576-0.01) | (1,048,576-0.01) | |
| Sinkhorn time (s) | 1.45 | 1.82 | 4.05 | 14.1 | 445 | 727 | |
| FID@NFE=4 | 48.4 | 47.8 | 47.2 | 47.2 | 46.7 | 45.7 | |
| FID@NFE=8 | 24.4 | 24.1 | 23.8 | 23.8 | 23.9 | 23.5 | |
| FID@NFE=16 | 15.2 | 15.1 | 15.2 | 15.2 | 15.1 | 15.0 | |
| FID@Dopri5 | 8.44 | 8.37 | 8.69 | 8.64 | 8.26 | 8.53 | |
| Ren. Entr. | 0.247 | 0.239 | 0.232 | 0.236 | 0.072 | 0.072 |
We use the entire dataset to estimate FID statistics for this table, hence the numbers are more accurate than the original draft.
3. Precomputing Noise/Data Pairs
Our implementation, to be added to OTT-JAX upon publication, can now precompute pairs of noise / data, as claimed in L.143: We can now decouple Steps 1~4 of Alg. 2 from FM training (Steps 5-8).
- As data points are retrieved from a dataloader
DL, and Gaussian noises are resampled, the outputs of Steps 1-4 (Sinkhorn + categorical sampling) are accumulated and buffered in a new augmented dataloaderDL~. - To avoid storing noise vectors, we generate each noise vector using random integer
rngkeys; Rather than store pairs of large vectors inDL~, we accumulate the output of Step 4 as pairs of data identifierid_{i}seedrng_{𝜎(i)}and inDL~. - When training FM (Step 5-8), we load pairs of indices from
DL~. For eachrng, idpair retrieved fromDL~, the corresponding data vector is retrieved and the noise vector is regenerated using therng.
We use this approach to ablate any hyperparameter of FM training, e.g. learning rate as discussed in the answer to Reviewer 3ETv, avoiding Sinkhorn recomputations.
[J. Thornton and M. Cuturi, Rethinking Initialization of the Sinkhorn Algorithm, AISTATS23]
I thank the authors for their detailed answers. Most of my questions are resolved, but I would like to emphasize on the following point.
The results for (the authors showed , , but perhaps including and would be good) should be included in the main text, not the appendix. Without this information, the picture remains incomplete, and it becomes unclear when is min-batch OT helpful. The reasons outlined by the authors as to why mini-batch OT might fail for small , even resulting in worse results, are of great importance.
If the authors could include this complete study, I believe this would be an immense contribution to the community.
Many thanks for acknowledging our rebuttal. Many thanks for your encouragement and for your detailed recommendations.
Adding small results: We agree that adding these small results (which result in datapoints per device when running on a 8 GPU node) in the paper would be beneficial for the community.
Initially, we did not report these numbers because they were out of range for in the column of our piecewise affine synthetic benchmark (see our Table above). We commit to:
- Report numbers for all dimensions and benchmarks in Fig. 10 / 12, and for do in the same way we report the IFM🔻when it is out of range in the y-axis (in Fig. 10), with an arrow and numerical values, or, possibly, use a broken y-axis (we toyed with this idea but gave up at the time due to its complexity, we will try again).
- Shorten the discussion above on sampling with/without replacement and add it to the main text
- Run for ImageNet 32 and ImageNet64.
In the interest of providing additional perspectives on this specific point around ultra-small : we believe that very small scales not only yields worse FM metrics, but also brings additional challenges in our implementation which is tailored for larger scales:
- Small incurs a lot of compute overhead: to take a bit of an extreme argument, while running 8192 times small different Sinkhorn problems is much faster on paper than running a single problem, there is significant time wasted on GPU with our implementation when reinstantiating these problems, leading to irregular / inefficient GPU utilization, even when considering arguably the most compute intensive setting for 𝜀=0.003
Average GPU Utilization, Piecewise Affine Benchmark
| 𝜀 | 256 | 2,097,152 |
|---|---|---|
| 0.003 | 35.82% | 90.66% |
- Some of our speed-up improvement, such as warmstarting and PCA, won't work as they explicitly leverage a better quality of the dual solution / dimensionality reduction, obtained as grows.
- For instance, if we provide the two tables above [Imagenet 32, Average Sinkhorn time (in seconds) to solve Alg. 1] and [Same as above, using warmstarting] in a single speed-up view, and include the smaller (we initially omitted that column above because we are missing a point), we see that warmstarting is often detrimental for the smallest .
ImageNet-32, speedup obtained when using warmstarting over naive implementation (>1 = better)
| 𝜀 | 2048 | 16384 | 65536 | 262144 | 524288 |
|---|---|---|---|---|---|
| 0.003 | 0.09 | 1.61 | 1.68 | 1.81 | 1.87 |
| 0.01 | 1.3 | 1.44 | 1.51 | 1.52 | 1.62 |
| 0.03 | 0.77 | 1.32 | 1.37 | 1.42 | 1.39 |
| 0.1 | 0.9 | 1.36 | 1.84 | 1.93 | 1.82 |
| 0.3 | - | 1.4 | 1.38 | 1.04 | 0.99 |
- This degradation for smaller is intuitive: because the sample size is small, the dual potentials estimated by Sinkhorn do not generalize well to unseen data (specially for smallest 𝜀) and fail to provide a useful initialization for the next batch.
We are extremely grateful for your time reading these our draft and our comments. We are thankful for your recommendations. Thanks again!
The authors
I thank the authors for their commitment to improving the paper. I have increased my score.
We are very grateful that you read our 2nd response, as well as for your score increase.
Following your comment on small OT batch size and our commitment to include such experiments in the paper, we are happy to report our preliminary experiments with small on ImageNet-32.
These preliminary experiments confirm that while using small OT batch sizes can offer some FID gains over IFM for the smallest NFEs, it may also perform worse relative to IFM for larger NFE/Adaptive solvers. On the other hand is always significantly worse than large .
The following table shows our experiment with small OT batch size (i.e. images per device) on ImageNet-32, and compares it with IFM and a larger OT batch size. As seen from the table,
- large OT batch size is substantially better than small for all metrics.
- For large NFE, we even see that IFM edges above regardless of 𝜀.
| OT Batch Size | 𝜀 | FID@NFE=4 | FID@NFE=8 | FID@NFE=16 | FID@Dopri5 | Checkpoint (steps) |
|---|---|---|---|---|---|---|
| IFM | NA | 65.8 | 23.9 | 12.1 | 5.38 | 180K |
| 256 | 0.3 | 42.1 | 20.2 | 13.8 | 8.48 | 180K |
| 256 | 0.1 | 39.7 | 19.4 | 13.4 | 8.16 | 180K |
| 256 | 0.03 | 39.7 | 19.7 | 13.4 | 8.15 | 180K |
| 524288 | 0.03 | 30.2 | 14.9 | 9.54 | 5.18 | 180K |
Please note that we make the comparison at the 180K step checkpoint (out of planned) since our runs with batch size haven’t finished yet. However, measures like FID@NFE=4/8 have already plateaued and will not improve at later checkpoints (see also our response to Reviewer 3ETv showing that ImageNet32 metrics tend to plateau from 150k iterations, and might even slightly increase again after ~250k iterations)
We are seeing very similar trends in ImageNet64 (if not worse), but we need to wait longer to see the runs conclude.
Many thanks again to you (and Reviewer 52g8) for suggesting this experiment, we agree that it completes our message by also looking at what happens at the "left end of the range", not just at very large .
The authors consider the Flow Matching (FM) algorithm and study its training with minibatch Optimal Transport (OT) plans, computed via the Sinkhorn algorithm. Unlike the existing literature, which focuses on small minibatches (e.g., 256), the authors explore the use of extremely large batches (up to millions of points) and demonstrate that this approach can benefit FM training. However, utilizing such large batches introduces numerical and computational challenges, which the authors address through several technical contributions, including the automatic rescaling of in Sinkhorn, a scale-free renormalized coupling entropy, modifications to the transport cost calculation, and a multi-GPU implementation for scaling Sinkhorn. Experiments are conducted on both synthetic ("W2 benchmark" [1]) and real-world data (CIFAR, ImageNet).
The initial reviews were mixed. Primary concerns included the limited range of OT batch sizes (with a focus on large batches), a lack of conditional generation experiments, potentially misleading plots, and the absence of error bars. During extensive discussions, the authors addressed most of these concerns, leading the reviewers to update their scores positively.
In summary, based on the author-reviewer discussion and the meta-reviewer's own reading of the paper, the positive aspects are:
-
The paper presents a comprehensive study on the use of large batches in Flow Matching, offering practical and useful recipes.
-
It explores minibatch OT in FM at an unprecedented scale, employing significant engineering effort and providing clear guidelines.
The negative aspects are:
-
The paper is highly technical, and one could argue it reads more like an engineering/technical report than a scientific paper.
-
The entire methodology is inapplicable to modern conditional problems (e.g., text-to-image generation), where minibatches cannot be constructed in the same way (as for a single condition there is typically just one target).
While the first bullit point is common nowadays, the second negatively distringuishes the current work from many other papers of the same kind (which propose technical improvelements for generative models). However, in private discussion, reviewers additionally suggested that this paper could hold value not only for the generative modeling community but also for the optimal transport community.
Given the above, the meta-reviewer considers this paper to be borderline. During the AC-SAC discussion, it was decided that the paper should be rejected in its current form. It was also noted that the paper would benefit from addressing additional aspects related to other flow matching and broader research, such as:
-
Synergy with other techniques: Whether the proposed large optimal transport (OT) batch techniques are orthogonal to other engineering-focused FM research (such as [2]), in the sense that they could be applied together to further improve performance. This includes applying iterative flow matching techniques (e.g., rectified flow, ReFlow [3]) on top of the proposed large-batch OT-FM and evaluating whether this yields additional improvements.
-
Connection to Schrödinger Bridge Methods: Whether the proposed large-batch OT techniques -- specifically Sinkhorn for entropic OT (also known as the static Schrödinger Bridge) -- can improve the performance (e.g., in terms of FID) of methods based on bridge matching for Schrödinger Bridges. While a complete answer may be beyond the paper's scope, it would be relevant to at least mention the connection between the Schrödinger Bridge and entropic OT, along with the relevant literature. One relevant paper [4] has already been cited in the current study.
The authors are also advised to rework their conclusions to make them more transparent and understandable from a state-of-the-art perspective.
[1] Korotin, A., Li, L., Genevay, A., Solomon, J. M., Filippov, A., & Burnaev, E. (2021). Do neural optimal transport solvers work? a continuous wasserstein-2 benchmark. Advances in neural information processing systems, 34, 14593-14605.
[2] Kim, B., Hsieh, Y. G., Klein, M., Ye, J. C., Kawar, B., & Thornton, J. Simple ReFlow: Improved Techniques for Fast Flow Models. In The Thirteenth International Conference on Learning Representations.
[3] Liu, X., & Gong, C. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. In The Eleventh International Conference on Learning Representations.
[4] Tong, A. Y., Malkin, N., Fatras, K., Atanackovic, L., Zhang, Y., Huguet, G., ... & Bengio, Y. (2024, April). Simulation-Free Schrödinger Bridges via Score and Flow Matching. In International Conference on Artificial Intelligence and Statistics (pp. 1279-1287). PMLR.
Our paper received five positive (4,4,4,5,5), high-quality & confident reviews. With 5 engaged reviewers, the rebuttal was both fruitful and extremely time-consuming not only for authors, but also for reviewers. Meanwhile the AC/SAC both stayed silent, and did not reply to our private messages.
All reviewers unanimously supported the paper, and wrote final recommendations (post-rebuttal) that supported acceptance in very clear terms.
The AC / SAC / PCs rejected the paper arguing two weaknesses listed in the metareview:
▪️ "the paper is highly technical. it reads more like an engineering/technical report than a scientific paper."
This criticism is vague, not actionable, and was raised by none of the reviewers. The paper lacks or has too much of what, to be deemed technical vs. scientific? Such a comment would be deemed weak even for a regular review.
▪️ "The entire methodology is inapplicable to modern conditional problems (e.g., text-to-image generation), where minibatches cannot be constructed in the same way (as for a single condition there is typically just one target)."
This impossibility claim is baffling: it is confidently laid out, yet weakly supported by a naive reasoning ("cannot be.."). Most problematically, it is wrong.
A direct refutal of this naive claim was in our rebuttal, in which we cited works that used OTFM for conditional generation, notably https://arxiv.org/pdf/2403.18705v3 (JMLR) that was 18 months old at the time of decision. We used the simple method proposed in that paper to provide class conditional generation results in response to Reviewer zUcn.
The approach is simple and works for continuous conditions: pair samples of (noise, random condition in train) with samples of (data, ground-truth condition) using an augmented cost. Reviewer zUcn (the only reviewer to mention conditional generation in their review) was satisfied with our answer. Reviewer zUcn followed up with unrelated questions & increased their score to 5.
Additional papers that apply and discuss OT couplings methods to conditional FM include https://arxiv.org/abs/2404.04240 (Neurips 24), https://arxiv.org/abs/2503.10636 (ICCV 25).
Note that the AC / SAC / PCs are not expressing a scientific opinion (or doubt) on the feasibility / practicality / scalability of using OT for conditional FM (which we could have discussed), they are blindly stating that the method is "inapplicable". This shows that the AC / SAC / PC chain that validated this meta review is not knowledgeable enough in this area, and they should have exerted caution.
🔺 In summary, in spite of unanimous acceptance ratings (reached after hundreds of hours of work and compute for the rebuttal), the AC/SAC/PCs rejected our paper arguing weaknesses that are poorly phrased or wrong.
Why?
We reached out to the PCs. When confronted to the evidence above, their final argument, after a lengthy exchange, was that unfair decisions happen, and that, while accountable, they do not feel responsible for final decisions. That nuance is lost on me. This is an excuse of last resort.
Remarkably, the final request in the metareview to add connections to rectified flows (a costly post-processing approach to distill a high-quality pretrained flow model to make it straighter, which cannot be compared with the pre-processing approach presented here to train a FM model from scratch) or Schrödinger bridges resonate as defensive actions by an AC or SAC that would rather silence papers that threaten their research agenda than simply bow to the opinion of reviewers and admit they may not hold all cards.
Ultimately, the PCs sided with the opaque motivations and gate-keeping mindset of the AC/SAC, while ignoring both authors' and reviewers' colossal efforts. This is shameful, and symptomatic of the lack of accountability of the Neurips program committee.
It also shows a wider disregard for authors' efforts that was witnessed at Neurips this year, as seen through many disruptive changes to established practices that penalized systematically authors (e.g. last minute change to the pdf rebuttal policy, refusal to allow for links to compensate for this, extensions of rebuttal period, inability for authors to reach out to reviewers if no answer, limitations on number of messages, last minute request to authors to write "Author final remarks" for the AC)