5.3

/10

Poster4 位审稿人

最低4最高7标准差1.1

3.8

置信度

正确性2.8

贡献度2.8

表达3.0

NeurIPS 2024

Progressive Entropic Optimal Transport Solvers

Parnian Kassraie,Aram-Alexandre Pooladian,Michal Klein,James Thornton,Jonathan Niles-Weed,marco cuturi

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

TL;DR

We propose a progressive algorithm for estimation of OT maps and plans. We prove its statistically consistency and demonstrate its performance on synthetic and single-cell data.

摘要

Optimal transport (OT) has profoundly impacted machine learning by providing theoretical and computational tools to realign datasets. In this context, given two large point clouds of sizes $n$ and $m$ in $\mathbb{R}^d$, entropic OT (EOT) solvers have emerged as the most reliable tool to either solve the Kantorovich problem and output a $n\times m$ coupling matrix, or to solve the Monge problem and learn a vector-valued push-forward map. While the robustness of EOT couplings/maps makes them a go-to choice in practical applications, EOT solvers remain difficult to tune because of a small but influential set of hyperparameters, notably the omnipresent entropic regularization strength $\varepsilon$. Setting $\varepsilon$ can be difficult, as it simultaneously impacts various performance metrics, such as compute speed, statistical performance, generalization, and bias. In this work, we propose a new class of EOT solvers (ProgOT), that can estimate both plans and transport maps. We take advantage of several opportunities to optimize the computation of EOT solutions by *dividing* mass displacement using a time discretization, borrowing inspiration from dynamic OT formulations, and *conquering* each of these steps using EOT with properly scheduled parameters. We provide experimental evidence demonstrating that ProgOT is a faster and more robust alternative to *standard solvers* when computing couplings at large scales, even outperforming neural network-based approaches. We also prove statistical consistency of our approach for estimating OT maps.

关键词

Optimal TransportEntropy Regularization

评审与讨论

审稿意见

评分: 5置信度: 42024-07-05

In this paper, the authors propose a new entropic optimal transport solver as an alternative to the commonly used Sinkhorn algorithm named ProgOT. This solver has three main properties:

(i) It is less sensitive to the choice of the entropy-regularized parameter than the Sinkhorn algorithm;

(ii) When computing couplings between point-clouds, the runtime of ProgOT is no longer than the Sinkhorn algorithm;

(iii) The resulting optimal transport map estimator is consistent and stable.

The authors provide both theoretical and empirical results to demonstrate the above claims.

优点

The proposed ProgOT is a new entropic optimal transport solver built based on the McCann-type interpolation.
The theoretical results in the paper are sound, and provided with rigorous proofs.
The paper is well written and organized. The background section is very useful, particularly for readers who are not familiar with optimal transport.

缺点

Since the optimal transport map $T_0$ is unknown in practice, the assumption (A.2), which says that the inverse map $T^{-1}_0$ has at least three continuous derivatives, is quite strong.
In lines 33-36, when comparing the efficiency of the Sinkhorn algorithm to the linear programs for solving the EOT problems, the authors should make it more explicit by stating the computational complexity of both algorithms.
In lines 238-239, when initializing the entropy-regularized parameter $\varepsilon_0$ , the authors should at least briefly present the intuition for setting it to be the average of the values in the cost matrix between sources and targets.
Minor issues: There are some undefined notations and grammatical mistakes:

In line 141, I think the term $S^{(1)}$ should be $S^{(0)}$ .
In the inequality between lines 196-197, the notation $\lesssim_{\log(n),k}$ has not been defined yet.
In line 167, the notation $\alpha(k)$ has not been defined. Is it a constant depending on $k$ ?
In the assumption (A.3), what does the notation $D$ stand for?
In line 204, $\mu^{(k)}$ is corresponds a location --> grammatical mistake.

The authors should add a discussion about the limitations of the proposed method as well as future directions. For instance, can we generalize the ProgOT so that it applies for entropic unbalanced optimal transport. I believe that such discussion would make the paper more complete.

问题

Are there any chances that the assumption (A.2) can be reduced?
What is the computational complexity of the ProgOT algorithm (Algorithm 2)?
Could the authors please explain more clearly why line 3 in Algorithm 2 helps improve the runtime?
Are there any instructions on how to choose the number of steps $K$ ?

局限性

The limitations have not been discussed in the paper.

作者回复

2024-08-07

Thank you for taking the time to review our work. We have fixed the notation and typos issues, thank you for pointing them out.

Since the optimal transport map is unknown in practice, the assumption (A.2), which says that the inverse map has at least three continuous derivatives, is quite strong. Are there any chances that the assumption (A.2) can be reduced?

We highlight that requiring up to second-order derivatives on the inverse map is a standard requirement, see [1, 2, 3, 4]. The additional derivative comes from the assumptions in [4], which is, to date, the only work that demonstrates that the entropic OT map can statistically estimate the OT map. It is largely believed that the assumption is not superfluous, and that the extra derivative is required to make this statistically viable [5]. Though there are restricted scenarios under which the 3rd derivative is unnecessary, e.g., if $\nu$ is a discrete measure [5]. One can hope that ultimately these results can be bridged, but for now this is outside the scope of this work.

What is the computational complexity of the ProgOT algorithm (Algorithm 2)? In lines 33-36, when comparing the efficiency of the Sinkhorn algorithm to the linear programs for solving the EOT problems, the authors should make it more explicit by stating the computational complexity of both algorithms.

The exact computational complexity depends on the problem geometry, since $\varepsilon_i$ is data-driven.

We can specify the worst-case computational complexity of ProgOT compared to Sinkhorn. The algorithm runs the sinkhorn subroutine $K$ times, with a decreasing sequence of regularizations $\varepsilon_i$ . ProgOT also performs linear interpolations, yielding a total of $\mathcal{O}(nd + \sum_{i=1}^K C_{\mathrm{Sink}}(\varepsilon_i))$ operations, where $n$ is the size of the input domain, $d$ the dimension and $C_{\mathrm{Sink}}(\varepsilon_i)$ denotes the worst-case complexity of Sinkhorn with regularization $\varepsilon_i$ . We added a paragraph to Appendix A discussing this matter.

In lines 238-239, when initializing the entropy-regularized parameter $\varepsilon$ , the authors should at least briefly present the intuition for setting it to be the average of the values in the cost matrix between sources and targets.

We have updated that paragraph and included a brief explanation. This choice is not ours, but largely followed by the community. To avoid simple scaling effects (such as multiplying all features by a constant), cost matrices in entropic OT are rescaled, typically by their maximum value (as commonly done in POT [7]) or by mean-value (as done in OTT [6]). This rescaling is often absorbed into $\varepsilon$ , and helps balance out cost/entropy in EOT optimization.

The authors should add a discussion about the limitations of the proposed method as well as future directions. For instance, can we generalize the ProgOT so that it applies for entropic unbalanced optimal transport. I believe that such discussion would make the paper more complete.

This is indeed an interesting research direction. We have updated the conclusions section with a discussion on directions such as unbalanced OT. We note that such extension is far from being straightforward. Crucially, there is no equivalent of an “entropic potential map”, or McCann interpolation, that can be extended out of sample.

Could the authors please explain more clearly why line 3 in Algorithm 2 helps improve the runtime?

Line 3 implements a warm-start for the Sinkhorn algorithm. The learned potentials at iterate $k$ might still be relevant for the next Sinkhorn subroutine. Our update is essentially a back-of-the-hand envelope: since the distances between point clouds decrease at each iteration by roughly a factor of $(1-\alpha_k)$ , we decrease the dual potentials by a similar amount. We observe that this simple update works well across various costs. Warm-starting Sinkhorn is an active area of work [6, 7, 8, 9].

Are there any instructions on how to choose the number of steps?

Intuitively, the number of steps should correlate positively with the distance between source and target, but for now, we do not have explicit instructions for this.

A simple solution can be choosing $K$ depending on the available compute. Choosing large $K$ typically does not affect performance negatively, but smaller $K$ may create a less stable algorithm (as we get back to Sinkhorn). In our experiments we have used K = 4, 8, 16, and noticed diminishing returns in increasing $K$ .

References

[1] Hutter and Rigollet. "Minimax estimation of optimal transport maps", (2021)

[2] Manole, et al. "Plugin estimation of smooth optimal transport maps", (2021)

[3] Muzellec, et al. "Near-optimal estimation of smooth transport maps with kernel sums-of-squares", (2021)

[4] Pooladian, Aram-Alexandre, and Jonathan Niles-Weed. "Entropic estimation of optimal transport maps." arXiv preprint (2021)

[5] Pooladian et al. “Minimax estimation of discontinuous optimal transport maps: The semi-discrete case” ICML (2023)

[6] Cuturi, Marco, et al. "Optimal transport tools (ott): A jax toolbox for all things wasserstein." arXiv preprint (2022)

[7] Flamary, Rémi, et al. "Pot: Python optimal transport." Journal of Machine Learning Research (2021)

[8] Amos, B., Cohen, S., Luise, G., & Redko, I. Meta optimal transport. arXiv preprint (2022)

[9] Thornton, James, and Marco Cuturi. "Rethinking initialization of the sinkhorn algorithm." AISTATS (2023)

2024-08-13

Dear the Authors,

I would like to thank you for your detailed response, which consolidates my positive evaluation of the paper. I highly encourage the authors to incorporate our discussion into the revision of the paper. This would help strengthen the paper substantially.

Best,

Reviewer 9mT1

评论- Many thanks for aknowledging our rebuttal.

2024-08-13

We will integrate the discussion in our final version and we thank you for the many comments you have made.

Between our code implementation and significantly larger experiment, we believe the paper has indeed improved further, and this is one of the merits of the rebuttal process.

We humbly ask if, as you mention that the discussions has consolidated your positive opinion of the paper, and your soundness / presentation / contribution scores al stand at "3:good", if you would consider increasing your score? The acceptance threshold at neurips this year is likely going to be around 5.5. Hence a score of 5 is, relatively to all other papers, negative in this context.

the authors

审稿意见

评分: 7置信度: 32024-07-12

This paper introduces a new class of entropic optimal transport (EOT) solvers called PROGOT. This work aims to address the challenges of selecting entropic regularization strength $\epsilon$ for original EOT. As we know $\epsilon$ is significant to the performance of EOT like computation time and convergence rate. PROGOT utilizes a progressive framework that interpolates the whole entropic transportation process into multiple steps by using dynamic OT formulations. This work proposed algorithms to set the parameters throughout the interpolation process. As it claimed, the new framework enhances the robustness and performance of EOT, and avoid the headache of tuning the value of $\epsilon$ . Experimental results show that PROGOT surpasses classic EOT in both synthetic and real dataset.

优点

The major contribution of this work is the proposal of a novel framework to estimate entropic transport plans. This framework addressed the challenge of selecting an appropriate entropic regularization strength $\varepsilon$ in traditional EOT algorithms.
The mathematical notation and theoretical analysis are clear and easy to follow. And the theoretical analysis looks good to me.
The convergence of PROGOT map is supported both in theoretical proof and experimental results.
The methodology for selecting hyper-parameters (regularization and threshold schedule) and the corresponding justifications are well provided.

缺点

While the effort in setting the hyper-parameters and the justifications is acknowledged, from a broader perspective, to replace the selection of $\varepsilon$ , PROGOT introduces a series of new hyper-parameters (step/regularization schedules and the length of the schedules, $K$ ) and theoretic assumptions on the inputs. One may find this less favorable, as it replaces one hyper-parameter with several others.
Some of the experiments would be benefit from including of the real OT cost (or map) as a reference. For example, adding the real OT map in Figure 1 and adding the real OT cost data point in Figure 5 will be helpful.
The gradient of PROGOT is missed, which is important for machine learning application.
In Figure 5, could you provide the actual number of iterations instead of just indicating different marker sizes? For example, comparing the size of markers for $\beta=0.08$ vs. $K=8$ is difficult. Additionally, since all the sub-figures use the same legend, it should be clarified whether they run the same number of iterations for the same configuration.

问题

Line 141: do you mean $S^{(0)}$ instead of $S^{(1)}$ in the equation $\mu^{(1)}=S^{(1)}\mu$ ? The same issue in line 143.
Figure 5, why does PROGOT with larger $K$ value has larger cost and appear closer to the original Sinkhorn point? Please correct my intuition if wrong: PROGOT with smaller $K$ runs fewer interpolation steps, leading to fewer number of iterations and closer to the original Sinkhorn cost. When $K=0$ , PROGOT should be the same as EOT. In figure 5, the number of iteration results align with my intuition, but the "distances" to Sinkhorn results looks inconsistent.

局限性

None

作者回复

2024-08-07

Many thanks for your careful review, we thank you for your positive score, encouraging comments and questions. We did our best to answer them.

… One may find this less favorable, as it replaces one hyper-parameter with several others.

While we certainly agree with this assessment, one message we want to convey is that having to choose a single variable $\varepsilon$ for everything (Sinkhorn, map estimation) holds too much sway over EOT. A "good" $\varepsilon$ can be difficult to nail right, notably for map estimation, and most practitioners want to bypass this. As it stands, this is likely one of the factors that limits usage of EOT.

Our goal is to "divide and conquer" $\varepsilon$ too, translating it into simpler quantities. In practice, a user would only need to only choose the number of steps $K$ , according to the means of compute available to them, since we provide automatic routines for choosing the rest of the hyper-parameters: the step schedule is linear (a.k.a constant-speed), the threshold scaling log-linear, and the epsilon schedule is tuned as in Algo. 4.

The gradient of PROGOT is missed, which is important for machine learning application.

Differentiability is an important point. We would like to provide a satisfactory answer by splitting the various meanings this could take in the context of ProgOT.

ProgOT can return three objects that a user may want to differentiate:

“Wasserstein-like” OT transport cost between two point-clouds, equal to $t(\mathbf{X},\mathbf{Y}):=\langle \mathbf{P}, [h(x_i-y_j)]_ij\rangle$ where $\mathbf{P}$ is the solution outputted by Algo.2. This is the quantity displayed in the x-axis of Fig. 5 (as well as in our new plots). We believe this is the quantity you have in mind when you mention a gradient. Differentiating this quantity w.r.t. $\mathbf{x}, \mathbf{y}$ is, in fact, very easy, since it amounts to using a “Danskin-like” linearization, i.e. use the chain rule to differentiate $t$ w.r.t. $\mathbf{X}$ or $\mathbf{Y}$ (or any other parameter), while keeping $\mathbf{P}$ fixed. We illustrate this in Figure H of the uploaded pdf, and also provided a colab link to the AC, who can share it with you.
The Jacobian of the coupling matrix $\mathbf{P}$ returned by Algo.2. This can be achieved by differentiating chained Sinkhorn iterations. This is the strategy used, in a different context, when differentiating the optimal coupling obtained in Gromov-Wasserstein for instance. It is costly, but doable.
The Jacobian of the ProgOT map (that returned by Algo. 3) at a given point $x$ , w.r.t. Parameters. This is the same as above, and will require a chained differentiation of successive OT solutions.

Hence, it is very easy to use ProgOT to obtain descent directions to minimize OT-costs. Differentiating higher-order objects (couplings / map estimator) is more involved.

In Figure 5, could you provide the actual number of iterations instead of just indicating different marker sizes?

We have updated the figures in the paper to show the number of iterations, see Figure F for the updated version of Figure 5 in the paper. We have also added 4 new figures to the appendix in new data configurations. Fig G in the upload PDF shows an example. To answer your question, We have prepared Table E with the exact number of iterations corresponding to Figure F, (the updated variant of Figure 5).

**Line 141: do you mean $S^{(0)}$ instead of $S^{1)}$ in the equation $\mu^{(1)}=S^{(1)}\mu$ ? The same issue in line 143.

Thank you for pointing this typo out, we have updated the text.

Figure 5, why does PROGOT with larger $K$ value has larger cost and appear closer to the original Sinkhorn point? Please correct my intuition if wrong: PROGOT with smaller runs fewer interpolation steps, leading to fewer number of iterations and closer to the original Sinkhorn cost. When $K=0$ , PROGOT should be the same as EOT. In figure 5, the number of iteration results align with my intuition, but the "distances" to Sinkhorn results looks inconsistent.

Even if $K$ is small, the cost values might be different if the regularization parameter is different. We visualize Sinkhorn using different values $\beta$ , and in many cases (c.f. Fig F and Fig G) ProgOT with $K=2$ lies close to a Sinkhorn instance. Perhaps your intuition is better reflected in the updated plots for this experiment.

We also highlight that the two axes of cost and entropy should be viewed together. One possible scenario is that by choosing a larger $K$ , we target a smaller $\varepsilon$ in the last iteration, and gaining sharper entropy, at the price of having a larger cost.

Some of the results in Figure 5 in our initial draft were slightly off because we were using a maximum number of iterations when implementing Sinkhorn and ProgOT. While the overall message of these plots have not changed, this error sometimes contradicted with our claim that all methods were converging to target accuracy (L.322). We now set that maximum iterations to infinity, and therefore all solutions converge correctly within the desired threshold, irrespective of the number of iterations. See Fig F and Fig G in the uploaded PDF for an example.

2024-08-13

Thank you for your response. I appreciate your effort in updating the figures and providing the Colab link to address my concerns. Your response addressed many of my questions, and I think your work is worth to be published.

评论- Many thanks for acknowledging our rebuttal.

2024-08-13

Many thanks for taking the time to read our rebuttal!

We take pride in hearing your concerns have been addressed, and that you think our work is worthy of being published.

The Authors.

审稿意见

评分: 5置信度: 42024-07-12

The choice of appropriate entropic trade-off term is one of the main headaches for finding maps between data distributions with sample access when considering Optimal Transport (OT) with entropic regularization. While the selection of sufficiently small regularization terms leads to unstable learning, the picking of large ones causes biased solutions. Considering the EOT problem with Euclidean quadratic cost, the authors propose to divide the entropic problem to small sub-problems between intermediate distributions with their own regularization terms. The final entropic map between initial data distributions is a composition of maps from sub-problems. The authors provide the theoretical guarantees that the constructed map does not differ much from the true one. Moreover, the authors offer the algorithm for selecting appropriate regularization terms for each step as well as demonstrate the method’s performance on synthetic data and single-cell experiments.

优点

The paper provides a new methodology for solving OT problems with entropy regularization which does not combine the ideas of prior works. The paper is well-written, well-organized and clear. The storyline is perfect and understandable through the paper.
Since the method alleviates the tuning process for entropic regularization terms and allows to avoid unstable learning as well as finding primitive OT maps, this work might be interesting to the ML community. I am sure that other researchers might apply this methodology for Flow matching or Schroedinger bridge methods that aim to build interpolation from one dataset to another one.
The ProgOT alleviates the tuning process for entropic regularization terms, thereby it does not suffer from unstable learning as well as primitive solutions.

缺点

Although the paper offers a unique theoretical approach, a crucial shortcoming of this method is scalability. Indeed, the estimation, which is provided by Theorem 3, demonstrates poor scalability of the method since the rate of convergence to the true OT maps depends on the dimension $d$ . From my point of view, this issue of the approach could not be ignored since it severely limits its practical usability.

This my concern is supported by the set of experiments considered in the paper. It seems that the authors avoid empirical evaluation of the method in high-dimensional problems (which might be useful for mitigating the concern). The performance of the method is tested only in low-dimensional ( $d<=64$ ) synthetic data experiments using the benchmark with available ground-truth OT maps (Korotin et al, 2021), as well as biological experiments with no sufficiently large dimensionality ( $d<=256$ ) of data. I am especially confused by the fact that the authors do not test their approach using the provided benchmark pairs for $d$ larger than 64 and at the same time apply the method to an important biological task in case of $d=256$ . However, I believe that prior to adapting the algorithm to the tasks from single-cell biology (especially as important as prediction of cancer cells responses to drug perturbation), it should be thoroughly tested on synthetic tasks.

Overall, although the proposed approach is as fast as the Sinkhorn algorithm and demonstrates better performance in low dimensional problems, there is no guarantees and understanding of the method’s behavior in high dimensional problems as well as there are no comparisons with other EOT solvers in this case. This is the main reason of my current score. However, I am open to adjust the score based on the authors' answers.

问题

From intuition’s point of view, it is understandable that a sub-problem is easier and well-conditioned, than the initial problem. Anyway, could you provide a theoretical understanding why it is so? It seems that this fact depends on the step-size of the McCann interpolator.
You probably use entropic OT formulation with Euclidean quadratic cost due to the convenient use of linear McCann interpolation between distributions. Is there generalization of the method for arbitrary transport cost function?
Does the method’s algorithm need a pre-trained OT map between initial distributions that approximately matches one distribution to another?
Is there intuition or methodology of picking a more appropriate step-size for McCann interpolator at step k?
Why does the speed of convergence of the proposed algorithm compare to the speed of the Sinkhorn algorithm of EOT problem between initial distributions? Could you provide a theoretical explanation and plot of convergence, at least in a synthetic Gaussian example?
In accordance with the developed methodology, the geodesic curves between initial distributions should be straight. Could you provide practical evidence that obtained OT curves are really straight, at least synthetic Gaussian experiments?

I suggest the authors to test their algorithm on benchmark with available ground-truth OT maps for bigger values of $d$ .

References.

A. Korotin, L. Li, A. Genevay, J. M. Solomon, A. Filippov, and E. Burnaev. Do neural optimal transport solvers work? A continuous Wasserstein-2 benchmark. Advances in neural information processing systems, 2021.

局限性

In accordance with the theorem 3 of the main text, the main limitation of the method is scalability. However, it seems that the authors did not mention this fact.

作者回复

2024-08-02

We would like to thank you for your review, and for the many thought provoking questions you have asked. We did our best to answer all of them.

a crucial shortcoming of this method is scalability. […] severely limits its practical usability.

There might be a confusion: Theorem 3 proves a theoretical recovery guarantee for ProgOT as a map estimator. This is a worst-case statistical analysis. This worst-case study has little impact on prediction performance on real tasks (e.g. Table 1), and no impact on computational aspects (Figure 5) or coupling estimation, which are all practically governed by other factors ( $K, \alpha_k, \varepsilon_k, \tau_k$ for instance).

Theorem 3 guarantees the soundness of the ProgOT map. In contrast, none of the NN-based methods (e.g. those compared in the [Korotin et al. 21] benchmark) have a theoretical statistical guarantee. The fact that ProgOT has such a guarantee (even if pessimistic) cannot therefore be seen as a weakness. Their estimation is hard, because of non-convexity. In contrast, ProgOT is recovered as a sequence of convex problems.

the authors do not test their approach using the provided benchmark pairs for $d$ larger than 64 […] it should be thoroughly tested on synthetic tasks.

Thanks for this suggestion. To demonstrate the scalability of ProgOT to larger $n$ and larger $d$ , we have added two new experiments: MSE error on the Korotin benchmark for $d=128, 256$ (as you requested) and a new large scale experiment on all $n=60.000$ grayscale images ( $d=1024$ ) of the CIFAR10 dataset.

GMM Experiment: We use the code from the [Korotin et al., 2021] benchmark to create high- $d$ synthetic problems, reaching the maximum dimension available $d=256$ . Tab. D in the uploaded PDF presents the results. ProgOT (with default settings) dominates other baselines. This is remarkable, because: (1) this benchmark favors Neural OT solvers by design: the ground truth transport is itself generated as the gradient of an ICNN. (2) running ProgOT takes about 3 minutes, while Neural solvers take 10-20X longer (about 30’ for ICNN and 60’ for the Monge Gap), all on a A100 GPU.

CIFAR10 Experiment: We designed a novel benchmark task, to match images and their blurred versions. As described in our general response, the ground truth OT coupling should be the diagonal. We only compare ProgOT to Sinkhorn (ICNN and Monge Gaps do not return couplings). See Fig. A& Tab. B in the uploaded PDF for the results and refer to the general response for more details.

Note that a sample size $n=60,000$ is, to our knowledge, unheard of for entropic OT for such high- $d$ (we overwhelmingly see $n\leq 10,000$ in papers). We achieve this through a tight implementation of ProgOT within OTT-JAX, and JAX, getting the immediate benefits of sharding across GPUs.

You use entropic OT formulation with Euclidean quadratic cost due to […] Is there generalization of the method for arbitrary transport cost function?

This generalization is already in the paper: we defined ProgOT using arbitrary translation invariant (TI) costs, $c(x,y) = h(x-y)$ (see L.81). Alg. 2 and 3 both use $h$ explicitly. We use a a $p$ -norm (with $p=1.5$ ) in the bottom of Fig. 5. We only need $h$ to be $\tfrac12\|\cdot\|^2$ when proving map guarantees in Section 3.2, see L.184.

To complete the empirical overview on general costs, we have added a new table to the appendix A, which benchmarks map estimation with general costs (similar to Table 1).

it is understandable that a sub-problem is easier and well-conditioned, than the initial problem […] could you provide a theoretical understanding why it is so?

Because the intermediate problems lie along the geodesic path, the distance/transport costs for them is smaller, L. 154~163. Calling Sinkhorn for problems with smaller transport costs is easier/more stable w.r.t. $\varepsilon$ . This also allows warm-starting to recycle solutions from the previous step (L.3 in Algo. 2.)

Does the method need a pre-trained OT map?

No, ProgOT runs off-the-shelf using source/target data (see L.229, inputs to Algo.2) and is deterministic, as it relies on sub-Sinkhorn calls. Could you please clarify your question?

Is there intuition or methodology of picking a more appropriate step-size for McCann interpolator at step $k$ ?

We investigated this thoroughly. We found out that scheduling the step-size does not seem to play a key role, neither theoretically nor empirically (see Fig 7 in the Appendix). We stick to the simplest choice, the constant-speed, a.k.a. linear, schedule.

Why does the speed of convergence of the proposed algorithm compare to the speed of the Sinkhorn algorithm of EOT[…]

The convergence rates are comparable because ProgOT uses calls to local Sinkhorn problems as a subroutine. Theorem 3 guarantees theoretically that by using ProgOT we remain statistically sample efficient, while gaining a computational edge. Theorem 3 is illustrated empirically in Figure 4 (A), where we plot the MSE for various $n$ , using the ground-truth OT map from [Korotin et al., 2021].

Could you provide practical evidence that obtained OT curves are really straight, at least synthetic Gaussian experiments?

Using Algo.2, curves can be either obtained by sampling straight lines from the final coupling matrix $\mathbf{P}$ returned by ProgOT, or using the map (Alg. 3) out of sample. Sampling from coupling yields straight lines, by construction. This might be used in, e.g., flow-matching estimation. The "straightness" of lines sampled with Alg. 3 will hinge on the various $\varepsilon_i$ , as represented schematically in Fig. 3.

I suggest the authors to test their algorithm on benchmark with available ground-truth OT maps for bigger values of $d$

Many thanks for this suggestion. We hope our new results [Tab. D] assuage your concerns.

评论- References in the main Rebuttal

2024-08-07

References

[1] Vacher, Adrien, and François-Xavier Vialard. "Parameter tuning and model selection in optimal transport with semi-dual Brenier formulation." Advances in Neural Information Processing Systems 35 (2022): 23098-23108.

[2] Van Assel, Hugues, et al. "Optimal Transport with Adaptive Regularisation." arXiv preprint arXiv:2310.02925 (2023).

[3] Scetbon, Meyer, Marco Cuturi, and Gabriel Peyré. "Low-rank sinkhorn factorization." International Conference on Machine Learning. PMLR, 2021.

[4] Pooladian, Aram-Alexandre, and Jonathan Niles-Weed. "Entropic estimation of optimal transport maps." arXiv preprint arXiv:2109.12004 (2021).

[5] Hutter and Rigollet. "Minimax estimation of optimal transport maps", 2021

[6] Manole, et al. "Plugin estimation of smooth optimal transport maps", 2021

[7] Muzellec, et al. "Near-optimal estimation of smooth transport maps with kernel sums-of-squares", 2021

[8] Bunne, Charlotte, et al. "Learning single-cell perturbation responses using neural optimal transport." Nature methods 20.11 (2023): 1759-1768.]

[9] Dessein, Arnaud, Nicolas Papadakis, and Jean-Luc Rouas. "Regularized optimal transport and the rot mover’s distance. arXiv e-prints." arXiv preprint arXiv:1610.06447 (2016).

评论- Before the discussion period closes

2024-08-13

Dear Reviewer Z1nb,

The discussion period is closing, and we will soon not be able to interact with you.

Still, we hope that you can consider our answers / additional material in coming days, during the reviewers-AC discussion.

Let us emphasize again that, while we have added results on the benchmark you requested, we have also designed and ran a large scale CIFAR-10 experiment ( $n=60k, d=1024$ ) motivated in part by your comments on scalability and high dimensions. Such numbers are unheard of in the EOT literature. Hence, we are very grateful for your detailed feedback on the paper, which has triggered this exploration.

We have also provided a colab that illustrates that ProgOT is not only differentiable, but also ready to be used "off the shelf", as we claim in the paper.

You mentioned that you were open to adjusting your score based on the authors' answers. If these experiments seem convincing to you, we would of course be very grateful if you could so.

If you have any remaining questions, we might be able to answer them through the AC (in principle we should be able to communicate with the AC following the closing of the discussion period).

Respectfully, The Authors

2024-08-13

I appreciate that the authors provide detailed responses and new experiments in order to address my concerns. The high-dimensional experiments on Wasserstein-2 benchmark dataset and CIFAR-10 dataset show that the method is applicable in high dimensions. After reading the entire discussion with other reviewers, I still have some concerns regarding the soundness and practical usability of the proposed approach. However, the authors have made a lot of work through the rebuttal phase, thus, I increase my score to 5.

评论- thanks for acknowledging our rebuttal

2024-08-13

We are very grateful for your score increase,

As we mentioned above, we are also thankful for your comments and for your time reviewing our work. Your insightful remarks led to this large scale CIFAR-10 experiment which we insist we have never seen elsewhere in EOT at these scales for n and d. This has definitely strengthened our paper.

We also want to highlight that users can already try our code, as showcased in the colab shared with you by the AC.

While we understand that you may have some remaining concerns, unfortunately, as the discussion period is closing in a few minutes, we won't have the time to discuss them. Still, if you have remaining concerns you would like to share with us, you might share them with the AC, who might relay them to us. We will then do our best to answer them.

Many thanks again for your time.

The Authors.

审稿意见

评分: 4置信度: 42024-07-13

The authors proposed ProgOT, a method to solve a sequence of EOT problems so that practitioners do not have to tune the entropic regularizer parameter $\varepsilon$ and strike a good balance between computational and statistical complexity.

优点

The manuscript is well-written and it was easy to understand.

缺点

I have several concerns about the contributions and the assumptions of this work:

First of all, I think the contribution of this work is rather limited since the only benefit is to avoid the tuning of the parameter $\varepsilon$ . In my experience, tuning $\varepsilon$ is usually not a big issue since Sinkhorn is rather robust. In addition, the work does not discuss any computational tradeoff in doing so. Tuning the parameter $\varepsilon$ only affects the accuracy of EOT, which is rather minor since the main bottleneck of large-scale Optimal Transport is the number of support $n$ .
Secondly, this work uses a lot of strong assumptions to establish theoretical guarantees. For example, it assumes Euclidean cost, convex, and compact condition and the inverse mapping has at least three continuous derivatives. These assumptions are very restrictive and would limit the applicability of the theoretical analysis.

问题

How would $\nu_{min}, \nu_{max}$ affect the theoretical analysis?
I would recommend the authors to consider quadratic regularized OT/UOT formulations [1], [2] and entropic-UOT [3].
In addition, I would recommend the authors to cite and discuss the following works as well.

References: [1] Smooth and Sparse Optimal Transport. Mathieu Blondel, Vivien Seguy, Antoine Rolet. In Proc. of AISTATS 2018. https://arxiv.org/abs/1710.06276

[2] "On Unbalanced Optimal Transport: Gradient Methods, Sparsity and Approximation Error". Quang Minh Nguyen, Hoang Huy Nguyen, Lam Minh Nguyen, Yi Zhou. Journal of Machine Learning Research, (JMLR), 2023

[3] Pham, K., Le, K., Ho, N., Pham, T., & Bui, H. (13--18 Jul 2020). On Unbalanced Optimal Transport: An Analysis of Sinkhorn Algorithm. In H. D. Iii & A. Singh (Eds.), Proceedings of the 37th International Conference on Machine Learning (pp. 7673–7682). Retrieved from https://proceedings.mlr.press/v119/pham20a.html

局限性

See above.

作者回复

2024-08-02

Many thanks for your review and for your comments. Our response follows.

I think the contribution of this work is rather limited since the only benefit is to avoid the tuning of the parameter $\varepsilon$ .

In light of your comment, we will delete the sentence “Setting $\varepsilon$ can be difficult, … and bias” in our abstract, so that readers do not get confused on what constitutes our contribution.

Our paper is not “only about avoiding tuning $\varepsilon$ ” in Sinkhorn. While this was the main focus of other recent works, such as [1, 2], L. 59, this is not our goal.

Our solver blends McCann type interpolation, local Sinkhorn computations and entropic map estimation to yield both a coupling and map estimator (L.12~18). While easing the burden to pick $\varepsilon$ , our solver does a lot more than that, as listed in Contributions, L.65~76, and demonstrated in Fig. 5 (computations, highlighting very different outputs compared to Sinkhorn), or Table 1 (out-of-sample prediction).

Please also take a look at our new experimental results [Fig. A, Tab. B] on the entire CIFAR-10 dataset.

In my experience, tuning is usually not a big issue since Sinkhorn is rather robust.

Use cases for Sinkhorn vary a lot, and we do trust that you may not have run into such problems in the past. In practice, however, EOT solvers are widely used, for many tasks, impacting many user profiles.

In the simplest cases, e.g. when computing couplings on normalized data (e.g. embeddings on the sphere), we also observe that Sinkhorn's outputs can be fairly stable across various $\varepsilon$ values. On real data sources (such as the genomics point clouds), this is not necessarily the case. $\varepsilon$ has a considerable impact on stability and compute time.

$\varepsilon$ also impacts disproportionately the predictive performance for map estimation. Not tuning it often results in poor outcomes, which is why all Sinkhorn baselines in our draft use cross-validation. To convince you, we have added a line with "untuned Sinkhorn" (i.e. using the default given by OTT-JAX) to our benchmarks [see Tab. D in the PDF], showing that it performs significantly worse than cross-validated Sinkhorn and ProgOT. Thank you for this suggestion.

In addition, the work does not discuss any computational tradeoff in doing so.

We did target this tradeoff explicitly in Figure 5, see L.314. We have updated appendix A with more experimental results to highlight that tradeoff.

Tuning the parameter only affects the accuracy of EOT,...

Can you clarify what you mean by accuracy of EOT? Do you mean proximity to the unregularized solution?

...which is rather minor since the main bottleneck of large-scale Optimal Transport is the number of support $n$

Each Sinkhorn iteration has indeed a cost in $n^2$ . However, the number of iterations varies tremendously with $\varepsilon$ , e.g. from 10 iterations for large $\varepsilon$ to 10k for small ones, as shown e.g. in our paper, and [Fig. F & Fig. G].

To our knowledge, the only linear time RegOT solver is the low-rank (LRSink) approximation from [3]. That solver does not yield a map estimator, as it does not solve the dual OT problem. Exploring hybridizing ProgOT with LRSink is an interesting direction.

...this work uses a lot of strong assumptions to establish theoretical guarantees. For example, it assumes Euclidean cost, convex, and compact condition and the inverse mapping has at least three continuous derivatives. These assumptions are very restrictive and would limit the applicability of the theoretical analysis.

To prove guarantees for estimators, the strength of assumptions should naturally be commensurate with the difficulty of the task.

Proving theoretical recovery of OT maps is hard, and to our knowledge, the entropic map estimator [4] that we build upon is the only tractable estimator that has such guarantees (L. 126~129). We follow their assumptions, and we do not see any alternatives now. The fact that the inverse mapping must have three continuous derivatives is a strong assumption, but we believe this is a fundamental issue, and not an artefact or our analysis. Other approaches (e.g., [5, 6, 7]) share many of the same regularity assumptions as we do.

Note that many OT map estimators are routinely used without any theoretical guarantees (e.g. all neural-OT approaches, flow-matching like formulations), even in biological settings [8]. Aside from the entropic map, ProgOT is the only other tractable map estimator that comes with convergence guarantees.

To summarize, apart from the entropic map estimator, ProgOT is the only currently known tractable OT map estimator with guarantees, and we show that it performs better in practice.

How would $\nu_{\min}, \nu_{\max}$ affect the theoretical analysis?

$\nu_{\min}$ and $\nu_{\max}$ appear as constants in the bounds, as established by [4], which are just meant to be multiplicative constants in the bounds. The role is analogous in the other papers mentioned above

I would recommend the authors to consider quadratic regularized OT/UOT formulations [1], [2] and entropic-UOT [3]. In addition, I would recommend the authors to cite and discuss the following works as well.

This is an interesting research direction. However, extending our work to the settings you mention (quadratic regularization/ unbalanced settings) is far from being straightforward. Crucially, there is no equivalent of an “entropic potential map”, that can be extended out of sample, for the dual solutions outputted from quadratic solvers (first proposed in [DPR16]), nor for the unbalanced case.

We have added references that you mentioned to the conclusion section, to highlight possible future work around ProgOT.

评论- Response to the authors

2024-08-13

Thank you very much for your detailed and insightful responses. Though I'm still not very convinced of the significance of the work, the responses have addressed many of my concerns. I will increase the score.

评论- Many thanks for acknowledging our rebuttal

2024-08-13

We are very thankful for your various comments. With these clarifications and the first (to our knowledge) demonstration that ProgOT can run an EOT based solver at large scales (60,000 points, d=1024), and still return a meaningful coupling, we believe we have a significant and convincing contribution.

We are also grateful for your score increase.

While there is little time left in the dicussion, we remain available to answer any other questions you may have.

the authors

作者回复

2024-08-07

We would like to thank all reviewers for taking the time to read about our new method. We take pride in seeing many appreciative comments, notably from reviewers Z1nb, mHCn and 9mT1:

I am sure that other researchers might apply this methodology for Flow matching or Schroedinger bridge methods that aim to build interpolation from one dataset to another one.

The paper provides a new methodology for solving OT problems with entropy regularization which does not combine the ideas of prior works.

The storyline is perfect and understandable through the paper

The theoretical results in the paper are sound, and provided with rigorous proofs.

The paper is well written and organized. The background section is very useful, particularly for readers who are not familiar with optimal transport.

We enjoyed very much, during this past week, doing our best to come up with clarifications, given separately for each reviewer below.

A common thread found in Reviewers' Agip and Z1nb comments was about scalability and applicability. We have added various experiments in the pdf (notably higher- $d$ results from the [Korotin et al] benchmark, or a concrete example showing the importance of tuning $\varepsilon$ for vanilla Sinkhorn), but we would like to highlight in our general response a new experiment, on real data, that we designed to assuage specific concerns on scalability.

We wanted to provide an experiment that was large $n$ , large $d$ , challenging enough to be of interest, on real data to stay relevant, and for which some form of ground truth was known.

We propose an OT problem on the entire (grayscaled) CIFAR10 dataset, that is, $n=60,000$ images, each of size $d=32\times32=1024$ .

The transport task we propose is the following: The isotropic Gaussian blur operation on an image is a linear operator. Following notations from the textbook [Peyre & Cuturi, 2019, Remark 4.17], for $N\times N$ sized-images (here $N=32$ ), writing $K=\left[\exp\left(-\frac{(i-j)^2}{N^2\sigma}\right)\right]\_{ij},$ for $i,j\leq N$ , the Gaussian blur operation takes an image stored in matrix form as $U\in\mathbb{R}^{N\times N}$ and transforms it into $L(U) = KUK\in\mathbb{R}^{N\times N}.$

Crucially, the Gaussian blur linear map is positive-definite.

Proof: Let $U_1, \dots, U_s$ be $s$ images, and form the kernel matrix $A_{ij} = \langle U_i, L(U_j) \rangle.$ Each entry can be re-written as $\langle U_i, L(U_j) \rangle = \langle U_i, K U_j K \rangle = \langle K U_i, K U_j \rangle$ . This is a dot-product matrix (of all elements $KU_i$ ), and is therefore always psd. This proves that $L$ is a positive-definite linear operator.

As a result, following Brenier's theorem, $L$ is necessarily a $\ell_2^2$ OT Monge map in $R^{N\times N}$ , from any family of images onto their blurred counterparts ( $L$ is the gradient of $h(U)=\tfrac12\langle U, L(U) \rangle$ , which is a convex potential, therefore an OT map, see [Santambrogio 2015 S.1.3.1]).

Practically, if one solves an assignment problem (with $\ell_2^2$ cost) between a family of $n$ images and their blurred $n$ counterparts, the optimal assignment is necessarily that which maps an image to itself, blurred, regardless of the value of $\sigma$ ⎯ the optimal permutation is the identity.

We create these two blurred datasets using a Gaussian kernel with $\sigma=2$ and $\sigma=4$ (visualized in Figure A of the uploaded PDF). We then use ProgOT and Sinkhorn to match the blurred dataset to the original CIFAR10 (de-blurring). We test various $\varepsilon$ and schedules, and, as always in our experiments, focus on relevant regimes for both methods.

As explained above, the ground-truth (unregularized) OT coupling matrix $\boldsymbol P^\star$ should be the diagonal matrix $q \boldsymbol{I}$ , with $q=1/60000$ .

We evaluate the performance of an OT solver by checking how close the trace of the recovered coupling $\mathrm{Tr}(\hat{\boldsymbol P})$ is to $1.0$ , or with the KL divergence to the ground truth, that is: $\mathrm{KL}(\boldsymbol P^\star||\hat{\boldsymbol P}) = \log q -\tfrac1q\sum_{i} \log (\hat{\boldsymbol P}_{ii})$ . Table A in the uploaded PDF shows the performance of ProgOT and Sinkhorn, along with the number of iterations needed to achieve this performance. Both Algorithms scale well and show high accuracy, while requiring a similar amount of computation.

We highlight that at this scale, simply storing the cost or coupling matrices would require 28.8Gb. The experiment has to happen across multiple GPUs. Thanks to its integration in JAX, and OTT-JAX, ProgOT supports sharding by simply changing ~3 lines of code. The algorithms scales seamlessly and each run takes about 15 minutes, on a single node of 8 A100 GPUs.

We hope that this experiment sets a convincing example on how ProgOT scales to much larger (in $n$ and $d$ ) problems than considered previously. This ties with the comment from Reviewer Z1nb on guided generation or flow-matching, which could benefit from our implementation to handle very large batches.

We would be happy to answer all follow up questions.

the authors

最终决定Accept (poster)

2024-09-25

For entropic optimal transport (EOT) problem, the paper proposes a dynamic solvers which breaks down the original difficult problem to a set of simpler problems via time discretization. Numerics demonstrate the efficiency of the proposed method.

From the reviewers' comment and the author-reviewer discussions, below is a summary of the reviewers concerns and how they are addressed by the authors

Reviewer Agip 1) "only contribution is avoid tuning $\varepsilon$ "; (the authors argue that their contribution is not about avoiding tuning $\varepsilon$ , and their contributions include a schme that can recover both Kantorovitch couplings and Monge map, good practical performance and map estimation, etc) 2) large-scale problems (the authors provide numerical experiments on large-scale problem showing that the scalability is not an issue); 3) "strong assumptions" (the authors argue that they are standard assumptions). Author-Reviewer discussion The reviewer commented that the concerns are addressed.
Reviewer Z1nb the main concern is about scalabilirty, and the authors' rebuttal addressed this problem, and this is acknowledged by the reviewer. Author-Reviewer discussion The reviewer commented that the concerns in the reviews are addressed. But I still have some concerns regarding the soundness and practical usability of the proposed approach.
Reviewer mHCn hyper-parameters introduced. The authors argued that they only parameter needs to be tune is $K$ , which they provide some discussions on how to choose it, though not perfect. Author-Reviewer discussion The reviewer commented that the concerns in the reviews are addressed.
Reviewer 9mT1 1) strong assumption as need to know "optimal transport map $T_0$ " (the authors argue that "requiring up to second-order derivatives on the inverse map is a standard requirement"); 2) more explicit comparison against Sinkhorn algorithm in terms of complexity (the authors provide some derivations). Author-Reviewer discussion The reviewer commented that the concerns in the reviews are addressed.

In general, the reviewers are satisfied with the authors' reponses.

During the AC-Reviewer discussion, Reviewers Agip, Z1nb and 9mT1 join the discussion, and the main concerns of them are

the contribution is limited.
Reviewer 9mT1 still thinks that requiring "optimal transport map $T_0$ " is a strong assumption.

Given the overall reviews and discussions, I think the submission worth a chance, hence recommending accept.