5.5

/10

Poster4 位审稿人

最低4最高7标准差1.1

2.5

置信度

正确性3.0

贡献度2.5

表达2.5

NeurIPS 2024

Learning Elastic Costs to Shape Monge Displacements

Michal Klein,Aram-Alexandre Pooladian,Pierre Ablin,Eugene Ndiaye,Jonathan Niles-Weed,marco cuturi

OpenReview PDF

提交: 2024-05-15更新: 2025-01-27

TL;DR

a deep dive into how changing cost functions impacts transport, and a new algorithm to learn the parameters of a new family of costs

摘要

Given a source and a target probability measure, the Monge problem studies efficient ways to map the former onto the latter. This efficiency is quantified by defining a *cost* function between source and target data. Such a cost is often set by default in the machine learning literature to the squared-Euclidean distance, $\ell^2_2(\mathbf{x},\mathbf{y}):=\tfrac12\|\mathbf{x}-\mathbf{y}\|_2^2$. The benefits of using *elastic* costs, defined using a regularizer $\tau$ as $c(\mathbf{x},\mathbf{y}):=\ell^2_2(\mathbf{x},\mathbf{y})+\tau(\mathbf{x}-\mathbf{y})$, was recently highlighted in (Cuturi et al. 2023). Such costs shape the *displacements* of Monge maps $T$, namely the difference between a source point and its image $T(\mathbf{x})-\mathbf{x}$, by giving them a structure that matches that of the proximal operator of $\tau$. In this work, we make two important contributions to the study of elastic costs:*(i)* For any elastic cost, we propose a numerical method to compute Monge maps that are provably optimal. This provides a much-needed routine to create synthetic problems where the ground-truth OT map is known, by analogy to the Brenier theorem, which states that the gradient of any convex potential is always a valid Monge map for the $\ell_2^2$ cost; *(ii)* We propose a loss to *learn* the parameter $\theta$ of a parameterized regularizer $\tau_\theta$, and apply it in the case where $\tau_{A}({\bf z}):=\|A^\perp {\bf z}\|^2_2$. This regularizer promotes displacements that lie on a low-dimensional subspace of $\mathbb{R}^d$, spanned by the $p$ rows of $A\in\mathbb{R}^{p\times d}$. We illustrate the soundness of our procedure on synthetic data, generated using our first contribution, in which we show near-perfect recovery of $A$'s subspace using only samples. We demonstrate the applicability of this method by showing predictive improvements on single-cell data tasks.

关键词

optimal transport

评审与讨论

审稿意见

评分: 7置信度: 22024-06-27

Efficiently mapping one distribution of points to another is crucial in numerous machine learning applications. Optimal transport (OT) theory has become a favored approach for solving these matching problems across various scientific disciplines. This work zeroes in on a significant OT subtask: solving the Monge problem. The objective is to determine a map $T$ that converts high-dimensional source data $(x_1, \ldots x_n)$ into target data $(y_1,\ldots,y_m)$ . The map $T$ should act as a pushforward map, ensuring that the transformed source samples align with the target distribution, and be efficient, meaning the transformed points $T(x_i)$ are, on average, close to the original points $x_i$ . Efficiency is defined by a cost function $c$ that evaluates the cost of mapping a point $x$ to $T(x)$ .

Elastic costs in optimal transport (OT), augmented with a regularizer $\tau$ , promise structured OT maps, but understanding is limited beyond early experiments by Cuturi et al. (2023). This contrasts with Brenier's detailed work on squared-Euclidean costs. Regularizers also introduce hyperparameter selection challenges. To address this the paper shows OT maps for any elastic cost $h$ can be generated using proximal gradient descent with the proximal operator of $\tau$ . This allows visualization of Monge maps beyond gradient-convex Brenier maps and synthetic generation in high dimensions. Introduces subspace elastic costs, promoting displacements in low-dimensional subspaces defined by matrix $A$ . It provides sample-complexity estimates for the MBO estimator of Cuturi et al. 2023, and links to the spiked transport model. Since different regularizers create diverse OT map structures, the paper also proposes parametrized families $\tau_{\theta}$ and a loss function to adaptively select $\theta$ . Results demonstrate the effectiveness of the MBO estimator and recovery of $A$ from i.i.d. samples in synthetic and single-cell data, showing improved predictive ability over baseline estimators.

优点

The first strength of this paper is the first visualizations of Monge maps beyond the standard gradient-convex Brenier maps, as well as synthetic generation in high dimensions, showcasing the practical applicability of the proposed methods. The results, particularly the improved performance of the MBO estimator and the recovery of $A$ in synthetic and single-cell data tasks, provide empirical validation of the proposed methods.

缺点

It would be good if the authors have an elaboration on the meaning of jargon terms in the literature of optimal transport in the introduction itself before discussing the main contributions and prior work.

问题

I do not have any questions.

局限性

Authors have addressed limitations.

作者回复

2024-08-06

We are very grateful for your careful review and positive grade, and were happy to see that you mentioned several times the originality of our contribution.

The first strength of this paper is the first visualizations of Monge maps beyond the standard gradient-convex Brenier maps, as well as synthetic generation in high dimensions, showcasing the practical applicability of the proposed methods.

We appreciate this comment. We believe the ability to create examples for such unusual costs was untapped.

The results, particularly the improved performance of the MBO estimator and the recovery of in synthetic and single-cell data tasks, provide empirical validation of the proposed methods.

We appreciate this comment. We were also surprised to see quasi-perfect recovery of the subspaces, notably when the displacements are not that peaked to belong to a certain subspace, as in the left-most plot.

It would be good if the authors have an elaboration on the meaning of jargon terms in the literature of optimal transport in the introduction itself before discussing the main contributions and prior work.

We agree that OT has, by now, too much jargon which slows reading for first time readers. As was highlighted by Reviewer zNJm for MBO, we will revisit the introduction section to ensure all terms are extensively defined.

评论- Response to rebuttal

2024-08-13

Thanks to the authors for addressing my question, and looking forward to improved writing. My score remains the same.

评论- Many thanks for reading our rebuttal

2024-08-13

We are grateful for your time during this reviewing process, and for supporting the paper.

the authors

审稿意见

评分: 5置信度: 32024-07-04

This paper explores the optimal transport problem with an elastic regularizer, to help deal with an underparametrized model. Specifically, the regularizer acts in a specific subspace, defined by a matrix $A$ , which is also learned. A convex optimization problem is proposed and a method used to find this optimal transport map, and the solution is applied to single-cell application, with improved observed performance

优点

The math seems sound and interesting, and the application seems to benefit from this framework. It is unclear to me the novelty of the work, but I am not aware of any major competing work.

缺点

I got a bit confused as to all the moving parts in this paper. Each particular component makes sense, but how does it all fit together? I think this paper would benefit from having some kind of "algorithms box", either in the main paper or appendix, that lists from start to finish how an application problem is solved using their method. In particular, the interplay between learning A and learning the transport map, and any decisions required to tune that part of the method.

There should be some numerical comparisons of this method compared to others, attempting to solve similar problems. An obvious one is OT without the subspace constraint. Are there also non-OT methods that are competitive in this space?

问题

Eq (8), is this really an entropy term? Seems more like a kernel function.

Is this prox operator really easy to use for all types of tau? Also, are there non-euclidean distances that you look at when looking at this elastic prox?

What happens if A is not on the stiefel manifold? Does it have to be? are there cases where you may want some "directional penalization" but not necessarily wiping out all movement in that direction? Is it challenging to constrain A to be on the stiefel manifold?

What does MBO stand for?

What is the computational complexity for this method? What kind of data sizes is it meant for? Is it challenging to find T? is there some kind of method for choosing the hyperparameters, such as for the soft min?

What are the p hats and such in Fig 4 signifying?

局限性

no negative societal impact

作者回复

2024-08-06

We are very thankful for your encouraging grades and for your great comments.

A convex optimization problem is proposed […]

There is a mild confusion that we would like to clarify: While the task of computing the entropic map given a cost $h$ and samples $\mathbf{X}, \mathbf{Y}$ weighted as $\mathbf{a}, \mathbf{b}$ is convex (Eq.10), as you point out, the main task we tackle is that of learning the parameters of a regularizer $\tau_\theta$ (S.5). While we pair learning cost parameters with the MBO estimator to improve predictive performance on a real task (S.6.3), learning the cost is not a convex pipeline.

It is unclear to me the novelty of the work, but I am not aware of any major competing work.

To the best of our knowledge, the task of estimating an OT map whose displacements mostly happen in a subspace is indeed new (S.5).

Crucially, without devising a scheme for computing ground-truth maps for elastic costs (S.3), it would be literally impossible to verify our methodology. Our contributions are (i) a method for computing ground-truth OT mappings for arbitrary elastic costs $h$ (L.10), (ii) a method to estimate the parameters of such costs (L.15), (iii) demonstrated applicability in both synthetic and real settings L.18.

I think this paper would benefit from having some kind of "algorithms box"

We thank you for this great suggestion, and added these algorithms in the attached pdf.

There should be some numerical comparisons of this method compared to others [...]. An obvious one is OT without the subspace constraint.

We have such experiments in Fig. 5. The dotted line is our baseline: the “standard” MBO estimator that does not use any subspace assumption, computing OT maps with a $\ell^2_2$ cost (L.284). The other lines are "our" MBO estimators computed with learned costs. Remarkably, they perform better, even in $\ell^2_2$ sense.

Are there also non-OT methods that are competitive in this space?

Not to our knowledge. One might speculate that flow-matching methods could be extended to estimate a map, knowing $A^*$ (Sec.6.1), but we do not see how they could also learn a subspace (Sec.6.2)

Eq (8), is this really an entropy term?.

This duality relation links the (primal) entropy-regularized OT solution, with dual potentials in Eq. 7, see [Peyre & Cuturi, 19, Prop. 4.3 & proof]. We will clarify.

Is this prox operator really easy to use for all types of tau?

A good question! The proximal operator is known in closed-form for a huge family of regularizers $\tau$ . See for example [1]

Also, are there non-euclidean distances that you look at when looking at this elastic prox?

The original MBO paper looked at L1 and $k$ -overlap norms. In our revision, we have added a plot with the $k$ -overlap norm for the ground truth OT map.

What happens if $A$ is not on the Stiefel manifold?

An inverse pops up in the computation of the prox, Eq.14. That inverse disappears when $A$ is in the Stiefel manifold, Eq.15. Since we compute a gradient w.r.t. $A$ of the loss $\mathcal{L}$ in Eq.19, if $A$ were not in the Stiefel manifold, one would need to differentiate through this inverse.

are there cases where you may want some "directional penalization" but not necessarily wiping out all movement in that direction?

This is a very good question. Note that we do not "wipe out" all movements in the orthogonal of $A$ , we penalize this (uniformly on all directions in $A$ ) with $\gamma$ . This is illustrated in the two rightmost plots in Fig.1: they only differ in their penalization $\gamma, \tfrac12$ vs $\tfrac{10}{2}$ : The right most displacements agree “more” with $b$ , i.e. their norm $\| \mathbf{b}^\perp \cdot\|^2$ is smaller, because $\gamma$ is larger.

If $\gamma\rightarrow \infty$ , one would all displacements (and points) would only along $b$ , which would be equivalent to projecting all data on $b$ first. This is the spiked model discussed in S.4.2.

Is it challenging to constrain A to be on the stiefel manifold?

At each Riemannian gradient update, this only requires the SVD of a $p\times d$ matrix. This is usually not challenging, because $p\ll d$ , we did not encounter challenges. We will detail further L.224.

What does MBO stand for?

MBO stands for Monge-Bregman-Occam, the estimator in [Cuturi et al 2023] that extends the entropic map [Pooladian and Niles-Weed, 2021] to arbitrary translation invariant costs $h$ . We went too fast, we will clarify.

What is the computational complexity for this method? What kind of data sizes is it meant for? Is it challenging to find T? is there some kind of method for choosing the hyperparameters, such as for the soft min?

The approach outlined in S.5 requires computing Sinkhorn solutions, and differentiate them. This has compute/memory footprint in $O(Kn^2)$ , where $n$ is the batch-size and $K$ depends on the regularization $\varepsilon$ in the soft-min. These blocks are very well implemented by now. For hyperparameters, we can think of cross-validation, but this was not used.

What are the p hats and such in Fig 4 signifying?

A ground-truth $A^*$ has dimension $p^* \times d$ . Our goal is to recover the subspace in $A^*$ . However, we will likely never know $p^*$ in practice. Yet, we need to choose the size of our estimator, $\hat{A}$ , set as $\hat{p} \times d$ .

In synthetic settings, we picked $\hat{p}$ to be exactly $p^*$ , or a bit larger (the latter is more realistic in practice) L.270. We show that, regardless, we are able to accurately recover the subspace described in the ground-truth matrix $A^*$ , L.272.

On real data (Fig. 5), we only considered $\hat{p}=2,4,8$ .

[1]: G. Chierchia, E. Chouzenoux, P. L. Combettes, and J.-C. Pesquet. "The Proximity Operator Repository. https://proximity-operator.net/".

2024-08-10

There is a mild confusion that we would like to clarify: While the task of computing the entropic map given a cost and samples weighted as is convex (Eq.10), as you point out, the main task we tackle is that of learning the parameters of a regularizer (S.5). While we pair learning cost parameters with the MBO estimator to improve predictive performance on a real task (S.6.3), learning the cost is not a convex pipeline.

Got it, yes that is an important point

We thank you for this great suggestion, and added these algorithms in the attached pdf.

Thanks, this is helpful.

We have such experiments in Fig. 5. The dotted line is our baseline: the “standard” MBO estimator that does not use any subspace assumption, computing OT maps with a cost (L.284). The other lines are "our" MBO estimators computed with learned costs. Remarkably, they perform better, even in sense.

Nice, these are great.

The proximal operator is known in closed-form for a huge family of regularizers . See for example [1]

Ok, I think I was caught up on if tau = ||Ax||_2^2, then you at least have to solve a linear system with A. But i guess you consider that a small cost, if A is small, and you can compute its SVD easily?

The original MBO paper looked at L1 and k-overlap norms. In our revision, we have added a plot with the k-overlap norm for the ground truth OT map.

Is the k overlap norm the same as the latent group norm by Obozinski et al? If so, actually the prox is a bit complex, but the dual norm is often easier to compute. I wonder if that's something you can leverage.

The approach outlined in S.5 requires computing Sinkhorn solutions, and differentiate them. This has compute/memory footprint in , where is the batch-size and depends on the regularization in the soft-min. These blocks are very well implemented by now. For hyperparameters, we can think of cross-validation, but this was not used.

I think we still need some hard numbers for the numerical experiment section (dataset sizes). Not to say that the experiments provided aren't valuable, and certainly for an OT paper I do expect smaller datasets than for a typical DNN paper, but still, we need a reference.

For all the other comments, I read them and am happy with them.

评论- Thanks for your comments!

2024-08-11

Many thanks for acknowledging our rebuttal!

The proximal operator is known in closed-form for a huge family of regularizers . See for example [1]

Ok, I think I was caught up on if tau = ||Ax||_2^2, then you at least have to solve a linear system with A. But i guess you consider that a small cost, if A is small, and you can compute its SVD easily?

This is a great question that was on our mind for a while while drafting the idea of the paper.

While it might be natural to start from $\tau_A(z) :=\|Az\|^2$ , three things happen when making that choice:

(1) as you mention, one needs to compute an inverse to evaluate the proximal operator. While this can be managed, the real downside is that one would need to differentiate through that inverse (more precisely, through a linear system) when computing the Jacobian of our regularizer, i.e. the term $\partial R(\theta)^*[P^\star(\theta)]$ that appears in Eq. 19; This sounds fairly costly.

(2) if one uses loss (Eq.19), then one seeks a projection where, ultimately, $\|Az\|^2$ is small. For very high dimensional data, $p\ll d$ , it's usually very easy to find a subspace where "nothing happens", and $\|Az\|^2$ is more or less always zero. Therefore using this approach would likely select small / noisy / uninformative projections, which is somewhat the contrary of what we seek.

(3) In the absence of any constraint on $A$ , one would certainly need to deal with the magnitude of $A$ in the optimization, as one can imagine that $A$ going to $0$ or $\infty$ may easily impact most relevant objectives.

Taken together, these reasons explain largely why we converged instead towards the definition $\tau_A(z):=\|A^\perp z\|$ ^2, the norm of $z$ in the orthogonal of $A$ , with $A$ a matrix in the Stiefel manifold: that way, we get instead that $A$ is where most of the action in the transport occurs, $A$ is easy to optimize and properly normalized.

We have explained this intuition in Appendix A, "More on Subspace Elastic Costs", but given the many questions you have asked, we are now convinced this will benefit from more details.

The original MBO paper looked at L1 and k-overlap norms. In our revision, we have added a plot with the k-overlap norm for the ground truth OT map.

Is the k overlap norm the same as the latent group norm by Obozinski et al? If so, actually the prox is a bit complex, but the dual norm is often easier to compute. I wonder if that's something you can leverage.

We believe you refer to Obozinski et al's group overlap norm [1].

The $k$ -overlap norm that we use (and which was used in the MBO paper) was proposed in [2] and generalized in [3]. The $k$ -overlap norm can be viewed as a L1-norm type regularization that does not suffer from shrinkage.

Because of this, while being still sparse (i.e. most go down or left), the arrows shown in Figure A exhibit more displacements than those shown with the $\ell_1$ norm (second plot of Fig. 1 in the draft), and none stays put.

[1] Laurent Jacob, Guillaume Obozinski, Jean-Philippe VertAuthors, Group lasso with overlap and graph lasso, ICML '09:

[2] Andreas Argyriou, Rina Foygel, and Nathan Srebro. Sparse prediction with the k-support norm, NIPS, 2012.

[3] Andrew M. McDonald, Massimiliano Pontil, Dimitris Stamos, Spectral k-Support Norm Regularization, NIPS 2014

The approach outlined in S.5 requires computing Sinkhorn solutions, and differentiate them. This has compute/memory footprint in , where is the batch-size and depends on the regularization in the soft-min. These blocks are very well implemented by now. For hyperparameters, we can think of cross-validation, but this was not used.

I think we still need some hard numbers for the numerical experiment section (dataset sizes). Not to say that the experiments provided aren't valuable, and certainly for an OT paper I do expect smaller datasets than for a typical DNN paper, but still, we need a reference.

This is indeed a very good point, we did not realize we had not given these numbers, we apologize for this oversight. These datasets are part of the Sciplex dataset ([Srivatsan et al. 2020], a science paper), which has gained much popularity in recent years in OT, and is featured prominently in

[4] Bunne et. al, Learning single-cell perturbation responses using neural optimal transport, Nature Methods 2023

The data we use has the following size:

Cell line	Control	Dac	Bel	Quis
A549	3274	558	669	475
K562	3346	388	656	339
MCF7	6346	1562	1684	1520

For each drug, we therefore have 3 sets of source/target datasets. We average results over these cell-lines.

We have run experiments for 2 other drugs (similar size) as mentioned in our general response. We will report them in the final draft. Our method is substantially better with Givinostat (better than Dac, Bel and Quis shown in draft) and is slightly worse on Hesperadin.

2024-08-11

Thanks for all the responses. I have no further questions and am happy with all answers.

评论- Thanks for your questions and comments.

2024-08-12

Your comments and questions have helped improved our draft, we have incorporated these points.

We remain available until the closing of the discussion period to address any other aspect of our draft.

the authors

审稿意见

评分: 4置信度: 32024-07-10

The paper studies the problem of computing Monge maps between two distributions. Such maps are usually found with respect to the $\ell_2^2$ cost function, however the focus of the paper is on elastic cost functions which have an additional penality term (denoted by $\tau$ ). A numerical method is formulated for estimating the Monge maps w.r.t elastic costs. Moreover, a loss function is proposed for learning the parameter $\theta$ of a parametrized regularizer $\tau_{\theta}$ , and is applied to the case where $\theta$ corresponds to the low dimensional subspace. Experiments are performed on synthetic data and and also on single-cell data tasks.

优点

The paper builds on the work of Cuturi et al. 2023 which proposed the idea of using elastic costs. However that paper proposes an estimator for finding an approximate Monge map (via the MBO estimator), while the present paper proposes a projected gradient descent scheme whose iterates converge to the Monge map. The paper also provides insights for the case where $\tau$ involves a subspace constraint, and the results are likely to be of interest for researchers working on related problems.

缺点

The exposition of the paper can be improved substantially. At present, it is difficult to parse the technical details since the notation is unclear at several places. For e.g., in eq. (7), where are the quantities a,b, X and Y defined?
The first main contribution involves showing that a projected gradient scheme recovers the ground truth Monge map as the iterates approach infinity. So for finitely many iterates, we would have an approximation to the ground-truth. But even for the MBO estimator, the estimate is a good approximation to the ground-truth provided the number of samples is sufficiently large. So, while this is a nice observation, I feel that its a bit incremental in terms of contribution.
The second contribution involves studying the case where the regularizer corresponds to a subspace constraint penalty, for which estimation rates are derived for the MBO estimator. However these appear to follow readily from existing work, so the theoretical contributions are a bit weak in my view.

问题

In Theorem 1, the rate depends exponentially on d but the subspace dimension p does not appear anywhere. Then what is the benefit of the (low-dimensional) subspace constraint?

局限性

I do not see the limitations discussed anywhere.

作者回复

2024-08-06

We thank the reviewers for their time and for formulating many interesting questions.

A numerical method is formulated for estimating the Monge maps w.r.t elastic costs.

We believe there might be a minor confusion here. You are referring to the proximal gradient method in Prop. 2. Please notice this method is only used to construct ground-truth OT maps for elastic costs (as mentioned in the title of Section 3, L.113, On Ground Truth Monge Maps for Elastic Costs). This method cannot, therefore, be used to estimate Monge maps (this was done by [Cuturi+23] with their MBO method), but only to create examples for which one can recover Monge maps for elastic (not $\ell^2_2$ ) costs.

However that paper proposes an estimator for finding an approximate Monge map (via the MBO estimator), while the present paper proposes a projected gradient descent scheme whose iterates converge to the Monge map.

Please see our response above. Prop.2 is only used to generate ground truth OT examples. It starts from a cost and any arbitrary concave functions $g$ to define $T^h_g$ in L.117. By contrast, the MBO estimator only takes samples $\mathbf{X}, \mathbf{Y}$ weighted by $\mathbf{a}, \mathbf{b}$ .

The exposition of the paper can be improved substantially. At present, it is difficult to parse the technical details since the notation is unclear at several places. For e.g., in eq. (7), where are the quantities a,b, X and Y defined?

We went too fast, apologies, this will be improved, notably with any additional page we might get.

In L. 203, “Given input and target measures characterized by point clouds $\mathbf{X}, \mathbf{Y}$ and probability weights $\mathbf{a}, \mathbf{b}$ ” : we mean that the data is given as two weighted point clouds, as $\mu=\sum_{i=1}^n a_i \delta\_{\mathbf{x}\_i},\quad\nu=\sum_{j=1}^m b_j \delta\_{\mathbf{y}\_j}.$ $(a_i), (b_j)$ are probability vectors summing to 1. In our experiments, $a_i$ and $b_j$ are uniform weights, $1/n$ and $1/m$ , with $n$ and $m$ changing. We will also improve other notation, see e.g. comments to Rev. EFNS and zNJm.

But even for the MBO estimator, the estimate is a good approximation to the ground-truth provided the number of samples is sufficiently large. So, while this is a nice observation, I feel that its a bit incremental in terms of contribution.

Let us mention again that in real-life applications we do not have access to the ground-truth. The method in Section 3 is therefore usefuless with real data. The method in Section 3 was only used to benchmark the MBO estimator (in Fig. 3) and our "learning elastic costs" bit, in Fig. 4. For instance, this method is not used at all on single-cell data, S.6.3, Fig. 5

The second contribution involves studying the case where the regularizer corresponds to a subspace constraint penalty, for which estimation rates are derived for the MBO estimator. However these appear to follow readily from existing work, so the theoretical contributions are a bit weak in my view.

The results from Thm1 indeed follow from the work of [Pooladian & Weed, 2021], using a change of variable. While the proof technique is simple, this is a new result that sheds an important light on the subspace cost function.

We also want to stress that our parameterization $\ell_2^2 + \gamma\|A^{\perp}x\|^2_2$ is instrumental in two aspects:

it allows to establish a link with spiked transport models when $\gamma\to\infty$
it enables the learning of a Stiefel matrix $A$ using the loss mentioned in Sec.5. This approach recovers the subpace ground-truth matrix $A^*$ from samples, as demonstrated in Fig.4, and is useful to predict better transport on real single-cell data in dimension $d=256$ in Fig.5.

While the cost presented in Section 4 is simple (it is indeed a quadratic norm), a crucial piece is its parameterization, which fits well with our method in Section 5, to recover a subspace cost. In that sense, Section 4 only takes its full meaning when paired with Section 5.

In Theorem 1, the rate depends exponentially on d but the subspace dimension p does not appear anywhere. Then what is the benefit of the (low-dimensional) subspace constraint?

Our method does not constraint displacements to happen in a subspace of a dimension $p$ . We use a regularization mechanism (illustrated in Fig. 1) that promotes displacements taking place mostly in that subspace (depending on $\gamma$ ). Notice that this is the biggest difference with subspace projection methods mentioned in L.37, which can be seen as a degenerate / collapsed case of our method.

Because our algorithms uses the data in its original space dimension, and Theorem 1 makes no specific assumption on the ground-truth generating mechanism, there is no immediate way to leverage $p$ . This result is there to highlight that the MBO estimator, in the case where $A$ is known, is sound.

评论- Authors response

2024-08-13

Thank you for your response to my queries, I acknowledge that I have read the rebuttal.

评论- Many thanks for acknowledging our rebuttal

2024-08-13

We thank you for the time you have taken to read our rebuttal and have already benefitted from your comments (as well as those of all other reviewers) to improve presentation.

Although remaining discussion time is short, we are happy to answer any further concerns you may have on the paper.

the authors

审稿意见

评分: 6置信度: 22024-07-13

This paper studies the problem of optimal transport with elastic costs. They demonstrate the performance of the MBO estimator on subspace losses. They then introduce a loss for learning the elastic cost function and provide experimental results on the performance of this cost learning scheme.

优点

Clearly and well-written (except for some notational confusion that I mention in the Questions section)
Optimizing over cost functions to find low-dimensional subspace is a cool idea.

缺点

The theoretical contributions of this work (sections 3 and 4) seem limited. The main machinery used is the application of the MBO estimator, which is from pre-existing work, to straightforward cost functions. It would be more interesting if some theoretical results for the setting of sections 5 and 6 were demonstrated (such as, i) does the true $\theta$ optimize the elastic costs loss when the data is drawn from, e.g., $T^h_g$ and ii) what is the sample complexity of parameter recovery?)

问题

Why is it called elastic costs?
What are $\mathbf{a}$ and $\mathbf{b}$ in equation 7?
What is $\tilde{T}$ in Theorem 1?
Am I missing something or is the setting of section 4 very simple given the pre-existing results for $h = \frac{1}{2} ||\cdot||^2$ ? In particular, because $h$ is a squared Mahalanobis-norm, this OT problem seems easily solvable by first pre-processing the data s.t. $h$ (on the new space) becomes the usual $\frac{1}{2} ||\cdot||^2$ function and then second, solving the usual OT map with cost $\frac{1}{2} ||\cdot||^2$ . It seems like the proof of Theorem 1 acknowledges this with the map $W$ , but I was wondering if there is anything more to these results than rescaling the data.
Related to the previous question, but can the results of section 4 be extended directly to general squared Mahalanobis norms?

局限性

The authors have addressed their limitations.

作者回复

2024-08-05

We are very grateful to your detailed reviews and many thought-provoking questions.

i) does the true $\theta$ optimize the elastic costs loss when the data is drawn from, e.g., $T_g^h$

We cannot prove the ground truth parameter is the global optimum of our loss, but we do observe empirically, in all our synthetic experiments, that the loss of the ground truth parameter lower bounds that of our iterates.

To follow your intuition, we agree that if the regularizer $\tau\_\theta$ admits a parameter $\theta\_{\text{null}}$ such that $\tau\_{\theta\_{\text{null}}}$ is identically 0, then our loss would be trivially minimized for $\theta\_{\text{null}}$ . This cannot happen with the Stiefel-manifold parameterization presented in the paper. For regularizers where this can occur, our loss might need a contrastive term to avoid such trivial $\theta\_{\text{null}}$ .

A possible way to give more intuition for $\mathcal{L}$ (Eq.19) is to cast it in a continuous formulation. With, as you suggest, a ground truth map $T_{g_0}^{h_{\theta_0}}$ , obtained with: a ground truth parameter $\theta_0$ to parameterize costs and a ground truth potential $g_0$ , the loss for parameter $\theta_0$ is (using Prop. 1 and 2) reads:

$\mathcal{L}(\theta_0) = \int_{\mathbb{R^d}} \tau_{\theta_0}(x - y^\star_{\theta_0}(x)) \mu(dx),$

Where

$y^\star(x)\_{\theta\_0} := \arg\min\_{y} \frac{1}{2} \|x-y\|^2 - g(y) + \tau\_{\theta\_0}(x-y),$

For a different $\theta$ , with the same source/target measures, we would need to recompute first an OT map from $\mu$ to $\nu_0:=T_{g}^{h_{\theta_0}}\sharp \mu$ using cost $h\_\theta$ ,

$T^\star[\theta] = \min\_{T, T\sharp\mu = \nu\_0}\int h\_{\theta}(x-T(x))\mu(dx),$

and use that map to define the loss,

$ \theta $ (x)\right)\mu(dx).$$ > **ii) what is the sample complexity of parameter recovery?)** This sounds hard. In the much simpler spiked model, $\gamma\rightarrow \infty$ (S4.2), for the projected Wasserstein metrics, existing results [Niles-Weed et al. 22, Lin et al.21] do not study the statistical complexity of parameter recovery, but only that of the modified distances between sample vs. population measures (taking a max, or even a mean as [Lin et al.21] over subspace projections). > **Why is it called elastic costs?** The $\ell^2_2$ squared norm + regularizer term form in **(Eq.5)** was first introduced as the elastic net (https://en.wikipedia.org/wiki/Elastic_net_regularization) by [Zhou & Tibshirani, 2005], we will cite them. This split ($\ell^2_2$ + regularizer) is natural, and leads to explicit *displacements* in Eq. (6) where $T(x) = x - \textrm{prox}_{\tau}(\cdots)$. Splitting the cost as $\ell^2_2 +\tau$ leads to the first term in $x$, and a displacement inheriting the “regularity” of $\tau$ encoded in its prox, as shown in [Cuturi et al.'23] > **What are $\mathbf{a}$ and $\mathbf{b}$ in equation 7?** We went too fast, apologies. In **L. 203**, *“Given input and target measures characterized by point clouds $\mathbf{X}, \mathbf{Y}$ and probability weights $\mathbf{a}, \mathbf{b}$”* we mean that the data is given as two weighted point clouds, as $$\mu=\sum_{i=1}^n a_i \delta\_{\mathbf{x}\_i},\quad\nu=\sum_{j=1}^m b_j \delta\_{\mathbf{y}\_j}.$$ $(a_i), (b_j)$ are probability vectors summing to 1. In our experiments, $a_i$ and $b_j$ are uniform weights, $1/n$ and $1/m$, with $n$ and $m$ changing. > **What is $\tilde{T}$ in Theorem 1?** We apologize for this. $\tilde{T}$ is defined in **L.462**, in Theorem 1's proof. It is the Brenier OT map between the source and target measures pushforwarded by the linear map $W$ in **L.456**, defined from matrix $A$. We will simplify the exposition of this theorem to focus on the rate given in between **L.177 and 178**, since $\tilde{T}$ only influences constants. > **Am I missing something or is the setting of section 4 very simple given the pre-existing results for $h=\tfrac12\|\cdot\|^2$? In particular, because $h$ is a squared Mahalanobis-norm [...] It seems like the proof of Theorem 1 acknowledges this with the map $W$, but I was wondering if there is anything more to these results than rescaling the data.** The elastic cost seeded with regularizer $\|A^{\perp} z\|^2$ in **(Eq.13)** is indeed a Mahalanobis norm as written in **L.456**, and the results from Thm1 follow from the proof technique you mention. This is a new result that sheds an important light on the subspace cost function, and that as you point out relies on the results of [Pooladian & Weed, 2021]. We also want to stress that our parameterization $\ell_2^2 + \gamma\|A^{\perp}x\|^2_2$ is instrumental in two aspects: - it allows to establish a link with spiked transport models when $\gamma\to\infty$ - it enables the learning of a subspace $A$ using the loss **Eq.19** described in Sec.5, either to recover a ground-truth parameter $A^*$ used to generate synthetic data (in Sec.6.2, Fig.4), or as a parameterized transport that improves predictive performance on real data (Sec.6.3). > **can the results of section 4 be extended directly to general squared Mahalanobis norms?** As you correctly point out, the results from Sec.4.1 can be applied to any Mahalanobis norm $h(x) = \frac12 x^T Wx$ with $W$ positive definite. We will clarify this in the text. The interest of our parameterization becomes, however, fully apparent when trying to *learn* subspaces **for displacements**, while still estimating an OT map valid in full dimensionality between samples. Learning a "valid" full-rank $W$ directly would be intractable computationally for large $d$; learning a low-rank $W$ would be degenerate, and is equivalent to the spiked transport model (Sec.4.2). The spiked model only studies OT maps once the measures have been *entirely* projected in a lower-dim. subspace. We would not know how to extend Sec. 5 to *learn* the parameters of a general Mahalanobis distance.

2024-08-13

Thanks for your response. The response clarifies the contributions and I will increase my score.

评论- Many thanks for acknowledging our rebuttal.

2024-08-13

We sincerely appreciate your time reading our rebuttal.

We are very grateful for your score increase, and remain available to answer any further questions on the paper you may have.

The Authors

作者回复

2024-08-07

We would like to thank all reviewers for the quality of their review, and for taking the time to formulate many insightful questions. It was a pleasure on our end to do our best during this rebuttal week to clarify some of these concerns. As a result, our rebuttal text is fairly long, and we thank again reviewers for their time reading our answers.

Many reviewers have mentioned the complexity of our paper. Indeed, the task we consider is fairly complex, and is at the forefront of research in OT: to our knowledge, ours is the first work that considers both structured priors for costs (beyond $\ell^2_2$ ), studies ground truth OT maps for such costs, and proposes a path to recover these parameters exclusively from samples. This works both in synthetic settings, but also, crucially, reaches demonstrated gains on real datasets when compared to the standard $\ell^2_2$ pipelines.

Because of this, we agree that some parts may not have been clear, notably in our experiments where even the parts that concern synthetic experiments are fairly complex. We agree with Reviewer zNJm that taken together, these multiple moving parts may be challenging, and that algorithmic boxes would have helped.

Following this very good suggestion, we have drafted an "Algorithms" section in the attached pdf. With these algorithms, we have tried to strike a balance between rigor (specially when it comes to low-level routines) and intuition (to describe data generation). We have then mapped succinctly our experiments to these algorithms, to highlight what parts matter for each task. For instance, our parameter recovery pipeline Recover-Theta can be used with the MBO estimator (Section 6.3), but can also be used on its own (Section 6.2) if one only seeks a subspace to gather interpretability.

We have also added two figures in the pdf.

Figure A will be added to the current Figure 1. This shows a ground truth OT map beyond $\ell_1$ and the subspace norm, using the $k$ -overlap norm, as was formerly explored in [Cuturi et al. 23]. This illustrates better, in our opinion, the interest of being able to generate ground truth displacements for arbitrary structured norms.
Figure B will replace Figure 4. While we were already positively impressed by the ability of our recovery pipeline to estimate the ground truth subspace $A^*$ exclusively from samples (using Recover-Theta, Algorithm 5 in pdf), we noticed that, looking at experiments in more detail, the few failure cases we were observing were simple due to suboptimal seedings for the initialization. Running the algorithm 3 times, with 3 different initializations, and keeping the parameter with lowest loss $\mathcal{L}$ significantly improves performance (as with, e.g., $k$ -means).

We have also ran more experiments on drugs, adding 2 more datasets. We do not report these for lack of space, but observe similar trends: our subspace regularizer has a predictive edge.

Many thanks again for your time,

The Authors.

最终决定Accept (poster)

2024-09-25

Overall reviewers felt that this paper makes a solid contribution towards extending our understanding of Monge maps under regularized Euclidean loss functions. The paper should serve as a foundation for more follow up work on optimal transport problems with such loss functions. While reviewers felt that the paper was largely well-written, they made several suggestions for improvements that we hope the authors will take into account when preparing the final version.