PaperHub
6.6
/10
Poster4 位审稿人
最低3最高4标准差0.5
3
4
4
3
ICML 2025

Optimal Transport Barycenter via Nonconvex-Concave Minimax Optimization

OpenReviewPDF
提交: 2025-01-17更新: 2025-07-24

摘要

The optimal transport barycenter (a.k.a. Wasserstein barycenter) is a fundamental notion of averaging that extends from the Euclidean space to the Wasserstein space of probability distributions. Computation of the *unregularized* barycenter for discretized probability distributions on point clouds is a challenging task when the domain dimension $d > 1$. Most practical algorithms for approximating the barycenter problem are based on entropic regularization. In this paper, we introduce a nearly linear time $O(m \log{m})$ and linear space complexity $O(m)$ primal-dual algorithm, the *Wasserstein-Descent $\dot{\mathbb{H}}^1$-Ascent* (WDHA) algorithm, for computing the *exact* barycenter when the input probability density functions are discretized on an $m$-point grid. The key success of the WDHA algorithm hinges on alternating between two different yet closely related Wasserstein and Sobolev optimization geometries for the primal barycenter and dual Kantorovich potential subproblems. Under reasonable assumptions, we establish the convergence rate and iteration complexity of WDHA to its stationary point when the step size is appropriately chosen. Superior computational efficacy, scalability, and accuracy over the existing Sinkhorn-type algorithms are demonstrated on high-resolution (e.g., $1024 \times 1024$ images) 2D synthetic and real data.
关键词
Wasserstein barycenter; optimal transport; nonconvex-concave minimax optimization; convergence rate;

评审与讨论

审稿意见
3

This paper investigates the Wasserstein barycenter problem. The approach involves transforming the problem into the dual formulation of Kantorovich's optimal transport problem, which is then reformulated as a nonconvex-strongly concave minimax optimization problem. To solve this, the standard gradient descent-ascent method is employed, with theoretical guarantees provided for its convergence. Numerical experiments demonstrate the effectiveness of the proposed algorithm.

给作者的问题

In line 253, I would like to understand why the projection operator onto the convex gradient smooth function class is guaranteed to be unique. Could the authors provide further clarification or a formal justification for this claim?

论据与证据

Yes, it is clear.

方法与评估标准

The algorithm is tested on standard machine learning tasks.

理论论述

The overall theoretical claims are reasonable. The specific dual formulation of Kantorovich's problem allows the original Wasserstein barycenter problem to be verified as a strongly concave minimax problem. The author should emphasize that the problem is not merely concave but strongly concave, as this distinction is crucial for ensuring the validity of the derived results. Additionally, the complexity bound introduced in line 210 does not hold for general nonconvex-concave problems and should be clarified accordingly.

实验设计与分析

The algorithm tested in the numerical section differs from the one analyzed in the theoretical parts. In addition to the discretization of the function, the practical implementation approximates the projection onto the convex function space using the convex envelope. To enhance the rigor of the analysis and the credibility of the numerical results, it would be beneficial for the paper to either provide a more precise theoretical justification for these approximations or conduct numerical experiments on the exact algorithm analyzed.

补充材料

I have reviewed the supplementary material but have not conducted a detailed verification of its correctness.

与现有文献的关系

This paper is a valuable supplement to the existing literature on computing the Wasserstein barycenter through its reformulation approach.

遗漏的重要参考文献

Not found.

其他优缺点

Please see above.

其他意见或建议

  1. line 215, it should be ''Let''.
  2. Several citations in the paper, such as "Algorithm 3.3" and "Theorem 1," are incorrect and should be properly referenced.
作者回复

Thanks for your constructive comments and careful reading. We will fix minor errors in the revision. Please find our response to your main questions below.

The author should emphasize that the problem is not merely concave but strongly concave, as this distinction is crucial for ensuring the validity of the derived results. Additionally, the complexity bound introduced in line 210 does not hold for general nonconvex-concave problems and should be clarified accordingly.

Thanks for pointing out that our algorithmic convergence rate and the existing Euclidean result rely on strong concavity. Indeed, Theorem 1 works for 0<α<β<0 < \alpha < \beta < \infty, under which Problem (4) is a nonconvex-strongly-concave problem. In the revision, we will clarify in line 106 to "as in the Euclidean nonconvex-strongly-concave optimization problems", revise line 207 to "(Lin et al., 2020) proved that, if ff is convex in xx and strongly concave in yy, with suitable choices of step sizes (η,τ)(\eta, \tau ), the following bound holds:", and revise line 233 to "Fix ν\nu and 0<α<β<0 < \alpha < \beta < \infty, the objective functional J(ν,φ1,,φn)\mathcal{J} (\nu, \varphi_1, \dots, \varphi_n ) is strongly concave in each φi\varphi_i." to emphasize that the dual program is strongly concave.

The algorithm tested in the numerical section differs from the one analyzed in the theoretical parts. In addition to the discretization of the function, the practical implementation approximates the projection onto the convex function space using the convex envelope. To enhance the rigor of the analysis and the credibility of the numerical results, it would be beneficial for the paper to either provide a more precise theoretical justification for these approximations or conduct numerical experiments on the exact algorithm analyzed.

Results produced by projection and conjugate envelope (i.e., double convex conjugates) are identical if α=0\alpha=0 and β=\beta=\infty, and are close if we set α\alpha small and β\beta large. In Appendix, Subsection C.2, we compared the two algorithms on 1D distributions, where we set α=0.001\alpha=0.001 and β=1000\beta=1000 in Algorithm 3 (projection), which performs comparably well as Algorithm 4 (convex envelope) for this example, i.e., no significant difference. Specifically, projection achieves W2\mathcal{W}_2-distance 7.610×1057.610\times 10^{-5} (std: 3.367×1053.367\times 10^{-5}) and L2L^2-distance 0.608 (std: 0.1940.194), while convex envelope has W2\mathcal{W}_2-distance 7.771×1057.771\times 10^{-5} (std: 3.32×1053.32 \times 10^{-5}) and L2L^2-distance 0.6434 (std: 0.4510.451).

To compute the projection for 2D distributions, one can use the algorithm proposed by Simonetto. Smooth Strongly Convex Regression. However, this involves solving a large convex quadratically constrained quadratic problem. For 2D distributions discretized on the grid of size 102421024^2, this problem would have 102441024^4 quadratic constraints, potentially taking weeks to solve. Due to the rebuttal time constraint, we are unable to report numeric results of running projection in 2D.

In line 253, I would like to understand why the projection operator onto the convex gradient smooth function class is guaranteed to be unique...

This projection is defined as P_F_α,β(φ)=argmin_ψF_α,βφψ_H1˙\mathcal{P}\_{\mathbb{F}\_{\alpha,\beta}}(\varphi) = \arg\min\_{\psi\in\mathbb{F}\_{\alpha,\beta}} \\| \varphi - \psi \\| \_{\dot{ \mathbb{H}^1}} for every φH˙1\varphi\in\dot{\mathbb{H}}^1. First, we note that F_α,β\mathbb{F}\_{\alpha,\beta} is a convex set. In fact, For every φ0,φ1Fα,β\varphi_0, \varphi_1\in\mathbb{F}_{\alpha,\beta}, by definition, we have

α2xy22φi(x)φi(y)φi(y),xyβ2xy22,i=0,1.\frac{\alpha}{2} \\| x-y \\|_2^2 \leq \varphi_i(x) - \varphi_i(y) - \langle\nabla\varphi_i(y), x-y\rangle \leq \frac{\beta}{2} \\| x-y \\|_2^2,\quad\forall\,i=0,1.

This implies that for every λ[0,1] \lambda \in [0, 1], with φλ=λφ1+(1λ)φ0\varphi_\lambda = \lambda \varphi_1 + (1-\lambda) \varphi_0, it holds that

α2xy_22φ_λ(x)φ_λ(y)φ_λ(y),xyβ2xy22.\frac{\alpha}{2} \\| x-y \\|\_2^2 \leq \varphi\_\lambda(x) - \varphi\_\lambda(y) - \langle \nabla \varphi\_\lambda(y), x-y\rangle \leq \frac{\beta}{2} \\| x-y \\|_2^2.

So, the convexity of the set F_α,β\mathbb{F}\_{\alpha,\beta} is proved. Next, we will show that the projection is unique, which follows from the standard argument for projection to convex subset in Hilbert spaces. For φH˙1\varphi \in \dot{\mathbb{H}}^1, if both ψ_0\psi\_0 and ψ_1\psi\_1 in F_α,β\mathbb{F}\_{\alpha,\beta} minimizes φ_H˙1\\| \cdot - \varphi \\| \_{\dot{\mathbb{H}}^1}, by taking ψ~=(ψ_0+ψ_1)/2F_α,β\widetilde \psi = (\psi \_0 + \psi\_1)/2 \in\mathbb{F}\_{\alpha,\beta}, we have

ψ~φ_H˙12=12ψ_0φ_H˙12+12ψ_1φ_H˙1214ψ_1ψ_0_H˙12.\\| \widetilde\psi - \varphi \\| \_{\dot{\mathbb{H}}^1}^2 = \frac{1}{2} \\| \psi\_0 - \varphi \| \_{\dot{\mathbb{H}}^1}^2 + \frac{1}{2} \\| \psi\_1 - \varphi \\| \_{\dot{\mathbb{H}}^1}^2 - \frac{1}{4} \\| \psi\_1 - \psi\_0 \\| \_{\dot{\mathbb{H}}^1}^2.

The optimality of ψ_0\psi\_0 and ψ_1\psi\_1 implies that ψ_0=ψ_1\psi\_0 =\psi\_1. Thus, the uniqueness of projection is proved.

审稿意见
4

The paper propose a novel Wasserstein Descent H1-Ascent algorithm for the barycenter on W2-space, which is an important problem without an accurate and computational efficient solution. The proposed method adopt Gradient on the functional in the dual formlation of the W2 optimazation and apply it into the min-max problem for the barycenter with a extension on the Euclidean space. The convergence is proved under the assumption for the bounded density. The computational complexity is also discussed with a 2D numerical study to show the effectiveness of the method.

给作者的问题

  1. The same above. how does the dimension influence the algo, can we apply more numerical studies on 3D cases.
  2. How does the grid resolution influence the proposed algo, what's the sample complexity if we hope to extend the method to contionous ones.
  3. Please clarify the technical contribution for the theoretical results. I am not whether there are novel theoretical tools or insights included. I think the main bonus is coming from the computation complexity in H1 sapce but it is referred in [Jacobs $ Leger et al.].I do not think it is necessary yo keep very big novelty in techniques for the paper to solve the barycenter problem. I just hope to justify.

Update after Rebuttal

Thanks for the authors' efforts. Most of my concerns are well addressed.

论据与证据

Yes. The claim in the paper is almost self-contained except for the verification of the assumptions. But it will not influence much about its applications.

方法与评估标准

Yes. The method makes sense that the proposed method design optimization for the min-max problem with gradient in H1 space.

理论论述

Yes. I check for the proof of the convergence of the WDHA under the assumptions and the proof for the computational complexity.

实验设计与分析

Yes. An 2d barycenter problem is designed for different shapes. A table might be reported for better visulization. If possible, the authors also need to clarify which algorithm uses GPUs or all of them used. I am not sure whether GPU versions are used for the methods compared.

补充材料

Yes. I verify the technique details for the theorems and keep a quick look at the proof for lemmas. I also verify the additional details and implementation for the algorithm verification.

与现有文献的关系

The computable method for W2 barycenter is very significant. For my expertise about the differential privacy, I believe it is deeply related to the data privacy level and will better solve the problem in the field. More effective application about the data processing, contol or editing based on SDE formulation can also be reduced. More applications in 3D geometry (like basic morphing) could be improved.

遗漏的重要参考文献

Almost all the possible and important solutions for w2 barycenter are referred, except for sampling methods like "Sampling From the Wasserstein Barycenter" [Daaloul 2021]).

其他优缺点

Pros:

  1. The barycenter problem is significant and the method is novel and computation effective in 2D.
  2. Writing is great and clear for understanding.

Cons:

  1. For the min-max problem, we can only find for the stationary point but not global opt. Therefore the convergence proof is actually not for the optimal solution for barycenter problems. But as mentioned in the content, it is not a problem since the complexity. But if more simple examples for the global optim equals to local should give more insights to the practitioner.
  2. I'm not sure how the dimension influence the algo. Actually, for most of the barycenter computation, high dimension will bring problems. I think the proposed method also suffers from the problem. But I think it is also not a problem. More data dimension reduction can be adopted. Actually, what I care about is 3D cases but only 2D showed. I hope a fast 3D computation can be achieved.

其他意见或建议

Line 318 No Theorem 1.

伦理审查问题

N/A

作者回复

Thanks for your constructive comments. In the revision, we will cite the paper [Daaloul 2021], which proposed an interesting notion for approximating and sampling from unregularized Wasserstein barycenter. Please find our response to your main questions below.

the authors also need to clarify which algorithm uses GPUs or all of them used...

All algorithms are executed on Google Colab using the same hardware configuration. We compared the runtimes on L4 GPU and CPU for Example 1:

OursCWBDSB
L4-GPU67437317249
CPU85446718887

Note that our WDHA algorithm can naturally be parallelized because the core operations, such as the convex conjugate for the nn dual programs (the primary computing bottleneck), can be solved independently on high-performance computing (HPC) clusters (CPU/GPU). Our current implementation is not optimized for HPC-acceleration since the core code is written in C++ with for loops. It is possible that the WDHA algorithm can fully leverage the power of HPC frameworks (e.g., CUDA) and substantially benefit from parallelization. To engineer an HPC-accelerated version of WDHA is a valuable future direction.

For the min-max problem, we can only find for the stationary point but not global opt. Therefore the convergence proof is actually not for the optimal solution for barycenter problems.

If the distributions are regular enough, then a stationary solution is indeed a global optimum because the barycenter functional is convex in the linear structure (Proposition 7.19 in Santambrogio. Optimal Transport for Applied Mathematicians) [even though not geodesically convex]. This implies that any global minimizer ν\overline{\nu} is a barycenter of μ1,,μn\mu_1, \dots, \mu_n iff 0=id1ni=1nφνμi0=id - \frac{1}{n} \sum_{i=1}^n \nabla \varphi_{\overline{\nu}}^{\mu_i} . Moreover, if μi\mu_i's have densities, the barycenter functional is strictly convex and ν\overline{\nu} is unique.

Now, suppose that φ_νμ_iF_α,β\varphi\_{\overline{\nu}}^{\mu\_i} \in \mathbb{F}\_{\alpha, \beta} for all ii (i.e., no approximation error due to assumed regularity), then by Lemma 3.2 and Definition 3.3, any smooth stationary point ν\nu would satisfy that (Wasserstein gradient of F_α,β(ν)\mathcal{F}\_{\alpha, \beta}(\nu))= id1n_i=1nφ_νμ_i id - \frac{1}{n} \sum\_{i=1}^n \nabla \varphi\_{\nu}^{\mu\_i} implies that ν\nu is a barycenter. This justifies our convergence result to a first-order stationary point solves the global opt problem. This discussion was provided in Remark 3.7.

A simple example with known ground truth is provided in Appendix, Subsection C.1, wherebarycenter of four uniform round disks is computed under the same setting as in Examples 1 and 2. The true barycenter is known as a round disk in the middle. Our method performs uniformly the best to recover the ground truth.

how does the dimension influence the algo, can we apply more numerical studies on 3D cases.

Yes, you are right. A key factor impacting computation speed is the total number of grid points, which grows exponentially in dimension and results in the major computation burden for higher dimensional cases. One solution is to use parallel computing on HPC clusters to accelerate via the decoupled dual programs. Review 9GjF raised the same question to give a 3D example. Please find our reply therein.

How does the grid resolution influence the proposed algo, what's the sample complexity if we hope to extend the method to contionous ones.

A larger number of grid points leads to more accurate approximations but also increases computational cost, creating a tradeoff. Comparing to existing methods in the literature, we demonstrate that a grid size of 500 or 1000 achieves the highest accuracy with the least computation time in Examples 1 and 2. Sample complexity to accommodate discretization error is a challenging question. Given the stability estimate (Lemma 3.1), we conjecture that WDHA sample complexity depends on the smoothness of the continuous dual potentials and the domain dimension dd in a nonparametric fashion.

Please clarify the technical contribution for the theoretical results.

We acknowledge that Lemma 3.1 is derived based on the results of [Jacobs & Leger]. However, the other fundational Lemmas 3.2, 3.4, 3.5 are novel and original. Notably, the bound shown in Lemma 3.4 (see eqn (6)) improves upon existing results in literature regarding the quantitative stability of OT maps. This is achieved by restricting φ\varphi to be α\alpha-strongly convex and β\beta-smooth. Lemma 3.2 establishes the first variation of max_φF_α,βI_νμ(φ)\max\_{\varphi \in \mathbb{F}\_{\alpha, \beta}} \mathcal{I}\_{\nu}^{\mu}(\varphi), which is both original and nontrivial due to the presence of maximization. These lemmas serve as fundamental building blocks for Theorem 3.6, integrating techniques from applied mathematics, OT theory, and optimization in Euclidean space.

审稿意见
4

This work introduces a coordinate optimization algorithm for the computation of Wasserstein barycentres for probability measures discretized on compact Euclidean domains (such as images). By reformulating the traditional barycenter problem as a nonconvex-concave minimax optimization problem and employing a Gradient Descent-Ascent (GDA) approach, authors alternate between the barycenter atoms updates in the Wasserstein space and the n Wasserstein dual maximizers (Kantorovich potentials) updates in the Sobolev space. The Sobolev gradient, under reasonable assumptions, allows for faster numerical convergences and more precise barycenters. Multiple theoretical analysis highlights the properties of the proposed approach, the Wasserstein descent H1 ascent algorithm (WDHA), as well as 2-dimensional examples demonstrate that WDHA outperforms OT state-of-the-art algorithms.

给作者的问题

  1. Is not straightforward for me the nonconvexity (w.r.t. the barycenters) and concavity (w.r.t. the potentials) of Problem (4). Could you please provide deeper details for such claims? Moreover, is not clear if parameters alpha and beta, defining the subset of the Sobolev space, are fixed a priori; are related (in some sense) to the input data; or are optimized. Could you clarify better this point?
  2. In Sec. 3.5, to speed up the computational costs you suggest to replace the dual potentials projections with computing their second convex conjugate. Did you do some experiments to check the difference of the two versions of the algorithm for 2D cases (with visualizations)? For example, how the barycenter change in the scenario of Example 1, estimating it using Algorithm 3?
  3. Is it possible to test the behaviour of the WDHA solution in a 3D context? For example, interpolating between shapes?

论据与证据

Yes, all the claims presented in the paper are all supported by clear and convincing evidence.

方法与评估标准

The authors are clear about the context and limitations of the problem addressed and the proposed approach. They say, “the current approach is mainly limited to computing the Wasserstein barycenter of 2D or 3D distributions supported on a compact domain”. Moreover, comparison methods and evaluation criteria are coherent with the presented application (images). However, exploring the solution of such approach in 3D scenarios could be interesting.

理论论述

No

实验设计与分析

Yes, experiments 1 and 2 are consistent. As for experiment 3 (in section C.1 and Figure 3), here it seems that spatial notions are introduced in the OT problem. Space, understood as the coordinate (x, y) of the four input distributions considered. I am not convinced that this scenario is coherent with the paper and the previous applications, where the distributions are discretized on m-point grids. Furthermore, the notion of space implies a natural embedding of distributions as point clouds, characterising circles, and consequently the comparison with techniques such as CWB and DSB is unfair.

补充材料

No

与现有文献的关系

The proposed approach (WAHD algorithm) extends in an innovative way the optimal transport (OT) literature and further expands the wide range of methodologies for computing barycenters of continuous probability measures. Although the Wasserstein barycenter problem has been deeply studied, under different frameworks (discrete and continuous case, addition of entropic regularisations, etc.) and across a wide range of domains (images, shapes, high-dimensional distributions, etc.), the WAHD algorithm establish a direct connection between “classical” OT optimization and differential optimization. The key step of this approach is to pass to solve n independent OT problems (primal problems), where n is the number of input measures, to solve the respective dual problems in a Sobolev space. The updates of the dual potentials via gradient ascent with the homogeneous Sobolev gradient, leads to a smaller computational cost per iteration, which is very interesting in the case of high-resolution images.

遗漏的重要参考文献

To the best of my knowledge, no. The bibliography is comprehensive and sufficient for covering the different areas of interest discussed in the article. Moreover, accurate and consistent citations are provided for the general understanding.

其他优缺点

This work incorporates modern gradient-based optimization techniques with classical optimal transport problems in an original way. The switch to the computation of gradients of the dual potential in the Sobolev space is very interesting and proves to be effective for a more precise and detailed barycenter. I am very interested to see the potential applications of this approach, as it opens up new avenues for the application of modern OT techniques in machine learning.

其他意见或建议

List of minor errors:

  • Throughout the paper, there are numerous references to Theorems, Lemmas and Algorithms (introduced in the paper) that are inconsistent. For example, Algorithm 4 is referred to in the experimental section as Algorithm 3.5. The same applies to the theoretical statements on pages 5 and 6.
  • I assume that the method used for the computation of the second barycenter in Figure 2 (row 2, column 2) is CWB. WSB has never been defined.
  • Page 4, Line 215, l.h.s. : Missing an "S" for “(S)et”
  • Page 4, Line 212, r.h.s. : First equation in Sec. 3.3, "The dual formulation of Kantorovich problem implies", the min sound be a max (am I right?)
  • Page 5, Line 229, r.h.s. : Error in writing the functional “I_{alpha, beta}” should be “I_{nu}^{mu}”
作者回复

Thanks for your constructive comments and careful reading. We will fix minor errors in the revision. Please find our response to your main questions below.

As for experiment 3, here it seems that spatial notions are introduced in the OT problem. Space, understood as the coordinate (x, y) of the four input distributions considered ... the comparison with techniques such as CWB and DSB is unfair.

The experiments 1-3 were arranged on the same 1024×10241024 \times 1024 grid, except that they are uniform distributions on different supports/shapes. We agree that the spatial notion of placing the shapes (e.g., centers of disks/circles) generally implies an embedding of distributions. However, in our experimental setting, the 4 shapes were equilaterally arranged in the corners from the grid center. This means that the coordinate information (x,y)(x, y) does not impact the barycenter. For instance, we can move all shapes towards the center of the square grid domain (squeezing), and the barycenter does not change. Effectively, all of the 4 input distributions sit in a common domain centered around (0,0)(0,0) and the problem can be seen as a discretization on a fixed grid. Thus the comparison with CWB and DSB is fair in this particular configuration. To avoid confusion, we will revise line 911 to ``Here, the goal is to compute the barycenter of four uniform distributions, each supported within [0,1]2[0,1]^2 and taking the shape of round disks with a radius of 0.15, centered equilaterally at (0.2,0.2),(0.2,0.8),(0.8,0.2),(0.8,0.8)(0.2, 0.2), (0.2, 0.8), (0.8, 0.2), (0.8, 0.8), respectively. Their densities are discretized on a 1024×10241024 \times 1024 grid."

Is not straightforward for me the nonconvexity (w.r.t. the barycenters) and concavity (w.r.t. the potentials) of Problem (4).

We first provide some insights about the nonconvex-concave structure of the barycenter problem. As mentioned in replying to Reviewer 9GjF, the barycenter problem is defined as minνi=1nαiW22(ν,μi),\min_\nu \sum_{i=1}^n \alpha_iW_2^2(\nu, \mu_i), where the objective function is geodesically nonconvex in ν\nu for n3n \geq 3 (in fact, it is concave over Gaussians!) However, the barycenter objective is convex in the linear structure (cf. our reply to Reviewer sfZf). This is connected to the concavity in the dual program of potentials, where their L2L_2 gradients are key to defining the Wasserstein gradient of J\mathcal{J}. Hence, the Wasserstein barycenter functional above can be recast into Problem (4) as a primal-nonconvex and dual-concave program in two different geometries. Below we give more rigorous details about this structure.

Strong concavity in φ\varphi. Lemma 3.1 implies that I_νμ(φ)\mathcal{I}\_\nu^\mu(\varphi) is (strongly) concave on the function space F_α,β\mathbb{F}\_{\alpha,\beta} w.r.t. H˙1\dot{\mathbb{H}}^1-norm when μ\mu admits a density function bounded away from zero and uniformly upper bounded. As the objective functional is J(ν,φ)=1ni=1nI_νμ_i(φ_i)\mathcal{J}(\nu, \varphi) = \frac{1}{n}\sum_{i=1}^n \mathcal{I}\_\nu^{\mu\_i}(\varphi\_i) is separable in potentials, we know J(ν,φ)\mathcal{J}(\nu,\varphi) is strongly concave w.r.t. φ\varphi.

Nonconvexity in ν\nu. Recall that I_νμ(φ)=x_222φ(x)dν+x_222φ(x)dμ,\mathcal{I}\_\nu^\mu(\varphi) = \int \frac{ \\| x \\| \_2^2}{2} - \varphi(x) d \nu + \int \frac{ \\| x \\|\_2^2}{2} - \varphi^\ast(x) d \mu, where φ\varphi^\ast is the convex conjugate of φ\varphi. Only the first integration in the above definition depends on ν\nu. It can be shown that the functional νid22φdν\nu\mapsto\int \frac{\rm{id}^2}{2} - \varphi d \nu is geodesically convex if and only if the function id22φ\frac{\rm{id}^2}{2} - \varphi is convex, which cannot be guaranteed in our WDHA algorithm.

is not clear if parameters alpha and beta, defining the subset of the Sobolev space, are fixed a priori; are related (in some sense) to the input data; or are optimized.

Prameters α\alpha and β\beta are fixed: they are used for the theoretical purpose for convergence analysis of WDHA algorithm. In practice, we suggested using Algorithm 4 by taking the conjugate envelope (i.e., double convex conjugates) at each iterate, which could help to reduce the computational time than projection. We verified by numeric example that practical differences between small-α\alpha-large-β\beta and conjugate envelope are small (see also our reply to Reviewer rrvS7 on Experimental Designs).

you suggest to replace the dual potentials projections with computing their second convex conjugate. Did you do some experiments to check the difference of the two versions of the algorithm for 2D cases (with visualizations)?

Reviewer rvS7 raised the same question. Please find our reply therein.

Is it possible to test the behaviour of the WDHA solution in a 3D context?

Review 9GjF raised the same question to give a 3D example. Please find our reply therein.

审稿意见
3

This paper introduces the Wasserstein-Descent H-Ascent (WDHA) algorithm, a primal-dual method for computing the exact Wasserstein barycenter with nearly linear time and space complexity. WDHA alternates between Wasserstein and Sobolev optimization geometries for the primal barycenter and dual Kantorovich potential subproblems. The algorithm outperforms existing Sinkhorn-type methods in terms of computational efficacy, scalability, and accuracy, as demonstrated on high-resolution 2D synthetic and real data.

给作者的问题

Is this algorithm limited to barycenter weights set to 1/n1/n? How would it handle other barycenter weight settings?

论据与证据

Yes

方法与评估标准

Yes, the proposed algorithm seems to work according to Figure 1.

理论论述

Yes, I have checked the proof of Theorem 3.6 in Appendix B.5.

实验设计与分析

Yes, the performance is shown in Figure 1, and the running time and time complexity are given in Sections 3.5 and 4.

补充材料

Yes, the code is given in Supplementary Material.

与现有文献的关系

This paper proposes a GPU-friendly primal-dual descent method for exact Wasserstein barycenter calculation, whereas previous works mainly focus on regularized barycenter calculations, as shown in [1,2,3].

[1] Iterative Bregman projections for regularized transportation problems, 2016

[2] Fast optimal transport averaging of neuroimaging data

[3] Fast computation of Wasserstein barycenters

遗漏的重要参考文献

I think [1] should be discussed that one can resort to using first order methods such as subgradient descent on the dual for exact barycenter calculations.

[1] Numerical methods for matching for teams and Wasserstein barycenters

其他优缺点

Strengths: The writing is good, the idea is interesting, and the theory is complete, with full proofs provided for every lemma and theorem. Additionally, the authors have provided a clear and complete algorithm along with its code.

Weaknesses: The experiments seem too simplistic.

其他意见或建议

I suggest the authors include more comprehensive experiments.

作者回复

Thanks for your constructive comments. Please find our response to your questions below.

I think 11 should be discussed that one can resort to using first order methods such as subgradient descent on the dual for exact barycenter calculations.

Thanks for pointing out this relevant first-order method; we will discuss the paper " 11 Numerical methods for matching for teams and Wasserstein barycenters" in the revised version.

The experiments seem too simplistic... I suggest the authors include more comprehensive experiments...

We extended our WDHA implementation to handle 3D distributions and conducted an example in such setting. The goal is to compute the barycenter of three uniform densities supported on different shapes of ellipsoid contained in [0,1]3[0,1]^3. Our simulation setup is: (x0.5)2a2+(y0.5)2b2+(z0.5)2c20.82\frac{(x-0.5)^2}{a^2} + \frac{(y-0.5)^2}{b^2} + \frac{(z-0.5)^2}{c^2} \leq 0.8^2, where the parameters are

(a,b,c){(0.25,0.25,0.5),(0.5,0.25,0.25),(0.25,0.5,0.25)}.(a,b,c) \in \{ (0.25, 0.25, 0.5), (0.5, 0.25, 0.25), (0.25, 0.5, 0.25) \}.

These three densities are discretized on the grid of size 200×200×200200 \times 200 \times 200. In addition, we include a weighted barycenter setting, where the weights is set to be (1/4,1/4,1/2)(1/4, 1/4, 1/2). (see below for the extension to weighted barycenter). We ran our algorithm on a cluster with 2000 iterations and it took 5 hours to compute the barycenters. The figure showing the computed barycenters by our methods can be accessed using the anonymous link https://drive.google.com/file/d/1k5Yg4YF9FzkEOi7PC03jep5coXKNb2Vz/view?usp=sharing. Moreover, we note that current version of the Python package POT does not support the computation of barycenter for 3D distributions.

Is this algorithm limited to barycenter weights set to 1/n1/n? How would it handle other barycenter weight settings?

Our algorithm can be generalized to the computation of weighted Wasserstein barycenter with any barycentric coordinates (α1,,αn)(\alpha_1, \dots, \alpha_n) on the probability simplex

ν=argminνi=1nαiW22(ν,μi).\overline{\nu} = \arg\min_\nu \sum_{i=1}^n \alpha_iW_2^2(\nu, \mu_i).

Precisely, we can replace the H˙1\dot{\mathbb{H}}^1-gradient of J\mathcal{J} with \mathbf{\nabla}\_{\varphi\_i} \mathcal{J}(\nu, \mathbf{\varphi}) = \alpha\_i (-\Delta)^{-1} (-\nu + {(\nabla \varphi\_i^{*})}{\\#} \mu\_i) and replace the Wasserstein gradient of J(ν,φ)\mathcal{J}(\nu,\varphi) with id_i=1nα_iφ_iid - \sum\_{i=1}^n\alpha\_i\varphi\_i.

最终决定

In the paper, the authors propose a Wasserstein-Descent H1-Ascent (WDHA), a near linear time and linear space complexity algorithm, for computing the exact barycenter when the input probability measures are discrete on an mm-point grid.

All the reviewers are positive about the contributions of the papers, including: (1) the novelty of the proposed WDHA algorithm that has solid theoretical guarantee; (2) the experiments are sufficiently extensive and appropriate to evaluate the proposed algorithm; (3) the writing and presentation of the paper are good. After the rebuttal, most of the concerns of the reviewers were addressed, and all the reviewers are happy with the current stage of the paper.

In my opinion, the contributions and originality of the proposed algorithm WDHA are sufficient for acceptance at ICML. Therefore, I recommend accepting it in its current form. However, I encourage the authors to address the reviewers’ suggestions and integrate their feedback into the camera-ready version of their paper.