6.3

/10

Rejected4 位审稿人

最低4最高7标准差1.3

3.8

置信度

正确性3.5

贡献度3.0

表达2.8

NeurIPS 2024

Partial Gromov Wasserstein Metric

Yikun Bai,Rocio Diaz Martin,Abihith Kothapalli,Hengrong Du,Xinran Liu,Soheil Kolouri

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

TL;DR

We introduce a novel partial Gromov-Wasserstein metric for unbalanced settings

摘要

关键词

Optimal transportmachine learning

评审与讨论

审稿意见

评分: 7置信度: 42024-07-13

This paper introduces the an unbalance Gromov-Wasserstein distance, which adopted the formulation of unbalance optimal transport into the Gromov-Wasserstein with total variance (TV) penalty instead of KL. This new distance allows comparison of probability measures from different space with partial amount of mass, and so they name it Partial Gromov-Wasserstein (PGW). They further prove the metric properties of this distance and proposed two algorithms to solve PGW: Frank-Wolfe algorithms and Line search method. In experiment section, they test PGW with different tasks including: shape-matching, shape retrieval and barycenter problem.

优点

This paper fills a gap in the topic of unbalance (GW) problems by introducing a TV-relaxed unbalanced GW, they call it Partial Gromov-Wasserstein (PGW). This work shows the theoretic metric properties of this distance with a solid proof, making a contribution within this topic. They proposed two solvers to compute this distance and show diverse experimental applications. The experiment results show that this new variance of unbalance GW performs effectively with data containing outliers, aligning with the anticipated performance of unbalanced Wasserstein or unbalanced Gromov-Wasserstein methods on noisy data.

缺点

It's worth to note in literature review the similar works formulating partial Waserstein as a metric with TV constraints [1] [2].
The proof of the metric properties is not well displayed in the main text, as this is the main highlight of this work.
Further comparison with KL version was lack as regards to robustness against outliers. And also, further discussion on the scalability (very large dataset) will be appreciated.
In experiment section, the selection of hyperparameter is not clear. The justification of choosing $\lambda$ value for both PGW and MPGW method was not presented.

[1] Raghvendra, Sharath, Pouyan Shirzadian, and Kaiyi Zhang. "A New Robust Partial $p$ -Wasserstein-Based Metric for Comparing Distributions." arXiv preprint arXiv:2405.03664 (2024).

[2] Nietert, Sloan, Rachel Cummings, and Ziv Goldfeld. "Robust estimation under the Wasserstein distance." arXiv preprint arXiv:2302.01237 (2023).

问题

Could you add the parameter sensitivity analysis in experiment section?
The table under line 271 doesn't have a title, and in (a) the performance of MPGW is $0$ on Dataset II which looks abnormal to me. Is this the actual experiment result or just a typo?

局限性

None

作者回复

2024-08-07

It's worth noting in the literature review the similar works formulating...

Answer: The authors thank the reviewer for their point. These two references will be added in the Introduction.

The proof of the metric properties is not well displayed in the main text, as this is the main highlight of this work.

Answer: The proof of the metric property is proposed in Appendix D. We apologize that we cannot present it in the main text due to the page limit. The formal statement is based on the definition of equivalence classes among mm-spaces (Remark D.1), and the main technique of the proof relies on the relation between PGW and GW (see Appendix D.3 where we prove the triangle inequality). Due to the page limit, we cannot present these concepts in the main text, but we have added these comments in the main text (right before the statement of Proposition 3.4) to give the reader some insights about the main ideas behind the proof.

Further comparison with KL version was lacking...

Answer:

Comparison with the KL version: In Experiment 5.1, we explicitly compare PGW and UGW (which uses the KL divergence) in outlier detection. Both UGW and PGW can detect outlier points when parameters are suitably selected. However, since UGW utilizes the Sinkhorn solver, it tends to favor a mass-splitting transportation plan. As a result, we can still observe some matching between clean data and outlier data. This phenomenon is also observed implicitly in other experiments.
Scalability: In Appendix K, we will include the time complexity of our algorithms, which is derived from our convergence analysis in the same appendix. In particular, we refer to the "computational complexity" section in the author rebuttal for details.

This can be further improved if the gradient computation step (optimal partial transport solving step) is replaced by the Sinkhorn method:
$\mathcal{O}\left(\frac{\max^2(2L_1, \max(|\mu|, |\nu|) \max(2|C^X|^2+ 2|C^Y|,2\lambda))n^2}{\epsilon^2} \frac{1}{\epsilon} \ln(n) n^2\right).$

In the experiment section, the selection of hyper-parameters is not clear.

Answer: We refer to the "parameter selection" section in the author rebuttal for a discussion of the parameter settings of the PGW method.

In the three experiments presented in the main text, we require full matching for the measure that has less mass. Thus, we set $\lambda$ to be sufficiently large (see Lemma E.2). Similarly, $\rho$ is set to be $\min(|\mu|,|\nu|)$ in MPGW. Additionally, we set $\rho_1=\rho_2$ to be suitably large for UGW. In PU learning experiment, presented in appendix H, $\lambda$ in PGW and $\rho$ in MPGW are required to be set to be sufficiently large. In UGW, we test different $\rho_1,\rho_2$ such that the transported mass equals the prior $\pi=0.2$ .

In particular:

In the point cloud matching experiment (Section 5.1 and Appendix M.2), we explain how $\lambda$ (for PGW), $\rho_1, \rho_2$ (in UGW), and $\rho$ (for MPGW) are selected in lines 1077-1081.
In the point cloud interpolation experiment (Section 5.3 and Appendix M.1), we explain how $\lambda$ is selected in line 1043.
In the shape retrieval experiment (Section 5.2 and Appendix N), we explain the setting of $\lambda$ (for PGW), $\rho_1, \rho_2$ (for UGW), and $\rho$ (for MPGW) in lines 1093-1097.
In the PU learning experiment (Section P), we explain the settings of $\lambda$ (for PGW) and $\rho$ (for MPGW) in lines 1198-1202. The setting of $\rho_1, \rho_2$ for UGW is explained in lines 1189-1190.

Could you add the parameter sensitivity analysis in the experiment section? The table under line 271 doesn't have a title, and in (a) the performance of MPGW is on Dataset II which looks abnormal to me. Is this the actual experiment result or just a typo?

Answer:

We've add one section to explain the parameter sensitivity in the experiments into the paper. In particular, $\lambda$ in PGW, $\rho$ in MPGW, $\rho_1,\rho_2$ in UGW has a (sufficient) threashold, whenever the parameter is greather than the threashold, the performance demonstrated in this paper will be achieved.
We apology for the title missing of the table below 271. The title should be "accuracy and wall-clock comparision in shape retrival problem." and it has been added.

2024-08-13

I appreciate your rebuttal. Your response addressed my concerns. I think this work is worth for publication and so I raise my score.

评论- Comment

2024-08-14

Thank you for your comments. These two references will be added. We will also make other modifications (e.g. a section about parameter selection/sensitivity) based on the reviewer's comments.

审稿意见

评分: 7置信度: 32024-07-15

The paper considers an unbalanced version of the Gromov-Wasserstein distance, where the discrepancy terms correspond to the total variation between certain product measures for the given marginal and the marginal of the solution, respectively. Different (re)formulations of this distance is considered in both the discrete case and for general measures, existence of optimal solution and metric properties are shown for certain cost functions, as well as numerical methods for computing (local) optimal solutions. Finally a number of numerical experiments are considered.

The paper is well written and extensive. In the main paper the results are stated and with proofs in the appendix.

优点

The paper is strong in terms of both content and the presentation. In particular the results about the metric properties of the PGW problem. It is also a quite complete paper in terms that is contains substantial results on theory, computational methods, and numerical simulations.

缺点

One weakness with the methods in the paper is that it only provides local optimal solutions. This is a problem with many non-convex problems, but in some cases global solutions can be guaranteed. In the abstract it is stated the methods solve the PGW problem. This is probably to strong statement since they are not guaranteed to converge to the global solution.

问题

In the abstract it is stated the methods solve the PGW problem. This is probably to strong statement since they are not guaranteed to converge to the global solution.
I think that some early references on the partial/unbalanced OT problem are missing. Neither of the following two papers are mentioned in the introduction. Both papers contain the problem formulation (3).

Georgiou, Tryphon T., etal. "Metrics for power spectra: an axiomatic approach." IEEE Transactions on Signal Processing 57.3 (2008): 859-867.

Piccoli, Benedetto, and Francesco Rossi. "Generalized Wasserstein distance and its application to transport equations with source." Archive for Rational Mechanics and Analysis 211 (2014): 335-358.

Some minor typos: line 174: he --> the line 175: con --> cost line 574: spit --> split

局限性

作者回复

2024-08-07

One weakness with the methods in the paper...

Answer: The authors apologize for the misunderstanding regarding the English term "Partial Gromov-Wasserstein solver" in the main text. As discussed in the paper, GW and its variants UGW/MPGW/PGW are non-convex problems. To the best of our knowledge, both classical algorithms (Frank-Wolfe algorithm) and the Sinkhorn algorithm converge to local minima rather than global minima. Thus, in the optimal transport community, it is generally accepted to say these methods are "solvers" or that these methods "solve the GW/UGW/MPGW problem", rather than stating "these methods are computational algorithms that achieve a local minimum" (see, e.g., [45] abstract and introduction section; [44] section 3.1). We have added a footnote in the introduction section to explain this, maintaining both readability and rigor, and ensuring consistency with previous works.

In the abstract it is stated the methods solve the PGW problem...

Answer: The claim that our method solves the PGW problem follows the convention of English explanations in previous works [44], [45]. We've added a footnote in the introduction section to maintain consistency with related works and provide a rigorous explanation. See the answer for weaknesses for details.

I think that some early reference...

Answer: These two references will be added to the paper. Equation (3), Optimal partial transport, is a classical unbalanced OT, and many works discuss its theory and applications. We apologize for only citing the classical reference from the authors who introduced this concept (Mccann and Figalli) initially.

Some minor typos...

Answer: The authors thank the reviewer, and we have fixed the typos.

2024-08-13

Thank you for your rebuttal and the additional clarifications.

About the comment: "We apologize for only citing the classical reference from the authors who introduced this concept (Mccann and Figalli) initially."

No need to apologize. However, note that the reference above introduced the same "optimal partial transport" formulation (3) and was published 2 years before the earliest references on this that was provided in the paper.

评论- comment

2024-08-14

Thank you for your comments. These two references will be added to the paper. We will also make other modifications based on the reviewer's comments.

审稿意见

评分: 7置信度: 32024-07-16

This paper introduces the Partial Gromov-Wasserstein (PGW) metric as a means to handle unbalanced Gromov-Wasserstein problems between non-probability mm-spaces. The authors develop and demonstrate two computationally efficient variants of the Frank-Wolfe algorithm for solving the PGW problem. They establish that PGW is a well-defined metric, providing theoretical proofs and applications in shape matching, shape retrieval, and interpolation. The metric and algorithms are validated through numerical experiments against established baselines, showcasing improved results in handling outlier data with a robust performance.

优点

The paper presents a novel approach to defining a metric in the context of unbalanced Gromov-Wasserstein, which has been a challenging issue in the field.
The quality of the research is high, evident from rigorous theoretical developments, proofs, and comprehensive experiments that validate the effectiveness of the PGW metric in practical applications.
The paper is well-organized, with clear explanations of complex concepts. The use of examples and experimental results helps in understanding the practical implications and advantages of the PGW metric.

缺点

While the paper provides a comparison with existing methods, it could benefit from a broader range of comparative baselines, especially newer techniques that might provide a stiffer benchmark.
The paper does not extensively discuss the sensitivity of the PGW algorithm to the choice of its parameter (e.g., the regularization coefficient - lambda), which is crucial for understanding its robustness and adaptability in diverse real-world scenarios.
There is limited discussion on the scalability (or the time complexity) of the proposed methods, particularly in high-dimensional or large-scale settings, which is vital for their applicability in big data applications.

问题

The regularization coefficient lambda is referenced in Equations (9), (10), and (14) of $\tilde{M}$ . However, it appears that two of the algorithms do not explicitly incorporate lambda. Could you clarify whether these algorithms require the use of lambda?
If lambda is indeed utilized in these algorithms, how does the selection of lambda influence the performance of the PGW metric? Are there guidelines or methods for choosing this parameter optimally based on the dataset characteristics?
In practical applications, especially in high-dimensional spaces, what are the computational limitations of the proposed algorithms, and how might these be mitigated?
Are there potential extensions of the PGW metric that could handle non-metric spaces, given the increasing interest in such spaces in various applications?

局限性

Including limitations on the scalability and time complexity, such as the maximum solvable problem size within one hour, would be beneficial for its applications.

作者回复

2024-08-07

While the paper provides a comparison with existing methods...

Answer: The baseline methods we selected follow from the paper [45] [44] and Beier et al., 2023. As the main topic of this paper is to introduce a partial GW formulation that defines a metric for two positive measures in different spaces, we only implemented our method in the classical experiments from related works and compared our method with existing baselines in these papers.

The paper does not extensively discuss ...

Answer: The authors agree that the choice of $\lambda$ is important, whether in the classical partial OT setting or the partial GW setting. We refer to the parameter selection section in AU for details. This section will be added to the paper.

There is limited discussion on the scalability...

Answer: In Section O, we provide the wall-clock time comparison for the PGW solvers, with the size of tested data varying from 100 to 10,000. In Section K, we've discussed the convergence rate of the proposed algorithms. We refer to the "computational complexity" section and Equation (TC) in the Author Rebuttal for the theoretical time complexity. We will add this conclusion as a direct result in Appendix K and the main text.

The regularization coefficient lambda is referenced in Equations (9), (10), and (14)...

Answer: In Algorithm 1, $\lambda$ is incorporated in the gradient computation. See Equations (59) and (60) in Step 1. Additionally, $\lambda$ is incorporated in the line search step; see Equations (65) and (66).

Similarly, in Algorithm 2, $\lambda$ is applied in the gradient computation step (see Equation (61)) and the line search step (see lines 886-887).

If lambda is indeed utilized in these algorithms...

Answer: Selection of $\lambda$ highly depends on the task where PGW is applied. When full matching for one measure is desired, by Lemma E.2, we should set $\lambda$ to be sufficiently large (i.e., $2\lambda=\max( (C^X)^2+(C^Y)^2)$ ). We refer to the "parameter selection" section in the author rebuttal for details.

Intuitively, $2\lambda$ plays the role of an "upper bound" on transportation cost. Suppose $(|x_1 - x_2|^2 - |y_1 - y_2|^2)^2 > 2\lambda$ ; in this case, we will not apply the pairing $x_1 \to y_1, x_2 \to y_2$ or $x_2 \to y_1, x_1 \to y_2$ .

If we aim to transport less mass and destroy/create more mass, a smaller $\lambda$ is required. If we aim to transport more mass, we need to set a larger $\lambda$ .

In practical applications, especially in high-dimensional spaces...

Answer: To the best of our knowledge, the computation of GW/MPGW/UGW/PGW only depends on the size of the dataset and is independent of the dimension of the space. The dimension of the space may affect the computation of the cost matrices $C^X$ and $C^Y$ ; however, theoretically, the computation cost in this step is always $O(n^2 + m^2)$ .

Are there potential extensions of the PGW metric that could handle non-metric spaces...

Answer: In general, GW/UGW/MPGW/PGW can be defined in a gauged measure space, where the gauge function (symmetric, L2 function) is a generalized version of the metric function. The main idea of GW/MPGW/UGW/PGW is to adapt the similarity of each space to build a measure that describes the similarity between data in two different spaces. We require a structure to describe the similarity for each pair of points in each space. Such a structure can be a metric or a gauge mapping. If such a structure is missing, then it is beyond the scope of the GW method.

2024-08-14

Thank you for your detailed response. Since some of my concerns have been addressed, I will maintain my positive score.

评论- comment

2024-08-14

The authors thank the reviewer's comment.

The related results in "computational complexity" section will be added to the paper.

审稿意见

评分: 4置信度: 52024-07-23

This paper proposes a partial Gromov -Wasserstein (PGW) formulation, which relaxes the original constraints present in Gromov-Wasserstein (GW) formulation. In PGW, the marginal equality constraints of GW are replaced by marginal inequality constraints. Following existing works in partial GW setting, the paper additional imposes TV-based marginal regularization in the objective. The paper showed that the proposed PGW approach defines a metric between metric measure spaces. The PGW problem is solved via Frank-Wolfe (FW) and empirical results are shown on shape interpolation in the main paper.

优点

The paper explores partial mass transportation setting in the GW problem. As discussed around lines 149, an existing work [45] has also explored similar concepts in GW setting. [45], in particular, involves an additional hyperparameter (\rho) for overall mass of the learned transport plan. The proposed problem (16) as well as [45] employs FW algorithm.
- Proposition L.1 in the appendix shows that if \gamma is an optimal solution of proposed PGW problem, then \gamma is also an optimal solution of the mass constrained PGW problem (MPGW) of [45] with \rho=|\gamma|.
- It is unclear what the paper implies by "mathematically this equivalence relation is not verified." in line 955? How is the paper defining the term "equivalence" which is used multiple times in Appendix L.
- For a given \lambda=\lambda_0 in PGW, does there exist a \rho=\rho_0 for MPGW such that the set of first order critical points for PGW and MPGW are same?
- The steps of the FW algorithm for proposed PGW and MPGW [45] seem quite similar.

缺点

The paper is poorly written due to the following reasons:
- The abstract and introduction states that the paper propose two variants of FW algorithm for solving the proposed PGW. However, the main paper does not describe two variants of FW algorithm. Only one variant is discussed in Sections 4-5. The other variant is described only in Appendix G. If the algorithm Typically, the main paper should be self contained, having the necessary details of the contributions claimed in the abstract and introduction. It should be noted that going through the supplementary material is a reviewer's discretion.
- The paper provides very less discussion on how its differences with MPGW [45] in the main paper (lines 148-149). While Appendix L contains this discussion, important parts of it should be discussed in the main paper. As mentioned above, going through the supplementary material is a reviewer's discretion.
- There are multiple grammatical and spelling errors throughout the paper. Eg. he -> the, con -> cost, etc.
The main paper contained experiments only on synthetic dataset. Tables 2 and 4 in appendix discuss experiments on real-world datasets. MPGW obtains same generalization performance as PGW in both the tables. This should be discussed in the main paper as well.

问题

In (14), let \beta = minimum entry in cost tensor M. Then, what is the solution of PGW (16) when \beta > 2\lambda ?
Why does MPGW obtains 0.0813 and 0 mean accuracy in Table (a) while other methods get > 0.89 and > 0.78 scores, respectively? It should be noted that in Tables 2 and 4 in appendix, MPGW performs at par with PGW.

局限性

N/A

作者回复

2024-08-07

It is unclear what the paper implies by "mathematically this equivalence relation is not verified." in line 955?

Answer: The authors thank the reviewer's point. The sentence "mathematically this equivalence relation is not verified" will be removed. The section "relation between PGW and MPGW" in author rebuttal will be added into the paper to clarify the relation between PGW and MPGW.

For a given \lambda=\lambda_0 in PGW ....

Answer: Answer is No. See example 1 and "relation between PGW and MPGW" section in AR.

the steps of the FW algorithm...

Answer: The formulation PGW and MPGW are related and thus the FW iteration steps of the two problem are similar. They have difference in Gradient computation steps and line search step. The algorithms for PGW requires $\lambda$ in these two steps. The solver for $MPGW$ requires $\rho$ in the gradient computation step.

The paper is poorly written due to the following reasons: ...

Answer: The authors thank the reviewer for their points

We have not include the two variants of the FW solver in the main text due to the page limit. However, for clarity, in the abstract and in the introduction section, we have modified the text:
- Abtract: "We then propose variants of the Frank-Wolfe algorithm for solving the PGW problem."
- Introduction: line 74 "Based on this relation, we propose two (mathematical equivalent) variants of Frank-Wolfe algorithms for the discrete PGW problem. One is presented in the main text, and the other is discussed in Appendix G." We refer section "relation between the two FW algorithms" and "motivation of algorithm 2" in Author Rebuttal.
We refer "relation between the two FW algorithms" section in Author Rebuttal for details. This section will be added into the paper.

The paper provides very less discussion...

Answer: The modified statement of Proposition L.1, which demonstrates the relation between PGW and MPGW, will be added into the main text. Especially,
$PGW_\lambda(\mu,\nu)=MPGW_\rho(\mu,\nu)+\lambda(|\mu|^2+|\nu|^2-2|\rho|^2)$ where $\rho$ is determined by $\lambda$ .

There are multiple grammatical and spelling errors...

Answer:The authors thank the reviewer. We have fixed the typos and grammar mistakes.

The main paper contained experiments...

Answer:

Datasets: both synthetic and real datasets are applied in this paper:
- In the shape retrieval experiment (Section 5.2), the dataset is given in [51].
- In the point cloud interpolation experiment (Section 5.3), the dataset is from Github.
- In the positive unsupervised learning experiment (Section P), the datasets include MNIST, EMNIST, Amazon, Webcam, and DSLR (See [60]).
We refer "Positive label unsupervised learning" section in Author Rebuttal to see the explanation about why PGW, MPGW admit same performance in this particular task. The discussion will be added in the main text as a numerical evidence of Proposition L.1.

In (14), let \beta = minimum entry in cost tensor M.

Answer: Note, $M_{i,j,i',j'}=|C^X_{i,i'}-C^Y_{j,j'}|^2,\forall i,i'\in[1:n], j,j'\in[1:m]$ . Thus $\min M=|C^X_{1,1}-C^Y_{1,1}|^2=|0-0|^2=0$ . In this case, $\lambda=0$ . zero measure will be one optimal solution. Our algorithm will return $\gamma=0_{n\times m}$ . Run the code link for a numerical example.

Why does MPGW obtains 0.0813 and 0 mean accuracy in Table (a)...

Answer:

We refer "limitations of MPGW" section in Author Rebuttal to see the why MPGW does not has good performance in shape retrieval experiment. In one sentence, when one shape is similar to part of another shape (e.g. "square" is similar to part of "house"), MPGW value will be closed to 0.
For the PU learning experiment, (table (2)(4), we refer section "PU learning" section in Author Rebuttal for details. In this experiment, we only need transportation plans from the GW-based methods. In the experimental setting, full matching is required for PGW/MPGW, and based on Proposition L.1 and Lemma E.2, MPGW and PGW yield the same solution sets in this scenario.

评论- Response to authors

2024-08-11

I thank the authors for their rebuttal. Please find my (further) questions below.

Regarding global response Statement 1 example 1: Given $lambda=0$ , solve PGW and obtain the solution $\gamma_{PGW}$ . Then set $\rho=|\gamma_{PGW}|$ . Now, would the first order critical points of MPGW contain $\gamma_{PGW}$ ?
Regarding the authors' response "both synthetic and real datasets are applied in this paper:"

By the statement "In the shape retrieval experiment (Section 5.2), the dataset is given in [51].", are the authors claiming that this experiment use real-world dataset?
By the statement "In the point cloud interpolation experiment (Section 5.3), the dataset is from Github (link)", are the authors claiming that this experiment use real-world dataset?
Regarding experiments in Section P, I have already acknowledged that they are the only real-world experiments in the whole manuscript.

Regarding, "In (14), let \beta = minimum entry in cost tensor M", I agree with the authors that the optimal solution will be a zero measure if lambda is sufficiently large. However, does the author response contract their Lemma E.2.
In the appendix L.1 example, does $MPGW_\rho(X_i,X_j)$ = 0 for all values of rho? What about $\rho=|\gamma_{PGW}|$ ?

Overall, I have found the author response underwhelming. Hence, for now, I am reducing my score by one.

评论- answer to the further questions 4

2024-08-12

In the appendix L.1 example, does $MPGW_\rho(X_i,X_j)=0$ for all values of $\rho$ ? What about $\rho = |\gamma_{PGW}|?$

Answer: Yes, $MPGW_\rho(X_i,X_j)=0$ for all values of $\rho$ , including $\rho = |\gamma_{PGW}|$ .

Below, we provide the reasoning to clarify this point. In this example:

$\mathbb{X}^1=(\mathbb{R}^3,\|\cdot\|,\sum_{i=1}^{1000} \alpha \delta_{x_i}),$ $\mathbb{X}^2=(\mathbb{R}^3,\|\cdot\|,\sum_{i=1}^{800} \alpha \delta_{x_i}),$ $\mathbb{X}^3=(\mathbb{R}^3,\|\cdot\|,\sum_{i=1}^{400} \alpha \delta_{x_i}),$

where $\alpha=1e-3$ , $\lambda=1e+1$ , and $x_i \in [0,1]^2$ for all $i$ .

Let $\gamma_{1,2}$ be the optimal solution for $PGW_\lambda(X_1,X_2)$ , then we have $|\gamma_{1,2}|=0.8$ . Similarly, we define $\gamma_{1,3}$ and $\gamma_{2,3}$ , where $|\gamma_{1,3}|=0.4$ and $|\gamma_{2,3}|=0.4$ . The numerical results are consistent with our Lemma E.2. We can verify the following:

$MPGW_{0.8}(\mathbb{X}^1,\mathbb{X}^2)=0, \text{ where } |\gamma_{1,2}|=0.8,$ $MPGW_{0.4}(\mathbb{X}^1,\mathbb{X}^3)=0, \text{ where } |\gamma_{1,3}|=0.4,$ $MPGW_{0.4}(\mathbb{X}^2,\mathbb{X}^3)=0, \text{ where } |\gamma_{2,3}|=0.4.$

For the case of arbitrary $\rho > 0$ , we may first consider the problem $MPGW_{\rho}(\mathbb{X}^1,\mathbb{X}^2)$ . Let $\gamma^*=\sum_{i=1}^{800}\alpha\delta_{(x_i,x_i)}$ denote the transportation plan induced by the Monge mapping $x_i \mapsto x_i$ for all $i \in [1:800]$ . Then, $\frac{\rho}{|\gamma^*|}\gamma$ will be an optimal solution for $MPGW_{\rho}(\mathbb{X}^1,\mathbb{X}^2)$ . Consequently, $MPGW_{\rho}(\mathbb{X}^1,\mathbb{X}^2) = 0$ . We may apply similar reasoning to see that $MPGW_{\rho}(\mathbb{X}^1,\mathbb{X}^3),MPGW_{\rho}(\mathbb{X}^2,\mathbb{X}^3)=0$ for any $\rho\in[0,0.4]$ as well, where 0.4 is the largest value we can choose for $\rho$ in these two MPGW problems.

Run this link (cell 1 and cell 3) to see a numerical example. Note that, as we set the tolerance of PGW/MPGW algorithms to 1e-5, the values of $MPGW_{0.8}(\mathbb{X}^1,\mathbb{X}^2)$ , $MPGW_{0.4}(\mathbb{X}^1,\mathbb{X}^3)$ , and $MPGW_{0.4}(\mathbb{X}^2,\mathbb{X}^3)$ are of the order $\mathcal{O}(1e-5)$ , rather than exactly 0.

评论- answers to the further questions 1,2,3

2024-08-12

We appreciate the reviewer’s engagement with our work and the response to our rebuttal. While we respect the reviewer's perspective, we have concerns that the current evaluation may not fully reflect the scientific and technical merits of our paper. The original review did not identify any theoretical flaws, instead raising minor clarification questions that we have thoroughly addressed. We also recognize the reviewer's comments regarding the paper's organization and the request for additional experiments on real-world data. However, these concerns do not appear to undermine the fundamental theoretical contributions of our work, which we feel may have been overlooked. Furthermore, we would like to highlight that the scope of our numerical experiments is consistent with prior work on Gromov-Wasserstein (GW) and unbalanced GW. We would welcome further discussion rooted in specific scientific reasoning and detailed feedback, which would enhance the constructive nature of this review process.

"Regarding global response Statement 1 example 1:..."

Answer: Yes.

In this case, if we select the zero measure, which is an optimal solution for PGW, and set $\rho=|\gamma|=0$ , the global solution coincides with the first-order critical point of MPGW. In fact, the space $\Gamma_\leq^\rho(\mu,\nu)$ contains only a single element for MPGW.

Otherwise, as explained in the author rebuttal, if we choose $\gamma=\delta_{(x_i,y_j)}$ , where $i \in [1:n]$ and $j \in [1:m]$ , which is also an optimal solution for $PGW_0(\mathbb{X},\mathbb{Y})$ , and we set $\rho=|\gamma|=1$ , then $\gamma=\delta_{(x_i,y_j)}$ would be one of the first-order critical points of $MPGW_1(\mathbb{X},\mathbb{Y})$ and a solution to the MPGW problem.

"Regarding the authors' response "both synthetic and real datasets are applied in this paper..."

Answer: Our statement regarding our use of both synthetic and real-world data remains to be correct. To clarify:

In the shape retrieval experiment (Section 5.2), “Dataset I” is derived from [51] and “Dataset II” is entirely synthetic. Dataset I is also synthetic; we do not mean to claim otherwise, only to outline the sources of data in our experiments.
The data for the point cloud interpolation experiment comes from [41]. This dataset is also synthetic, but nevertheless used in prior GW papers for method evaluation.
MNIST and EMNIST are real-world yet small-scale problems commonly used for demonstration purposes, and the Amazon, Webcam, and DSLR are clearly real-world datasets.

"Regarding, "In (14), let \beta = minimum entry in cost tensor M", I agree with the authors that the optimal solution will be a zero measure if lambda is sufficiently large. However, does the author response contract their Lemma E.2.

Answer: No, it does not contradict with Lemma E.2.

Here, we point out that the reviewer might have misinterpreted our response or there might have been a confusion. The optimal solution will be a zero measure if $\lambda$ is sufficiently small (e.g., $\lambda \leq\min(M)$ ). When $\lambda$ is sufficiently large, based on our Lemma E.2., the $|\gamma|$ will achieve its maximum, that is $|\gamma|=\min(|\mu|,|\nu|)$ .

评论- Response to authors #2

2024-08-13

Answer: Yes. In this case, if we select the zero measure, which is an optimal solution for PGW, and set , the global solution coincides with the first-order critical point of MPGW. In fact, the space $\Gamma_{\leq}^{\rho}(\mu,\nu)$ contains only a single element for MPGW.

Then, in the setting of Example 1 of the global response, it does seem that the solution set of PGW and MPGW can be identical - with appropriate value of $\rho$ .

Our statement regarding our use of both synthetic and real-world data remains to be correct. To clarify:

From the author response, I think we are on the same page. In the original review, I had written - "The main paper contained experiments only on synthetic dataset. Tables 2 and 4 in appendix discuss experiments on real-world datasets. MPGW obtains same generalization performance as PGW in both the tables."

No, it does not contradict with Lemma E.2.

Please excuse my oversight. I had written incorrectly statement 3 in my previous response.

The response of one of my questions in my original review was a bit unclear. The question was "In (14), let \beta = minimum entry in cost tensor M. Then, what is the solution of PGW (16) when \beta > 2\lambda ?"

The solution of (16) seems to be zero when \beta > 2\lambda. If this is indeed the case, then the PGW distance between two distributions with marginals p and q, respectively, and with \lambda < \beta/2 will be \lambda(|q|^2 + |p|^2) (from 15) - which is a constant dependent on hyper-parameter and marginals. This restricts PGW's ability to distinguish the source distribution with target distributions of same marginal norm (|q|=|q_1|=|q_2|). No geometric information is being taken into account.

评论- answers to new comments 2

2024-08-13

Then, in the setting of Example 1 of the global response, it does seem that the solution set of PGW and MPGW can be identical - with appropriate value of $\rho$ .

Answer: There may have been some confusion here. In Example 1, the solution sets of PGW and MPGW are NOT identical for any $\rho\in [0, \min{(|\mu|,|\nu|)}]$ .

The solution set of $PGW_0(\mu, \nu)$ contains $\{\delta_{(x_i, y_j)} : i\in[1:n], j\in[1:m]\} \cup \{0\}$ .
When $\rho>0$ , the zero measure is a solution for $PGW_0(\mu,\nu)$ , but it is not a solution for $MPGW_\rho(\mu,\nu)$ .
When $\rho=0$ , $\delta_{(x_i,y_j)}$ is a solution of $PGW_\lambda(\mu,\nu)$ , but it is not a solution for $MPGW_0(\mu,\nu)$ .

From the author response, I think we are on the same page. In the original review, I had written - "The main paper contained experiments only on synthetic dataset. Tables 2 and 4 in appendix discuss experiments on real-world datasets. MPGW obtains same generalization performance as PGW in both the tables."

Answer: Thank you for the clarification. We would once again like to note that the scope of our numerical experiments is consistent with existing literature on GW and unbalanced GW. If the reviewer is instead concerned with the similar performance of PGW and MPGW in Tables 2 and 4 in the appendix, this is discussed in the global response under the heading "Positive Label Unsupervised Learning", as stated earlier.

The response of one of my questions in my original review was a bit unclear. The question was "In (14), let \beta = minimum entry in cost tensor M. Then, what is the solution of PGW (16) when \beta > 2\lambda ?"

The solution of (16) seems to be zero when \beta > 2\lambda. If this is indeed the case, then the PGW distance between two distributions with marginals p and q, respectively, and with \lambda < \beta/2 will be \lambda(|q|^2 + |p|^2) (from 15) - which is a constant dependent on hyper-parameter and marginals. This restricts PGW's ability to distinguish the source distribution with target distributions of same marginal norm (|q|=|q_1|=|q_2|). No geometric information is being taken into account.

Answer: There still seems to be some confusion here. When $\beta=\min M$ as specified in the reviewer's original comment, this results in $\beta = 0$ . To be precise, since $\lambda$ cannot be negative number, we first correct $2\lambda<\beta=0$ to $2\lambda\leq \beta=0$ , i.e., $\lambda = 0$ . As a result, $PGW_0(\mu, \nu)$ admits the zero measure as one solution, and the PGW distance will be $\lambda(|q|^2+|p|^2)=0$ as $\lambda=0$ .

This restricts PGW's ability to distinguish the source distribution with target distributions of same marginal norm (|q|=|q_1|=|q_2|).

As stated in Proposition 3.4, PGW is a metric which can distinguish two measures if $\lambda>0$ . In this case, $\lambda=0$ and thus PGW cannot distinguish the difference between the source and target distributions. The resulting analysis given by the reviewer simply reflects the fact that $\lambda=0$ is a bad choice of hyperparameter when we aim to apply PGW as a metric.

评论- Reviewer response

2024-08-14

Regarding

There may have been some confusion here.

Let me rephrase my question. In this example 1, will the solution of PGW be a first order critical point of MPGW with appropriate value of $\rho$ ? In my earlier question "Given $lambda=0$ , solve PGW and obtain the solution $\gamma_{PGW}$ . Then set $\rho=|\gamma_{PGW}|$ . Now, would the first order critical points of MPGW contain $\gamma_{PGW}$ ?", the author response was "yes"

If the reviewer is instead concerned with the similar performance of PGW and MPGW in Tables 2 and 4 in the appendix, this is discussed in the global response under the heading "Positive Label Unsupervised Learning", as stated earlier.

In the global response, it has been stated that PGW has the advantage of being a metric (while MPGW is not). Then, perhaps some experiments where the utility of PGW being a metric comes out - should be performed.

There still seems to be some confusion here.

I am not sure why $\min M$ should be zero. It would be great if the authors could explain it. My understanding is as follows: M is computed using (14) - using $C^X$ and $C^Y$ matrices. If none of the entries of $C^X$ is equal to $C^Y$ , $\min M$ would not be zero. So $\beta = \min M$ seems to be a positive entry and for any $0<\lambda < \beta/2$ , the PGW distance has a closed form solution $\lambda(|p|^2 + |q|^2)$ . So there is no question of setting $\lambda=0$ .

评论- Response to Answers to new comments 5

2024-08-14

This statement contradicts the reviewer's original claim (from Response to authors #2):

Previously, I was wondering if the set of solution can be shown to be same for individual values of hyper-parameters. The authors have shown with example that this is not the case.

However, please note that my previous response was more in the form of a question to the authors (albeit without a question mark) rather than a definite claim about the paper's contribution.

As stated previously, the authors hoped to clarify exactly this point; namely, that the solution sets to the two problems are different, although particular solutions can coincide.

While this is true for specific values of hyper-parameter values, practically, one does not run the model on one hyper-parameter value. A common practice is to run the model on multiple hyper-parameter values (tuning/cross-validation) and choose one (based on some other criteria). Hence, can the solutions of PGW as one tunes PGW over $lambda$ be obtained by MPGW as one tunes over $\rho$ ?

I would also increase my score to the original score.

评论- answers to new comments 3

2024-08-14

Let me clarify my question. In Example 1, will the solution of PGW be a first-order critical point of MPGW for an appropriate value of $\rho$ ?

Answer:

In Example 1, we assert that the solution sets of $PGW_0(\mu,\nu)$ and $MPGW_\rho$ are different. We have already explained the distinction between these sets in our author rebuttal and in previous comment.

In my earlier question, "Given $\lambda=0$ , solve PGW and obtain the solution $\gamma_{PGW}$ . Then set $\rho=|\gamma_{PGW}|$ . Now, would the first-order critical points of MPGW contain $\gamma_{PGW}$ ?", the author responded "yes."
The answer to this earlier question remains yes, as we previously explained. This particular $\gamma_{PGW}$ is also a solution for $MPGW_\rho(\mu,\nu)$ since $\rho=|\gamma_{PGW}|$ . However, this does not imply that another solution of $PGW_0(\mu,\nu)$ will also be a solution for this specific $MPGW_\rho(\mu,\nu)$ .
Therefore, there is no contradiction to our claim "the solution sets of the two problems are different", although particular solutions might coincide.

In the global response, it was stated that PGW has the advantage of being a metric (whereas MPGW is not). Perhaps some experiments should be conducted to demonstrate the utility of PGW as a metric.

Answer: In the shape retrieval experiment, we obtained GW/UGW/MPGW/PGW costs to describe the similarity between two shapes. Since PGW and GW are metrics, they induce better performance than the other two methods. That is shown in table on page 8.

I am not sure why $\min M$ should be zero....

Answer: The definition of $M$ is provided in Eq (14), and the definitions of $C^X$ and $C^Y$ are given in Eq (11).

In particular $(C^X)_{i,i}=d_X(x_i,x_i')$ where $d_X$ is a metric.

Thus, when $i=i'$ , $C^X_{i,i'}=0$ . That is, diagonal elements of matrix $C^X$ are zero. Similarly, diagonal elements of matrix $C^Y$ are zero.

Based on the definition of $M$ , we have $\min M=|C_{1,1}^X-C_{1,1}^Y|^2=|d_X(x_1,x_1)-d_Y(y_1,y_1)|^2=|0-0|^2=0.$

评论- Reviewer response

2024-08-14

However, this does not imply that another solution of of $PGW_{0}(\mu,\nu)$ ...

I am not sure what the phrase another solution of $PGW_{0}(\mu,\nu)$ implies. The solution of PGW was taken to be $\gamma_{PGW}$ , without any assumption.

In the shape retrieval experiment,...

While the author response is true for synthetic shape retrieval experiments, it would be great to see the practical utility of PGW in real world applications involving real-world datasets. Appendix sections contained such experiments where PGW and MPGW had same generalization performance.

The definition of $M$ is provided ...

I thank the authors for clarifying my doubts for this question.

评论- answers to new comments 4

2024-08-14

I am not sure what the phrase another solution of ...

Answer:

We have explained this in Author Rebuttal example 1 and in the comment “answers to new comments 2”.

When $\gamma_{PGW}$ is the zero measure, we set $\rho=|\gamma_{PGW}|=0$ :

The zero measure is a solution for $PGW_0(\mu,\nu)$ and also for $MPGW_0(\mu,\nu)$ .
Choose $i\in[1:n],j\in[1:m]$ , $\delta_{(x_i,y_j)}$ is another solution for $PGW_0(\mu,\nu)$ . But $\delta_{(x_i,y_j)}$ is not a solution for $MPGW_0(\mu,\nu)$ .

评论- Response to answers to new comments 4

2024-08-14

Choose $i\in[1:n],j\in[1:m]$ , $\delta_{(x_i,y_j)}$ is another solution for $PGW_0(\mu,\nu)$ . But $\delta_{(x_i,y_j)}$ is not a solution for $MPGW_0(\mu,\nu)$ .

In this case, we could set $\rho=|\gamma_{PGW}|$ where $\gamma_{PGW}$ is the "another solution". Then $\gamma_{PGW}$ should be a solution for $MPGW_{|\gamma_{PGW}|}(\mu,\nu)$ . Since $\lambda$ and $\rho$ are not comparable hyper-parameters, one should not expect that all PGW solutions corresponding to fixed $\lambda$ would be obtained by a MPGW with a fixed $\rho$ and vice-versa.

评论- Answers to new comments 5

2024-08-14

In this case, we could set $\rho = |\gamma_{PGW}|$ where $\gamma_{PGW}$ is the "another solution". Then $\gamma_{PGW}$ should be a solution for $MPGW_{|\gamma_{PGW}|}(\mu, \nu)$ .

This follows immediately from Statement 1 in the global response.

Since $\lambda$ and $\rho$ are not comparable hyper-parameters, one should not expect that all PGW solutions corresponding to fixed $\lambda$ would be obtained by a MPGW with a fixed $\rho$ and vice-versa.

This statement contradicts the reviewer's original claim (from Response to authors #2):

Then, in the setting of Example 1 of the global response, it does seem that the solution set of PGW and MPGW can be identical - with appropriate value of $\rho$ .

As stated previously, the authors hoped to clarify exactly this point; namely, that the solution sets to the two problems are different, although particular solutions can coincide.

作者回复

2024-08-07

Relation between PGW and MPGW

Statement 1: The relation between PGW and MPGW can be described as follows (Proposition L.1. will be updated):

For each $\lambda \ge 0$ $λ \geq 0$ , there exists $\rho \in [0, \min(|\mu|, |\nu|)]$ $ρ \in [0, min (∣ μ ∣, ∣ ν ∣)]$ such that:
- For each $\gamma \in \Gamma_\leq(\mu, \nu)$ with $|\gamma| = \rho$ , $\gamma$ is optimal for $PGW_\lambda(\mu, \nu)$ iff $\gamma$ is optimal for $MPGW_\rho(\mu, \nu)$ .
- $PGW_\lambda(\mu, \nu) = MPGW_\rho(\mu, \nu) + \lambda(|\mu|^2 + |\nu|^2 - \rho^2).$
- Note, in this case, the sets of solutions for $PGW_\lambda(\mu, \nu)$ and $MPGW_\rho(\mu, \nu)$ are not necessarily identical.

Example 1: Suppose $\lambda = 0$ , $\mu = \sum_{i=1}^n \delta_{x_i}, \nu = \sum_{j=1}^m \delta_{y_j}$ . Then the set of optimal solutions for $PGW_0(\mu, \nu)$ contains all the elements of the following form: $\{\delta_{(x_i, y_j)} \mid i \in [1:n], j \in [1:m]\} \cup \{0\},$ where $0$ is the zero measure.

If we set $\rho = 0$ , the solution $\delta_{(x_i, y_j)}$ will not be an optimal solution for $MPGW_0(\mu, \nu)$ . If we set $\rho = 1$ , the solution $0_{m \times n}$ will not be an optimal solution for $MPGW_1(\mu, \nu)$ . Thus, the sets of solutions of $PGW_0(\mu, \nu)$ and $MPGW_\rho(\mu, \nu)$ are not consistent for any $\rho \in [0, \min(|\mu|, |\nu|)]$ .

Statement 2: We claim the following:

There exist $MPGW_\rho(\mu, \nu)$ problems where $\rho \in [0, \min(|\mu|, |\nu|)]$ , such that, for each $\lambda > 0$ , the solution of $MPGW_\rho(\mu, \nu)$ is not a solution of $PGW_\lambda(\mu, \nu)$ .

Example 2: Suppose $\mu = \sum_{i=1}^{10} \delta_{x_i}, \nu = \sum_{i=1}^{5} \delta_{x_i}$ , $\rho = 2$ . For each $\lambda > 0$ , choose an optimal $\gamma$ for $PGW(\mu, \nu)$ , we have $|\gamma| = 5 > 2$ . Thus, the solution of $MPGW_\rho(\mu, \nu)$ is not a solution for $PGW_\lambda(\mu, \nu)$ , and we cannot build an equivalence relation between the two problems.

Positive Label Unsupervised Learning:

Table 2 and Table 4 are part of the positive unsupervised learning section. In this experiment, we require transporting all the mass from positive samples to the target domain. In this case, by Proposition L.1. and Lemma E.2, PGW and MPGW admit the same set of solutions. As numerical evidence, we obtain the same performance from the two methods. We will add this discussion to the paper as a numerical verification of Proposition L.1. However, the PGW method has the extra advantage of giving rise to a metric, which is the main topic of this paper.

Limitation of MPGW:

Example: We refer to the example in Appendix L.1 for understanding the limitation of MPGW in this experiment: $MPGW\left(\sum_{i=1}^{400} \delta_{x_i}, \sum_{i=1}^{800} \delta_{x_i}\right) = 0,$ although these two measures (datasets) are distinct.

Performance of MPGW in Shape Retrieval Experiment:

About the MPGW in Tables (a) and (b) in Section 5.3: In this experiment, we use MPGW/UGW/PGW to measure similarity between different shapes. The reason MPGW exhibits poor performance is that whenever shape 1 is similar to a part of shape 2 (e.g., a "square" is similar to part of a "house"), MPGW will return an approximate value of 0. The classifier relies on the similarity measure, leading to poor performance.

Relation between the Two FW Algorithms in PGW:

The following will be added to the paper to clarify the equivalence relation between the proposed two algorithms:

In the gradient computation step, the gradient of Algorithm 2 $\hat{M} \circ \hat{\gamma}$ can be written in terms of $\tilde{M} \circ \gamma$ , which is the gradient of Algorithm 1, where $\tilde{M} = \hat{M}[1:n, 1:m, 1:n, 1:m]$ and $\gamma = \hat{\gamma}[1:n, 1:m]$ (See Lemma H.2 and Lemma H.3).
The $a, b$ values in the line search step in Algorithms 1 and 2 are the same based on (68).

Motivation of Algorithm 2:

Mathematically and numerically, the two algorithms are equivalent, as discussed in the paper (Appendix G). The reason for proposing Algorithm 2 is to provide a numerical implementation of Proposition G.1, which demonstrates that Macunn's conclusion [12] can be extended to the GW setting, thus establishing an equivalence relation between PGW and GW.

Parameter Selection for PGW

Shape retrieval experiment, shape interpolation, PU learning experiment: We require full matching for the measure that has less mass than the other. In this case, we only require $\lambda$ to be sufficiently large (see Lemma E.2).
When the data contains noise/outliers (e.g., shape matching experiment): If we assume that the distance between outliers and clean data is large, a suitable $2\lambda$ should be less than this distance and greater than the pairwise distance within the clean data.

Computational Complexity

The time complexity1 to obtain an $\epsilon$ -accurate solution for algorithm 1 is

\mathcal{O}\left(\frac{\max^2(2L_1, n^2\min(|\mu|, |\nu|)\cdot \max(\max\{2(C^X)^2 + 2(C^Y)^2,2\lambda)\}}{\epsilon^2} \cdot n^3\right), \tag{TC}

where $L_1$ depends on the initial guess $\gamma^{(1)}$ .

Note, the first term is the upper bound of iteration numbers. The $n^2$ in this term is obtained by the upper bound of the Lipschitz constant of the gradient of $\gamma$ in the PGW problem, as follows from Theorem 1 in [38]. In Frank-Wolfe algorithms for GW/MPGW, this term is also contained in the convergence rate. In practice, we generally set the number of iterations to a fixed number (the default value is 1000, as follows from [44] and PythonOT), and in all of our experiments, this number is not reached.

The second term $n^3$ can be improved to $1/\epsilon \ln(n)n^2$ if the Sinkhorn algorithm is applied.

最终决定Reject

2024-09-25

The reviewers noted that the paper presents a variation on existing unbalanced Gromov-Wasserstein (GW) methods and emphasized that preserving the triangular inequality is an important feature. However, they felt that the contribution is too incremental compared to the method proposed by Chapel et al. ("Partial Optimal Transport with Applications on Positive-Unlabeled Learning"). Methodologically, the proposed contribution (a penalized formulation, denoted PGW) is closely related to the method of Chapel et al. (a constrained formulation, denoted MPGW). From a numerical perspective, the proposed approach relies on the same Frank-Wolfe (F-W) algorithm.

In making this decision, there was a discussion with the reviewers, who argued that "solutions of PGW can be obtained via MPGW through appropriate hyper-parameters." While there may be cases where the solutions are non-unique, meaning the mapping is not a bijection of parameters, it remains unclear whether this difference is significant from a practical standpoint, particularly when the parameters are cross-validated. Furthermore, the reviewers pointed out that "Tables 2 and 4 in the appendix discuss experiments on real-world datasets. In those experiments, MPGW achieves the same generalization performance as PGW in both tables." These numerical experiments do not clearly demonstrate that preserving the triangular inequality is a crucial feature for machine learning applications of unbalanced GW.

For these reasons, I cannot recommend acceptance.