6.0

/10

Rejected4 位审稿人

最低5最高8标准差1.2

3.5

置信度

正确性2.8

贡献度2.8

表达3.3

ICLR 2025

Projection Optimal Transport on Tree-Ordered Lines

Hoang V. Tran,Huyen Trang Pham,Tho Tran Huu,Minh-Khoi Nguyen-Nhat,Thanh Chu,Tam Le,Tan Minh Nguyen

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

TL;DR

In this paper, we replace one-dimensional lines in Sliced Optimal Transport with a more intricate structure, called tree systems, where OT problems on tree systems still allow closed-form solutions

摘要

关键词

projection optimal transportoptimal transport

评审与讨论

审稿意见

评分: 8置信度: 42024-10-27

This paper introduces a new tree Sliced-Wasserstein distance. This new distance is obtained by projecting on trees instead of 1d lines. The authors provide a complete theoretical analysis of the tree structures introduced, of the Radon transform used to project on them, as well as a discussion on how to sample such trees. Finally, they compare their method with the original Sliced-Wasserstein distance and some of their variants on several tasks such as gradient flows, color transfer, generative modeling with SWGAN and for diffusion models.

优点

This paper provides a new tree Sliced-Wasserstein distance, which in contrast with previous works, project on trees, and is thus more flexible.

The authors provide first a full theoretical analysis of the tree space introduced, as well as a way to sample them in practice. Defining the right Radon transform, this analysis allows defining naturally a Tree Sliced-Wasserstein distance, which generalizes the original Sliced-Wasserstein distance. They show that this is well a metric, and that it can be approximated efficiently in the same complexity as Sliced-Wasserstein (w.r.t samples). This construction is thus original, well motivated and with good properties.

Finally, they compare their distance with Sliced-Wasserstein and variants thereof, on different experiments. On these different tasks, they show consistently better performances than Sliced-Wasserstein. Moreover, their experiments are done on classical baselines such as gradient flows on 2D datasets and higher dimensional gaussians, or color transfer, but also on high dimensional data such as images with generative modeling.

缺点

Some minor weaknesses are that there are no statistical analysis for the Tree Sliced-Wasserstein distance proposed, e.g. no sample complexity, nor topological analysis. One can wonder how it relates with other distances, it TSW-SL a lower-bound of the Sliced-Wasserstein or of the Wasserstein distance?

The theoretical section on trees is very interesting, but some parts are challenging to read (for instance Section 3.2). I note however that an effort has been made on being clear, notably through Figure 1 and 2, which help to understand the constructions.

问题

When citing references about work which project on lower dimensional subspaces, some references are missing such as [1,2,3].

I think in Equation (2), there is an abuse of notation as the Radon transform of a density is a density, and not a measure.

The construction given in Algorithm 1 only samples a chain tree. Did you also test with sampling trees where nodes have different childs?

For the experiments on SW generators (Section 6.3), the number of projections (50 and 500) seems very low compared to the original paper, where I think they use something like 100K projections.

Typos:

Line 102: "(Helgason & Helgason, 2011)" -> (Helgason, 2011)
Line 322: "If tree systems in $\mathbb{T}$ consists only one line"

[1] Lin, T., Zheng, Z., Chen, E., Cuturi, M., & Jordan, M. I. (2021, March). On projection robust optimal transport: Sample complexity and model misspecification. In International Conference on Artificial Intelligence and Statistics (pp. 262-270). PMLR.

[2] Huang, M., Ma, S., & Lai, L. (2021, July). A riemannian block coordinate descent method for computing the projection robust wasserstein distance. In International Conference on Machine Learning (pp. 4446-4455). PMLR.

[3] Muzellec, B., & Cuturi, M. (2019). Subspace detours: Building transport plans that are optimal on subspace projections. Advances in Neural Information Processing Systems, 32.

2024-11-18

We appreciate the reviewer’s feedback and have provided the following responses to address the concerns raised about our paper. Below, we summarize the weaknesses and questions highlighted by the reviewer and provide our answers accordingly.

W1. Some minor weaknesses are that there are no statistical analysis for the Tree Sliced-Wasserstein distance proposed, e.g. no sample complexity, nor topological analysis. One can wonder how it relates with other distances, it TSW-SL a lower-bound of the Sliced-Wasserstein or of the Wasserstein distance?

Answer. Given the paper's emphasis on the construction of tree systems and the corresponding Radon Transform, and considering the already extensive content, we have chosen to defer the analytical and statistical examination of TSW-SL to future work.

It is important to highlight that analyzing these aspects of TSW-SL presents unique challenges due to the inclusion of splitting maps. This feature is exclusive to Tree-Sliced Wasserstein variants and sets them apart from Sliced Wasserstein variants. We are actively investigating the properties of splitting maps, which appear to be a highly promising research direction.

Q1. When citing references about work which project on lower dimensional subspaces, some references are missing such as [1,2,3].

Answer. We appreciate the reviewer for highlighting those citations. We have incorporated them into our revised paper (Line 57).

Q2. I think in Equation (2), there is an abuse of notation as the Radon transform of a density is a density, and not a measure.

Answer. We acknowledge the reviewer's observation regarding the misuse of notation. We have carefully reviewed the paper and addressed similar errors in the revision.

Q3. The construction given in Algorithm 1 only samples a chain tree. Did you also test with sampling trees where nodes have different childs?

Answer. We evaluated the performance of models using alternative tree structures and found their performance to be comparable. Since the chain tree structure offers a straightforward approach for sampling trees and calculating Eq. (13), we opted to focus exclusively on the chain tree structure in the main body of the paper.

Considering the paper's emphasis on the construction of tree systems and the associated Radon Transform, and given its already comprehensive content, we have chosen to reserve further analysis on tree sampling and other aspects of TSW-SL for future work.

Q4. For the experiments on SW generators (Section 6.3), the number of projections (50 and 500) seems very low compared to the original paper, where I think they use something like 100K projections.

Answer. In the GAN experiment described in Section 6.3, we evaluated the performance of our TSW-SL method in comparison with SW using a total of 50 and 500 projection directions. In the existing literature, [1] also introduced SW with control variates for GANs and tested it using 10 and 1000 projection directions. Interestingly, we observed that increasing the number of projection directions from 500 to 1000 does not improve the FID score. For instance, [1] reported an FID score of 10.05 on CelebA when using SW with 1000 projection directions, which is higher than the FID score of 9.62 achieved with SW using 500 projection directions, as detailed in Section 6.3 of our paper. Therefore, we choose to conduct experiments to compare TSW-SL and SW with 50 and 500 projecting directions.

Reference.

[1] Khai Nguyen and Nhat Ho. Sliced wasserstein estimation with control variates. In The Twelfth International Conference on Learning Representations, 2024.

Once again, we sincerely thank the reviewer for their feedback. Please let us know if there are any additional concerns or questions from the reviewer regarding our paper.

评论- Any Questions from Reviewer HzfV on Our Rebuttal?

2024-11-22

We would like to thank the reviewer again for your thoughtful reviews and valuable feedback.

We would appreciate it if you could let us know if our responses have addressed your concerns and whether you still have any other questions about our rebuttal.

We would be happy to do any follow-up discussion or address any additional comments.

2024-11-23

Thank you for addressing my (minor) concerns. I am satisfied with the content of this paper and would recommend acceptance.

评论- Thanks for your endorsement!

2024-11-23

Thanks for your response, and we appreciate your endorsement.

审稿意见

评分: 5置信度: 42024-10-30

The paper introduces a variant of the Sliced-Wasserstein distance that, rather than projecting distributions onto lines, projects them onto tree systems. This approach leverages the closed-form solution of optimal transport (OT) for trees to develop a fast algorithm based on the Radon transform applied to tree structures. It uses splitting maps to define how distributions are projected onto these structures. This tree-sliced Wasserstein distance on a System of Lines (TSW-SL) is shown to be a metric. Experiments compare the original SW with TSW-SL, demonstrating that TSW-SL provides slightly better performance than SW, with comparable computation times.

优点

Paper introduces a new variant of sliced-Wasserstein: relying on a tree metric is original and allows benefiting from closed-form solution that exists, allowing preserving computational efficiency. The paper is clear, well illustrated, experimental section validates the approach in different contexts.

缺点

Note: I have reviewed an extension of TSW-SL for the same venue.

As noted in the conclusion, the paper "introduces a straightforward alternative to SW by replacing one-dimensional lines with tree systems," aiming to provide a more geometrically meaningful space. This objective aligns with several SW variants (as mentioned in the related work section of the introduction) that have achieved performance improvements in learning tasks involving SW and/or offer alternative schemes with distinct properties (e.g., statistical characteristics, behavior in high-dimensional spaces, the ability to provide a transport plan). However, given that TSW-SL achieves similar (or slightly better) performance to SW while retaining the same limitations (e.g., difficulties in sampling meaningful lines in high dimensions, as noted in Table 2), it remains unclear why and under what circumstances TSW-SL should be preferred over SW or its many variants.

There is also no discussion regarding the impact of the number of lines, which seems to be a critical aspect of the method—especially as using only one line recovers the standard SW. Additionally, the influence of the splitting maps is not addressed, despite being central to the new Radon transform.

Other Comments:

In the gradient flow experiment, differences in ground cost (e.g., $L_2$ , tree metric) between methods make it challenging to compare results at a fixed number of iterations.
Table 1: By Wasserstein distance, do you mean $W_2$ ? Additionally, could you clarify the timings reported in Table 2?
Line 467: The loss should not be denoted as $\mathcal{L}$ since this letter represents a line.
Section 6.2: It is difficult to visually confirm that TSW-SL produces images that most closely resemble the target.
Appendix p.17, line 870: $\overline{L}$ should be $\overline{\mathcal{L}}$ .
Example A.14 is unclear: $n_i$ should have length 6 (i.e., $i = 0$ to 5). What do the arrows signify? Additionally, an arrow seems to be missing on line 1035. The progression through step $i$ is confusing.
Page 25: Please provide more detail about the transition from eq. (46) to eq. (47).

问题

What are the main arguments in favor of TSW-SL wrt other many variants of SW? Could you highlight the unique advantages of TSW-SL?
Can you give some stronger argument to the claim that tree systems allow avoiding a loss of topological information? Is this claim related with the number of lines that should be sampled ? What is the impact of this parameter onto the perfomances?
Is it possible to derive additional properties of TSW-SL? For instance, does it metrizes the weak convergence? What are its statistical properties?
It is stated in the conclusion that improved performances are expected by adapting recent advanced sampling schemes or techniques of SW to TSW-SL. It seems that the extension is not that straightforward: can you comment on that, and how improved performances can be expected?
It is stated in the beginning of section 6 that "the root is positioned near the mean of the target distribution": what is the impact of this setting?

2024-11-19

Q2. Can you give some stronger argument to the claim that tree systems allow avoiding a loss of topological information? Is this claim related with the number of lines that should be sampled ? What is the impact of this parameter onto the perfomances?

Answer. Given that the complexity of TSW-SL, as presented in Section 5.1 of our main text, is $\mathcal{O}(Lknlog⁡n+Lkdn)$ (L is the number of trees, k is the number of lines per tree, n is the number of supports for each distribution, and d is the number of dimensions in the original space), while the computational complexity of SW is $\mathcal{O}(Lnlog⁡n+Ldn)$ ( $L$ is the total number of projection directions in SW), we conducted experiments such that the total number of projection directions in SW equals the product of the number of trees and lines in TSW-SL for a fair comparison. For example, when we set the total number of projection directions in SW to 50 in Section 6.3, we conducted experiments with TSW-SL using $3$ and $5$ lines per tree, selecting the number of lines accordingly to ensure that the total number of projection directions for both TSW-SL and SW was approximately the same. Additional experimental results on the impact of the number of lines per tree are provided in Table 5 in Appendix E.3, where we explore different configurations with varying numbers of lines per tree in TSW-SL. The results show that although the number of lines in each tree does have a different impact on performance, given the same number of total projection directions, our TSW-SL consistently yields better results compared to SW and Orthogonal SW.

Q3. Is it possible to derive additional properties of TSW-SL? For instance, does it metrizes the weak convergence? What are its statistical properties?

Answer: Given the paper's emphasis on the construction of tree systems and the corresponding Radon Transform, and considering the already extensive content, we have chosen to defer the analytical and statistical examination of TSW-SL to future work.

2024-11-19

W2. There is also no discussion regarding the impact of the number of lines, which seems to be a critical aspect of the method—especially as using only one line recovers the standard SW. Additionally, the influence of the splitting maps is not addressed, despite being central to the new Radon transform.

Answer. Given that the complexity of TSW-SL, as presented in Section 5.1 of our main text, is $\mathcal{O}(Lknlog⁡n+Lkdn)$ ( $L$ is the number of trees, k is the number of lines per tree, n is the number of supports for each distribution, and d is the number of dimensions in the original space), while the computational complexity of SW is $\mathcal{O}(Lnlog⁡n+Ldn)$ ( $L$ is the total number of projection directions in SW), we conducted experiments such that the total number of projection directions in SW equals the product of the number of trees and lines in TSW-SL for a fair comparison.

For example, when we set the total number of projection directions in SW to $50$ in Section 6.3, we conducted experiments with TSW-SL using $3$ and $5$ lines per tree, selecting the number of lines accordingly to ensure that the total number of projection directions for both TSW-SL and SW was approximately the same. Additional experimental results on the impact of the number of lines per tree are provided in Table 5 in Appendix E.3, where we explore different configurations with varying numbers of lines per tree in TSW-SL.

The results show that although the number of lines in each tree does have a different impact on performance, given the same number of total projection directions, our TSW-SL consistently yields better results compared to SW and Orthogonal SW.

Q1 + Q2. What are the main arguments in favor of TSW-SL wrt other many variants of SW? Could you highlight the unique advantages of TSW-SL?"

Can you give some stronger argument to the claim that tree systems allow avoiding a loss of topological information?

Answer. The motivation for this paper arose from a simple yet intriguing idea: In the framework of Sliced Wasserstein (SW), a probability distribution on $\mathbb{R}^d$ is pushed forward onto a line. This raises the question: what does the resulting distribution reveal about the original one? It is evident that distinct distributions, when projected onto the same line, can become indistinguishable.

Now, let us compare a tree system composed of two lines, $a$ and $b$ in $\mathbb{R}^2$ , along with a distribution $\mu$ on $\mathbb{R}^2$ . For simplicity, assume that $\mu$ is a Dirac delta distribution.

Many Dirac delta distributions in $\mathbb{R}^2$ become identical after being projected onto line $a$ , and the same holds true for line $b$ . However, in most cases, the projections of $\mu$ on both lines of the tree system can uniquely identify the original Dirac delta distribution. ("Most cases" excludes exceptional scenarios, such as when $a$ and $b$ are the same line.)
The above reasoning might appear insufficient because it evaluates the tree system by considering only one of its lines at a time. What happens if we evaluate the tree system using both lines together? This is where the splitting map becomes essential. The splitting map enables a more versatile allocation of mass between the two lines, rather than concentrating all the mass onto one line.

In summary, with the same number of lines—and thus the same computational cost—tree systems in TSW-SL provide a significantly deeper understanding of probability distributions compared to individual lines in SW.

A natural question arises: if a better representation space is desired, why not replace one-dimensional lines with higher-dimensional subspaces of $\mathbb{R}^d$ ? The answer lies in computational feasibility. Optimal Transport in $\mathbb{R}^d$ for $d>1$ is computationally prohibitive due to the lack of a closed-form solution. In contrast, both SW and TSW-SL offer efficient closed-form expressions, making them more practical.

We believe this explanation adequately addresses the reviewer's concerns.

2024-11-19

Comment 3. Line 467: The loss should not be denoted as $\mathcal{L}$ since this letter represents a line.

Answer. Thank you for your correction. We have made the adjustments in our manuscript.

Comment 4. Section 6.2: It is difficult to visually confirm that TSW-SL produces images that most closely resemble the target.

Answer. Figure 5 showcases both qualitative and quantitative results. Qualitatively, methods that yield images with colors more closely matching the target image are considered superior. The figure displays the generated images and the corresponding 2-Wasserstein distances at the final epoch, with a smaller 2-Wasserstein distance indicating better performance.

For a more detailed qualitative comparison of our methods with SW, MaxSW, and other SW variants, we provide an additional example of the color-transfer task in Appendix E.2. This example further demonstrates that our TSW-SL and MaxTSW-SL loss functions, when used to measure the distance between two color distributions, achieve superior results both qualitatively and quantitatively.

Comment 5. Appendix p.17, line 870: $\overline{L}$ should be $\mathcal{L}$ .

Answer. Thank you for your correction. We have made the adjustments in our revision.

Comment 6. Example A.14 is unclear: $n_i$ should have length 6 (i.e., $i = 0$ to $5$ ). What do the arrows signify? Additionally, an arrow seems to be missing on line 1035. The progression through step $i$ is confusing.

Answer. Thank you for your correction. We have revised the Example A.14 in our revision as follows:

Definition A.13. Let $T$ be a nonnegative integer, and $n_1, \ldots, n_T$ be $T$ positive integer. A sequence $s = \\{x_i\\}\_{i=0}^T$ , where $x_i$ is a vector of $n_i$ nonnegative numbers, is called a tree representation if $x_0 = [1]$ , and for all $1 \le i \le T$ , $n_i$ is equal to the sum of all entries in vector $x_{i-1}$ .

Example A.14. For $T = 5$ and $\{n_i\}_{i=1}^{5} = \{1,3,4,2,3\}$ , the sequence $s ~ \colon ~ x_0 = [1] \rightarrow x_1 = [3] \rightarrow x_2 = [2,1,1] \rightarrow x_3 = [1,0,2,0] \rightarrow x_4 = [1,2] \rightarrow x_5 = [0,0,1].$ is a tree representation.

Comment 7. Page 25: Please provide more detail about the transition from eq. (46) to eq. (47).

Answer. We recall the transition from Eq. (46) to Eq. (47).

$\displaystyle \int_{\mathbb{T}} \biggl(\text{W}\_{d\_\mathcal{L},1}(\mathcal{R}^\alpha\_\mathcal{L} \mu\_1, \mathcal{R}^\alpha\_\mathcal{L} \mu\_2) + \text{W}\_{d\_\mathcal{L},1}(\mathcal{R}^\alpha\_\mathcal{L} \mu\_2, \mathcal{R}^\alpha\_\mathcal{L} \mu\_3) \biggr) ~d\sigma(\mathcal{L}) \ge \int_{\mathbb{T}} \text{W}\_{d\_\mathcal{L},1}(\mathcal{R}^\alpha\_\mathcal{L} \mu\_1, \mathcal{R}^\alpha\_\mathcal{L} \mu\_3) ~d\sigma(\mathcal{L})$

This transition comes directly from the inequality

$\text{W}\_{d\_\mathcal{L},1}(\mathcal{R}^\alpha\_\mathcal{L} \mu\_1, \mathcal{R}^\alpha\_\mathcal{L} \mu\_2) + \text{W}\_{d\_\mathcal{L},1}(\mathcal{R}^\alpha\_\mathcal{L} \mu\_2, \mathcal{R}^\alpha\_\mathcal{L} \mu\_3) \ge \text{W}\_{d\_\mathcal{L},1}(\mathcal{R}^\alpha\_\mathcal{L} \mu\_1, \mathcal{R}^\alpha\_\mathcal{L} \mu\_3),$

since the Wasserstein distance $\text{W}\_{d\_\mathcal{L},1}$ is a metric on the space of all measures on $\mathcal{L}$ .

We sincerely thank the reviewer for the valuable feedback. If our responses adequately address all the concerns raised, we kindly hope the reviewer will consider raising the score of our paper.

2024-11-19

Q5. It is stated in the beginning of section 6 that "the root is positioned near the mean of the target distribution": what is the impact of this setting?

Answer: Through empirical observations, we discovered that initializing the tree root near the mean of the target distribution results in more consistent performance for our models.

We believe a thorough investigation into the analytical and statistical properties of TSW-SL is necessary to provide a theoretical explanation for these findings. However, since the paper primarily concentrates on the construction of tree systems and the associated Radon Transform, and given its already extensive scope, we have decided to defer these aspects of TSW-SL to future research.

Comment 1. In the gradient flow experiment, differences in ground cost (e.g., $L_2$ , tree metric) between methods make it challenging to compare results at a fixed number of iterations.

Answer. As described in Section 6.1 of the main text (Lines 408–413), the objective of this task is to update the source distribution to make it as close to the target distribution as possible. The empirical advantage of our TSW-SL over other SW-variants is demonstrated by measuring the 2-Wasserstein distance between the source and target distributions at steps 500, 1000, 1500, 2000, and 2500. In this task:

TSW-SL, MaxTSW-SL, and other SW-variants are used as the loss function.
The 2-Wasserstein distance serves as the evaluation metric.

It is crucial to note that the roles of the 2-Wasserstein distance (evaluation metric) and the other distances, such as TSW-SL, MaxTSW-SL, and SW-variants (loss functions), are separate. Hence, there is no difference when evaluating the distance between the updated source distribution and the target distribution.

Comment 2. Table 1: By Wasserstein distance, do you mean $W_2$ ? Additionally, could you clarify the timings reported in Table 2?

Answer. The Wasserstein mentioned in our paper is indeed $2-$ Wasserstein distance $W_2$ . In Table 2, the number reported is the average $2$ -Wassersein distance over 10 runs. We have also added the average time to calculate the distance at one iteration over 10 runs in Table 2 of our manuscripts. The timings reported in Table 1 and Table 2 suggests that although the time per iteration for TSW-SL is slightly higher, the substantial reduction in Wasserstein distance underscores its efficiency across both datasets.

Table 2: Average Wasserstein distance between source and target distributions of 10 runs on high-dimensional datasets.

Number of dimensions	Iteration 0	Iteration 0	Iteration 500	Iteration 500	Iteration 1000	Iteration 1000	Iteration 1500	Iteration 1500	Iteration 2000	Iteration 2000	Iteration 2500	Iteration 2500	Time/Iter(s)	Time/Iter(s)
	SW	TSW-SL	SW	TSW-SL	SW	TSW-SL	SW	TSW-SL	SW	TSW-SL	SW	TSW-SL	SW	TSW-SL
10	6.41	6.41	4.32e-3	2.81e-3	2.94e-3	2.00e-3	2.81e-3	1.55e-3	2.23e-3	1.59e-3	2.28e-3	1.75e-3	0.010	0.015
50	42.72	42.72	50.41	39.26	45.69	21.91	42.56	11.91	38.81	4.08	35.75	1.72	0.014	0.018
75	69.06	69.06	92.39	79.71	90.79	67.99	90.07	53.92	86.58	44.91	90.31	31.61	0.015	0.018
100	91.5	91.5	130.12	117.66	128.13	103.23	128.58	93.41	129.80	80.46	128.29	75.28	0.018	0.019
150	142.54	142.54	214.09	203.30	213.71	190.62	215.05	186.77	212.90	183.52	216.32	182.63	0.020	0.022
200	192.52	192.52	302.84	289.83	301.35	283.34	303.07	276.94	302.70	279.24	301.51	279.08	0.020	0.021

2024-11-19

Q4 + W1. As noted in the conclusion, the paper "introduces a straightforward alternative to SW by replacing one-dimensional lines with tree systems," aiming to provide a more geometrically meaningful space. This objective aligns with several SW variants ... it remains unclear why and under what circumstances TSW-SL should be preferred over SW or its many variants.

It is stated in the conclusion that improved performances are expected by adapting recent advanced sampling schemes or techniques of SW to TSW-SL. It seems that the extension is not that straightforward: can you comment on that, and how improved performances can be expected?

Answer: Thanks for your nice question. We are eager to elaborate further on it. Roughly speaking, the tree-sliced framework in our paper is built on two key insights:

Local Perspective: Each line in a tree system is treated similarly to a line in the Sliced Wasserstein (SW) framework. Splitting maps determine how the mass at each point is distributed across the lines, and then the projection of these mass portions onto the lines is processed in the same way as in SW. A significant challenge with this approach is verifying whether the injectivity of the corresponding Radon Transform is preserved, as this determines whether the proposed metric qualifies as a true metric or merely a pseudo-metric. However, we addressed this concern by providing a proof in the paper.
Global Perspective: Tree structures and splitting maps establish connections between the lines, creating a cohesive system. This introduces a novel aspect compared to SW, enabling interaction and integration among the lines in a tree system. The Wasserstein distance can now be computed on this space with a closed-form expression, analogous to how lines are treated in SW.

From these insights, several follow-up directions for TSW-SL can be pursued, focusing on each perspective:

Local Perspective: Consider, for example, the Generalized Sliced Wasserstein (GSW) distance [1]. In GSW, the SW framework is retained, but the projection mechanism is altered. Specifically, GSW generalizes the integration level set in the Radon Transform, replacing the level set defined by the inner product (representing orthogonal hyperplanes) with one defined by an arbitrary function. Similarly, in TSW-SL, which currently relies on the inner product, a framework could be developed that generalizes this by using an arbitrary function, offering new flexibility and applications.
Global Perspective: More detailed analysis can aid in designing improved tree structures and splitting maps. For instance, splitting maps could be enhanced to account for both positional information from points and lines, rather than relying solely on lines as in the current implementation. This approach forms the core concept of one extension of the TSW-SL paper, which has also been submitted to this venue and can be accessed here.

It is important to note that while these ideas might seem straightforward, their development is non-trivial. For example, a critical property that must be preserved in TSW-SL variants is the injectivity of the corresponding Radon Transform. Additionally, from an empirical perspective, replacing lines with tree systems shows promising results. To illustrate this, let's consider the Denoising Diffusion Models task in our paper. We closely follow [2], which, to the best of our knowledge, is the most recent paper evaluating SW methods in diffusion models. Table 4 reports the following:

The FID for the vanilla SW method is 2.90.
The FIDs for four variants using random-path projecting directions from [2] are 2.70, 2.82, 2.87, and 2.88.

By merely replacing the foundation lines with tree systems, without introducing new techniques, our model using TSW-SL achieves an FID of 2.83. Models using TSW-SL with additional techniques (in the mentioned submission), achieve FIDs of 2.60 and 2.525.

We believe that our response regarding the both perspectives address the reviewer's question effectively. We hope this paper serves as a catalyst for a new research direction in the field, and we are actively working on advancing these ideas further.

评论- Official comments by Authors

2024-11-21

References.

[1] Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Roland Badeau, and Gustavo Rohde. Generalized sliced wasserstein distances. Advances in neural information processing systems, 32, 2019.

[2] Khai Nguyen, Shujian Zhang, Tam Le, and Nhat Ho. Sliced wasserstein with random-path projecting directions. In Forty-first International Conference on Machine Learning, 2024.

评论- Any Questions from Reviewer 7kaA on Our Rebuttal?

2024-11-22

We would like to thank the reviewer again for your thoughtful reviews and valuable feedback.

We would appreciate it if you could let us know if our responses have addressed your concerns and whether you still have any other questions about our rebuttal.

We would be happy to do any follow-up discussion or address any additional comments.

审稿意见

评分: 5置信度: 32024-11-02

The authors advance the concept of Sliced Wasserstein (SW) distance, which involves projecting data onto one-dimensional lines and leveraging closed-form expressions for one dimensional optimal transport to reduce computational costs. The authors propose an alternative projection onto tree systems, asserting that this approach preserves topological information. Additionally, they rigorously demonstrate that their framework yields an efficient metric for comparing measures. They also present extensive experimental results showcasing the superior performance of their method compared to existing methods based on SW-type distances.

优点

The derivation is clear and well-structured, providing a thoroughly developed framework. The authors demonstrate improved performance relative to previous related methods while maintaining lightweight computation. The code is provided.

缺点

Although the authors present a well-founded theoretical framework and demonstrate practical improvements on several setups, the proposed method essentially introduces (yet) another distance metric akin to the Sliced Wasserstein distance. The overall paper seems to be a rather straightforward and incremental extension of the prior related works in the field. That is, I am rather skeptical about the significance and impact of the construction proposed here on the field of ML.

It seems to me that the main purpose of the proposed TSW-SL seems to be to compare the probability distributions, a feature which is primary needed for GAN training nowadays (the other experiments in the paper with gradient flows, etc. are toy, so I do not consider them as particularly demonstrative and serious). Here I have a general concern regarding the usefulness of the proposed TSW-SL in GANs. The experiments with GANs indeed show some practical improvement in the task of image generation. However, according to the famous work [1] tuning GANs is more about tuning various hyperparameters rather than changing the particular loss functions. This is also confirmed by the fact that most more practically-related papers (from CVPR, ECCV, etc.) still rely on vanilla/NS/WGAN loss with additional regularizations rather than other more complex losses (like SW-based) proposed by the community later. This suggests that (in 2024) the contribution of the current paper (TSW-SL as a GAN loss) may be relatively minor, so I am more on the negative side about the paper.

Additionally, the experiments with GANs lack details and sufficient explanations. For example, it is not clear why the problem in lines (Appendix) 1501-1505 is a valid GAN optimization problem. In the DDGAN experiment, how exactly the DDGAN loss is used in TSW is also not explained in detail, while it should: the DD-GAN loss is a non-trivial model due to various conditioning (discriminator is conditioned on the point, on the time moment). How this conditioning fits to the the proposed TSW-SL framework remains explained (line 1618 is not enough).

[1] Lucic, Mario, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. "Are gans created equal? a large-scale study." Advances in neural information processing systems 31 (2018).

问题

I would like to see more experimental results and discussion regarding the GAN training.

Could you explain how does your training time (time per iteration, per epoch, overall time to convergence) compares with the one from DDGAN and the other basedlines considered in the experiments.
Could you please provide the convergence plots (FID as a function of epoch, e.g., once per several epochs) for your model vs. DDGAN vs. some other SW baseline? I would like to understand how stable is the overall training of your model compared with the baselines.

伦理问题详情

None

2024-11-21

Q1. Could you explain how your training time (time per iteration, per epoch, overall time to convergence) compares with the one from DDGAN and the other baselines considered in the experiments.

Answer. Regarding the training time of our methods in DDGAN and other baselines in Table 4, we have already provided the training time per epoch of our methods compared with other baselines. We have further provided the training time per iteration as a new column in Table 4 of the revision of our paper. In terms of overall time to converge, it is impossible for us to provide in this discussion phase due to lack of time and resources since it requires reproducing the full training of all baselines mentioned in [1]. We will include the overall time to convergence of the remaining baseline in the final revision of our manuscript.

Q2. Could you please provide the convergence plots (FID as a function of epoch, e.g., once per several epochs) for your model vs. DDGAN vs. some other SW baseline? I would like to understand how stable is the overall training of your model compared with the baselines.

Answer. We have added the FID plots of our TSW-SL-DD compared to SW-DD in Figure 9 of Appendix E.4 of the revision of our paper. The results show that our TSW-SL-DD achieves a greater reduction in FID scores compared to SW-DD during the final 300 epochs. Due to the lack of time and resources during the discussion periods, we cannot provide FID score over epochs of other SW variants reported in the papers (since we follow the results reported in [10] to compare with our TSW-SL-DD results). Additionally, reproducing all baselines for DDGAN experiments would require time and resources far beyond what is feasible during the discussion period. We will include the FID over epochs of the remaining baselines in the final revision of our manuscript.

We sincerely thank the reviewer for the valuable feedback. If our responses adequately address all the concerns raised, we kindly hope the reviewer will consider raising the score of our paper.

2024-11-21

We sincerely appreciate the reviewer’s thoughtful feedback and have provided the following responses to address the concerns raised about our paper. Below, we begin with a general reply and then proceed to address each of the reviewer’s questions individually.

It appears to us that the reviewer may not be fully familiar with the context of Optimal Transport, particularly the specific topic of our paper, which focuses on the sliced variants of the Wasserstein distance in Optimal Transport.

It seems to me that the main purpose of the proposed TSW-SL seems to be to compare the probability distributions, a feature which is primary needed for GAN training nowadays

The Wasserstein distance is known to have a supercubic computational complexity concerning the number of supports in the input measures. Specifically, for probability measures with at most $n$ supports, its computational complexity is $\mathcal{O}(n^3 \log n)$ [1]. This high computational cost has driven the development of sliced variants in Optimal Transport, which provide more computationally efficient alternatives to the original Wasserstein distance.

In the case of the original Sliced Wasserstein distance, the computational complexity is reduced to $\mathcal{O}(L n \log n + Ldn)$ , where $d$ is the dimension of the supports and $L$ is the number of samples used in the Monte Carlo approximation. This represents a significant improvement in computational efficiency, reducing the complexity from $n^3 \log n$ to $n \log n$ .

In our TSW-SL distance, the computational complexity is $\mathcal{O}(Lkn \log n + Lkdn)$ , where $L$ is the number of samples in the Monte Carlo approximation, $k$ represents the number of tree levels, $n$ is the number of supports, and $d$ is the dimensionality of the supports. This complexity retains the same order with respect to the number of supports $n$ , making it computationally efficient while incorporating additional structural information through the tree system.

This is also confirmed by the fact that most more practically-related papers (from CVPR, ECCV, etc.) still rely on vanilla/NS/WGAN loss with additional regularizations rather than other more complex losses (like SW-based) proposed by the community later. This suggests that (in 2024) the contribution of the current paper (TSW-SL as a GAN loss) may be relatively minor, so I am more on the negative side about the paper.

It is important to clarify that the primary contribution of our paper is not about introducing TSW-SL as a GAN loss as the reviewer stated. Instead, the paper focuses on introducing a novel integration domain, referred to as tree systems, as a replacement for the traditional lines used in the Sliced Wasserstein framework. The GAN experiments in our work primarily serve to highlight the advantages of this new integration domain when compared to lines.

As stated in our paper (Line 377): "It is worth noting that the paper presents a simple alternative by substituting lines in SW with tree systems, focusing mainly on comparing TSW-SL with the original SW, without expecting TSW-SL to outperform more recent SW variants."

Improving existing components in the SW framework ([1], [2], [3], [4], [5], [6], etc.) and extending one-dimensional lines to more complex domains such as one-dimensional manifolds or low-dimensional subspaces ([6], [7], [8], [9], etc.) are active and evolving areas of research in the Machine Learning community. Hence, we believe it is not entirely fair to evaluate a theoretical contribution like ours solely on one specific task. Instead, the focus should be on the generality and potential of the introduced approach.

2024-11-21

References.

[1] Khai Nguyen, Nicola Bariletto, and Nhat Ho. Quasi-monte carlo for 3d sliced wasserstein. In The Twelfth International Conference on Learning Representations, 2024.

[2] Khai Nguyen, Nhat Ho, Tung Pham, and Hung Bui. Distributional sliced-wasserstein and applications to generative modeling. In International Conference on Learning Representations, 2021.

[3] Kimia Nadjahi, Alain Durmus, Pierre E Jacob, Roland Badeau, and Umut Simsekli. Fast approximation of the sliced-wasserstein distance using concentration of random projections. Advances in Neural Information Processing Systems, 34:12411–12424, 2021.

[4] Ishan Deshpande, Yuan-Ting Hu, Ruoyu Sun, Ayis Pyrros, Nasir Siddiqui, Sanmi Koyejo, Zhizhen Zhao, David Forsyth, and Alexander G Schwing. Max-sliced wasserstein distance and its use for gans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10648–10656, 2019.

[5] Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Roland Badeau, and Gustavo Rohde. Generalized sliced wasserstein distances. Advances in neural information processing systems, 32, 2019.

[6] Clément Bonet, Laetitia Chapel, Lucas Drumetz, and Nicolas Courty. Hyperbolic sliced-wasserstein via geodesic and horospherical projections. In Topological, Algebraic and Geometric Learning Workshops 2023, pages 334–370. PMLR, 2023

[7] David Alvarez-Melis, Tommi Jaakkola, and Stefanie Jegelka. Structured optimal transport. In International conference on artificial intelligence and statistics, pages 1771–1780. PMLR, 2018.

[8] Franc¸ois-Pierre Paty and Marco Cuturi. Subspace robust wasserstein distances. In International conference on machine learning, pages 5072–5081. PMLR, 2019.

[9] Jonathan Niles-Weed and Philippe Rigollet. Estimation of wasserstein distances in the spiked transport model. Bernoulli, 28(4):2663–2688, 2022.

[10] Khai Nguyen, Shujian Zhang, Tam Le, and Nhat Ho. Sliced wasserstein with random-path projecting directions. In Forty-first International Conference on Machine Learning, 2024.

[11] K. Nguyen and N. Ho. Sliced wasserstein estimation with control variates. In The Twelfth International Conference on Learning Representations, 2024.

评论- Any Questions from Reviewer MtTT on Our Rebuttal?

2024-11-22

We would like to thank the reviewer again for your thoughtful reviews and valuable feedback.

We would appreciate it if you could let us know if our responses have addressed your concerns and whether you still have any other questions about our rebuttal.

We would be happy to do any follow-up discussion or address any additional comments.

评论- Reply to reviewer MtTT (Cont.)

2024-11-29

GAN Experiment Results and Their Significance: We acknowledge your concern regarding the modest numerical gains in FID scores and the possible variability of these metrics. Nevertheless, we believe that the improvements—22.43% over DDGAN and 2.4% over SW—are significant, especially considering the already low baseline FID scores. Furthermore, when placing our results within the broader context of image generation benchmarks, such as those on CIFAR-10, our approach ranks among the top 49 methods. This highlights the effectiveness of our method in advancing diffusion models.
Log scale of our newly added figure for FID plots over epochs: As noted in Appendix E4 of our manuscripts, we utilize a logarithmic scale in FID plots due to the wide range of values (from over 400 in initial epochs to less than 3.0 in final epochs). We think that this scale offers clearer visualization, and we will include the FID plot in the original scale in the final revision of our manuscript.

Thank you again for your thoughtful feedback, and we are happy to address any further concerns regarding our work.

Best regards,
The Authors

References

[1] Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Roland Badeau, and Gustavo Rohde. Generalized sliced wasserstein distances. Advances in neural information processing systems, 32, 2019.

[2] Nadjahi, K., Durmus, A., Jacob, P. E., Badeau, R., & Simsekli, U. (2021). Fast approximation of the sliced-Wasserstein distance using concentration of random projections. Advances in Neural Information Processing Systems, 34, 12411-12424.

[3] Khai Nguyen and Nhat Ho. Sliced wasserstein estimation with control variates. In The Twelfth International Conference on Learning Representations, 2024.

[4] Khai Nguyen, Nhat Ho, Tung Pham, and Hung Bui. Distributional sliced-wasserstein and applications to generative modeling. In International Conference on Learning Representations, 2021.

[5] Khai Nguyen, Nicola Bariletto, and Nhat Ho. Quasi-monte carlo for 3d sliced wasserstein. In The Twelfth International Conference on Learning Representations, 2024a.

[6] Khai Nguyen, Shujian Zhang, Tam Le, and Nhat Ho. Sliced wasserstein with random-path projecting directions. In Forty-first International Conference on Machine Learning, 2024.

[7] Izquierdo, S., & Civera, J. (2024). Optimal transport aggregation for visual place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 17658-17668).

[8] Wu, X., Dong, X., Nguyen, T. T., & Luu, A. T. (2023, July). Effective neural topic modeling with embedding clustering regularization. In International Conference on Machine Learning (pp. 37335-37357). PMLR.

[9] He Zhao, Dinh Phung, Viet Huynh, Trung Le, and Wray Buntine. Neural topic model via optimal transport. arXiv preprint arXiv:2008.13537, 2020.

评论- Reply to reviewer MtTT

2024-11-29

Dear Reviewer MtTT,

Thank you for your detailed feedback and thoughtful analysis. We appreciate the time you’ve taken to evaluate our work and provide constructive insights.

We would like to further address your concerns regarding the significance of our proposed framework below.

Application to ML-related problems: While we understand the importance of demonstrating the relevance of our framework to real-world ML-related problems, we emphasize that the primary contribution of our paper is the introduction of tree systems as a novel integration domain within the Sliced Wasserstein (SW) framework. As noted in the revised paper (Line 377), our goal is to present this as an alternative to traditional lines in SW, with the focus being on comparing TSW-SL with the original SW, rather than developing a loss that can achieve state-of-the-art results in a specific domain like generative models.
Soundness of Our Proposed Distance in the Optimal Transport Field: To the best of our knowledge, our work introduces the first method to formally construct a tree system that preserves topological properties, accommodates dynamic supports, and achieves computational efficiency comparable to the Sliced Wasserstein (SW) distance. TSW-SL serves as a bridge between SW and TSW, leveraging the increased flexibility of TSW while retaining the dynamic adaptability characteristic of SW. We believe our method is non-trivial for several reasons. Firstly, it opens up new avenues for efficient computation in Optimal Transport, overcoming the limitations of both SW and TSW. As discussed in our global aspect of our General Response, the framework we present lays the groundwork for future variants of TSW-SL, which could retain its advantages while incorporating more sophisticated improvements. For example, future enhancements could involve refining the splitting maps to incorporate positional information from both points and lines, rather than relying solely on lines as in the current implementation. This idea is a central part of an extension to the TSW-SL framework, referenced by Reviewer 7kaA, and is also being submitted to this venue, with access available here.
Broader impacts of our works in real-world applications: Rather than generative models for images, the need to efficiently compare probability distributions in applications such as Visual Place Recognition [7] and natural language processing tasks [8, 9, 10], where slow algorithms like Sinkhorn are often used despite SW’s promise since its computation relies on the projection of the supports in the original space into 1-dimensional space, which causes the loss of topological information. We believe that the introduction of our TSW-SL and other extensions of TSW-SL in the future will be a better option in assisting to advance these fields.
The choice of experimental tasks: We choose to report the superiority of our methods compared with other SW-variants, varying from the synthetic task (gradient flow) and real tasks (color transfer and generative models including GAN and DDGAN) since each experiment aims to provide better aspects to highlight the empirical advantage of our methods compared with SW. From synthetic tasks to real-world problems such as generative models, we conduct experiments across various datasets, ranging from simpler datasets like CIFAR-10 (32×32) to more complex datasets like STL-10 (96×96) and CelebA (64×64). Most results show that our methods consistently outperform SW. Recognizing that experiments with SN-GAN are sufficient for simpler models, we extend our experiments to DDGAN due to its superior image generation quality. The results also indicate that our models yield better results than SW. We do not report the FID score in mean and standard deviation in DDGAN as in SN-GAN since the experiments with DDGAN take a long time to train. Most of the baselines we compare with are from [6], and this work does not provide the standard deviation for the DDGAN experiment due to long training time. With a resource constraint, we cannot repeat the training multiple times to report mean and standard deviation results.

Of all the broader impacts mentioned above, we choose to conduct our experiments on gradient flow, color transfer, and generative models since in the SW literature, there are existing works that validate their empirical advantages through these experimental settings (gradient flow [1, 2], Generative Adversarial Networks [3, 4], color transfer [5], Denoising Diffusion Model [6]). We think that comparing our methods with the same setting as existing works can help better for a fair comparison and alleviate the need to re-implement all previous methods in new tasks, thus completing the broad picture of our evaluation.

审稿意见

评分: 6置信度: 32024-11-04

The authors proposed a computation of the Sliced Wasserstein (SW) distance using a novel projections on systems of lines and achieved improvements compared to the original SW approach.

优点

Clarity: The paper is well-written and logically structured, making it easy to follow.

Originality: The approach of efficiently solving sliced optimal transport using tree-based priors presents a novel and intriguing perspective.

Significance: This research represents a crucial advancement in the development of more efficient closed-form tools for computing Wasserstein distances, which is an important area of study.

缺点

Quality: The poor evaluation is the main weakness of the paper. The authors provided several experiments, but only Tables 1 and 2 offer an expressive justification of the proposed method.

问题

What should I observe in Figure 5? The images appear identical to me. More expressive examples of the color transfer are necessary. Why did you consider your distance as the regularization for the GAN model? In my understanding, gradient flow methods from section 6.1 in higher dimensions would provide better evidence. The paper demonstrates how the proposed distance can help decrease the distance to the target datasets. However, there is no evaluation showing that the computed distance is actually closer to the real Wasserstein distance provided. I think it is crucial to demonstrate this across a range of datasets. Table 4 does not demonstrate that the proposed method actually provides better results in FID.

2024-11-19

W1. The poor evaluation is the main weakness of the paper. The authors provided several experiments, but only Tables 1 and 2 offer an expressive justification of the proposed method.

Answer. The paper proposes a straightforward alternative to Sliced Wasserstein (SW) by substituting lines with tree systems, focusing primarily on comparing Tree-Sliced Wasserstein (TSW-SL) with the original SW. A significant advantage of tree systems is that they serve as a simple replacement for lines while preserving all the beneficial properties of lines in SW.

Enhancements to TSW-SL could be achieved by incorporating advanced techniques designed for SW, as tree systems can be locally considered as lines. Techniques applied to lines, such as the Generalized Sliced Wasserstein [2], could similarly be adapted for TSW-SL. However, while these extensions may appear straightforward, their implementation involves significant challenges. For instance, it is crucial to ensure that any TSW-SL variant retains the injectivity of the corresponding Radon Transform.

Given the extensive research and development of improved SW variants over the years, we do not anticipate TSW-SL to surpass the performance of more recent SW variants. Our focus remains on establishing the foundational aspects of TSW-SL, leaving further developments for future research.

Q4. Table 4 does not demonstrate that the proposed method actually provides better results in FID.

Regarding Table 4, which relates to Denoising Diffusion Models, we closely follow [3], the most recent paper evaluating SW methods in diffusion models, to the best of our knowledge. Table 4 reports:

The FID for the vanilla SW method is 2.90.
The FIDs for four variants using random-path projecting directions from [3] are 2.70, 2.82, 2.87, and 2.88.

By merely replacing the foundation lines with tree systems, without introducing new techniques, our model using TSW-SL achieves an FID of 2.83. As noted by reviewer 7kaA, extensions of TSW-SL are also under consideration at this venue. One such submission can be found at this link, where models using TSW-SL with additional techniques achieve FIDs of 2.60 and 2.525.

For these reasons, we believe the paper should be evaluated by the development of tree systems and the corresponding Radon Transform aspects (i.e., compare apple-to-apple), rather than focusing on numerical comparisons with state-of-the-art methods.

We sincerely thank the reviewer for the valuable feedback. If our responses adequately address all the concerns raised, we kindly hope the reviewer will consider raising the score of our paper.

2024-11-19

Q1. What should I observe in Figure 5? The images appear identical to me. More expressive examples of the color transfer are necessary.

Answer. The primary objective of the color-transfer task is to navigate along the curve connecting the color distributions of the source and target images. In details:

We employ our TSW-SL and MaxTSW-SL loss functions to minimize the distance between the color distributions of the generated and target images.
The 2-Wasserstein distance serves as the evaluation metric for the similarity between the generated and target images, with a smaller 2-Wasserstein distance indicating better performance.

Figure 5 showcases both qualitative and quantitative results. Qualitatively, methods that yield images with colors more closely matching the target image are considered superior. The figure displays the generated images and the corresponding 2-Wasserstein distances at the final epoch.

For a more detailed qualitative comparison of our methods with SW, MaxSW, and other SW variants, we provide an additional example of the color-transfer task in Appendix E.2. This example further demonstrates that our TSW-SL and MaxTSW-SL loss functions, when used to measure the distance between two color distributions, achieve superior results both qualitatively and quantitatively.

Q2. Why did you consider your distance as the regularization for the GAN model?

Answer. In the experiment with the GAN model discussed in Section 6.3, we use TSW-SL as the loss function to compute the distance between the distributions of the target image and the generated image. Specifically, we employ it as the generator loss in the GAN model, not as a regularization term. This approach is closely based on the methodology of the Sliced Wasserstein generator [1], with details provided in the Appendix E.3.

Q3. In my understanding, gradient flow methods from section 6.1 in higher dimensions would provide better evidence. The paper demonstrates how the proposed distance can help decrease the distance to the target datasets. However, there is no evaluation showing that the computed distance is actually closer to the real Wasserstein distance provided.

Answer. It appears that the reviewer may have misunderstood the role of TSW-SL, MaxTSW-SL, and other SW-variants in our experiments. To clarify, these are utilized as loss functions.

As described in Section 6.1 of the main text (Lines 408–413), the objective of this task is to update the source distribution to make it as close to the target distribution as possible. The empirical advantage of our TSW-SL over other SW-variants is demonstrated by measuring the 2-Wasserstein distance between the source and target distributions at steps 500, 1000, 1500, 2000, and 2500. In this task:

TSW-SL, MaxTSW-SL, and other SW-variants are used as the loss function.
The 2-Wasserstein distance serves as the evaluation metric.

It is important to emphasize that the roles of the 2-Wasserstein distance (evaluation metric) and the other distances, including TSW-SL, MaxTSW-SL, and SW-variants (loss functions), are distinct. Therefore, it is not the objective of this task to identify a distance that more closely approximates the true Wasserstein distance.

评论- Official Comment by Authors

2024-11-21

References.

[1] Deshpande, I., Zhang, Z., & Schwing, A. G. (2018). Generative modeling using the sliced wasserstein distance. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3483-3491).

[2] Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Roland Badeau, and Gustavo Rohde. Generalized sliced wasserstein distances. Advances in neural information processing systems, 32, 2019.

[3] Khai Nguyen, Shujian Zhang, Tam Le, and Nhat Ho. Sliced wasserstein with random-path projecting directions. In Forty-first International Conference on Machine Learning, 2024.

评论- Any Questions from Reviewer tDY2 on Our Rebuttal?

2024-11-22

We would like to thank the reviewer again for your thoughtful reviews and valuable feedback.

We would appreciate it if you could let us know if our responses have addressed your concerns and whether you still have any other questions about our rebuttal.

We would be happy to do any follow-up discussion or address any additional comments.

评论- Official Response to Authors

2024-11-25

Thank you for addressing my concerns. I'm raising my score.

评论- Thanks for your endorsement!

2024-11-25

Thanks for your response, and we appreciate your endorsement.

评论- Summary of Revisions

2024-11-21

Incorporating comments and suggestions from reviewers, as well as some further empirical studies we believe informative, we summarize here the main changes in the revised paper:

We added an additional example of the color-transfer task in Appendix E.2 to clearly demonstrate the qualitative advantages of our TSW-SL loss over other SW-variant losses.
We added a new column to Table 4 (Denoising Diffusion Model task) to report the training time per iteration for each method.
We included the time per iteration of gradient flow task for high-dimensional dataset in Table 2.
We included the FID plot over epochs for DDGAN using TSW-SL and SW to showcase the empirical advantages of our method in terms of convergence rate.
We included the explanation of our method (TSW-SL) into Denoising Diffusion Model in Appendix E.4, line 1632.
We corrected typos in lines 467 and 870, clarified Example A.14 (as noted by Reviewer 7kaA), corrected the typo in line 322 (as suggested by Reviewer HzfV), and added references [1], [2], and [3] in line 57 based on the recommendations of Reviewer HzfV.

Reference.

[3] Muzellec, B., & Cuturi, M. (2019). Subspace detours: Building transport plans that are optimal on subspace projections. Advances in Neural Information Processing Systems, 32.

评论- General Response (1/2)

2024-11-21

Dear AC and reviewers,

Thanks for your thoughtful reviews and valuable comments, which have helped us improve the paper significantly.

We sincerely thank the reviewers for their valuable feedback and constructive suggestions. We are encouraged by the positive endorsements regarding the following aspects of our work:

The derivation is clear, well-structured, and provides a thoroughly developed framework. (Reviewers tDY2, MtTT, 7kaA, HzfV)
The proposed metric is indeed a valid metric and can be approximated efficiently via a closed-form expression with the same computational complexity as Sliced-Wasserstein (with respect to samples). (Reviewers tDY2, 7kaA, HzfV)
The proposed metric demonstrates consistently better performance compared to Sliced-Wasserstein in tasks such as Gradient Flows, Color Transfer, and Generative Models. (Reviewers tDY2, 7kaA, HzfV)

Below, we address some common points raised in the reviews:

Q1. Key contribution of Tree-sliced framework and promising directions for future work.

The tree-sliced framework in our paper is built on two key insights:

Local Perspective: Each line in a tree system is treated similarly to a line in the Sliced Wasserstein (SW) framework. Splitting maps determine how the mass at each point is distributed across the lines, and the projection of these mass portions onto the lines is then processed in the same way as in SW. A significant challenge with this approach lies in verifying whether the injectivity of the corresponding Radon Transform is preserved, as this determines whether the proposed metric is a true metric or merely a pseudo-metric. However, we addressed this concern by providing a proof in the paper.
Global Perspective: Tree structures and splitting maps establish connections between the lines, creating a cohesive system. This introduces a novel aspect compared to SW, enabling interaction and integration among the lines in a tree system. The Wasserstein distance can now be computed on this space with a closed-form expression, analogous to how lines are treated in SW.

From these insights, several follow-up directions for TSW-SL can be pursued, focusing on each perspective:

Local Perspective: Consider, for example, the Generalized Sliced Wasserstein (GSW) distance [1]. In GSW, the SW framework is retained, but the projection mechanism is altered. Specifically, GSW generalizes the integration level set in the Radon Transform, replacing the level set defined by the inner product (representing orthogonal hyperplanes) with one defined by an arbitrary function. Similarly, in TSW-SL, which currently relies on the inner product, a framework could be developed that generalizes this by using an arbitrary function, offering new flexibility and applications.
Global Perspective: More detailed analysis can aid in designing improved tree structures and splitting maps. For instance, splitting maps could be enhanced to account for both positional information from points and lines, rather than relying solely on lines as in the current implementation. This idea is central to an extension of the TSW-SL paper, which is referenced by Reviewer 7kaA and has also been submitted to this venue. It can be accessed here.

It is important to note that while these ideas might seem straightforward, their development is non-trivial. For example, a critical property that must be preserved in TSW-SL variants is the injectivity of the corresponding Radon Transform. For these reasons, we believe the paper should be evaluated by the development of tree systems and the corresponding Radon Transform aspects (i.e., compare apple-to-apple), rather than focusing on numerical comparisons with state-of-the-art methods.

评论- General Response (2/2)

2024-11-21

Q2. Evaluation of TSW-SL.

The paper proposes a straightforward alternative to Sliced Wasserstein (SW) by substituting lines with tree systems, focusing primarily on comparing Tree-Sliced Wasserstein (TSW-SL) with the original SW. A significant advantage of tree systems is that they serve as a simple replacement for lines while preserving all the beneficial properties of lines in SW.

Enhancements to TSW-SL could be achieved by incorporating advanced techniques designed for SW, as tree systems can be locally considered as lines. Techniques applied to lines, such as the Generalized Sliced Wasserstein [1], could similarly be adapted for TSW-SL. However, while these extensions may appear straightforward, their implementation involves significant challenges. For instance, it is crucial to ensure that any TSW-SL variant retains the injectivity of the corresponding Radon Transform.

From an empirical perspective, replacing lines with tree systems shows promising results. To illustrate this, let's consider the Denoising Diffusion Models task in our paper. We closely follow [2], which, to the best of our knowledge, is the most recent paper evaluating SW methods in diffusion models. Table 4 reports the following:

The FID for the vanilla SW method is 2.90.
The FIDs for four variants using random-path projecting directions from [2] are 2.70, 2.82, 2.87, and 2.88.

Reference.

[1] Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, Roland Badeau, and Gustavo Rohde. Generalized sliced wasserstein distances. Advances in neural information processing systems, 32, 2019.

[2] Khai Nguyen, Shujian Zhang, Tam Le, and Nhat Ho. Sliced wasserstein with random-path projecting directions. In Forty-first International Conference on Machine Learning, 2024.

We are glad to answer any further questions you have on our submission.

评论- Any Questions about Our Rebuttal?

2024-11-24

Dear Reviewers,

We would like to thank you very much for your feedback, and we hope that our response addresses your previous concerns. In case you have not responded to our rebuttal so far, please feel free to let us know if you have any further comments on our work as the discussion phase is expected to conclude in the next few days. We would be more than happy to address any additional concerns from you.

Thank you again for spending time on the paper. We really appreciate that!

Best regards,

Authors

评论- Discussion Phase Summary

2024-12-03

Dear Area Chair and Reviewers,

As the discussion phase comes to a close, we would like to summarize the key contributions of our paper.

In this work, we introduce the Tree-Sliced Wasserstein (TSW) distance, an extension of the well-known Sliced Wasserstein (SW) distance. Instead of relying on traditional one-dimensional lines, the TSW distance utilizes a more intricate integration domain, referred to as tree systems. In essence, tree systems are structures where dimensional lines are interconnected, behaving locally as lines.

This innovative framework paves the way for the introduction of the Tree-Sliced Wasserstein on Systems of Lines (TSW-SL) distance, facilitating the application of Tree-Sliced distance to dynamic-support problems (e.g., GANs and Diffusion models). Addressing such applications has been a significant open challenge since the concept of Tree-Slicing was introduced in NeurIPS 2019 ([1]). To our knowledge, no successful attempts have been made to tackle this problem until now. Our work addresses this gap and opens new avenues for applying TSW to these complex tasks.

Our experimental results demonstrate the feasibility of the proposed method. While this is a foundational study, we show that TSW-SL outperforms traditional SW under equivalent computational costs. It is important to note that our primary goal is to establish a theoretical framework for TSW rather than competing with state-of-the-art SW-based methods. Furthermore, since tree systems are locally composed of lines, we anticipate that techniques developed for SW can be adapted to tree systems, potentially offering further improvements. This potential is discussed in more detail in our General Response.

Finally, the concepts introduced in this work serve as the foundation for extensions of the TSW-SL framework, which have also been submitted to this venue. Two examples of these extensions can be found here (referenced by Reviewer 7kaA) and also here. In these submissions, the proposed methods demonstrate either superior performance or results comparable to state-of-the-art SW approaches in their respective tasks.

We hope this summary helps the Area Chair and Reviewers form an accurate evaluation of the theoretical contributions presented in our paper.

Best regards,

Authors

Reference.

[1] Tam Le, Makoto Yamada, Kenji Fukumizu, and Marco Cuturi. Tree-sliced variants of Wasserstein distances. NeurIPS, 2019.

AC 元评审

2024-12-21

This paper introduces the Tree-Sliced Wasserstein distance on Systems of Lines (TSW-SL), a novel Optimal Transport (OT) variant that replaces one-dimensional projections in Sliced Wasserstein (SW) with tree systems to better preserve topological information. It provides a theoretical framework for tree systems, including their topological properties, splitting maps as projection mechanisms, and a Radon transform for injectivity. Experiments on tasks like gradient flows, image style transfer, and generative models are shown to perform similarly or favorably over SW and its variants.

Reviews for this paper are mixed. All the reviewers agree that the paper present an original novel formulation of SW, that is theoretically consistent. Nonetheless, after considering the different elements of discussions during the rebuttal, I concur with reviewer MtTT and 7kaA that the experiments are not fully convincing to show the superiority of TSW-SL compared to other variants of SW, nor it is demonstrated in a ML setting where its usage would be beneficial. For this reason, I am enclined to give a reject option. I encourage the authors to strengthen their papers by finding a ML setting where a clear advantage is shown for TSW-SL, not only with regards to variants of SW, but also with other distributions divergences or distances.

审稿人讨论附加意见

Discussion were quite lengthy, but some of the critics remained unadressed by authors.

最终决定Reject

2025-01-22

Reject