Fourier Sliced-Wasserstein Embedding for Multisets and Measures
A bi-Lipschitz Euclidean embedding for multisets that extends injectively to measures, based on the sliced Wasserstein distance
摘要
评审与讨论
Summary:
The paper introduces the "Fourier Sliced Wasserstein (FSW) embedding" for data in .
Theoretical Contributions:
- The authors prove that the embedding preserves or approximates the sliced Wasserstein distance.
- They also demonstrate that the embedding technique is injective and bi-Lipschitz.
Numerical Experiments:
- The authors evaluate the approximation error of the proposed Fourier Sliced Wasserstein embedding.
- They showcase an application of FSW for approximating the Wasserstein distance using a Multi-Layer Perceptron (MLP).
优点
- The combination of the Fourier/cosine transform and the sliced Wasserstein distance (see Eq. (6)) is a novel approach.
- Theoretical properties for this new technique with respect to the uniform distribution, along with its empirical approximation, are proposed (see Theorem 3.2, Corollary 3.3).
- Injectivity and bi-Lipschitz properties of the embedding have been investigated.
缺点
- I recommend adding a section to introduce baseline methods. For example, explaining how Sinkhorn [Cuturi, 2013] can be used to train a neural network as a Wasserstein distance estimator. Currently, the experimental setup (E1, E2, Phi, Leaky-ReLU) appears tailored only to the proposed method in this paper.
- It would be beneficial to introduce a real-data application of the proposed Sliced Wasserstein distance embedding technique to illustrate its practical utility.
- I’m unclear on why 'bi-Lipschitz' is considered a crucial property. Could you provide an example to clarify? For instance, in which applications would the lack of a bi-Lipschitz property cause issues, and where having this property could offer distinct advantages?
问题
- Regarding point (2), is "Multisets" simply another term for "discrete distributions"?
- Could you clarify what "E2" refers to in lines 473-474?
Response to summary:
We would like to highlight that, in addition to the theoretical guarantees for our embedding, we present the impossibility result stated in Theorem 4.4. This result proves that it is impossible to embed discrete distributions into any finite-dimensional Euclidean space in a bi-Lipschitz manner. This saves the community further efforts in that direction, and essesntially shows that an embedding with substantially better analytical properties than the FSW does not exist.
Weakness 1: Introducing baseling methods
Thank you for your suggestion. We added to the revised manuscript a paragraph describing all the baseline methods (lines 498-504).
We would like to stress that the methods presented in Table 2 are those introduced in [Chen] and earlier papers, and were not implemented by us. Specifically: , and WPCE use their own architectures, and Sinkhorn is an approximation algorithm specifically designed to approximate the -Wasserstein distance.
Our architecture based on , , , described in line 485, achieves state of the art results with a simple combination of our embedding and one MLP. In comparison, produces inferior results using three MLPs.
The only method other than ours that we tested with our architecture was PSWE, which is designed to compute Sliced-Wasserstein preserving embeddings.
Lastly, the reason why Sinkhorn cannot be incorporated into our architecture is due to its own inherent limitation: it takes pairs of input distributions and estimates their distances, rather than computing a distance-preserving embedding for individual distributions. This significantly limits its applicability to practical learning tasks, as we further discuss in our response to Reviewer Ew1o. We also added a brief explanation in line 307.
Weakness 2: Real-data application of our method
The experiment presented in Table 2 illustrates the utility of our embedding in a learning task on real-world data (ModelNet-40). We note that our paper includes all experiments from the NeurIPS-accepted work by Chen and Wang [Chen], as well as theoretical results that in our opinion merit acceptance in their own right.
To be continued...
Weakness 3: Why bi-Lipschitzness is important
The lack of bi-Lipschitzness, which plagues most prevalent multiset architectures to date, practically implies that there inevitably exist pairs of different input multisets that appear numerically identical to the architecture. This poses a problem, for example, in classification tasks where such pairs need to be assigned different labels. Any multiset architecture based on sum- or max-pooling is provably affected by this problem [FWT].
Achieving a bi-Lipschitz embedding for multisets has been recognized as an important goal in previous works. Our work is the first to fully achieve this goal. Below is a selection of previous works that underscore the importance of bi-Lipschitzness:
"The question of which metric spaces admit a bilipschitz embedding into some (finite-dimensional) Euclidean space is natural, and has received a lot of attention in recent work. The results obtained so far indicate that there is no simple answer to this question."
— Lang, Urs, and Conrad Plaut. "Bilipschitz embeddings of metric spaces into space forms." Geometriae Dedicata 87 (2001): 285-307.
"Since the late 1990’s, it has become apparent that designing efficient approximate nearest neighbor algorithms, at least for high-dimensional data, is closely related to the task of designing low-distortion embeddings. A bi-Lipschitz embedding between two metric spaces is a mapping such that ... where the parameter called [sic] the distortion of ."
— Indyk, Piotr, and Assaf Naor. "Nearest-neighbor-preserving embeddings." ACM Transactions on Algorithms (TALG) 3.3 (2007): 31-es.
"The second negative result is that while moments of MLPs with analytic activations can be injective, they can never be stable in the bi-Lipschitz sense. This points to a possible advantage of injective multiset functions that are not based on moments, but rather on sorting or max-filters."
— Amir, T., Gortler, S., Avni, I., Ravina, R., & Dym, N. (2023). "Neural injective functions for multisets, measures and graphs via a finite witness theorem." Advances in Neural Information Processing Systems (NeurIPS) 37 (2023)
“We propose developing fine-grained expressivity results, namely metric equivalencies between explicit graph metrics and feature metrics for GNNs on graphs with features. An ideal result would derive a bi-Lipschitz correspondence between such metrics.”
— Christopher Morris, Nadav Dym, Haggai Maron, İsmail İlkan Ceylan, Fabrizio Frasca, Ron Levie, Derek Lim, Michael Bronstein, Martin Grohe, and Stefanie Jegelka. "Future Directions in Foundations of Graph Machine Learning." International Conference on Machine Learning (ICML) (2024)
Question 1: Regarding point (2), is "Multisets" simply another term for "discrete distributions"?
Multisets and discrete distributions refer to different concepts. Multisets are essentially sets that account for repetitions. For instance, . Discrete distributions, on the other hand, are probability distributions with finite support. However, multisets can be idenitified with the subset of of discrete distributions with uniform weights, as discussed beginning at line 183.
Question 2: Could you clarify what "E2" refers to in lines 473-474?
and are two independent instances of the FSW embedding, with different input and output dimensions. maps distributions over into , and maps distributions over into , with , being architecture hyperparameters. This is detailed in the manuscript (lines 481-485 in the revised version).
References:
[Chen] Chen, S., Wang, Y. "Neural approximation of Wasserstein distance via a universal architecture for symmetric and factorwise group invariant functions." Advances in Neural Information Processing Systems (NeurIPS) 37 (2023).
[FWT] Amir, T., Gortler, S., Avni, I., Ravina, R., & Dym, N. (2023). "Neural injective functions for multisets, measures and graphs via a finite witness theorem." Advances in Neural Information Processing Systems (NeurIPS) 37 (2023).
The reviewer thank the authors' response. I will keep my rate (accept).
The paper seeks to establish a mapping from multisets and measures over into Euclidean space, ensuring that the sliced Wasserstein distance corresponds to the distance between their mappings in the target space. The authors propose a mapping that is bi-Lipschitz for multisets and injective for measures. Additionally, they demonstrate that a bi-Lipschitz map for measures does not exist.
优点
The paper is well-structured, and its message is clear. The proofs provided are rigorous and exceptionally clear. This particular problem is quite interesting. I really enjoyed reading the paper.
缺点
I don't see any weaknesses. Therefore, I recommend accepting it.
问题
I haven't any question.
Thank you :)
This paper presents a novel approach to high-dimensional dataset embedding. The authors provided theoretical performance guarantees and numerical study to show the superior performance of the framework.
优点
The theoretical contribution seems to be sound, with explicitly stated technical assumptions and results. Numerical study is solid.
缺点
- The authors denovted much space to describe the p-wasserstein and infinity-type Wasserstein distance. Why it is necessary to introduce infinity-type Wasserstein distance?
- In line 222, the authors mentioned that in the special case of d=1, Wasserstien can be computed significantly fast. So what is the complexity rate?
- In line 344, what is the definition of STD???
- The authors should provide proof ideas for the main technical results in the main content.
问题
I am new to this field. Could the authors elaborate more on the practical motivation and applications of this approach?
We thank the reviewer for the comments. We have uploaded a revised version of the manuscript, where the comments have been addressed and incorporated.
Response to summary:
We would like to highlight an additional theoretical contribution in our paper: the impossibility result of Theorem 4.4, which proves that it is impossible to embed discrete distributions into any finite-dimensional Euclidean space in a bi-Lipschitz manner. This saves the community further efforts in that direction, and essesntially shows that an embedding with substantially better analytical properties than the FSW does not exist.
Response to weaknesses:
- On the inclusion of : We included the definition of the -Wasserstein distance with to allow our results to be stated across all . Our bi-Lipschitzness guarantee and impossibility result apply uniformly for all in this range.
- Complexity of Wasserstein when : The complexity in this case is the computational complexity of the sort function, which is . This is stated in lines 223-224.
- Definition of STD: STD here is the Standard Deviation. We clarified this in the revised manuscript, l. 341-342.
- Proof ideas in the main text: We added an overview of the proof ideas of Theorems 4.1, 4.2 and 4.4 to the revised manuscript. Thank you for this comment.
Response to question on the practical motivation:
To illustrate the advantage of our approach for practical applications, consider a learning task on multisets handled by traditional architectures based on sum- or max-pooling. With these methods, certain pairs of input multisets may appear numerically identical, meaning the architecture will not be able to distinguish between them—even if they represent different underlying data. In contrast, our approach, due to its bi-Lipschitzness guarantee, can distinguish between these multisets in a way that reflects their actual differences. This practical advantage is evidenced in our experimental results shown in Table 2.
Dear Reviewer kJBi,
We wanted to follow up regarding our response to your review. In our rebuttal, we addressed all the points raised, including the practical motivation for our approach, proof ideas, and other requested clarifications.
If you find that our revision and explanations have resolved your concerns, we would greatly appreciate it if you could consider revisiting and potentially adjusting your score.
Thank you for your review and feedback. Please don’t hesitate to reach out if there are any remaining points we can clarify further.
Best regards, The Authors
This paper considers Fourier slicing embedding both for a collection of probability distributions and multisets over and supported at points. The embedding consists of a projection sample on a 1-dimensional vector on the sphere then calculates a cosine transform of the projected quantile function. Under a specific probability distribution of the frequency, the authors prove that the expectation of the estimation error between the embedded measures is exactly the sliced Wasserstein distance. A second part of the theoretical results consists of proving the injectivity of the embedding under the assumption that the dimension embedding . Numerical experiments are conducted on point cloud classification.
优点
- The paper is well-written and easy to follow. Proofs are rigorous.
- Proposing the sliced embedding Wasserstein (SEW) through a cosine transform of the projected quantile function. Sampling the quantile function via cosine transform is novel.
- Injectivity and bi-Lipschitz properties of FSEW on the collection of multisets.
- Numerical experiments showcase better Wasserstein approximation on simulated datasets and three real datasets than NProductNet, WPCE, NSDeepSets, and Sinkhorn.
缺点
- Several approaches for the derivative of sliced Wasserstein distance like, distributional sliced Wasserstein (Nguen et al, ICLR'21), max-sliced Wasserstein, etc ... Could you highlight the difference between FSW and the SOTA derivative of sliced Wasserstein?
问题
See Weaknes section.
There is a fundamental difference between our embedding approach and approaches such as Distributional Sliced Wasserstein and Max-Sliced Wasserstein. Our approach takes one input distribution at a time and computes an embedding, whereas the aforementioned approaches take two input distributions and estimate a distance between them. Pairwise methods have two disadvantages in comparison with embeddings: (i) higher computational complexity when computing multiple pairwise disances, and (ii) limited applicability to real-world learning problems.
In terms of applicability, a pairwise method cannot be directly applied to common learning tasks, such as object classification, where the inputs are typically individual distributions. In contrast, an embedding is readily applicable to such tasks, as demonstrated in our experiments.
In terms of computation, pairwise methods to estimate sliced optimal transport distances typically require time (neglecting logarithmic factors), where is the number of slices, is the maximal number of support points, and is the ambient dimension of the support. Thus, computing all pairwise distances for a set of distributions would take time. In contrast, computing our embedding takes time for each input distribution, and pairwise distances can then be computed in the Euclidean space , resulting in a considerably lower total complexity of . This approach is therefore significantly more scalable for large datasets where pairwise distance computations are required.
We appreciate the reviewer's comment and will clarify this in the paper.
I thank the authors for their answers to my concern. I am keeping my score the same.
Dear Reviewers,
Thank you for your thoughtful comments on our submission.
We have responded to all of your comments and uploaded a revised version of the manuscript, carefully addressing your feedback. Changes are highlighted in blue for your convenience.
We would greatly appreciate it if you could confirm whether our responses and the revised manuscript adequately address your concerns. This will help us ensure that we have addressed all your feedback within the discussion period.
Thank you for your time and consideration.
Best regards,
The Authors
Dear Reviewers,
Thank you again for your thoughtful comments on our submission. As mentioned in our previous post, we have responded to all of your comments and uploaded a revised version of the manuscript, carefully addressing your feedback. Changes are highlighted in blue for your convenience.
As the discussion period comes to a close, we would greatly appreciate it if you could confirm whether our responses and the revised manuscript adequately address your concerns.
Thank you for your time and consideration.
Best regards,
The Authors
This paper proposes a novel approach to high-dimensional dataset embedding. Most reviewers found that the paper is of interest and provide relevant contributions to the field.
审稿人讨论附加意见
They were few discussions beyond the rebuttals as most authors are happy about the paper.
Accept (Poster)