PaperHub
6.6
/10
Poster4 位审稿人
最低3最高4标准差0.5
3
4
4
3
ICML 2025

Kernel Quantile Embeddings and Associated Probability Metrics

OpenReviewPDF
提交: 2025-01-22更新: 2025-07-24
TL;DR

We propose quantiles in RKHSs, and associated distances that are efficient, metrising under weaker conditions than MMD, and can be seen as kernelised generalisation of sliced Wasserstein. We demonstrate good performance in hypothesis testing.

摘要

关键词
kernel methodsprobability metricsquantiles

评审与讨论

审稿意见
3

This paper introduces kernel quantile embeddings (KQEs) as a novel way to represent probability distributions in reproducing kernel Hilbert spaces (RKHS), extending beyond traditional kernel mean embeddings (KMEs). The authors leverage KQEs to define a new family of probability metrics that require weaker kernel conditions than MMD, connect to the sliced Wasserstein distance, and enable efficient near-linear estimation, demonstrating their effectiveness in hypothesis testing.

给作者的问题

N/A

论据与证据

My main concerns are as follows:

  1. There is no intuitive explanation behind the KQE. While the authors claim their approach is motivated by concepts from the statistics and econometrics literature, it remains unclear what specific problem KQEs address that cannot be handled by existing quantile estimation methods or previous kernel mean embeddings.
  2. The procedure for conducting the hypothesis test and determining the threshold is not described.
  3. In Algorithm 3.1, the density function fvf_v is assumed to be given. How should this function be set in practice? Does it require additional data for simulation?

方法与评估标准

N/A

理论论述

This paper establishes several desirable properties of KQEs and highlights the connection between KQDs and Wasserstein distances.

实验设计与分析

There is a lack of comparison with other embedding methods proposed in Chatalic et al. (2022), Lerasle et al. (2019), and Chwialkowski et al. (2015).

References:

Chatalic A, Schreuder N, Rosasco L, et al. Nyström kernel mean embeddings[C]//International Conference on Machine Learning. PMLR, 2022: 3006-3024.

Lerasle M, Szabó Z, Mathieu T, et al. Monk outlier-robust mean embedding estimation by median-of-means[C]//International conference on machine learning. PMLR, 2019: 3782-3793.

Chwialkowski K P, Ramdas A, Sejdinovic D, et al. Fast two-sample testing with analytic representations of probability measures[J]. Advances in Neural Information Processing Systems, 2015, 28.

补充材料

N/A

与现有文献的关系

N/A

遗漏的重要参考文献

N/A

其他优缺点

N/A

其他意见或建议

N/A

作者回复

We thank the reviewer for their thoughtful comments and for recognizing the desirable properties established for KQEs and the connections drawn between KQDs and Wasserstein distances. We address the points raised below and hope that, in light of these responses, the positive feedback from other reviewers, and new experimental evidence provided in response to other reviewers (particularly kcy5), the reviewer may consider raising their score.

I. “it remains unclear what specific problem KQEs address that cannot be handled by existing quantile estimation methods or previous kernel mean embeddings”.

This question is two-part: (1) value compared to “previous kernel mean embeddings” and (2) value compared to “existing quantile estimation methods.” We address these separately.

  1. For benefits compared to kernel mean embeddings, and specifically when they are not enough, we refer to point (IV) in response to dCj7. In a nutshell, KQDs require weaker conditions for a kernel to be quantile-characteristic than for mean-characteristic kernels, which can be difficult to establish beyond Euclidean domains. Additionally, we show in a number of experiments that even when a mean-characteristic kernels is a good choice, the KQD can still outperform MMD approximations of similar cost, highlighting its practical advantage.
  2. As to benefits compared to existing quantile methods: Sliced Wasserstein (which compares all quantiles in [0, 1]), is (1) specific to Rd\mathbb{R}^d, and (2) does not take advantage of the flexibility and rich non-linear representations inherent to kernel methods—in contrast to KQD. This was pointed out in Connection 1,2: SW is KQD with the linear kernel k(x,x)k(x, x’). The consequences of this are evident in the experiment outlined in point (III) in response to dCj7: even in Rd\mathbb R^d, KQD with the Gaussian kernel k(x,x)k(x, x’) outperforms Sliced Wasserstein, due to the Gaussian kernel’s greater expressiveness compared to the linear kernel.

II. Procedure for Hypothesis Testing and Threshold Selection:

We use a permutation-based approach with 300 permutations to estimate the null distribution of the test statistic and determine the threshold, ensuring proper Type I error control ([1]). This procedure will be added as an algorithm in the camera-ready version.

[1] Erich Leo Lehmann, Joseph P Romano, and George Casella. Testing statistical hypotheses, volume 3. Springer, 1986.

III. Setting the Density Function fνf_\nu in Algorithm 3.1:

The density function fνf_\nu is the measure over the quantile levels. We use the uniform ν\nu over [0,1][0, 1] in our experiments, which corresponds to equal weight for all quantiles, fν1/nf_\nu \equiv 1/n. We will add a note to Algorithm 1 for clarity.

IV. Numerical Comparison with MMD based on Other KME Approximations:

As the reviewer notes, there are several efficient KME methods currently available, and no single approach has emerged as definitively superior. The multi-diagonal approximation chosen in this study serves as a common, strong, and interpretable benchmark, supported by both general and test-specific theoretical guarantees ([2]). While a comprehensive comparison of all existing methods falls outside the scope of this paper, we did conduct additional experiments as requested.

In particular, as requested, we compared the KQD with the following (at matching cost)

  1. the Mean Embedding (ME) approximation of MMD from [5] (which was identified as the best-performing method in their numerical study) and
  2. the Nystrom-MMD method from [3].

The results are presented in https://pdfhost.io/v/qLDRyBQawp_nystrom_vs_ME. ME performs at the level of MMD-multi, while Nystrom has extremely high Type II error—likely due to sensitivity to hyperparameters. This result was consistent, as we also observed poor performance of Nystrom-MMD in the following simple minimum-distance parameter estimation task, illustrated in https://pdfhost.io/v/KhHzg7UPsY_par_est . The results are over 1000 runs and show high variance of Nystrom-MMD.

Although we could not complete the BCD-Fast [4] comparison in time, it is noted that the method is primarily advantageous for its robustness to outliers, a feature not explicitly addressed in our study. We are happy to complete this experiment for the final paper.

[2] Schrab, A. (2025). Optimal Kernel Hypothesis Testing (PhD thesis, UCL).

[3] Chatalic A, Schreuder N, Rosasco L, et al. Nyström kernel mean embeddings. ICML.

[4] Lerasle M, Szabó Z, Mathieu T, et al. Monk outlier-robust mean embedding estimation by median-of-means. ICML.

[5] Chwialkowski K P, Ramdas A, Sejdinovic D, et al. Fast two-sample testing with analytic representations of probability measures. NeurIPS.

We appreciate the reviewer’s feedback. In response, we have provided additional experimental results, clarified the role of fνf_\nu, and explained the motivation behind KMEs and our testing procedure. We would appreciate it if the reviewer considered raising the score.

审稿人评论

Thanks for your response. My questions and concerns have been well addressed. I will update my rate.

作者评论

We are happy to hear this. Thank you for carefully considering our paper and rebuttal, and for taking the time to update your score.

审稿意见
4

This paper presents an alternative of kernel mean embeddings of distributions through the use of quantiles rather than the mean. More precisely, the embedding of a distribution on a set X is given by the collection of all alpha-quantiles of the pushforwards of the distribution all directions, each given by a unit vector in a RKHS (defined by a kernel k on X). The authors show a number of properties of the resulting embedding that mirror those of kernel mean embeddings, such as the conditions (much weaker as in KME) under which a kernel is characteristic (i.e. the embedding is injective), and complexity and approximation error of empirical estimators. They proceed to define a distance between probability distributions using the embeddings, in a similar fashion to MMD for KME. The resulting distances (the proof that it is indeed a distance is given) are integrals (or a maximum) over alpha and the unit-norm directions of the RKHS, which is reminiscent of the way Sliced Wasserstein distances are defined. The authors indeed show a connection to this class of distances in a particular case, as well as to Sinkhorn divergences. Finally they put forward a particular instance of the distance where the unit norm directions are sampled from a Gaussian measure on the RKHS. The authors finally present a set of experiments using the proposed KQD for two sample hypothesis testing, where its performance and complexity in various cases is tested against MMD estimators. Sensitivity to the input space dimension, capacity to distinguish between close (in moments) distributions with a polynomial kernel are tested, and experiments on image datasets are shown.

给作者的问题

  1. What are situations where using a polynomial kernel (typical of non mean-characteristic kernels) is advantageous wrt a kernel such as the Gaussian or exponential ones ? Or more generally situations where using non mean characteristic kernels is actually useful ?

  2. How do the more complex (Gaussian measure and kernel) variants of the proposed metrics compare to the more classical SW distance (uniform measure and linear kernel) ? What is the added value in practice of the generalization ? Is there an explanation as to why MMD seems superior in the image dataset experiments ?

论据与证据

The theoretical claims of the paper are sound, and the proposed measures of discrepancies are interesting in how the generalize KME derived distances such as MMD and kernelize (and generalize) Sliced-Wasserstein distances. The milder condition for a kernel to be characteristic is interesting as well.

As for the experimental claims, the sensitivity to input dimension and the capability to distinguish between Laplace and Gaussian distributions with the same first and second order moments under a cubic kernel are well supported by the experiments. The results of the image datasets are slightly less convincing, since it looks like MMD is the superior metric to use in this cases (the centered version of the proposed metrics is used and compares similarly to MMD, but the authors show that this version actually interpolates between MMD and Sliced Wasserstein distances – except the choice of kernel is more general). This observation is mitigated by the fact that among the linear or quasi-linear estimators, the KQE ones perform better.

Update after rebuttal

As mentioned in my comment below, I maintain my positive evaluation of the paper and my recommendation of acceptance.

方法与评估标准

The method to benchmark the proposed metrics is sound : two sample hypothesis tests are one of the key applications of MMD and KME. The authors actually check that the statistical tests actually are meaningful by checking that the tests’ significance level is respected. However, it would have been interesting to showcase different applications that are usually considered with MMD (generative modeling, for instance).

理论论述

I skimmed through a few of the proofs (all in appendices), namely those of theorem 2 and connections 1 and 2, which seemed correct to me.

实验设计与分析

It is not very clear why the Gaussian Kernel Quantile discrepancy is put forward instead of a simple version where \gamma is uniform over the sphere (as in SW distances) ? Is it for computational reasons and the fact that in fine only a standard univariate normal is sampled in fine ?

Also, it would have been interesting to consider more variants of the distances : for instance, the authors show that they are able to kernelize SW distances, which are widely used as an efficient distance between probability distributions ; I am curious to see how this would have compared (possibly generalizing through the kernel) to the Gaussian version put forward here. The theoretical connection is really nice though.

Also, even though under quantile embeddings the conditions for a kernel to be characteristic are very mild (X is sensible as a topological space and the kernel is separating), in practice this doesn’t seem very important since in all the experiments but the second a Gaussian Kernel is used (which is characteristic for both kernel mean and quantile embeddings).

补充材料

See the discussion above for proofs. I also took a look to the introductory material on MMD and Wasserstein distances, as well as the Type 1 control experiment in app D.

与现有文献的关系

I believe the paper is generally well positioned in the literature related to MMD and distances between probability measures, and the proposed distances essentially generalize several of those in meaningful ways. However, I feel like the connection to attempts at generalizing kernel mean embeddings using medians or variance, though mentioned, could have been detailed a bit further. How different are the constructions in those cases (I am not familiar with that literature) ? Would it have made sense to compare the resulting estimators in the experimental part ?

遗漏的重要参考文献

I can’t think of a key reference that would be missing.

其他优缺点

I have underlined a number of strengths and weaknesses of the work in my comments above.

其他意见或建议

It seems that Fig 3 is never referenced in the text. Besides a description seems necessary as I am not sure I understood it well. For instance, u1 looks like the identity map in the left panel, but the original and projected measures seem different (the projections are smoother). What am I missing ?

There is a missing reference in app C.5.

作者回复

We thank the reviewer for their thoughtful review and for recognizing the soundness and novelty of our contributions, including generalizing KME-derived distances and kernelizing Sliced-Wasserstein distances. We address the key comments below.

I. Additional Applications:

KQD is a general-purpose probability metric, applicable to many problems like conditional independence testing, causal inference, reinforcement learning, and generative modeling. Among the many possible applications, our focus has been on two-sample hypothesis testing to benchmark the proposed distances in a controlled and interpretable setting, which is commonly used in the literature. In addition, we are working on applying KQD to the task of parameter estimation in simulation-based inference. A toy version of the experimental setup is presented in point IV in response to u7Px.

II. Why isn’t γ\gamma uniform?

As pointed out in footnote 2 on page 5, the uniform/Lebesgue measure or standard Gaussian measure cannot be defined on infinite-dimensional spaces ([1]). Anticipating an infinite-dimensional RKHS, we propose a (projected) Gaussian measure, which is a standard approach in the inverse problem literature (see [1]). We emphasize that when X=RdX=\mathbb{R}^d, the kernel is linear, and γ\gamma is uniform over the unit sphere, KQD matches Sliced Wasserstein (Connection 1). A more general version, when the kernel is non-linear but induces a finite-dimensional RKHS, the KQD can be connected to the Generalised Sliced Wasserstein distances ([2]). We will include this discussion.

[1] Stuart, A. M. (2010). Inverse problems: a Bayesian perspective. Acta numerica.

[2] Kolouri, S. et al (2019). Generalized Sliced Wasserstein distances. Advances in neural information processing systems, 32.

III. Comparison with SW Distances

We now extended the power decay experiment to include SW distances, with directions sampled uniformly or from (Pn+Qn)/2(P_n + Q_n)/2 projected onto the sphere. Results in https://pdfhost.io/v/FsnNBjRnJa_kqd_vs_sw show KQD significantly outperforms SW---since Gaussian kernel is more expressive than linear kernel (Connections 1,2: SW is KQD with linear kernel).

IV. “when is using non mean characteristic kernels useful”

KQDs require weaker conditions to be quantile-characteristic, as shown in the Laplace vs. Gaussian experiment. While mean-characteristic kernels are well understood for bounded translation-invariant kernels on Euclidean domains ([3]), beyond this, it’s hard to establish whether a kernel is mean-characteristic. For example, many graph kernels are not characteristic ([4]). Deep kernels ([5]), or any transformed kernel k(T(u),T(u))k(T(u), T(u’)), where TT is a non-injective map, offer examples of non-characteristic kernels.

[3] Sriperumbudur, B. K., et al (2011). Universality, Characteristic Kernels and RKHS Embedding of Measures. JMLR.

[4] Kriege, N. M., et al. (2020). A survey on graph kernels. Applied Network Science, 5.

[5] Wilson, A. G., et al. (2016). Deep kernel learning. AISTATS.

V. Figure 2 Clarification:

We apologize for the omission. Figure 2 should be referenced in Section 3.2 after defining τ2\tau^2 to illustrate how the integrand varies for different directions. We will amend this and provide a clearer explanation.

VI. Reference Missing in Appendix C.5:

Thank you for pointing this out. We will add reference [6]. [6] Kukush, A. (2020). Gaussian measures in Hilbert space. Wiley.

VII. MMD Performance in Image Dataset Experiments:

We will clarify that the goal of the experiment was to demonstrate that KQD outperforms MMD approximations with matching cost, and when MMD outperforms KQD, the centered KQD offers MMD-level performance. Additionally, in response to kcy5 (I.(2)), KQD with uniform μ\mu outperforms MMD on the CIFAR problem, and we are exploring this further.

VIII. Connection to Median and Variance-Based Extensions of KMEs:

We are happy to elaborate. Longer response with review of the methods: https://pdfhost.io/v/t2rYgcDSLS_kernel_quantile_embeddings_2_25

Kernel Covariance Embeddings (KCE) is the 2nd-order moment of k(X,)k(X, \cdot), while KME is the 1st-order moment. KCE exists iff KME for k2k^2 exists; and, kk is covariance-characteristic iff k2k^2 is mean-characteristic. The divergence between KCEs can be estimated at O(n3)O(n^3) due to the need to do eigenvalue decomposition, while KQE comes with a practical Monte-Carlo estimator.

The median embedding is the geometric median of k(x,)k(x, \cdot) in the RKHS, minimizing the L1L_1 distance. While it exists for separable Hilbert spaces, it requires iterative algorithms for estimation and has O(n2)O(n^2) complexity. The median-characteristic property hasn’t been explored, and the relation to 1D-projected quantiles remains unclear. Further research into geometric median embeddings is needed.

We believe these revisions address the reviewer’s concerns and improve the clarity of the paper. Thank you for your constructive feedback.

审稿意见
4

This paper introduces the concept of Kernel Quantile Embeddings (KQEs) in reproducing kernel Hilbert spaces (RKHS) and investigates how these embeddings can be leveraged to define a new family of probability metrics: Kernel Quantile Discrepancies (KQDs). The authors argue that KQEs are a natural analogue of quantiles in function spaces and can capture distributional information beyond the first moment. The paper proves that these embeddings can be estimated at the same rate O(n-1/2) as classical kernel mean embeddings. It then proposes specific instances of these discrepancies (e-KQD and sup-KQD), connecting them to well-known probability metrics like sliced Wasserstein and Sinkhorn divergences. The authors also provide an estimator with near-linear O(nlog2n) complexity for a version of the e-KQD, and demonstrate empirically—in two-sample hypothesis testing tasks—that KQDs can be competitive with MMD, sometimes surpassing standard MMD or its fast approximations.

给作者的问题

Weighting measure: Have you tried or do you have insights on using different weighting distributions ν besides the uniform one, e.g., a heavier weighting around certain quantile regions? Could that accelerate the test in practice or help with tail detection? Conditional embeddings: You mention the possibility of extending to “conditional quantile embeddings.” Do you see a straightforward approach, or are there structural challenges that differ from conditional mean embeddings? Potential domain constraints: Could there be issues if X is not connected or if the kernel is unbounded? You mention bounded k, but might large or unbounded kernels degrade the performance or the theory?

论据与证据

Claim (1): KQEs offer a strictly weaker requirement on the kernel to preserve injectivity (quantile-characteristic) than classical mean-characteristic kernels. Evidence: The authors give proofs (Theorems 1–2) showing that every mean-characteristic kernel is quantile-characteristic, but not vice versa. Claim (2): KQDs defined from KQEs form valid probability metrics under weaker conditions. Evidence: Theorem 4 establishes that e-KQD and sup-KQD satisfy the properties of a metric, assuming the kernel has “full support” in the specified sense. Claim (3): An estimator for e-KQD can be computed in O(nlog2n) time and achieves O(n-1/2) convergence in the two-sample setting. Evidence: The authors outline a Monte Carlo sampling procedure (using Gaussian measures in the RKHS) and show a theoretical sample complexity bound (Theorem 5). Overall, these claims are supported by rigorous proofs (for theoretical statements) and empirical experiments (for computational cost and testing power).

方法与评估标准

The method centers on constructing quantiles in RKHS by projecting the kernel feature maps onto directions, then computing one-dimensional quantiles. Evaluation criteria revolve around: The rate at which KQD-based two-sample tests converge (statistical consistency), The computational complexity of the KQD estimators, The test power in detecting distributional differences for multiple synthetic and real benchmarks. These criteria align well with the standard practice in two-sample testing and kernel-based distribution comparison.

理论论述

The “quantile-characteristic” property (Theorems 1 and 2) generalizes the common “mean-characteristic” property in kernel embeddings. The curvature arguments and integral transform approach in the proofs are consistent with known results on characteristic functionals in topological vector spaces. The authors’ expansions connecting KQDs to Sliced/Max-Sliced Wasserstein and Sinkhorn-like divergences (Connections 1–3) appear algebraically consistent. I did not find obvious flaws in the proofs from a correctness standpoint; the steps follow standard theorems about characteristic functionals in Hilbert spaces.

实验设计与分析

The authors run two-sample tests on both synthetic (Gaussian vs. Gaussian with different variances, Laplace vs. Gaussian, etc.) and high-dimensional image data (Galaxy MNIST and CIFAR10 vs CIFAR10.1). They use standard metrics: rejection rate under different sample sizes, and also check Type I error. The analyses compare KQD-based tests against multiple MMD-based tests (including linear, multi, and full). Overall, the experimental design is sensible for measuring the performance of a novel test statistic in a fairly standard manner.

补充材料

The authors mention that the proofs for the main theorems are in the Appendix (Sections C.1–C.5), as well as additional experiments on Type I error. The main text references these properly, and from what was described, the supplementary clarifies the technical lemmas for sampling from Gaussian measures on RKHS, etc.

与现有文献的关系

The paper extends the kernel embedding framework (Smola et al., 2007; Gretton et al., 2012; Muandet et al., 2016). It connects directly to quantile-based distance measures in classical statistics (Kosorok, 1999; Ranger et al., 2020) and to sliced Wasserstein approaches (Bonneel et al., 2015). By bridging the concept of “quantiles” in infinite-dimensional Hilbert spaces, the paper also resonates with prior works that exploit directional statistics (Cramér–Wold theorem). The references seem adequate and highlight the synergy between kernel methods, quantile-based methods, and integral probability metrics.

遗漏的重要参考文献

The coverage is mostly thorough.

其他优缺点

Strengths:

Substantial theoretical contribution that expands kernel embeddings beyond means. Empirical demonstration that the proposed distances are competitive with MMD. The proposed near-linear Monte Carlo estimator is especially appealing for large-scale data.

Weaknesses: The weighting measure ν and reference measure ξ are not deeply explored in terms of best practical choices. The authors choose them fairly generically (often uniform or half from Pn, half from Qn). Additional experiments on different weighting might help. The paper mentions potential broader applications (like conditional independence) but does not provide concrete evidence or experiments in such directions.

其他意见或建议

Minor writing comment: Some early paragraphs heavily reference “Cramér–Wold in RKHS,” so it might help to add a more elementary explanation or an example in the main text. Also, if feasible, an experiment on centered e-KQD for a mixture distribution scenario might illustrate the effect of the MMD “shift.”

作者回复

We thank the reviewer for their positive feedback on our work. We appreciate the recognition of our theoretical contributions, the rigor of our proofs, and the relevance of our evaluation criteria. We are also glad the reviewer found our near-linear Monte Carlo estimator promising for large-scale data. Below, we address specific comments, and return additional experiments suggested.

I. Further exploring the choices for weighting measure ν\nu, measure on the sphere ξ\xi, reference measure μ\mu.

Thank you for this insightful suggestion.

  1. ν\nu: We conducted experiments on the Galaxy MNIST and CIFAR datasets, varying ν\nu from up-weighing extreme quantiles to down-weighing them. Results are in https://pdfhost.io/v/jeaYtK5X2Y_varying_nu , where triangle "/\” indicates up-weighting, and reverse triangle “\/” indicates down-weighting. For Galaxy MNIST, down-weighting extremes improved test power, whereas for CIFAR, up-weighting extremes worked better. Uniform weighting of the quantiles remained a good choice. This suggests that tuning ν\nu beyond the uniform is problem-dependent and can enhance performance. The difference likely arises from the nature of the problems: CIFAR datasets, where samples are expected to be similar, benefit from emphasising extremes, while Galaxy MNIST, which has fundamentally different galaxy images, performs better when “robustified,” i.e., focusing on differences away from the tails. Exploring this further presents an exciting avenue for future work.
  2. μ\mu: The reference measure in the covariance operator serves to “cover the input space” and is typically set to a “default” measure—for Rd\mathbb R^d, typically the standard Gaussian. We considered (Pn+Qn)/2(P_n + Q_n)/2 to stick to the most general setting, when no “default” is available—only PnP_n and QnQ_n. We now compare this choice against (i) standard Gaussian scaled by IQR/1.349 , where IQR is the interquantile range of (Pn+Qn)/2(P_n+Q_n)/2, and 1.349 is the interquantile range of N(0,1)\mathcal N(0, 1); (ii) a uniform measure on [1,1]d[-1,1]^d, scaled by IQR. Results in https://pdfhost.io/v/wtc2mgbFGU_varying_mu show performance superior to MMD for the standard/uniform μ\mu. This is a valuable finding, we will be performing further analysis.
  3. ξ\xi: Varying ξ\xi on the sphere in inf-dim spaces is extremely challenging due to the complexity of both theoretical definition and practical sampling (no uniform or standard Gaussian measure can be defined on an infinite-dimensional space). To the best of our knowledge, no practical alternative has been proposed.

II. Conditional Quantile Embeddings and Independence Testing as Future Work:

The population-level embedding of P(YX=x)P(Y|X=x) will be defined in the same way as a quantile embedding: for every xx, we have a standard quantile embedding. Estimating conditional quantiles requires learning a mapping from x,u,αx, u, \alpha to \rho^\alpha_{u \\# P}, a complex task beyond the scope of this paper; however, given the likely smoothness of this mapping in x,u,αx, u, \alpha, it is reasonable to expect that it should be possible to develop a practical method.

Independence testing can be framed as a two-sample testing problem (e.g., [1, 2]). However, a thorough investigation, comparable to that in [1, 2], is necessary to develop this approach rigorously. We leave this exploration for future research.

[1] Gretton et al., 2008. A kernel statistical test of independence. NeurIPS.

[2] Doran et al., 2014. A permutation-based kernel conditional independence test. UAI.

III. Clarifying “Cramér–Wold in RKHS” in Early Paragraphs

We will add a clarification.

IV. Experiment on Centered e-KQD for Mixture Distributions

We thank the reviewer for their suggestion. However, we would like to emphasise that moments on the input space map highly non-linearly to the RKHS. As a result, shifts in moments on the input space translate non-trivially to the outcome of the experiment, breaking the interpretability of results. We will instead focus on constructing an example where moments in the kernel features space (instead of the input space) vary. We will aim to identify these before camera-ready.

IV. “Could there be issues if X is not connected or if k is unbounded?”:

Boundedness: we highlight that boundedness is not required to define KQD as a probability metric. This contrasts with MMD, which requires the kernel to be bounded in order to be defined over all probability measures—due to the need to take an integral to obtain the mean (see “Why bounded kernels?” in www.jmlr.org/papers/volume24/21-0599/21-0599.pdf).

Connectivity: KQD remains a valid probability metric for completely regular Hausdorff spaces, including disconnected spaces like totally ordered sets.

We thank the reviewer again for their insightful feedback. We have addressed your concerns by providing additional experimental results and elaborating on future work in conditional independence/CMEs, and therefore would appreciate a raise in score.

审稿人评论

Thanks for the detailed explanation and thanks for the additional results, I confirm authors addressed my concerns and questions, I would like to rise my score.

作者评论

We thank the reviewer for their continued engagement with our work, valuable suggestions, and taking the time to increase their score.

审稿意见
3

The paper proposes an alternative to kernel mean embeddings where they embedd the quantile functions instead of the mean, each quantile function being computed in the direction of a unit-norm function in the RKHS. They show that such embedding is injective under milder conditions on the kernel than the known characteristic property for the mean embedding counterpart. The authors propose a new family of discrepencies based on these quantile embeddings, and exhibit conditions under which these are metrics on the space of Radon probability measures. As the computation of these discrepencies involves an expectation over the unit ball of the RKHSs, the authors develop tools based on Gaussian measures to compute it efficiently. Numerical experiments highlight the benefits of the approach.

update after rebuttal

The rebuttal confirms the problem in Theorem 5. The authors then failed to answer subsequent questions about the implications of the correct bound and the choice of the hyperparameters.

给作者的问题

What was the motivation behind staying in the small data regime (n2000)(n \leq 2000) ? Do you know how the estimator behaves beyond that ?

论据与证据

Yes, most of the claims are supported by clear and convincing evidence, except Theorem 5 whose proof cannot be located in the appendix.

方法与评估标准

Yes, they make sense.

理论论述

The claims seem sound, even though I only skimmed through the proofs.

I just have a problem with Theorem 5. It is the only result not proved in the appendix. I would have expected the bound to scale as O(n1/2+l1/2)\mathcal{O}(n^{-1/2} + l^{-1/2}) and I am surprised by the multiplicativity of the bound: O(n1/2l1/2)\mathcal{O}(n^{-1/2} * l^{-1/2}) means that we can keep only one direction uu in the RKHS and still get consistency as long as nn \to \infty. Or am I wrong here ?

实验设计与分析

I found no issue with the experimental designs.

补充材料

I skimmed through it but not in details.

与现有文献的关系

The idea of using quantile embeddings instead of mean embeddings is of great originality.

遗漏的重要参考文献

None

其他优缺点

+: The paper is clear and well written. The topic is of interest to the ICML community.

-: The limitations of the KQD are not enough discussed. It remains to be shown that picking ll and mm logarithmic w.r.t. nn is a good choice in terms of approximation error. Yes, you lose the quadratic dependency, but you add a layer of difficulty in that you need to compute the quantities along several directions. There should be a trade-off to find.

其他意见或建议

More of a suggestion: maybe there is something to investigate about how different kernels may allow easier approximation of the e-KQD (i.e. smaller ll) when the associated RKHS is small (sum of the eigenvalues of the integral operator as a suitable measure ?)

作者回复

We thank the reviewer for their thoughtful comments and positive evaluation of our work. We appreciate the recognition of the originality of using quantile embeddings, the clarity of our presentation, and the relevance of our contribution to the ICML community. We have carefully addressed all your comments below.

I. Proof of Theorem 5, and a discussion of trade-offs:

Thank you for pointing this out. We apologize for the omission, which was an oversight at the time of submission. The proof, which we will include in the revised version, uses the triangle inequality to decompose the error into two terms: (1) the error due to approximating the expectation, which is mathcalO(l1/2)\\mathcal{O}(l^{-1/2}) by Hoeffding’s inequality (valid since the kernel is bounded), and (2) the error from quantile approximation, which is mathcalO(n1/2)\\mathcal{O}(n^{-1/2}) by Theorem 3. Therefore, the correct bound is indeed mathcalO(n1/2)+mathcalO(l1/2)\\mathcal{O}(n^{-1/2}) + \\mathcal{O}(l^{-1/2}), and we will update this in the revision.

II. Choice of Kernels for e-KQD Approximation:

We appreciate this insightful suggestion—this direction definitely has potential! Of course, it involves a thorough theoretical study that is beyond the scope of this paper, since it involves extending RKHS theory aimed at KMEs to the non-linear case of KQEs. This is something we are planning on looking into in detail in a follow-up study to this paper.

III. Sample size in experiments:

We note that the Laplace VS Gaussian experiment goes up to 10,000 datapoints, and the sample sizes in each experiment align with prior work in similar settings. However, our proposed near-linear Monte Carlo estimator means the method is suitable for large datasets, both in higher-dimensional (note d=12288 for Galaxy MNIST) and larger-sample settings (see scaling in Figure 4); we will include a note to that effect in the revised version.

最终决定

The paper presents a generalization of mean embeddings called kernel quantile embeddings and introduces a distance in the space of probability distributions called the kernel quantile distance. Some interesting results are presented about its connection to the sliced Wasserstein distance, max-sliced Wasserstein distance, and Sinkhorn distance, along with the metric property and sample complexity. I concur with the reviewers that the results are interesting and novel; however, some aspects require more discussion. Given its novelty, the paper will generate some discussion and promote further investigation. Therefore, following the reviewer comments, I recommend accepting the paper.

I strongly urge the authors to incorporate all the discussion provided in the rebuttal in the final version of the paper. Among many, particular things to fix would be: l1/2+n1/2l^{-1/2}+n^{-1/2} and therefore ll has to be chosen as of order n; discussion about the choice of mm both on computational and statistical behavior; additional experiments as requested by the reviewers and as promised by the authors.