Statistical and Geometrical properties of the Kernel Kullback-Leibler divergence
In this paper, we study the statistical and geometric properties of the Kullback-Leibler divergence with kernel covariance operators (KKL) introduced by [Bach, 2022, Information Theory with Kernel Methods].
摘要
评审与讨论
This paper proposes the regularized kernel KL divergences. With the kernelization, the authors derive closed form formula and makes it computationally cheap to optimize using gradient methods. Theoretically, a new finite sample estimation convergence bound is derived. Numerical simulations suggest this proposed divergences are not only easy to implement, but also suitable for sampling complicated distributions.
优点
-
The paper is well-written and easy to follow.
-
Finite sample approximation bound is derived.
-
Numerical simulations demonstrate good performance over other related divergences.
缺点
-
Proposition 1 still assumes is absolutely continuous with respect to .
-
The bound in Proposition 3 becomes trivial since it approaches as .
问题
-
Typo at line 301: ''For small values of the regularization parameter'' ''For large values of the regularization parameter'.
-
Can the authors comment on how the performance of KKL compares with that of skewed KL divergence which is kernel-free and not restricted by a kernel, though the later one can be computationally more expensive?
-
You mention your algorithm is robust to the choice of . But I was wondering when is close to 1, as suggested in Figure 1 even for , the value of the divergence is always small, so it will be hard to distinguish between different distributions. Can you comment on this?
-
Why does Figure 2(a) begin with different particles across different methods at ?
局限性
Yes.
We thank the reviewer for careful reading the paper and for his relevant suggestions. We have addressed each of your comments below. Please don't hesitate to let us know if you have any further questions.
- Weakness: "Proposition 1 still assumes is absolutely continuous with respect to ."
Reply: Thank you for spotting the typo, please see the general comment.
- Weakness : "The bound in Proposition 3 becomes trivial since it approaches as ."
*Reply: This is true that the bound becomes trivial as goes to 1. However, our objective in calculating this upper bound was to prove that when tends to 0, the regularized KKL converges to the true KKL. We are therefore mainly interested in this bound for small values of alpha.
- Question: "Can the authors comment on how the performance of KKL compares with that of skewed KL divergence which is kernel-free and not restricted by a kernel, though the later one can be computationally more expensive?"
Reply The skewed KL is well defined even if is not absolutely continuous with respect to , but it seems to us that if and are atomic measures, and that the support of will move (for instance in the gradient flow setting as in our paper), then one would need to use kernel-based density estimates at each iteration. This could be an interesting idea to compare with the regularized KKL. We originally did not include it, because from [1], it is known that the KKL is a better approximation of the KL than a KL between smooth estimates of for a specific smoothing kernel, see Equation (8) in [1] which states that , where and are smoothed version of and . However, this is a simple baseline to implement that we will include in our experiments in the revised manuscript.
- Question: "You mention your algorithm is robust to the choice of . But I was wondering when is close to 1, as suggested in Figure 1 even for , the value of the divergence is always small, so it will be hard to distinguish between different distributions. Can you comment on this?"
Reply: We agree with the reviewer, and this is a problem that would appear with the regular KL - see section A.2 that recalls the monotony of skewed KL with respect to the skewness parameter . One of our contributions is to show that the skewed KKL verifies the same property (see Proposition 2). This is the reason why we tend to use small values of alpha in the rest of the experiments, see for instance Appendix C describing the hyperparameters for different experiments, is typically of order or . For larger values of , i.e. , an alternative idea would be to consider a "Jensen-Shannon" (JS) version of regularized KKL. Indeed alpha= 0.5 in the skewed (standard- KL can be used to design a JS divergence (see Remark 1) and the second term is used to balance this phenomenon. We think that this is an interesting idea, and will include illustrative experiments with a Jensen-Shannon version of our KKL divergence for the same experimental settings considered in our paper.
- Question : "Why does Figure 2(a) begin with different particles across different methods at ?"
Reply: You indeed spotted a mistake, for KKL, we plotted at T= 1 instead of T = 0 and this is why the initialisation looks different. In the submitted version, we corrected this mistake. You can see the corrected figure in a pdf in the general comment.
I thank the authors for the detailed response and I will keep my score.
The authors generalize and analyze the kernel Kullback-Leibler divergence introduced by Francis Bach a few years ago. The main contributions are 1) use a skew divergence (like in JS) tinstead of KL consider non-absolutely continuous divergences, 2) provide an explicit formula of the divergence for discrete measure which allows a direct implementation for data, 3) provide refined finite sample guarantees. The divergence is used then to build a Wasserstein gradient flow and generative modeling. Some low-dimensional numerical examples are provided.
优点
The work extend the original work by F. Bach in several aspects. In particular the explicit formula of the divergence for discrete measures is a nice and very useful results which wiil come handy in applications. The paper is very clearly and nicely written and the proof looked sound to the reviewer.
缺点
The reviewer find the applications somewhat underwhelming. The gradient flows built via kernel methods do not seem to perform particularly well, perhaps due to the well-known lack of expressivity. One might have hope that the covariance embeddings in these new divergence would have helped compared to MMD based gradient flow but the authors experiments do not make that case. Kernel methods have the advantages that provide explicit formula for the velocity vector field and one would hope that this allow to consider high-dimensional problem but no such experiment have been provided so it is not clear if the proposed model scales up.
问题
-
Is there a reasonable expectation that their flows to scale up to high dimension? Or what is the obstacle?
-
It seems to the reviewer that Wasserstein gradient flow based on kernel methods do not seem to perform well compared to gradient flows using neural network architecture. This occurs in the case of SVGD but also for plain Wasserstein gradient flows. For example the gradient flows in Gu et al (https://arxiv.org/abs/2210.17230) based on Lipschitz regularization of KL-divergence are comparable in spirit to the MMD of Kernel-KL flows considered here. Is the reviewer mistaken? Can the authors provide some insights on these issues.
局限性
The limitations of the work are properly addressed.
Thank you for your feedback and time. We have addressed each of your comment below. If you have any further question, please don't hesitate to let us know.
- Weakness: " The gradient flows built via kernel methods do not seem to perform particularly well, perhaps due to the well-known lack of expressivity. One might have hope that the covariance embeddings in these new divergence would have helped compared to MMD based gradient flow but the authors experiments do not make that case."
Reply We respectfully strongly disagree. We see (visually) on Figures 2.a and 2.b on page 9 that the KKL flow converges to a better particle configuration than the MMD one, even after carefully tuning all the hyperparameters, including for MMD (that is notoriously hard to optimize, see [1]). This is also seen when reporting the convergence with respect to different metrics in the Appendix C (see paragraph "3 rings" starting l683 and Figure 10). We do not understand why the reviewer claims the converse statement. Moreover, it has also been shown the paper cited above [1], that MMD gradient flow alone does not converge well even on simple experiments, e.g. between Gaussians as in Figure 2 in [1]. KKL flow also works better than KALE in the case where L-BFGS method is used, a method which cannot be used for the KALE flow as it doesn't have a closed form expression for its loss and gradient (that are requirements for L-BFGS). If the reviewer disagree or has questions on the experiments, please let us know so that we can clarify.
- Question "Is there a reasonable expectation that their flows to scale up to high dimension? Or what is the obstacle?"
Reply: Notice that in Appendix C, we conduct experiments on mixture of Gaussians up to dimension 10, see paragraph starting l669 and figures at the top of p26). We acknowledge that we did not try to use KKL in higher dimensions such as the one of images or in a more challenging generative modeling setting. A first step would be to work on reducing the computational complexity of KKL (which is as discussed l251 wrt to the number of samples, that should be multiplied by which is the kernel computation cost) to work on high-dimensional and large datasets. We leave this study for future work.
- Question "It seems to the reviewer that Wasserstein gradient flow based on kernel methods do not seem to perform well compared to gradient flows using neural network architecture. This occurs in the case of SVGD but also for plain Wasserstein gradient flows. For example the gradient flows in [2] based on Lipschitz regularization of KL-divergence are comparable in spirit to the MMD of Kernel-KL flows considered here. Is the reviewer mistaken? Can the authors provide some insights on these issues."
Reply: There might be a slight confusion here, but let us know please if we misunderstood. There exists 2 types of kernel-based approximations of the Kullback-Leibler divergence that were proposed in the literature: (a) the one we propose here, based on kernel covariance embeddings and (b) the "variational one" (see Equation (12) in our paper).
The paper you are referring to is analog to the second type; the difference being that the variational family there is defined by neural networks. Regarding the performance of this divergence compared to ours, there are two ingredients at play: (1) first, the type of approximation, and (2) second the parametric family.
Regarding (1), it is not clear how our kernel-based approximation of the standard KL (i.e. type (a)) is "better" than the variational approximation such as KALE (i.e. type (b)).
Our work, through numerical experiments, is a first tentative for comparing these different techniques, through comparing KKL with Kale.
Then, the choice of parametric family is crucial indeed. In the reference mentioned by the reviewer, the variational family is approximated through neural networks instead of a Gaussian RKHS as in KALE. As explained in Section 1 therein, when the variational family is an RKHS ball, these approximated divergences interpolate between MMD and KL divergence, while if it is the class of 1-Lipschitz functions, it interpolates between Wasserstein-1 (which has much better topological and geometrical properties than MMD) and KL. In [2], the space of 1-Lipschitz functions is approximated by a space of neural network.
A possible (and fair) comparison in our framework (i.e. type (a) using the KKL) may be to learn a kernel feature map with neural networks before applying KKL formula - instead of considering directly Gaussian kernels as in our paper. We think this is an interesting but deep question that requires further investigation.
[1] Arbel, Korba, Salim, Gretton. Maximum Mean Discrepancy Gradient Flow, Neurips, 2019.
[2] Gu, Birmpa, Pantazis, Ret-Belley, A. Katsoulakis, Lipschitz-Regularized Gradient Flows and Generative Particle Algorithms for High dimensional Scarce Data, 2023
Thank you for the detailed answer and I will keep my score.
Comparing probability measures is a fundamental task in many statistical and machine learning problems. One of the metrics that has been ubiquitously used is Kullback-Leibler (KL) divergence. KL contrasts the information contained in two probability distributions. Statistical learning using KL divergence involves density ratios and needs the fact that the supports of probability measures in question should be not disjoint.
A kernel variant of KL called kernel KL (KLL) divergence was proposed in Bach (2022, IEEE Trans. Info.). KLL compares probability measures through covariance operators in an RKHS space. As the standard KL, the KLL shares the limit of inability to compare measures with disjoint supports. This paper proposes a regularized variant of KLL that guarantees calculating divergence for all distributions. The authors derive bounds that quantify the deviation of regularized KLL to the original KLL. they further propose a closed form for KLL for empirical measures. Numerical experiments are conducted on simulated data to corroborate the theoretical results.
优点
- Proposing a regularized version of standard KLL, that allows calculating divergence for all distributions.
- Presenting statistical properties of .
- Providing a closed for KLL between empirical measures.
- Plugging in a Wasserstein gradient flow learning.
缺点
Weakness and Questions
- The decreasing property of in Proposition 2 needs the condition that is absolutely continuous with respect to . However, in the definition of (L144-L145) there is no need for this absolutely continuous. My question does this property still valid without ? The same remark for the result in Proposition 4.
- In Proposition 4, the definition of needs to be recalled.
- In Proposition 4, could you please motivate the condition On the Radon-Nikodym derivative to be bound above by .
- To facilitate the lecture on the bound given in Equation (5), I think it will be better to write its order.
- In Proposition 5, I didn't understand the definition of the gram kernels . Do these operators design the covariance operator or other more general kernels?
问题
Minor Typos
- L108: the notation is used for denoting probability measure and the dimension .
- L368: there is no caption of the figure and for (b) Shape transfer, the iteration is the same.
局限性
NA
Thank you for your careful reading and relevant suggestions. We have addressed your comments point-by-point below. Please don't hesitate to let us know if you have any further questions.
- Question: "The decreasing property of in Proposition 2 needs the condition that is absolutely continuous with respect to . However, in the definition of (L144-L145) there is no need for this absolutely continuous. My question does this property still valid without ? The same remark for the result in Proposition 4."
Reply: See the first point of general comment for the reply concerning Proposition 2.
Concerning Proposition 4, it is an interesting and important point that you bring up. We chose to make the assumption similarly than in Proposition 3, which shows the convergence of the regularized KKL to the true KKL in the case where as the regularization parameter . Under this common assumption, we can thus use Proposition 3 and 4 simultaneously to control the deviation of the regularized KKL on empirical distributions versus the unregularized KKL between and . Additionally, under this assumption, the bound in Proposition 4 scales with respect to the regularization parameter as .
The reviewer is right that this assumption is not the minimal assumption one could work with to obtain guarantees on the finite-sample approximation guarantees as we derive in Proposition 4. Indeed, it is possible to derive a similar rate of convergence without the assumption of absolute continuity, i.e. a bound that scales similarly with the number of samples , by slightly modifying the proof (see details below). However, in this case, the bound would scale with respect to the regularization parameter as . We chose to propose only the first one in the paper as it scales better with when is small, but we realize thanks to the comment of the reviewer that this is an important point, and we will provide the alternative one.
We provide in this paragraph more details about why the scale with respect to changes depending on the hypothesis. In the proof, we repeatedly upper bound different terms by a quantity depending on , using by taking advantage of the assumption that is finite. Also, we regularly have to deal with terms in which the operator appears, where . Under the assumption and , by operator inequality, the operator is upper bounded by , using the inequality . This will result ultimately in only one in the denominator of our bound, i.e. the scale mentioned above. Alternatively, it is possible to bound by using the simpler and this does not require absolute continuity of . However, this ultimately results in an additional factor in our bound compared to the previous case, hence the scale , that is less favorable when the regularization parameter is small. The corresponding bound is this one:
Thank you for pointing out this essential point, which we will clarify in the paper by giving an alternative proof and an alternative bound.
- Question: "In Proposition 4, the definition of needs to be recalled."
Reply: Thanks for the suggestion, we will indeed recall that is the feature map of in the RKHS .
- Question: "In Proposition 4, could you please motivate the condition on the Radon-Nikodym derivative to be bounded above by ." .
Reply: see answer above; in summary it enables us to get a tighter bound with respect to the regularization parameter when it gets smaller.
- Question: "To facilitate the lecture on the bound given in Equation (5), I think it will be better to write its order."
Reply: Thanks for the suggestion, we already worked on this post submission and think that this is an excellent suggestion to present our results in this simplified manner. A simplified bound is provided in the general comment to all the reviewers.
- Question: "In Proposition 5, I didn't understand the definition of the gram kernels , , . Do these operators design the covariance operator or other more general kernels?
Reply: Our notations here were confusing and we changed them. In the submission, and as defined in l207 refer to standard Gram matrices related to the kernel and the sample set or . The matrix is the Gram matrix related to the two sample sets. These notations hence refer to Gram matrices and not covariance operators. In a nutshell, our statement in Proposition 5 shows that the regularized KKL for two empirical distributions writes as a simple function of Gram kernel matrices. \vspace{2mm}
- Minor typos: thanks for spotting these, we corrected accordingly.
Summary:
In this paper, the authors study kernel KL divergences (Equation 1). First, they extend the definition of kernel KL to a new setting via regularization that works for distributions with disjoint support (Equation 2). In Propositions 2,3, they prove the approximation results for the regularized kernel KL divergence compared to the unregularized version of it. Second, they derive convergence rates for finite sample estimation of kernel KL divergences (Proposition 3). Moreover, they obtain a closed-form expression for the new divergence (Proposition 5). They also study Wasserstein GF of the new divergence in the paper.
优点
Pros:
- very well-written paper
- well-motivated setting with comprehensive results
缺点
Cons:
- the title does not reflect the contributions of the paper
问题
Questions/Comments:
I recommend changing the title of the paper. The whole draft is devoted to the study of "regularized kernel KL divergence" but in the title there is no reference to being regularized.
-
line 124 -- what do you mean by "if is compact"
-
line 131 -- typo "an interesting"
-
how tight Proposition 4 is?
局限性
N/A
Thank you for your valuable feedback and time. We have addressed your comments point-by-point below. Please don't hesitate to let us know if you have any further questions.
-
Weaknesses: "the title does not reflect the contributions of the paper"
Reply: We acknowledge the title does not reflect the fact that we are studying the regularized KKL and it is not clear that we are doing gradient flows on it, we will add this precision in the title.
-
Questions: "line 124 -- what do you mean by "if is compact"
Reply: " is compact" is a typo, it was originally a subset of . Thank you for this correction.
- Questions: tightness of Prop 4
Reply: See general comment please.
Thank you! I appreciate the authors' response. I believe if this paper is accepted, then the title must definitely be changed. Also, I continue supporting the paper so I keep my positive score.
We sincerely thank all the reviewers for their positive comments on our paper, as well as for their relevant suggestions and questions. We adress here the general comments and questions. If we have adequately addressed the reviewer's concerns, a re-evaluation would be greatly appreciated. For any unresolved issues, we are ready to engage further.
-
Contributions (to all reviewers): We propose a kernel-based divergence relying on kernel covariance operators (in contrast to kernel mean embeddings operators as used in the Maximum Mean Discrepancy) that is computable in closed-form (see our Proposition 5). We derive theoretical results on this divergence, including monotony with respect to the regularization parameter, statistical rates, and well-posedness and closed-form for its derivatives. This enabled us to use the latter object in a gradient flow setting to study its empirical behavior when approximating probability distributions and performs better than the MMD and is more practical than alternative kernel-based approximations of the Kullback-Leibler divergence such as KALE (which is not closed-form). We think our study motivates that using higher order moments (eg covariances instead of mean embeddings) enables to compare probability distributions in a more powerful manner than Maximum Mean Discrepancy, while being closed-form and tractable.
-
About the assumption in Proposition 2 (reviewer 19Wg and 6zq3) : This is a typo and we thank the reviewers for spotting this incoherence. This condition is not necessary to ensure the decreasing property of the regularized KKL in for , as can be seen in its proof in Section B.1, as can be defined even without .
-
About the bound in Proposition 4 and its tightness (reviewers 19Wg and r7ey), we simplified our bound which yields the following : This makes it easier to see which terms dominate the upper bound. And we see, as mentioned on line 184, that if we put , we find a bound similar to that of Proposition 7 of [1]. Notice that this is a similar rate (up to log factors) as the one of MMD that is minimax [2] and that rely on the estimation of similar Bochner integrals (kernel mean embeddings instead of kernel covariance operators). We did not tackle the study of lower bounds, as the calculation of our upper bound was already very technical (see the sketch of proof line 190, full proof is deferred to the appendix); we leave the question of lower bound for future work.
-
About Figure 2.a (reviewer 6zq3), we have attached a pdf containing the corrected version of Figure 2.a.
[1] Bach, F. Information Theory with Kernel Methods, IEEE Transactions in Information Theory, 2023.
[2] Tolstikhin, I., Sriperumbudur, B. K., Mu, K. Minimax Estimation of Kernel Mean Embeddings, JMLR, 2017.
The paper proposes a regularized version of the Kernel Kullback-Leibler divergence (KKL, Bach 2022) and studies its statistical properties, including in particular a finite-sample bound and the associated Wasserstein gradient flow.
Reviewers all agree that the presented results are novel and of interest to the community.
Additional comments:
-
The title should be changed, as suggested by Reviewer r7ey.
-
Eq(3): the second term on the right hand side should read . The exact reference for this identity in Ando (1979), i.e. which equation, page in that paper, should be given.
-
Proposition 3: as mentioned by Reviewer 6zq3, this bound is trivial when , so it's a loose bound.
-
In Proposition 4, the assumption that is a weakness, since the regularized KKL is supposed to be finite for all pairs p,q. One should be able to obtain finite-sample complexity in this general setting.