Hybrid Kernel Stein Variational Gradient Descent
摘要
评审与讨论
This paper discusses h-SVGD as a method to address variance collapse by using separate kernels for the driving and repulsive terms. The authors extend the theoretical aspects of the original paper and demonstrate that h-SVGD does not converge to the target distribution in the mean field limit. The empirical results support the theoretical findings.
优点
- Theoretical contributions are rigorous and sound, including the existence of a solution to the hybrid Stein PDE and the establishment of the descent lemma. A kernelised Wasserstein gradient flow interpretation is thoroughly discussed with .
- The experiments involve diverse datasets and metrics.
- The representation of experiments is clear and the paper is well-structured.
- Good to quantify that h-SVGD does not add additional computational costs to SVGD.
缺点
- The title may be a bit misleading as it suggests that h-SVGD is a novel algorithm introduced in this paper, but it was previously proposed. For instance, it can reflect the paper's focus on theoretical analysis and empirical evaluation of h-SVGD, rather than introducing it as a new method.
- The main focus on may be somewhat limiting, as it potentially restricts the exploration of using truly distinct kernels for the driving and repulsive terms.
- While RMSE and LL have been assessed in Appendix B, h-SVGD does not show clear benefits over SVGD in these metrics. It would be better to provide a more detailed discussion in the main text about why DAMV is an appropriate metric for evaluating h-SVGD's performance, especially in relation to the variance collapse problem.
- Using "S-SVGD" and "SSVGD" interchangeably is slightly confusing and could be more consistent.
问题
- The authors mention convergence issues in the mean field limit. Are there any ways to mitigate this bias?
- The authors focus on case. Have they tried using two completely different kernels?
- The authors reference h-SVGD's promising results on image classification. Have they conducted similar experiments in this paper?
- From an applications perspective, why is it important to avoid variance collapse?
We greatly thank the reviewer for their time in providing their review and feedback.
Weaknesses:
- We appreciate the suggestion for a new title and have updated the title to "Convergence aspects of hybrid kernel SVGD".
- We would like to draw attention to Section 3.5 where the gradient flow is explored in the more general case where and may be kernels from different families. Conditions are given for the existence of a Wasserstein gradient flow in Proposition 3.7. Also, there is a concrete calculation in Proposition 3.8 demonstrating the non-convergence in the mean field limit for a specific target.
- We would like to highlight that Tables 3 and 4 in Appendix B are comparing h-SVGD against S-SVGD, not vanilla SVGD. S-SVGD is a variant that addresses variance underestimation. Table 3 shows that h-SVGD outperforms S-SVGD at alleviating variance collapse on 9 out of 10 datasets, and runs much faster. Table 4 shows that both h-SVGD and S-SVGD are comparable in terms of RMSE and LL test metrics, but as per Table 2, they both outperform regular SVGD on these metrics. Hopefully this makes it more clear as to which algorithm we are comparing h-SVGD against in tables 3 and 4.
- We appreciate the reviewer's attention to detail in this comment. All uses of SSVGD have now been updated to S-SVGD for consistency.
Questions:
- Non-convergence in the mean field limit is a property of h-SVGD. Note that in the two concrete examples (Corollary 3.4 for and Proposition 3.8 for the general case), the mean field limiting distribution has the same mean. Therefore, the introduced bias increases the variance, but does not affect the mean. In this way, h-SVGD samples from a distribution with an overestimated variance, and this offsets the variance underestimation in the finite particle setting.
- The case of two RBF kernels with different bandwidths is explored in Section 3.5. This includes concrete calculations in the proof of Proposition 3.8. The theoretical analysis of two kernels with different forms (e.g. RBF and IMQ) is generally intractable because of the Fourier transforms involved (see Equation 22). Based on Proposition 3.8, we do not expect mean field convergence under other more general settings.
- Since the utility of h-SVGD on image analysis tasks has already been demonstrated (D'Angelo et al., 2021), we chose to focus on other benchmark datasets. We chose the BNN example because it was studied in the original SVGD paper by (Liu and Wang, 2016) as well as the S-SVGD paper (Gong et al., 2021).
- Capturing variance of the posterior is critical for uncertainty quantification in Bayesian statistics since variance is often used as a measure of confidence in an estimate (we have included this comment in the second paragraph of Section 1). Taking the BNN example, variance collapse in this setting leads to a set of networks with almost identical parameters, which defeats the purpose of taking a Bayesian approach.
I thank the authors for clarifications and adjustments. I have increased my confidence by 1.
Ths paper gives more theoretical insights about a sampling method named hybrid kernel stein variational gradient descent (h-SVGD). It is demonstrated that this method effictively sample a law which is not the target law (but linked to it). Moreover, it provides a descent result and a discretization quantification. Finally, it presents experiments that show the empirical interest of h-SVGD.
优点
The paper provides interesting theoretical results on h-SVGD. It extend the known results on SVGD to h-SVGD. The experimental results effectively support the interest of h-SVGD.
缺点
Major weaknesses:
- Assumption (A1) does not seem to be well written. To suppose that must be a more realistic assumption to ensure that could be a probability measure. Is it a writting error ?
- There is a few intuitions and comments in the paper, making the paper hard to read. In particular, I suggest to discuss the assumptions in details. A reminder on Wasserstein gradient flow will be appreciable. Finally, Proposition 3.6 is not comment.
Minor weaknesses:
- Equation 6 : there is a typo, it might be instead of .
- line 198 : in the paper notations, it might be instead of .
- The remark in line 246 seems inconsistent because Assumption (A1) induce that is bounded.
- In line 250, the constant is said to be dependent of and . However it is not clear what is the variable .
- In line 329, the normal density do not verify Assumption (A1). In fact in this case is quadratic.
- line 226, I suggest to be more precise about the sense of "symetric function" in this context.
问题
- In Assumption (A2), the first part can be deduce from Assumption (A1). In fact, Assumption (A1) induce that is bounded. Why do you precise this assumption ?
- Can you give an interpretation of the second part of Assumption (A2) ?
- Can you give an interpretation of Assumption (A3) ?
- Do you have examples of potentials that verifies Assumptions (A1)-(A2)-(A3)-(A4) ?
- According to your analysis of h-SSVD you have a proposition of a sampling strategy that does not suffer from dimensional collaps and sample the right distribution ?
We greatly thank the reviewer for their time in providing their review and feedback.
Weaknesses:
Major weaknesses:
- The reviewer is correct - thankyou. This is a typo and Assumption (A1) has been updated to .
- Thank-you. Following this suggestion, we have included some interpretation and commentary on the assumptions in Section 3.1. We have also included a reminder on the Wasserstein gradient flow definition in Section 3.4. Finally, we have also added some commentary on the interpretation of Proposition 3.6.
Minor Weaknesses:
- Due to the symmetry assumption on the kernels, Assumption (B1), both options are equivalent. For consistency with equation (2), we have updated equation (6) to as suggested.
- Line 198 should remain as there is no in this setting, only and .
- Following the correction to the typo in Assumption (A1) mentioned above, the remark in line 246 is now consistent.
- The in line 250 has been changed to , the target distribution. We appreciate the reviewer's attention to detail.
- Following the correction to the typo in Assumption (A1) mentioned above, the normal density now satisfies Assumption (A1).
- Since the functions and both have domain , by symmetry we mean . As per the suggestion, we have added this definition in Assumption (B1).
Questions:
- Following the correction to the typo in Assumption (A1) mentioned above, those two assumptions are now independent.
- Assumptions (A2) and (A3) can be interpreted as controls on the decay of the tails of the distribution, ensuring well-behaved tails. We have now included a comment on this in the paragraphs following the assumptions.
- Following the corrections to the typo in Assumption (A1), and the update from to in the first part of Assumption (A2), Gaussian potentials and mixtures thereof satisfy all four assumptions. Many other continuous distributions from the exponential family also satisfy these assumptions.
- We present h-SVGD as a sampling method that, despite sampling from a different distribution than the target, alleviates variance collapse in practice. This happens because the distribution from which the particles are sampled overestimates variance in the mean field limit, as seen in Corollary 3.4, and this overestimation offests the variance underestimation that occurs in the finite particle setting with regular SVGD.
I thank the authors for there responses and modifications. There was two main errors on Assumptions (A1) and (A2). The paper is still hard to read for me. In particular :
- Can you be more precise about your comment on the second part of Assumption (A2), it is not clear for me why this assumption only constraint the tails. It looks for me as a convexity condition. Can you give a detailed interpretation and comments about this assumption ?
- On Assumption (A3), Can you give a more detailed interpretation and comments about this assumption ? This assumption seems unclear for me and I do not understand its scope.
- Moreover, in the revised version on Proposition 3.3 and Proposition 3.6, the used assumptions are not given.
I tried to go through proofs and I wonder :
- Can you explain how to pass from Equation 29 first line to the second line ?
We thank the reviewer for their prompt response.
- The reviewer is correct in that Assumption (A2) is related to the convexity of the potential of our target distribution, which takes the form . Here the assumption relates to the log-concavity of the target distribution, where log-concave distributions are those with density function where is a convex function. There are indeed known links between log-convexity and the tails of distributions, in particular whether they are heavy or light tailed (Asmussen & Lehtomaa, 2017).
- We note that Assumption (A3) is only required for the existence and uniqueness of the hybrid Stein PDE solution (Proposition 3.2) and is similar to the assumptions in (Lu et al., 2019), which contains the original proof in the single kernel setting. This is an important auxilliary result that provides context to our main results on the Wasserstein gradient flow and non-convergence in the mean field limit. One may interpret Assumption (A3) as a condition that controls the growth of the first and/or second derivatives of the potential. Indeed, the bound on and within the ball depends on .
- We had initially written the required assumptions in the text of Section 3.4, but have now included it within Proposition 3.3, 3.6 and 3.7 directly for greater clarity. Note that the other assumptions listed in Section 3.1 are not needed for these results. They are only needed for Sections 3.2 and 3.3.
Lastly, regarding the comment on the proof, we have added two extra lines of calculations leading up to Equation (29) to improve clarity of the calculations. The key calculation here is the use of product rule when differentiating the term in square brackets.
References:
- Asmussen, S., & Lehtomaa, J. (2017). Distinguishing log-concavity from heavy tails. Risks, 5(1), 10.
- Lu, J., Lu, Y., & Nolen, J. (2019). Scaling limit of the Stein variational gradient descent: The mean field regime. SIAM Journal on Mathematical Analysis, 51(2), 648-671.
I thank the authors for their more details answers with interpretations. The paper is still hard to read for me and I am unable to ensure that the proofs are correct. However, due to the modifications and ameliorations of the paper, I increase me score to 6.
This paper proposed studied the theoretical foundation of the hybrid kernel Stein variational gradient descent (h-SVGD) method, which is a variant of the vanilla SVGD method. Specifically, the authors demonstrated the ability of h-SVGD to alleviate variance collapse, and showed the existence of a solution to the hybrid Stein partial differential equation for h-SVGD. They also showed that h-SVGD does not converge to the target distribution in the mean field limit. Experiments have been provided to show the promising properties of h-SVGD.
优点
-
The theoretical foundation of h-SVGD is established, which is new and important to this research topic.
-
Besides the new results, some existing results are proved with relaxed and more practical assumptions.
缺点
To be honest, since I am not familiar with the mathematical tools used in this work, I can only judge the paper based on what the authors have done, as they claimed, and I cannot say much about its weaknesses.
问题
No further questions.
We greatly thank the reviewer for their time in providing their review and feedback. We appreciate the acknowledgement of our contributions on the theoretical constraints and practical advantages of hybrid SVGD
This paper explores the theoretical understanding of the hybrid kernel variant of Stein variational gradient descent (h-SVGD). Specifically, it provides a kernelized Wasserstein gradient flow representation for h-SVGD and, building upon this, offers the dissipation rate and large particle limit of h-SVGD. In numerical experiments, the authors demonstrate that h-SVGD significantly improves variance collapse compared to vanilla SVGD.
优点
- This paper provides a relatively comprehensive theoretical foundation for the empirical method hybrid kernel variant of SVGD.
- This paper offers a clear explanation of the relationships and distinctions between the theoretical results of h-SVGD and those of SVGD.
缺点
- The experimental results presented in the paper offer relatively limited support. In results on the Protein and Kin8nm datasets, SVGD does not appear to suffer from variance underestimation issues, yet its performance is still inferior to that of h-SVGD. This observation suggests that the advantage of h-SVGD in BNN tasks may not primarily stem from mitigating variance underestimation.
问题
- The result in Corollary 3.4 indicates that h-SVGD does not guarantee the distribution of particles converges to the target distribution. While this is reasonable, the metrics commonly used in Bayesian Neural Network (BNN) tasks, such as test RMSE and test LL, do not reflect this limitation. This raises the question of how to interpret the advantages of h-SVGD despite this theoretical constraint.
伦理问题详情
No
We greatly thank the reviewer for their time in providing their review and feedback.
Weaknesses:
- We appreciate the insightful observation that the improvements in RMSE and LL may be due to factors other than mitigation of variance collapse. We have included a comment on this in Section 4.2.
- We would also like to address the comment about limited support. The BNN experiment offers support for improved variance in two ways. Firstly, through improved DAMV on all but two datasets. And secondly, there is a well known decomposition of the MSE, namely that . Since RMSE is improved on all but one dataset, this suggests an improved variance estimation.
Questions:
- We would like to emphasise that Corollary 3.4 is a result in the mean field limit, a common setting throughout the SVGD literature. However, the issue of variance underestimation occurs in the finite particle setting, which is more commonly encountered in practice. One can interpret the increased variance in the limiting state distribution as counteracting the variance underestimation in the finite particle regime, thereby leading to more accurate variance estimates in practice. We appreciate the observation and have chosen to include additional commentary on this point in the discussion of results in Section 4.2.
I would like to thank the authors for their valuable discussion of the point I am concerned about. Since the mean of the Bayesian posterior is the minimum mean squared error (MSE) estimator, the annealed posterior is typically considered a suboptimal estimator w.r.t. MSE. However, the limited distribution of particles from h-SVGD is exactly the annealed posterior. Therefore, I still have some concerns about why the annealed posterior is a better choice for the BNNs task. I think this requires further understanding; however, I also appreciate the authors for providing the theoretical description of h-SVGD.
We appreciate your follow up response. The reason the h-SVGD algorithm is a better approach in the BNN setting is due to the high dimension of the inference problem (several hundreds of dimensions in this case). In practice, vanilla SVGD suffers from variance collapse in high dimensions such as this because it is not computationally feasible to simulate enough particles. The advantage of h-SVGD is that it can yield superior performance in terms of test RMSE when the number of particles is much lower than the dimension of the inference problem, .
Note that h-SVGD will suffer when the number of particles is large. This is because in the mean field limit (or large particle setting), Corollary 3.4 shows that h-SVGD will not converge to the target distribution.
To better illustrate this, we have run additional simulations on the energy dataset from the BNN example in Section 4.2. The experiment in the submission uses particles, but we have run the experiment again up to particles. For larger we reduced the number of times the experiment ran due to time constraints.
Note that increasing does not produce an improvement in RMSE of vanilla SVGD outside the standard error range. However, as long as the number of particles is sufficiently lower than the dimension ( in this case), the greater repulsive force in h-SVGD can alleviate variance collapse in such a way that it improves the RMSE.
| N | Experiments | RMSE (SVGD) | RMSE (h-SVGD) |
|---|---|---|---|
| 20 | 20 | 1.528 0.169 | 1.040 0.128 |
| 200 | 20 | 1.463 0.204 | 1.031 0.149 |
| 400 | 20 | 1.453 0.169 | 1.043 0.142 |
| 500 | 20 | 1.422 0.149 | 1.286 0.129 |
| 600 | 20 | 1.447 0.196 | 1.331 0.194 |
| 1000 | 5 | 1.471 0.171 | 1.365 0.151 |
We thank each of the reviewers for their time reviewing our submission, and their considered feedback. As a direct result, the following changes have been made to the manuscript.
- The title has been updated to "Convergence aspects of hybrid kernel SVGD".
- A note on the importance of variance estimation has been added to the second paragraph of the introduction.
- Equation (6) has been updated from to .
- A typo in Assumption (A1) has been corrected. The condition has been changed from to .
- A typo in Assumption (A2) has been updated. The left-hand side of the first inequality has been updated from to .
- Some commentary on and interpretation of the assumptions in Section 3.1 has been added after they are introduced.
- The definition of symmetric has been included in Assumption (B1) for clarity.
- A typo in Proposition 3.1 has been updated to say that depends on instead of .
- A reminder of the definition of a Wasserstein gradient flow has been included in Section 3.4, along with references to two standard texts.
- A comment on the interpretation of Proposition 3.6 has been added after the statement of the proposition.
- A comment linking the improved variance estimation in practice with the mean field variance overestimation in Corollary 3.4 has been added to the start of Section 4.
- In Section 4.2, we included further interpretation of the improved variance collapse and linked it to Corollary 3.4. We also included a comment about how the improved RMSE and LL may be due to factors other than increased DAMV.
- All instances of SSVGD have been changed to S-SVGD for consistency. This is primarily in Appendix B.
The paper claims to develop a theory of h-SVGD, an extension of SVGD that alleviates variance collapse without computational overhead. Using the developed theory, it is claimed that h-SVGD does not converge to the target distribution in the limit of having infinite particles.
The strengths of this work are in establishing a theoretical foundation of h-SVGD.
A main weakness of the work is the clarity and issues with the mathematical presentation as pointed out by the reviewers. Reviewers were not able to verify the mathematical correctness or soundness of the presented proofs. Moreover, the experimental evaluation is limited and it remains unclear whether the advantage of h-SVGD really stems from mitigating variance underestimation.
I recommend rejection with a main reason being that the practical implications of the main theoretical result remain unclear. For a resubmission, I encourage the authors to simplify and improve the clarity of the theoretical results, as well as add numerical results which show a clearer connection to the established theorems.
审稿人讨论附加意见
Reviewer J5RT provided a short positive review, but could not verify the mathematical details. Similarly, Reviewer 2J3T could not ensure correctness of the mathematical results and even soundness of the assumption, even after discussions with the authors.
Reviewer YtiA found that the experiments don't match well with the theoretical results and did not change their view after the authors rebuttal.
Despite an overall positive score, these issues influenced my final decision to reject the work.
Reject