PaperHub
6.3
/10
Rejected4 位审稿人
最低3最高8标准差2.0
6
8
3
8
3.3
置信度
正确性2.5
贡献度2.5
表达3.0
ICLR 2025

Improving Generalization with Flat Hilbert Bayesian Inference

OpenReviewPDF
提交: 2024-09-16更新: 2025-02-05

摘要

关键词
Bayesian InferenceSharpness-aware Minimization

评审与讨论

审稿意见
6

In traditional variational inference, the goal is to approximate an intractable posterior p(θS)p(\theta | \mathcal{S}) by selecting a variational distribution qq from a family Q\mathcal{Q} that minimizes a divergence, often the KL divergence, resulting in the optimization problem:

argminqQKL(q(θ)p(θS)),\arg \min_{q \in \mathcal{Q}} \, \text{KL}(q(\theta) \parallel p(\theta | \mathcal{S})),

where p(θS)p(\theta | \mathcal{S}), termed the "empirical posterior" by the authors, is the target distribution.

This paper introduces a variational family within a Reproducing Kernel Hilbert Space (RKHS) framework and further reformulates the optimization problem to focus on approximating the "general posterior," p(θD)p(\theta | \mathcal{D}), over the dataset D\mathcal{D}. The optimization problem thus becomes:

argminfHd,fϵKL(q[I+f](θ)p(θD)),\arg \min_{f \in \mathcal{H}^d, \, \|f\| \leq \epsilon} \, \text{KL}(q_{[I + f]}(\theta) \parallel p(\theta | \mathcal{D})),

where q[I+f](θ)q_{[I + f]}(\theta) represents the transformed variational distribution. This reformulation leads to multiple steps of approximation, relaxation, and bounding, culminating in an iterative optimization procedure (detailed below Lemma 1).

The proposed method, Flat Hilbert Bayesian Inference (FHBI), is designed to enhance generalization in Bayesian inference by leveraging the structure of RKHS.

优点

  • Originality: I'm uncertain about the level of originality here. Variational families within an RKHS framework have, to my knowledge, been explored previously.
  • Quality: The work appears to be of high quality. The results seem technically correct, and while I haven’t meticulously checked every mathematical detail, the derivations appear consistent with expectations.
  • Clarity: Overall, the paper is well-written. I appreciated the clear, step-by-step walkthrough of the optimization problem and the detailed exposition of various bounds and relaxations necessary to implement the proposed approach.
  • Significance: The proposed method, Flat Hilbert Bayesian Inference (FHBI), is presented as a generalization of Stein Variational Gradient Descent (SVGD) and Sharpness-Aware Minimization (SAM). The experiments demonstrate promising potential for FHBI in applications like LoRA-style fine-tuning.

缺点

  • The general posterior defined, p(θD)p(\theta|\mathcal D), does not appear to me to be the correct theoretical counterpart to the "empirical" posterior. See more in questions below.
  • If the paper’s originality hinges partly on targeting the "general posterior" p(θD)p(\theta|\mathcal D), I have reservations about its practical benefits. In addition to my concerns about the missing sample size term, targeting this posterior seems unlikely to yield practical advantages and might even be counterproductive. The authors expend considerable effort introducing approximations to recast the optimization problem in terms of the "empirical posterior," which requires potentially loose bounds. The value of this detour would be enhanced by a more thorough discussion of why these approximations are justified or necessary.
  • The paper emphasizes "improving generalization" as a primary benefit of FHBI, yet this claim seems tenuous without more foundational support. It's unclear what is meant by "enhancing generalization in Bayesian inference" mathematically, as the method doesn’t inherently introduce any features that theoretically boost generalization. Rather, it appears that FHBI, when applied to fine-tuning, yielded improved generalization performance on a benchmark dataset compared to baseline methods. This distinction between observed outcomes and the initial methodological intent would be clearer if the authors clarified that FHBI’s generalization performance was more an empirical finding than a theoretically driven design choice.

问题

Major:

  • Something seems amiss in going from the empirical posterior p(θS)p(\theta|\mathcal S) to what you call the general posterior p(θD)p(\theta|\mathcal D). Let Ln(θ)=1ni=1nlogp(yixi,θ)L_n(\theta) = \frac{1}{n} \sum_{i=1}^n -\log p(y_i|x_i,\theta). Then the empirical posterior is p(θS)=exp(nLn(θ))p(θ)p(\theta|\mathcal S) = \exp(-n L_n(\theta)) p(\theta). The population counterpart to this is exp(nL(θ))p(θ)\exp(-n L(\theta)) p(\theta). But this is not your general posterior which no longer has any dependence on sample size nn.

Minor:

  • Currently, the way the loss \ell shows up in Section 3 might give the wrong impression that it is something one can freely choose. In fact the loss has to be (proportional) to the negative log likelihood. I think you need to write out explicitly what (fθ(x),y)\ell(f_\theta(x),y) is.
  • The sentence above Eqn (5), "In turn, the solution f^\hat f^* that solves the maximization problem above is given by". Which maximization problem above? There are so many approximations here, please use \label and \ref to refer to exact maximization problem. (I also doubt this is a true statement, (5) is not really a solution but an approximation of a solution right?)
  • Would you consider labeling the iterative procedure right below Lemma 1 and making it painfully clear how that turns into Algorithm 1?
评论

We appreciate the reviewer’s valuable feedback and will address the concerns as follows:

  • Regarding the originality: While the reviewer correctly noted that the RKHS framework has been used in previous works on variational families, this is not the main contribution of our paper. The first novelty of our work is that to improve generalization, we propose in Section 4 a novel Bayesian framework that aims to approximate the general posterior p(θD)p(\theta|D) instead of the empirical posterior p(θS)p(\theta|S) like prior works such as SVGD [1]. In doing so, the model sampled from the final posterior will fit the whole data distribution DD instead of overfitting to the specific dataset SS. Notice that we do not have access to this general posterior because DD is unknown. To address this general posterior, our second contribution is that we develop the concept of functional sharpness on the RKHS from Theorem 1, and propose to apply this concept to the context of Bayesian Inference. In short, our main contribution is not the usage of the RKHS in Bayesian Inference, but the introduction of the concept of sharpness on the RKHS, and the usage of this concept in Bayesian Inference.
  • Clarification about the empirical and the general posteriors: Recall that prior works in Bayesian typically define the likelihood as p(yixi,θ)=exp((fθ(xi),yi)),p(y_i|x_i,\theta) = \exp(-\ell(f_\theta(x_i),y_i)), which gives the posterior as the reviewer mentioned p(θS)exp(nLS(θ))p(θ)p(\theta|S)\propto\exp(-nL_S(\theta))p(\theta) Our work slightly modifies the likelihood by scaling with the dataset size: p(yixi,θ)=exp(1n(fθ(xi),yi)).p(y_i|x_i,\theta)=\exp(-\frac{1}{n}\ell(f_\theta(x_i),y_i)). We emphasize that this formulation has been employed by prior works such as SA-BNN [3]. Moreover, the modification does not change the intuition of the original formulation that a smaller loss yields a higher likelihood and, therefore retains its validity. With this formulation, the empirical posterior becomes: p(θS)exp(LS(θ))p(θ)p(\theta|S)\propto\exp(-L_S(\theta))p(\theta) Then, we can define its general posterior counterpart p(θD)exp(LD(θ))p(θ)=exp(E(x,y)D[logp(yx,θ)])p(θ),p(\theta|D)\propto\exp(-L_D(\theta))p(\theta)=\exp(-\mathbb{E}_{(x, y)\sim D}[\log p(y|x,\theta)])p(\theta), which is valid because this definition does not contain nn.
  • Regarding the benefits of addressing the general posterior: To understand the benefits of addressing the general posterior, let us first consider the method proposed by SAM [1]. SAM seeks to minimize LD(θ)L_D(\theta) instead of LS(θ)L_S(\theta), thereby reducing overfitting by fitting to the data distribution rather than a specific dataset. Since the DD is unknown, LD(θ)L_D(\theta) cannot be computed. Then, SAM has to implicitly minimize LD(θ)L_D(\theta) with a sequence of approximations to the worst-case empirical loss within a radius. Our approach extends this idea of SAM to the regime of distributions. As discussed in Section 4: rather than approximating p(θS)p(\theta|S) as done in conventional Bayesian methods, we aim to approximate p(θD)p(\theta|D). By doing so, the posterior fits the whole data distribution DD rather than overfitting to a specific dataset SS, so the model particles sampled from this posterior will not overfit to the specific dataset SS and hence improve generalization upon prior Bayesian techniques. Similar to SAM, our approach requires approximations to address the general loss/posterior. Nevertheless, this approximation error can be mitigated with the proper choice of the hyperparameter ρ\rho, which is typically a small constant.
  • Regarding the claim of improving generalization: As discussed above, the feature showing that FHBI improves generalization is that it approximates p(θD)p(\theta|D) instead of p(θS)p(\theta|S). Then, the model particles sampled from the final posterior follow the true distribution instead of overfitting to the dataset SS. This improvement is also evident in our Ablation studies. As noted by SAM, lower sharpness correlates with better generalization. Section 6.2 demonstrated that FHBI reduces the sharpness of each particle compared to SVGD and also prevents particles from collapsing into a single mode. Together, these properties support our claim that FHBI leads to better generalization.
  • Regarding other minor issues: We appreciate the suggestions of the reviewer about the minor issues and have incorporated them in the revision.

References:

[1] Qiang Liu and Dilin Wang. Stein Variational Gradient Descent: A general purpose Bayesian inference algorithm. Advances in neural information processing systems, 29, 2016a.

[2] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.

[3] Van-Anh Nguyen, Tung-Long Vuong, Hoang Phan, Thanh-Toan Do, Dinh Phung, and Trung Le. Flat-seeking Bayesian neural networks. Advances in Neural Information Processing Systems, 2023.

评论

Dear Authors,

I appreciate the significant effort you've put into this discussion period. However, several of my existing concerns regarding the intellectual basis of your claims remain unresolved, particularly about the relationship between targeting the population posterior p(θD)p(\theta|D) and its purported generalization benefits.

  1. What does it mean for the empirical posterior to overfit?

Your claim hinges on the idea that the empirical posterior p(θS)p(\theta|S) overfits because it is conditioned on the finite training set SS. However, this interpretation of overfitting in the Bayesian posterior context is unconventional. The posterior is not a predictive model but rather a reflection of the likelihood and prior given SS. The notion of overfitting typically applies to models rather than posterior distributions.

  1. Theoretical Claims vs. Practical Approximation

Supposing that there are truly rigorous generalization benefits to approximating p(θD)p(\theta|D), the practicality of achieving this target remains questionable: Approximating p(θD)p(\theta|D) relies on functional perturbations and optimization over RKHS, but every approximation introduces potential deviations. These deviations, while perhaps theoretically bounded, may render the connection between your approach and the benefits of targeting p(θD)p(\theta|D) tenuous in practice.

  1. Isolating the contribution of approximating the general posterior

Your empirical results demonstrate better generalization performance, but how do you isolate the contribution of approximating the general posterior p(θD)p(\theta|D) versus other factors that result from the layers and layers of approximations?

  1. Some more general thoughts

I can’t help but notice the trend in ICML/NeurIPS/ICLR-style works, where strong empirical results are often coupled with theoretical justifications to create a complete narrative. While this approach can be effective for framing contributions, it occasionally risks overstating the theoretical underpinnings of methods that achieve empirical success.

This paper introduces a methodology that works well empirically, and the attempt to provide a theoretical foundation is commendable. However, the scientific narrative could benefit from a more measured presentation of the theoretical claims, ensuring they are well-aligned with what is practically demonstrated. This would place the work on even firmer footing and contribute to the broader conversation about combining empirical performance with theoretical insights in a principled manner.

评论

We thank the reviewer for their thoughtful and detailed response. We would like to address the concerns as follows:

  1. Clarification on "overfitting posterior": We wish to clarify that, throughout the main text and this discussion, when we state that the method addresses overfitting by approximating the population posterior p(θD)p(\theta|D), we are referring to alleviating the overfitting issue associated with the ensemble of particles rather than the posterior itself. To further address the reviewer’s concern theoretically and verify the theoretical advantages of targeting p(θD)p(\theta|D), we have updated the rebuttal revision to include Proposition 1 at the beginning of Section 4. The statement of the proposition is as follows:

Consider the problem of finding the distribution Q\mathbb{Q} that solves:

Q=minQPEθQ[LD(θ)]+KL(QP)\mathbb{Q}^*=\min_{\mathbb{Q} \ll \mathbb{P}}E_{\theta \sim \mathbb{Q}}[L_D(\theta)] + KL(\mathbb{Q} || \mathbb{P})

where we search over Q\mathbb{Q} absolutely continuous w.r.t the prior P\mathbb{P}, and the second term is the regularization term. The closed-form solution to this problem is exactly the population posterior p(θD)p(\theta|D).

In this proposition, we aim to identify the posterior distribution that minimizes the expected population loss, where the expectation is taken over the entire parameter space with θQ\theta\sim\mathbb{Q}^*, while staying sufficiently close to the prior distribution to ensure simplicity. With access to Q\mathbb{Q}^*, we can sample a set of particles whose average performance optimally minimizes the population loss. Since the solution to this optimization problem corresponds exactly to the population posterior, the ensemble of the particles sampled from Qp(θD)\mathbb{Q}^*\equiv p(\theta|D) effectively minimizes the average value of the population loss. This is because Q\mathbb{Q}^* is explicitly chosen to minimize the expected value of the population loss LDL_D, which means the ensemble fits the whole data distribution instead of overfitting to the specific dataset SS, therefore establishes improved generalizability. Consequently, this proposition theoretically asserts that sampling from p(θD)p(\theta|D) improves the generalizability of the ensemble.

  1. Regarding the practical approximations: Our theoretical derivation relies on two approximations, both of which are widely established practices in prior works:
  • First approximation: We implicitly minimize the KL divergence to p(θD)p(\theta|D) by minimizing the worst-case loss within a neighborhood maxffρKL(q[I+f]p(θS))\max_{|f' - f|\leq\rho}\text{KL}(q_{[I+f']}\|p(\theta|S)). This step is necessary and inevitable because the dataset distribution DD is unknown. Importantly, we emphasize that the Euclidean analog of this approach is widely used in almost all sharpness-aware optimization techniques, including SAM or Fisher-SAM.

  • Second approximation: This occurs in Eq. (8), where we apply a first-order Taylor expansion. Such approximations are a standard practice in first-order optimization methods, including SGD and particularly SAM. A potential concern might arise because the perturbation and approximation are conducted in an RKHS, which may behave differently compared to perturbations in Euclidean space (as in SAM). However, it is noteworthy that the RKHS in question represents the space of functions ff, which encode the movements of particles residing in Euclidean space. Therefore, perturbations in the RKHS are directly tied to the perturbations of individual particles. Consequently, higher-order terms like O(ρ2)O(\rho^2) in the RKHS can be safely omitted, just as they are in Euclidean space with SAM.

In summary, the derivation relies on only two approximations, both of which are well-established in the literature on first-order methods and sharpness-aware optimization. These approximations are widely used practices and can be safely applied without introducing significant deviations.

  1. Regarding the isolation of the contribution of approximating p(θD)p(\theta|D): We would like to emphasize that the comparison between FHBI and SVGD already isolates the contribution of approximating p(θD)p(\theta|D), both theoretically and empirically. Specifically, if we modify the problem formulation in Eq. (6) by replacing p(θD)p(\theta|D) with p(θS)p(\theta|S), the ascending step (Eq. (10)) is no longer present, effectively reducing FHBI to SVGD. Thus, this comparison inherently represents the isolation of the contribution of approximating p(θD)p(\theta|D).

Finally, we note that while addressing the reviewers' concerns, the theoretical development of this paper has been further strengthened and refined. We sincerely thank the reviewer for their insightful and constructive comments, which have greatly enhanced the final revision. Please let us know if there are any remaining concerns or points that need further clarification.

评论

Dear Reviewer wK81,

We are grateful for your time and effort in reviewing our paper. We have made an effort to further address your remaining concerns. We would greatly appreciate your valuable feedback on whether we have adequately addressed these issues.

审稿意见
8

This paper presents Flat Hilbert Bayesian Inference(FHBI), a method to compute a posterior inference by integrating along the flow of distributions that converges to the target posterior. By formulating the derivative of the flow with the pushforward function that lives in a vector valued RKHS, the framework naturally establishes the "flow of infinite number of particles" which allows one to target the "general posterior" as opposed to "empirical" posterior in the setting of Bayesian Inference. Importantly, the method differs from Stein Variational Gradient Descent in that it derives the infinitesimal pushforward function that reduces the "worst upper bound" of the KL divergence in a way that resembles adversarial training.

The propsoed paradigm is made implementable by the computable form of λp(θS)\lambda p(\theta | S), together with the upper bound of the KL divergence to general posterior defined with the KL divergence to empirical posterior.
The Efficacy of the algorithm is verified through experiments and its scaling properties are investigated and compared with respect to SVGD.

优点

  • This research presents a scheme that generalizes both SVGD and SAM in the context of Bayesian Inference. The idea of FBHI itself is very clearly stated and convincing, and its efficacy are validated with ample experiments. The layout of the derivation is very instructive as well. The claimed sharpness advantage (that FBHI would is more sharpness-aware) is also validated in experiments, empirically
    showcasing the mechanism behind the advantage of the method.

缺点

  • While the relation of FHBI to SAM and the interpretation with spatial and angular repulsive force is insightful, the reviewer was a little confused in the introduction and in the multiple reference to SAM that sounded as if SAM would be the major idea on which to develop FHBI (which, in retrospect, is not "directly" so?). Meanwhile, it is true in the algorithm that when m=1, the algorithm agrees with SAM in the end. In the current presentation the connection to SAM seems something a posteriori.

If indeed the sharpness-aware philosophy is indeed the "motivation" of FHBI (which is, unfortunately, not yet clearly conveyed to this reviewer), the reviewer would like to see more analytical connection to SAM's derivation.

  • In the similar note, as the reviewer will post in the "Questions" section, the reviewer feels that the supposedly the important connections to SVGD and SAM are not clearly explained through objective function and derivations (**more than ρ=0\rho=0 and m=1m=1 )

问题

  • In my understanding, Theorem1 expresses the tradeoff relation between (1) worst-case empirical loss that scales with the size of the neighborhood (radius of the step size) (2) wellness of the upperbound approximation based on the empirical loss. The reviewer is making in his mind a super-rough analogy of "local linear approximation of a complex function", whose worst case error scales with the size of the neighborhood, and "functional ascending step" is analogous to choosing the direction of worst error in this approximation.

If this analogy is correct, the neighborhood radius ρ\rho as well as the update size ϵ\epsilon would be roughly analogous to a stepsize in Euler-Maruyama simulation of the ODE, except that in the context of FBHI the system to be simulated is infinite dimensional. Would the method improve its performance by choosing ρ\rho and ϵ\epsilon to be small, and running the iterations for greater number of times, as in ODE simulation? Also, with this intuition it feels as if ρ\rho shall be similar to ϵ\epsilon; why are they chosen quite differently in the experimental section?

While the reviewer is not too confident of the intuition based on this super-rough analogy, the reviewer also believes that it is crucial that the paper provides some explanation (either numerical or experimental, or at the very least, the heuristic with intuitive explanation) regarding the choice of ρ\rho and ϵ\epsilon, both for the sake of the future practical user of FBHI and the for the sake of the future schemes that will possibly branch out from FBHI.

  • In the similar note as above, because the "wellness" of RKHS method generally depends on the affinity between the choice of the kernel (and the hence the nature of the continuity of functions in the space) and the dataset. In the case of this research, the choice, I believe, is indirectly related to dataset because RKHS is a space of functions on "parameters". Is there any ablation study regarding this choice, or at the very least, a good heuristics that would help the user to choose appropriate Kernel in the applications?

  • It is explained in the paper that ρ=0\rho=0 would correspond to SVGD, and "in equation" it seems so. However, because the paper emphasizes its connection to SVGD, the reviewer wishes to see why this comes about in connection to Theorem 1 and the derivation of SVGD.

  • Is there comparisons against SAM in the setting of section 6.2?

评论

We would like to thank the reviewer for the useful feedback. We would like to address the concerns as follows:

  • Regarding the connection with SAM [1]: While our method is not directly derived from SAM, we frequently reference SAM due to the meaningful conceptual connections between the two approaches. As shown in Equations (6)–(8), our method performs functional sharpness-aware minimization (SAM) in the RKHS. As we have mentioned, FHBI reduces to SAM when m=1m=1. The key insight behind this fact is that FHBI is performing sharpness-aware minimization on the RKHS with ff being the parameter, and since ff captures the transportation of the particles, it indicates that FHBI is simultaneously performing SAM on all particles. Moreover, as demonstrated in Section 4, FHBI not only controls the sharpness of all particles via the function ff, but also controls the interaction between the particles to further reduce sharpness and promote particle diversity to further improve generalization. Consequently, in the single-particle case m=1m=1, FHBI naturally reduces to SAM, aligning with the intuitive understanding of its behavior in this special case.

  • Regarding the connection with SVGD [2]: This connection arises from the interpretation of FHBI as performing SAM in the RKHS, while SVGD is analogous to SGD in the RKHS. When ρ=0\rho = 0, SAM reduces to SGD due to the absence of an ascent step. As a result, FHBI simplifies to SVGD in this case. However, we emphasize that this connection is based on the reduction of SAM to SGD when ρ=0\rho = 0, but cannot be inferred from Theorem 1, as h(1/ρ2)h(1/\rho^2) becomes undefined when ρ=0\rho = 0.

  • Regarding the choice of the kernel: In our experiments, we utilized the RBF kernel, renowned for its strong representational capabilities and its ability to balance underfitting and overfitting through the kernel width σ\sigma. However, based on the reviewer’s suggestion, we conducted an additional ablation study to evaluate performance with alternative kernels. We tested our method with a polynomial kernel of degree 1010 on the Specialized datasets. The results, presented below, show that while the polynomial kernel slightly underperforms compared to the RBF kernel, the performance gap is negligible. We have addressed the reviewer's suggestion and included this experiment in Appendix B of the revised rebuttal.

KernelCamelyonEuroSATResisc45RetinopathyAVG
RBF85.395.087.279.686.8
Polynomial (d=10)85.094.986.879.286.5
  • Regarding the insight on ρ\rho and ϵ\epsilon: While the analogy on the Euler-Maruyama method for solving ODE provided by the reviewer is insightful and thought-provoking, it is noteworthy that smaller ρ\rho and ϵ\epsilon do not necessarily lead to better performance. From the perspective of solving a minimax optimization problem, ρ\rho and ϵ\epsilon serve as the step sizes/learning rates of the inner maximization and outer minimization problems, respectively. While smaller values of the learning rate ϵ\epsilon ensure convergence, they do not guarantee better performance due to the potential for getting stuck in local minima. Similarly, the ascent step size ρ\rho requires careful tuning: excessively large values can destabilize training, while very small values may render the ascent step ineffective, thus failing to enhance generalization. We also emphasize that these principles also hold for SAM. In summary, smaller step sizes with more iterations may promote convergence but do not ensure better performance due to the non-convex nature of the loss function and the risk of local minima in practice.

  • Regarding the choices of ρ\rho and ϵ\epsilon: Since ρ\rho and ϵ\epsilon govern distinct optimization problems—ρ\rho for the inner maximization and ϵ\epsilon for the outer minimization—they are tuned separately. Based on the above interpretation of ρ\rho and ϵ\epsilon as learning rates, we tuned these hyperparameters using a grid search within a reasonable range. Details of this process are provided in the Experimental section and Appendix C in the rebuttal revision.

  • Regarding Section 6.2: The goal of Section 6.2 is to empirically study the interaction between particles, supporting the claims made at the end of Section 4. This setting is specifically designed for multi-particle methods, so comparisons with SAM, which inherently operates in a single-particle framework, are not applicable in this context.

References:

[1] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.

[2] Qiang Liu and Dilin Wang. Stein Variational Gradient Descent: A general purpose Bayesian inference algorithm. Advances in neural information processing systems, 29, 2016a.

评论

Thank you for responding to my questions. Although I was a little disappointed that the proposed choice of ρ\rho and ϵ\epsilon could not further promote FBHI, the explanation regarding ρ\rho and ϵ\epsilon was very convincing as well. Thank you also for testing a different kernel.

With regard to the connection with SAM, while I understand the authors' point and intention, I still feel that there might be a better way to reorder the explanations--at the same time, I feel this might also be a matter of this reviewer's personal preference. I am thankful for the authors' additional efforts, and I would like to keep my score as is to show my support and I am willing to defend the acceptance of this work within my capability.

评论

Dear reviewer yQt2,

We thank you for your response and for maintaining a rating of 8. Your comments helped significantly improve the final revision. Please feel free to let us know if you have any further concerns.

Best,

The Authors

审稿意见
3

This paper combines sharpness aware minimization (SAM) with SVGD in RKHS as the Flat Hilbert Bayesian Inference (FHBI). It extends the proof of [1] to infinite-dimensional functional space. Empirical validations show better performance of FHBI compared to previous Bayesian inference approaches on various datasets.

[1] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.

优点

  1. Incorporating SAM into SVGD can somehow improve the stability. The idea generally makes sense.
  2. The paper is clearly written overall, and it is easy to capture the motivation.
  3. The empirical results are superior and tested on various datasets.

缺点

Even though I agree that the proposed method may be effective in some conditions, I am not convinced by the theory in this paper. Following are my concerns.

  1. Notations.

    • Empirical posterior should be P_θS\mathbb{P}\_{\theta|S} not P_S\mathbb{P}\_S, which is the data distribution. And the prior distribution should be Pθ\mathbb{P}_{\theta}.
    • p(θS)p(θ)p(Sθ)=p(θ)i=1np(yi,xiθ)p(\theta|S)\propto p(\theta)p(S|\theta) = p(\theta)\prod_{i=1}^n p(y_i, x_i|\theta) The exponential average loss should be related to a specific distribution. I do not know how you can obtain this form directly.
    • What is "general loss"? I have never heard this terminology. LD(θ)\mathcal{L}_{\mathcal{D}}(\theta) is often called population loss/ true error/ generalization error in different papers or learning theory books.
  2. For Theorem 2, what is the exact definition of h(1/ρ2)h(1/\rho^2)? It should be clearly presented in the main paper. In your derivation, O(ρ2)\mathcal{O}(\rho^2) is simply ignored, which means ρ\rho should be very small, which in turn gives a large h(1/ρ2)h(1/\rho^2).

  3. You are actually deriving an upper bound of the true risk, which is the empirical risk plus some complexity term. However, the sample complexity is not directly reflected in the bounds presented in the main paper. You claim that you can approximate the true posterior p(θD)p(\theta|\mathcal{D}). This is impossible if you only have limited samples nn.

  4. Experiments

    • How does data augmentation affect the empirical results? Have you used it in all baselines or just your method?
    • As shown in Figure 3, the runtime of the FHBI increases fast w.r.t the number of particles. Compared to SVGD, the margin also grows with the number of particles. Where is the computation bottleneck? Can you find any way to reduce it?

Minor:

  1. Is the template used correctly? There is no line number.
  2. Typo. In the related work section, sharness -> sharpness.

问题

Have you compared SVGD+SAM to SGLD+SAM [1, 2]?

[1] Van-Anh Nguyen, Tung-Long Vuong, Hoang Phan, Thanh-Toan Do, Dinh Phung, and Trung Le. Flat seeking bayesian neural networks. Advances in Neural Information Processing Systems, 2023.

[2] Yang, Xiulong, Qing Su, and Shihao Ji. "Towards Bridging the Performance Gaps of Joint Energy-based Models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

评论

Here is the comparison between our method, SVGD+SAM (SA-BNN), and SGLD+SAM (SADA-JEM). These results have also been included in the rebuttal revision.

Accuracy:

MethodCIFAR100Caltech101DTDFlower102PetsSVHNSun397CamelyonEuroSATResisc45RetinopathyClevr-CountClevr-DistDMLabKITTIdSpr-locdSpr-OrisNORB-AzisNORB-Ele
SA-BNN [1]65.191.571.098.989.489.355.283.294.586.475.261.463.240.071.364.534.527.231.2
SADA-JEM [8]70.391.970.298.291.285.654.784.394.183.477.079.972.151.679.470.745.329.640.1
FHBI74.193.074.399.192.487.356.585.395.087.279.680.172.352.280.472.851.231.941.3

ECE:

MethodCIFAR100Caltech101DTDFlower102PetsSVHNSun397CamelyonEuroSATResisc45RetinopathyClevr-CountClevr-DistDMLabKITTIdSpr-locdSpr-OrisNORB-AzisNORB-Ele
SA-BNN [1]0.220.080.190.150.120.120.240.130.060.120.180.140.210.220.240.250.410.460.34
SADA-JEM [8]0.220.110.200.050.130.160.180.150.210.230.260.190.200.250.270.350.200.140.13
FHBI0.190.100.160.060.060.090.160.090.050.120.080.140.150.210.150.160.180.110.07
  • Regarding other minor issues: We appreciate the reviewer pointing out the typo and the notation for the distributions and we have also included those suggestions in the revision.

Once again, we would like to thank the reviewer for the insightful feedback for an improved revision.

References:

[1] Van-Anh Nguyen, Tung-Long Vuong, Hoang Phan, Thanh-Toan Do, Dinh Phung, and Trung Le. Flat-seeking Bayesian neural networks. Advances in Neural Information Processing Systems, 2023.

[2] Germain, P., Lacasse, A., Laviolette, F., & Marchand, M. (2006). PAC-Bayesian theorems for domain adaptation. In Advances in Neural Information Processing Systems (NeurIPS).

[3] Bishop, C. M. (2006). Pattern Recognition and Machine Learning. New York: Springer.

[4] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=6Tm1mposlrM.

[5] Feng, Y., & Li, C. (2022). Fast Stein Variational Gradient Descent for Neural Network Sampling. IEEE Transactions on Neural Networks and Learning Systems.

[6] Kim, Minyoung, Da Li, Xu Hu, and Timothy Hospedales. "Fisher SAM: Information Geometry and Sharpness Aware Minimisation." Presented at ICML 2022.

[7] "AdaptiveSAM: Towards Efficient Tuning of SAM for Surgical Scene Segmentation." Preprint available on arXiv, 2023.

[8] Yang, Xiulong, Qing Su, and Shihao Ji. "Towards Bridging the Performance Gaps of Joint Energy-based Models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

评论

Dear Reviewer pWGm,

We would like to thank you very much for your insightful review, and we hope that our response addresses your previous concerns regarding our paper. However, as the discussion period is expected to end in the next few days, please feel free to let us know if you have any further comments on our work. We would be willing to address any additional concerns you may have. Otherwise, we hope that you will consider increasing your rating.

Thank you again for spending time on the paper, we really appreciate it!

Best regards,

Authors

评论

Dear Authors,

Thanks for continuing reminding me! I am a qualified reviewer, I never maintain a low score to a paper without discussion. Please give reviewer enough time to read your feedback and check your paper, your paper is not the only thing in my life.
Please do not ask me to increase score before I finish reading your rebuttal.

评论

We sincerely thank the reviewer for their valuable and constructive feedback. Below, we address the concerns raised in detail:

  • Regarding the notations of the distributions: Even though these notations do not affect the theoretical development since they are not used throughout the derivation, we thank the reviewer for the comment and have incorporated this suggestion in the revision.

  • Regarding the derivation of the empirical posterior: First, by applying Bayes' rule and assuming that the samples are i.i.d., we obtain

p(θS)=p(θ)p(Sθ)/p(S)p(θ)p(Sθ)=p(θ)i=1np(yixi,S,θ).p(\theta|\cal{S}) = p(\theta)p(\cal{S}|\theta)/p(\cal{S}) \propto p(\theta)p(\cal{S}|\theta) = p(\theta)\prod_{i=1}^n p(y_i | x_i, \cal{S}, \theta).

We define that the likelihood is proportional to exp(1S(fθ(x),y))\exp\left(-\frac{1}{|\cal{S}|} \ell(f_\theta(x), y)\right), where \ell is a loss function such as the Cross-Entropy loss, and fθf_\theta represents some sufficiently expressive model. Then, the posterior has the form:

p(\theta|\cal{S}) \propto p(\theta)\prod_{i=1}^n \exp(\frac{-1}{n}\ell(f_\theta(x_i), y_i)) = p(\theta) \exp(-\cal{L}_\cal{S}(\theta)) .

We emphasize that the formulation that expresses likelihood as the exponential of the negative loss value that we used above is a popular formulation that has been widely employed in prior works in learning theory and Bayesian Inference such as [1] and [3]. The intuition behind this formulation is that when fθf_\theta is sufficiently representative to capture the likelihood, a smaller loss value results in a larger exponential term, indicating a higher likelihood, and vice versa.

  • Regarding the usage of the terminology "general loss": The term \cal{L}_\cal{D} is referred to by various names in the optimization and statistical learning theory literature, including population loss and generalization loss, as the reviewer mentioned, and general loss that we used which has been employed in several previous works on learning theory and PAC-Bayes theory, including those cited by the reviewer [1, 2]. In response to this suggestion, acknowledging that the term population loss is more commonly used in learning theory, we have replaced general loss with population loss in the final revision to enhance readability. We appreciate this insightful feedback.

  • Regarding the scale of ρ\rho: In the theory of sharpness-aware minimization, ρ\rho represents the size of the neighborhood, which is also the size of the ascending step. We emphasize that similar to SAM, ρ\rho is a hyperparameter that we will tune in practice, and is typically a small constant. However, if ρ\rho is excessively small, the term maxθθρLS(θ)\max_{|\theta' - \theta| \leq \rho} L_S(\theta') also gets smaller and becomes a closer approximation of LS(θ)L_S(\theta); consequently, a larger error term (which is h(1/ρ2)h(1/\rho^2)) is required to ensure that inequality is maintained. Conversely, as ρ\rho increases, the term maxθθρLS(θ)\max_{|\theta' - \theta| \leq \rho} L_S(\theta') grows due to the larger neighborhood, necessitating a smaller error term to maintain the validity of the inequality. In short, ρ\rho is a small constant but is not excessively small. This intuition is also reflected in our practical algorithm as well as other sharpness-aware algorithms: when ρ\rho is too small, the ascent step becomes negligible and ineffective; conversely, if ρ\rho is too large, the ascent step becomes excessively big, leading to unstable training.

  • The role of the sample complexity nn in Theorem 2: As mentioned above, one can find in the formal statement of Theorem 2 in the Appendix that the error term has the form O(log(1+1ρ2)+4log(nδ)n1)O\left(\sqrt{\frac{\log(1+\frac{1}{\rho^2})+4\log\left(\frac{n}{\delta}\right)}{n-1}}\right), which tends to 00 as the sample complexity grows. Acknowledging the importance of the information about the sample complexity, we have incorporated this suggestion in the revision for better clarity.

  • Regarding the experiments:

    • Data augmentation: We adopted the data augmentations used by the V-PETL benchmark for model-finetuning. To ensure fairness, on each dataset, the same augmentation was applied across all methods.
    • Runtime and memory limitations: As discussed in the Limitations section, our method shares the same limitation with SVGD and other particle-based approaches that runtime and memory requirements grow linearly with the number of particles due to the sequential implementation. For future research, we suggest addressing this limitation by employing parallel and efficient implementations of SVGD, such as the one proposed in [5]. However, since this issue is not the primary contribution of our work, we opted not to discuss it in great detail in the main text.
审稿意见
8

This work presents an algorithm called Flat Hilbert Bayesian Inference (FHBI), which incorporates Sharpness-aware minimization (SAM) technique in Bayesian inference. Specifically, together with SAM, the authors perform Stein variational gradient descent (SVGD) as the dynamics of the model parameters, which leads the particles (models) to flat and diverse modes. FHBI is tested on VTAB-1K, a collection of various classification tasks, and achieves better performance on average among different Bayesian NN methods.

优点

  • The work extends generalization bounds from finite-dimensional parameter spaces to functional spaces, which leads to FHBI, a theoretically grounded Bayesian inference algorithm.
  • The performance improvement by the proposed method is validated through extensive comparisons with previous works.

缺点

  • I don't see any particular weaknesses in this paper.

问题

  • The method is tested only in fine-tuning regime. Is there any reason for that? Testing the proposed models trained from scratch would strengthen the empirical significance of the proposed method. I am not sure which datasets would be suitable, but datasets like ImageNet and its variants (ImageNet-A [1], and ImageNet-C[2]), and scientific machine learning benchmarks might be good datasets as uncertainty prediction would be critical.

[1] https://arxiv.org/abs/1907.07174 [2] https://arxiv.org/abs/1903.12261

评论

We appreciate the reviewer’s useful feedback. The motivation for testing our method on fine-tuning stems from its growing significance in various applications, such as model editing and personalized AI, where tasks often rely on learning from personalized, domain-specific datasets. Additionally, particle-based Bayesian methods, which involve storing multiple model instances, are suitable for fine-tuning methods like LoRA and prompt tuning, since these methods typically freeze most of the model’s parameters and train only a lightweight module, making them computationally efficient and suitable for Bayesian techniques. As for not evaluating the proposed method on training models from scratch, this is discussed in the limitations section. Our method, like other particle-based Bayesian inference approaches, requires storing multiple models, which can lead to memory bottlenecks when dealing with large-scale models.

评论

Thank you for the response! Although there is a memory consumption issue, I feel the combination of SVGD and SAM is quite reasonable from a high-level perspective. I will maintain the existing score.

评论

Dear Reviewer 4bfP,

We want to thank you for maintaining a rating of 8. We will incorporate your suggestions into the revision of our paper as discussed. Please feel free to let us know if you have any further concerns.

Best,

The Authors

评论

We sincerely thank the reviewers for their valuable feedback. In response, we have prepared a revision with the following key updates:

  • Based on the suggestions of Reviewer pWGm, we have extended our experiments by comparing our algorithm with two additional baselines: SVGD+SAM, referred to as Sharpness-Aware Bayesian Neural Network (SA-BNN) [1], and SGLD+SAM, referred to as Sharpness-Aware Joint Energy-Based Model (SADA-JEM) [2]. We have also enhanced the theoretical results section to include information about the sample complexity for improved clarity. Moreover, even though both the terms general loss and population loss appeared in the literature of learning theory, we decided to use the term population loss, according to the reviewer's suggestion, to improve readability since this terminology is more widely used in the literature.

  • Following the feedback of Reviewer wK81, we have added an explanation explicitly describing the correspondence between each step of Algorithm 1 and the iterative procedure outlined in Equations (14)–(16), providing a clearer connection between the theoretical framework and its implementation.

  • In response to Reviewer yQt2's feedback, we have added an ablation study in Appendix B to evaluate the impact of different kernel choices. Specifically, we compare the performance of the RBF kernel and the polynomial kernel on the Specialized datasets. Moreover, we have incorporated Proposition 1 in the methodology section to theoretically illustrate the advantages of addressing the population posterior p(θD)p(\theta \mid D), thereby strengthening the theoretical foundation of our motivation.

  • We have also addressed other minor issues suggested by all the reviewers.

We believe these revisions address the reviewers’ concerns and significantly improve the quality and clarity of the manuscript. Thank you for your thoughtful contributions.

References:

[1] Van-Anh Nguyen, Tung-Long Vuong, Hoang Phan, Thanh-Toan Do, Dinh Phung, and Trung Le. Flat-seeking Bayesian neural networks. Advances in Neural Information Processing Systems, 2023.

[2] Yang, Xiulong, Qing Su, and Shihao Ji. "Towards Bridging the Performance Gaps of Joint Energy-based Models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

AC 元评审

This paper proposes Flat Hilbert Bayesian Inference (FHBI), an algorithm to enhance generalization in Bayesian inference. Similar to Sharpness-Aware Minimization (SAM), each iteration of FHBI involves an adversarial perturbation step followed by a descent step, with the key difference being that these updates occur within a Reproducing Kernel Hilbert Space (RKHS). The proposed algorithm is accompanied by a theoretical analysis that extends previous SAM-like methods from finite-dimensional Euclidean spaces to infinite-dimensional function spaces. The paper also presents an empirical evaluation of FHBI on the VTAB-1K benchmark (consisting of 19 datasets across various domains), demonstrating that FHBI consistently outperforms the baselines. The authors also provide the code for their implementation to facilitate the reproduction of their experiments.

I find the idea of implementing SAM in function space via the RKHS framework elegant and original. The fact that the empirical assessment also shows consistent improvements further adds to the significance of this work, making it a potentially impactful development. These strengths are also reflected in the reviewers' comments. However, the proofs related to the presented theory seem underdeveloped, needing either corrections or the inclusion of missing details about the derivation. Concerns related to the proofs were mainly raised by reviewer pWGm, who thoroughly reviewed the appendix where the proofs are presented. The reviewer and authors engaged in multiple rounds of feedback and response, and as they delved deeper into the details of the proofs, more missing details surfaced. While this does not necessarily indicate errors in the proofs, it clearly shows that the proofs in their current state may not be clear enough to readers, hindering a full grasp of the theoretical results and under-appreciating the paper's contributions. This suggests the paper needs careful proofreading and revision of the proof presentation to improve clarity.

Unfortunately, as the theory is one of the major contributions, its clarity and accessibility are crucial and cannot be overlooked, despite the positive reviews from other reviewers. As much as I appreciate this paper (the idea of performing SAM in RKHS, the theoretical analysis, and the positive empirical results), sadly I have to recommend resubmission after a revision. I hope the authors are not discouraged by this decision and are assured that they have produced very interesting work that just needs some polishing to be well-understood and appreciated.

As a minor comment, I echo the observation made by some reviewers that the paper uses some uncommon terminology, which can lead to confusion. I hope the authors will make these small changes in their future submission to improve clarity. In particular, I encourage them to replace "general loss" with "generalization loss" or "true risk," and perhaps "functional space" with "function space."

审稿人讨论附加意见

I find the idea of implementing SAM in function space via the RKHS framework elegant and original. The fact that the empirical assessment also shows consistent improvements further adds to the significance of this work, making it a potentially impactful development. These strengths are also reflected in the reviewers' comments. However, the proofs related to the presented theory seem underdeveloped, needing either corrections or the inclusion of missing details about the derivation. Concerns related to the proofs were mainly raised by reviewer pWGm, who thoroughly reviewed the appendix where the proofs are presented. The reviewer and authors engaged in multiple rounds of feedback and response, and as they delved deeper into the details of the proofs, more missing details surfaced. While this does not necessarily indicate errors in the proofs, it clearly shows that the proofs in their current state may not be clear enough to readers, hindering a full grasp of the theoretical results and under-appreciating the paper's contributions. This suggests the paper needs careful proofreading and revision of the proof presentation to improve clarity.

最终决定

Reject