PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
5
4
5
4
2.5
置信度
创新性3.0
质量3.0
清晰度3.3
重要性2.8
NeurIPS 2025

Posterior Contraction for Sparse Neural Networks in Besov Spaces with Intrinsic Dimensionality

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29

摘要

关键词
posterior contractionBayesian neural networksBesov spaceintrinsic dimensionality

评审与讨论

审稿意见
5

This manuscript establishes that fully Bayesian sparse ReLU networks attain near‑minimax posterior contraction rates over both anisotropic and composite Besov spaces, with rates governed solely by the intrinsic (effective) dimension rather than the ambient dimension. Through careful approximation theory (Lemmas B.4, B.6) the authors show that suitably deep, sparse networks admit spike‑and‑slab or continuous shrinkage priors yielding a contraction rate εn=ns~/(2s~+1)(logn)3/2,\varepsilon_n = n^{-\tilde s/(2\tilde s + 1)} (\log n)^{3/2}, where s~\tilde s is the intrinsic smoothness (Theorems 3.3, 3.8, 3.14). Importantly, they demonstrate rate‑adaptation by placing priors on depth, width, and sparsity (Theorem 3.9), so no oracle knowledge of smoothness or compositional structure is required .

优缺点分析

Strengths

  • By working in anisotropic and composite Besov spaces, the paper covers a broad spectrum of functions—including non‑smooth, discontinuous, additive, and multiplicative structures—far beyond classical Hölder or Sobolev classes

  • The contraction rates depend on an intrinsic dimension dd^* , effectively mitigating the curse of dimensionality. This unifies and extends prior frequentist results (e.g., Suzuki & Nitanda) under a fully Bayesian framework

Weaknesses

  • Although the paper is strong theoretically, there are no simulations or experiments accompany the theory. While the focus is theoretical, even small‑scale numerical illustrations would strengthen confidence that the asymptotic rates manifest in finite samples—and would help practitioners gauge constants and log‑factors.

  • All results assume a homoskedastic Gaussian noise model. Extensions to classification or heteroskedastic settings are only sketched (Appendix A), so applicability to common real‑world tasks (e.g., image classification) remains speculative

  • The analysis is confined to fully connected ReLU nets. Modern architectures (convolutional, residual, attention‑based) are unaddressed; Section 4’s “Future work” correctly flags this but provides no roadmap for how to generalize the key approximation and entropy bounds

问题

  • How could the proposed theoretical framework scale up to modern foundational models (e.g., any changes in the validity of assumptions when dealing with large Bayesian neural networks).

局限性

  • The authors identify the limitations and subsequent extensions in the future works section.

最终评判理由

I thank the authors for their responses. While I am not an expert to fully follow the key theoretical analysis, the rebuttal by the authors addresses my doubts on the methods. I think the paper is of high quality and technically rigorous. I lean towards acceptance of the paper.

格式问题

N/A

作者回复

We acknowledge the reviewer's insightful observations and provide our responses below.

W1:Lackofsimulations**W1: Lack of simulations**

We acknowledge that the paper does not include simulations or empirical experiments. This omission naturally stems from the theoretical focus of the work, which aims to investigate the theoretical properties of Bayesian neural networks (BNNs) rather than to propose a new algorithm. To complement this weakness and clarify the practical relevance, we will elaborate further on the practical implications of our results in the final version, emphasizing that the function classes considered in this study capture a range of practical scenarios.

W2:Extensiontoclassification**W2: Extension to classification**

The extension to binary classification is fully developed in the supplementary material. We will clarify this point in the introduction.

W3:BeyondfullyconnectedReLUnetworks**W3: Beyond fully connected ReLU networks**

Several studies have investigated the statistical theory of modern architectures, including convolutional networks [1-3], residual networks [4-5], and transformers [6-8]. These works may provide useful foundations for deriving posterior contraction results beyond fully connected networks. We will clarify their relevance in the future work section of the final version.

Q1:Scalingtomodernfoundationalmodels**Q1: Scaling to modern foundational models**

Our theoretical results, consistent with related work in the literature, only require that the architectural complexity (such as depth and width) scale appropriately with certain constants as the number of observations increases. In principle, this means the underlying assumptions remain unchanged, and the results are applicable even to very large network models. However, modern foundational models are typically not simple feedforward networks, which introduces a gap between our theoretical framework and the architectural trends observed in large-scale models.

References**References**

[1] Kohler, M., Krzyżak, A., & Walter, B. (2022). On the rate of convergence of image classifiers based on convolutional neural networks. Annals of the Institute of Statistical Mathematics, 74(6), 1085-1108.
[2] Fang, Z., & Cheng, G. (2023). Optimal Convergence Rates of Deep Convolutional Neural Networks: Additive Ridge Functions. Transactions on machine learning research.
[3] Yang, Y., Feng, H., & Zhou, D. X. (2024). On the rates of convergence for learning with convolutional neural networks. arXiv preprint arXiv:2403.16459.
[4] Lai, J., Yu, Z., Tian, S., & Lin, Q. (2023). Generalization ability of wide residual networks. arXiv preprint arXiv:2305.18506.
[5] Zhang, Z., Zhang, K., Chen, M., Takeda, Y., Wang, M., Zhao, T., & Wang, Y. X. (2024). Nonparametric classification on low dimensional manifolds using overparameterized convolutional residual networks. Advances in Neural Information Processing Systems, 37.
[6] Kim, J., Nakamaki, T., & Suzuki, T. (2024). Transformers are minimax optimal nonparametric in-context learners. Advances in Neural Information Processing Systems, 37.
[7] Jiao, Y., Lai, Y., Sun, D., Wang, Y., & Yan, B. (2025). Approximation Bounds for Transformer Networks with Application to Regression. arXiv preprint arXiv:2504.12175.
[8] Liu, P., & Zhou, D. X. (2025). Generalization Analysis of Transformers in Distribution Regression. Neural Computation, 37(2), 260-293.

评论

I thank the authors for their responses. While I am not an expert to fully follow the key theoretical analysis, the rebuttal by the authors addresses my doubts on the methods. I think the paper is of high quality and technically rigorous. I lean towards acceptance of the paper.

评论

We sincerely appreciate the discussion with the reviewer. Please let us know if there are any further questions.

审稿意见
4

This paper presents a significant theoretical investigation into the properties of sparse Bayesian neural networks (BNNs) for nonparametric regression. The primary contribution is establishing that the posterior distribution of a BNN contracts around the true underlying function at a near-minimax optimal rate. The analysis extends existing theory to two highly complex and practically relevant function classes: anisotropic and composite Besov spaces. These classes model functions that exhibit direction-dependent smoothness or are formed through hierarchical compositions, respectively. Such structures are believed to possess an "intrinsic dimensionality" that is lower than the ambient data dimension, providing a formal framework for understanding how neural networks can mitigate the curse of dimensionality.  

The authors demonstrate these results for sparse ReLU networks using two distinct types of priors on the network weights: the theoretically well-understood but computationally demanding spike-and-slab prior, and a class of more tractable continuous shrinkage priors. A cornerstone of the work is the proof of rate adaptation. The authors show that by placing appropriate hyper-priors on the network's architectural parameters (width and sparsity), the BNN can automatically achieve the optimal contraction rate without requiring prior knowledge of the true function's smoothness level—a key advantage of the Bayesian paradigm over frequentist approaches that often require such oracle information for tuning. The results unify and extend previous work, most notably by providing the first analysis for these function spaces within a fully Bayesian framework using the standard posterior, in contrast to recent work that relied on a pseudo-Bayesian fractional posterior.

优缺点分析

This paper represents a substantial and high-quality contribution to the theoretical understanding of Bayesian deep learning. I would give more details from the following 4 perspectives

Quality**Quality**:

The technical quality of this paper is high. The theoretical analysis is rigorous, comprehensive, and builds upon the foundational framework for Bayesian posterior contraction rates established by Ghosal, Ghosh, and van der Vaart . However, a significant weakness lies in the assumptions required for the continuous shrinkage priors. While these priors are motivated as a practical, computationally efficient alternative to spike-and-slab priors, the conditions imposed to secure the theoretical guarantees are highly restrictive. Specifically, Assumption (C3) requires that the prior probability mass outside an exponentially shrinking neighborhood of zero, [an,an][−a_n​,a_n​] where an=eL1nlogna_n​=e^{−L_{1n}​logn}, must vanish at a rate of o(1/[Nn(logn)2])o(1/[N_n​(logn)^2]), which excludes popular, empirically successful priors like the horseshoe. The horseshoe prior is effective because it combines a sharp peak with heavy tails, allowing it to aggressively shrink noise coefficients while leaving large, true signals untouched. To satisfy the assumption, the authors construct a "relaxed spike-and-slab" prior (Example 3.5), but they concede this construction is "practically undesirable" as it forces the prior to behave almost entirely like a spike. Consequently, the theoretical results for shrinkage priors (Theorem 3.8) apply to a narrow class of priors not representative of those used in modern Bayesian practice, which limits the immediate practical implications of this part of the analysis.

Clarity**Clarity**

The paper is well-written and clear, especially given its technical density. The introduction provides an excellent motivation for the problem and situates the work within the existing literature. Table 1 is particularly effective, offering a concise and invaluable summary of related frequentist and Bayesian results, which immediately clarifies the paper's novel contributions. The definitions of anisotropic and composite Besov spaces are presented with care, and the illustrative figures (Figures 1 and 2) are helpful for building intuition.  

A minor point of criticism regarding clarity pertains to the mechanism of rate adaptation. While Theorem 3.9 is a powerful result, the discussion of how it is achieved in practice could be more nuanced. The authors propose achieving rate adaptation with a "sufficiently large" constant. This "sufficiently large constant" is a common theoretical device, but it slightly obscures the fact that the adaptation is not entirely assumption-free; it relies on specific, robust choices of the slab prior. A more explicit discussion of this subtlety would further enhance the paper's clarity.

Significance**Significance**

The significance of this work is profound. It addresses a fundamental question in machine learning: why do overparameterized models like neural networks generalize well in high dimensions. This paper provides a rigorous, Bayesian answer by showing that BNNs can adapt to the intrinsic, low-dimensional structure of the target function, as captured by anisotropic and composite Besov spaces.  

The paper makes a landmark contribution by being the first to establish these results for the standard Bayesian posterior. Prior frequentist work demonstrated that deep neural network estimators could achieve these optimal rates, but their construction required oracle knowledge of the function's smoothness to properly specify the network architecture. The most closely related Bayesian work by Castillo and Egels successfully analyzed these function spaces but did so using a fractional posterior, a technique that deviates from standard Bayesian inference and can be more difficult to interpret. By proving these results for the true posterior, this paper provides a much stronger theoretical justification for the practical use of BNNs and strengthens their standing within the Bayesian community. The results on rate adaptation are particularly impactful, as they demonstrate a key theoretical advantage of the Bayesian approach over its frequentist counterpart in this setting.  

Originality**Originality**

While the paper builds on existing lines of research in NN approximation theory and Bayesian nonparametrics, it is original since it synthesizes them in a novel way to tackle a problem that was previously unsolved. The primary original contributions are: (1) the extension of posterior contraction theory for BNNs to anisotropic and composite Besov spaces under a fully Bayesian framework; (2) the demonstration that this holds for both spike-and-slab and a (albeit restrictive) class of continuous shrinkage priors; and (3) the proof of automatic rate adaptation to unknown smoothness in these complex settings. These contributions collectively push the frontier of our theoretical understanding of Bayesian deep learning.

问题

  1. On the Practicality and Interpretation of the Shrinkage Prior Results: Your work provides a crucial theoretical bridge for BNNs. However, there appears to be a significant tension regarding the continuous shrinkage priors. Assumption (C3) is instrumental for the proof of Theorem 3.8, but as you note, it excludes widely-used priors like the horseshoe and leads to constructions (Example 3.5) that may be "practically undesirable". Could you elaborate on this trade-off? Specifically, what is the key theoretical barrier to extending your results to priors with heavier tails (like the horseshoe), which are known to perform well in practice? Does the class of priors satisfying (C1)-(C3) offer tangible benefits over the true spike-and-slab prior beyond computational tractability, given how closely they must mimic its structure? My evaluation would improve if the authors can provide a clear roadmap for how to relax this assumption or can better articulate the unique value of the priors they analyze.

  2. Your work shows that a parametrically sparse BNN (one with few active weights) can optimally learn a structurally sparse composite function. My question is about the connection between these two types of sparsity. Is the sparsity imposed on the BNN's weights primarily a technical tool required for the theoretical proofs to work, or does it reflect a deeper assumption that the individual component functions (fh,jf_{h,j}​) within the true composite structure are themselves best represented by sparse networks?

  3. A Suggestion for Improved Clarity: The assumptions on the priors are complex and spread across several sections. To improve readability, have you considered adding a table in the main text that summarizes the key assumptions for each prior type (Spike-and-Slab, Shrinkage, and their adaptive versions), highlighting the dependencies (or lack thereof) on unknown quantities like B1B_1​ and s~\tilde{s}^*? This could greatly help the reader navigate the dense theoretical conditions.

局限性

The "Future work" discussion in Section 4 is candid and accurate, correctly identifying the computational challenges of spike-and-slab priors, the theoretical gap between their assumptions and popular shrinkage priors like the horseshoe, and the need to extend the theory to more complex, modern network architectures like CNNs and Transformers. The work is purely theoretical, and as such, the authors correctly note in the checklist that discussions of direct societal impact, data are not applicable.

格式问题

None. The paper is professionally formatted and adheres to the NeurIPS style guidelines. The notation is consistent, and the figures and tables are clear and well-designed.

作者回复

We sincerely appreciate the reviewer's thoughtful comments and detailed feedback. Our responses are provided below.

W1:Conditionsforcontinuousshrinkagepriors**W1: Conditions for continuous shrinkage priors**

As the reviewer pointed out and as noted in the text, we acknowledge that the required conditions for our shrinkage priors apply only to a limited class, excluding several widely used priors such as the horseshoe. We elaborate on this limitation in our response to Q1, addressing the reviewer's question.

W2:Clarityonrateadaptation**W2: Clarity on rate adaptation**

This is indeed a good point. As the reviewer noted, our adaptation results rely on specific constants and hold over a certain range of smoothness parameters, rather than being entirely assumption-free. To clarify further, if a Gaussian slab is used instead of a uniform slab, the adaptation becomes independent of the choice of priors. However, it still requires a properly calibrated depth parameter L~n\tilde L_n with a sufficiently large constant CLC_L, so the reviewer's observation is accurate. While we briefly mentioned this limitation in the paragraph following Theorem 3.9, we will elaborate on it further in the final version to enhance clarity.

Q1:Practicalityandinterpretationoftheshrinkagepriorresults**Q1: Practicality and interpretation of the shrinkage prior results**

As the reviewer pointed out, the assumptions are instrumental but exclude several widely used shrinkage priors in practice. To elaborate on this trade-off, the main difficulty stems from the fact that the polynomial tail of the horseshoe prior renders the assumptions incompatible. Specifically, while (C1) and (C3) can be satisfied by an appropriately scaled horseshoe prior, such scaling is not suitable for (C2), and vice versa. The heavy tail of the horseshoe necessitates a sufficiently large sieve to restrict the parameter space, effectively ignoring the prior mass outside the sieve. However, this enlargement results in overly loose entropy bounds relative to the target rate, making it infeasible to construct a global test function over the sieve using a union bound. In contrast, priors with exponential tails mitigate this issue, as demonstrated in our examples with relaxed spike-and-slab priors (Examples 3.5 and 3.6).

Nevertheless, we believe that priors satisfying Assumptions (C1)--(C3) offer advantages beyond computational convenience. In particular, they demonstrate that continuous priors can achieve the same contraction rate, allowing the point mass component to be removed. This, in turn, enables a theoretical understanding of optimization and algorithmic behavior through well-established results for gradient-based methods.

We plan to continue exploring the topic to address this fundamental limitation regarding shrinkage priors. While we do not yet have a clear roadmap, we have identified a few potential directions that may prove useful.

  • Tighter entropy bound: Our current entropy calculation relies on previous results [1-2], but it is not entirely clear how tight these estimates are. If a sharper bound could be obtained, it might allow for a larger sieve that accommodates heavier-tailed priors. Although we have spent some time exploring this direction, we have not yet made meaningful progress. Nonetheless, we believe it is a question worth further contemplation.
  • Constructing a global test function: Our current theoretical approach follows standard results in testing-based posterior contraction theory, which requires a test function with exponentially small errors over a sieve, along with an appropriate prior that assigns negligible mass outside the sieve. To achieve this, we rely on the well-known fact that the Hellinger distance yields a well-behaved local test over small balls, which we then combine with an entropy bound. If we can directly construct a test function over the entire sieve, it may be possible to bypass the entropy calculation and thereby allow for a larger sieve.
  • Adaptation with heavy-tailed priors: There is a line of recent work on Bayesian adaptation using heavy-tailed priors [3-4]. Although these approaches are not directly applicable to our setting, they may offer valuable insights and suggest promising directions for future investigation. In particular, fractional posteriors always simplify the situation, as they only require sufficient prior mass on a Kullback-Leibler neighborhood. However, we are more inclined to focus on the usual posterior.

Q2:Sparseneuralnetworks**Q2: Sparse neural networks**

The primary role of weight sparsity in the parameter space is to facilitate the derivation of suitable entropy bounds. Since every anisotropic Besov composite function can be approximated by a sparse neural network with optimal error (Lemma B.8), it is essential to control model complexity. In this context, the class of sparse neural networks provides a suitable approximation class that is not excessively complex. This suggests that weight sparsity primarily serves as a useful theoretical device and may not be strictly necessary to represent such Besov functions. However, this does not imply that weight sparsity is merely a technical artifact. To the best of our knowledge, it remains unclear whether anisotropic composite Besov functions can be approximated by smaller networks without sparsity. In the absence of such results, sparse neural networks remain an effective and possibly optimal representation strategy, as observed by recent studies on dense networks [5-6].

Q3:Improvedclarity**Q3: Improved clarity**

In response to the reviewer's suggestion, we will include a table summarizing this information in the final version. While we are still considering the most effective format, the table will present the main results from Theorems 3.3, 3.8, 3.9, and 3.14, along with summaries of the priors (spike-and-slab, shrinkage) and the function spaces (anisotropic Besov, composite Besov).

References**References**

[1] Suzuki, T., & Nitanda, A. (2021). Deep learning is adaptive to intrinsic dimensionality of model smoothness in anisotropic Besov space. Advances in Neural Information Processing Systems, 34.
[2] Lee, K., & Lee, J. (2022). Asymptotic properties for Bayesian neural network in Besov space. Advances in Neural Information Processing Systems, 35.
[3] Agapiou, S., & Castillo, I. (2024). Heavy-tailed Bayesian nonparametric adaptation. The Annals of Statistics, 52(4), 1433-1459.
[4] Agapiou, S., Castillo, I., & Egels, P. (2025). Heavy-tailed and Horseshoe priors for regression and sparse Besov rates. arXiv preprint arXiv:2505.15543.
[5] Kong, I., & Kim, Y. (2024). Posterior concentrations of fully-connected Bayesian neural networks with general priors on the weights. arXiv preprint arXiv:2403.14225.
[6] Castillo, I., & Egels, P. (2024). Posterior and variational inference for deep neural networks with heavy-tailed weights. arXiv preprint arXiv:2406.03369.

评论

Thanks for the detailed response and clarification on rate adaptation limitation. Adding clarification of this and table to summarize all together would be an excellent addition to readers from broader expertise!

The paper is strong, with a robust theoretical foundation. I am inclined to recommend acceptance.

评论

We truly appreciate the discussion and the helpful feedback from the reviewer. We would be happy to address any additional questions.

审稿意见
5

This paper examines how fast the posterior distribution of Bayesian Neural Networks (BNNs) concentrates around the true posterior ("posterior contraction"). For BNNs with either sparse or continuous shrinkage prior, the key findings suggest that the optimal contraction rate can be attained, which is dependent on the intrinsic dimension of the true structures -- avoiding the curse of dimensionality. For a class of non-smooth function space called Besov space, the paper examines anisotropy, composite and adaptation. Contributions are centered around the theoretical findings: BNNs achieve optimal posterior contraction rate over Besov spaces.

优缺点分析

Strengths:

  • Differentiation to the existing work is well presented.

  • Several examples are presented that adds to the better understanding of the materials.

Weakness:

  • The paper can be more kind to the readers, atleast in the introduction and abstract.

For example, general machine learning audience may not know the terms that are assumed in the paper, e.g., posterior contraction, sparsity inducing priors, Besov spaces, etc.

  • The choice of venue could improve.

We have COLT which is dedicated to theoretic research of machine learning. This paper I think also fits there.

问题

What would be the practical implication of the paper's findings? On Bayesian Deep Learning research, many of us are focusing on developing practical learning algorithms that can provide the promises of Bayesian methods: well-calibrated uncertainty, incorporation of domain knowledge through priors, etc. How does the findings on contraction rate over a particular type of function space manifest for many of on-going discussions within the community.

局限性

I am not from the learning theory side of the Bayesian Neural Networks research. I hope other reviewers can comment better.

格式问题

N/A

作者回复

We thank the reviewer for the insightful comments. Please find our responses below.

W1:Readability**W1: Readability**

We agree that the paper is technical due to the nature of the topic. Although our study is primarily intended for experienced readers, we will provide further elaboration where appropriate in the final version, if accepted.

W2:Venuechoice**W2: Venue choice**

While COLT could be a suitable alternative, we believe NeurIPS is one of the most appropriate venues given its broad scope and openness to theoretical contributions. In particular, NeurIPS has consistently published works on the statistical theory of neural networks [1-3], and our work aligns well with this line of research.

Q1:Practicalimplication**Q1: Practical implication**

The findings of this paper help bridge the gap between theory and practice in Bayesian neural networks (BNNs). It is widely recognized that BNNs offer a compelling alternative to classical NN models by providing principled uncertainty quantification through the posterior distribution, which is often overlooked in traditional approaches. Although BNNs perform well across a wide range of functions, their theoretical analysis has largely been limited to relatively smooth function classes. By incorporating anisotropic Besov spaces and their composite structure, this work significantly broadens the theoretical understanding of BNNs. For example, images are often modeled as elements of Besov spaces due to their inhomogeneous and discontinuous nature [4-6], and our formulation offers additional flexibility through its treatment of anisotropy and compositional structure. Therefore, this work may offer insight into why BNNs perform well in image analysis, despite the discrepancy that our analysis is limited to feedforward network architectures. We will elaborate further on the practical implications of the study in the final version, if accepted.

References**References**

[1] Polson, N. G., & Ročková, V. (2018). Posterior concentration for sparse deep learning. Advances in Neural Information Processing Systems, 31.
[2] Suzuki, T., & Nitanda, A. (2021). Deep learning is adaptive to intrinsic dimensionality of model smoothness in anisotropic Besov space. Advances in Neural Information Processing Systems, 34.
[3] Lee, K., & Lee, J. (2022). Asymptotic properties for Bayesian neural network in Besov space. Advances in Neural Information Processing Systems, 35.
[4] Donoho, D. L., & Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet shrinkage. Journal of the American Statistical Association, 90(432), 1200-1224.
[5] Donoho, D. L. (2002). De-noising by soft-thresholding. IEEE Transactions on Information Theory, 41(3), 613-627.
[6] Meyer, Y. (2001). Oscillating patterns in image processing and nonlinear evolution equations: the fifteenth Dean Jacqueline B. Lewis memorial lectures (Vol. 22). American Mathematical Soc.

评论

We appreciate the reviewer's original comments and hope that our rebuttal has addressed them adequately. Please let us know if any further questions arise.

审稿意见
4

This paper presents a comprehensive, rigorous analysis of Bayesian sparse DNN. Compared to existing results, it considers a more general class of smooth functions and considers both spike-and-slab prior and shrinkage prior. It also proposes a prior to adapt to the unknown function smoothness. It provides an optimal posterior contraction rate.

优缺点分析

Strength: Along this line of research, I believe this work is most comprehensive. It combines the standard theory of Bayesian nonparametric convergence and the approximation theory of DNN. The theoretical analysis is rigorous and sound.

Weakness:

  1. The scope of this submission is purely theoretical. Its impact on real applications is omitted. All results are asymptotic, and how to apply them to finite sample applications is unclear. All conditions required for convergence are sufficient conditions. Based on my personal experience, these conditions are over-sufficient for posterior consistency for practice.

  2. What is the additional theoretical insight obtained by this paper, beyond existing results? Or the paper only aims to generalize the existing theory/prior specification to Besov Spaces.

Overall, it seems this paper is more suitable for a pure theory venue other than NeruIPS.

问题

  1. Line 181: ''.These network parameters depend on the Besov parameters s~\tilde{s} and pp''. I fail to see the dependency on pp.

  2. Prior (6): "for some constants λN>0\lambda_N>0 and λH>0\lambda_H > 0." Going through the proof, it seems that the two λ\lambda can be arbitrary. Please clarify

局限性

Yes

最终评判理由

I maintain my score. The paper delivers important theoretical contributions, but I don't think the author can address the weakness I pointed out: 1. lack of numerical justification 2. no additional insight on the practical implementation of sparse BNN beyond existing results.

格式问题

None

作者回复

We are grateful for the reviewer's thoughtful feedback. Our responses are outlined below.

W1:Scopeofthepaper**W1: Scope of the paper**

As the reviewer pointed out, this study is purely theoretical in scope and does not directly consider real applications. This naturally follows from the theoretical nature of the work. Nonetheless, the results may offer meaningful practical insights. While Bayesian neural networks (BNNs) have demonstrated remarkable success in capturing complex patterns in real-world data, our theoretical understanding of their practical effectiveness remains limited. By incorporating anisotropic Besov spaces and their composite structure, this work significantly broadens the theoretical foundation of BNNs. For instance, images are often modeled as elements of Besov spaces due to their inhomogeneous and discontinuous characteristics [1-3]. As such, this work may help explain why BNNs perform well in such complex scenarios. We will elaborate further on the practical implications of the study in the final version, if accepted.

W2:Additionaltheoreticalinsights**W2: Additional theoretical insights**

We believe that this study offers substantial theoretical insights that have not been addressed in prior work. The central theoretical contribution lies in the function classes that neural networks are designed to approximate, for which we consider anisotropic Besov spaces and their composite structures. To the best of our knowledge, no previous Bayesian studies have treated such rich function classes in the theoretical analysis of neural networks. Even the most recent work by [4] considers anisotropic Besov spaces but assumes H"older-type regularity for the compositional structure. (This distinction was not clearly stated in the submitted version and will be clarified in the final version.) Accordingly, our results encompass more complex yet practically relevant functions, such as additive or multiplicative Besov functions. In this sense, our work advances the theoretical understanding of the function classes to which Bayesian neural networks can adapt.

Q1:Dependencyon**Q1: Dependency on p**

The magnitude parameter B1B_1 depends explicitly on pp, although this is not apparent in equation (2) owing to the use of proportionality notation for simplicity. As noted in the main text, the precise form of this dependence is given in Lemma B.4 of the supplementary material.

Q2:Theconstants**Q2: The constants \lambda_Nandand\lambda_H**

We thank the reviewer for pointing this out. It is correct that these constants can be arbitrary, and the prior in (6) can be applied for any positive values of λN\lambda_N and λH\lambda_H. We will revise this in the final version.

References**References**

[1] Donoho, D. L., & Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet shrinkage. Journal of the American Statistical Association, 90(432), 1200-1224.
[2] Donoho, D. L. (2002). De-noising by soft-thresholding. IEEE Transactions on Information Theory, 41(3), 613-627.
[3] Meyer, Y. (2001). Oscillating patterns in image processing and nonlinear evolution equations: the fifteenth Dean Jacqueline B. Lewis memorial lectures (Vol. 22). American Mathematical Soc.
[4] Castillo, I., & Egels, P. (2024). Posterior and variational inference for deep neural networks with heavy-tailed weights. arXiv preprint arXiv:2406.03369.

评论

Thanks. I have a follow-up question regarding W2:Additionaltheoreticalinsights**W2: Additional theoretical insights**.

I agree that this paper has significant theoretical contributions. I am personally a theorist, and I appreciate it. However, my original question is more about the insights into the prior specification that your theory provides. In other words, do your theories offer additional guidance on how to set up the prior in practice, beyond existing results? For Bayesian practitioners who don't care about function space / posterior contraction, what could be the takeaway message for them?

Best

评论

We sincerely appreciate the discussion and thank the reviewer for their valuable feedback. Please do not hesitate to let us know if any further questions arise.

评论

We thank the reviewer for clarifying the question. Given that the main contribution of this paper is purely theoretical, its practical relevance for practitioners who are not concerned with function classes or posterior contraction is inherently limited. Nonetheless, we suggest that our results may offer useful insights for practitioners when selecting network architectures and prior distributions.

  • Our theory suggests that it may be beneficial to consider a large neural network (NN) architecture with strong sparsity imposed on the weights, rather than a moderately sized NN with little or no sparsity. Therefore, when working with fixed width, depth, and sparsity, it may be advisable to increase the network size while ensuring sufficient sparsity in the weights (e.g., via spike-and-slab or continuous shrinkage priors). If this approach proves useful, it may motivate practitioners to develop suitable algorithms and computational strategies, as such architectures are often avoided owing to computational challenges.

  • Our adaptation result indicates that determining width and sparsity in a data-driven, adaptive manner may be advantageous. This contrasts with the common practice of using NN models with fixed architectural parameters, often chosen for computational convenience. Thus, developing practical algorithms to adaptively adjust such parameters could be beneficial and of interest to practitioners.

Although our results offer limited relevance for practitioners focused solely on applications, the above points may encourage them to explore the effectiveness of large but sufficiently sparse NN classes for modeling real-world complexity.

最终决定

This paper studies the posterior contraction rate of sparse Bayesian neural networks, showing that over anisotropic Besov spaces (and hierarchical versions) that they are optimal. This is a pure theory paper, and while some reviewers mention that it would be improved with some empirical validation, there is consensus that the theoretical insights are novel and timely to the NeurIPS audience, with one mentioning that the results may be profound. While some assumptions may be restrictive, the reviewers were largely satisfied after the rebuttal period, and there are various 'applied' takeaways from the theoretical results. For instance, it suggests that perhaps a large network with strong sparsity in the prior may be beneficial rather than a moderately sized, less sparse design on the same 'budget'. The priors enable rate adaptation, which sends another practical message in addition to theoretical novelty. The authors state they will take into account the constructive criticisms in a final camera ready version