6.3

/10

Spotlight3 位审稿人

最低5最高8标准差1.2

3.3

置信度

正确性3.0

贡献度2.3

表达3.0

NeurIPS 2024

BMRS: Bayesian Model Reduction for Structured Pruning

Dustin Wright,Christian Igel,Raghavendra Selvan

OpenReview PDF

提交: 2024-05-09更新: 2024-11-06

摘要

关键词

Bayesian model reductionstructured pruningvariational inferenceefficient machine learningdeep learning

评审与讨论

审稿意见

评分: 8置信度: 22024-07-12

The paper introduces Bayesian model reduction for structured pruning (BMRS), an efficient method for structured pruning of neural networks. It improves Bayesian structured pruning with multiplicative noise by combining it with Bayesian model reduction, enabling a principled pruning strategy based on efficient Bayesian model comparison. Two variants with different priors are proposed, threshold-free BMRS_N and BMRS_U with a hyperparameter that allows for more aggressive compression. The experimental evaluation on a range of data sets and neural network architectures demonstrates the usefulness of both approaches.

优点

The paper tackles an important problem with fundamental societal impact, especially given the massive resource requirements of current large language models.
The paper is well-written and introduces all required concepts clearly, also to readers like me who are less familiar with the field. It focuses on addressing a single gap in the existing literature clearly and comprehensively.
An anonymous code repository with clear structure and dependencies is provided upfront.
The experiments are thoughtfully designed, featuring a range of data sets and neural network architectures, aggregation across a sufficiently high number of random seeds, clear presentation of the main results, and mostly convincing evidence for the usefulness of the proposed method.

缺点

The performance of the proposed BMRS method is better than SNR for simple architectures (see Table 1), but this advantage vanishes for more complex architectures (see Table 2), where SNR achieves higher compression at similar accuracy for 3 of 4 settings. Unlike the thorough analysis of the results for the rest of the experiments, possible underlying reasons are not discussed here. Ultimately, the economic and environmental impact of the proposed pruning method scales with the size of the neural network architecture, so I believe these results to be crucial for extrapolating to bigger architectures that are not computationally feasible to evaluate experimentally here (e.g., LLMs). These concerns currently prevent me from giving a higher score to this otherwise excellent work.
The computational overhead of introducing multiplicative noise is discussed in the Limitations Section, but not considered in the experiments. I understand that this is not related to the novelties of the BMRS method itself, but it would still be helpful to provide runtime information in the tables so that the difference to the baseline can be factored in.

问题

As the main figure of the paper, Figure 1 could profit from a more explicit description both in the graphs themselves (e.g., axis labels) and the label (e.g., a more detailed description of the steps taken or that it uses the log-uniform prior).
Why is a pruning threshold defined for L2 but not SNR at the start of Section 5? Isn't a major drawback of SNR the need for choosing a threshold?

局限性

Several limitations are discussed.

作者回复

2024-08-06

We thank the reviewer for their time and thoughtful review! We are happy that they noted that the paper tackles and important problem, it is well-written and clearly and comprehensively addresses a gap in the literature, and that the experiments are thoughtfully designed. We address their concerns and questions below.

W1. Differences in performance advantage for different settings

The reviewer is referring to the continuous pruning experiments. For MNIST, Fashion-MNIST, and CIFAR10 with an MLP and LeNet, BMRS clearly outperforms the next best baseline in terms of compression vs. accuracy, while for ResNet50 and ViT on CIFAR10 and Tiny-ImageNet, SNR provides slightly higher compression at comparable accuracy. We explain this as follows:

SNR is dependent on selecting a pruning threshold, which we set at a fixed value (1.0) as used in previous work [1]. We show that this results in radically different pruning behavior for different model sizes and datasets. In contrast, BMRS $_{\mathcal{N}}$ does not use any explicit threshold, and its solution is dependent on how good our variational approximation is. We chose to end training at 200 epochs for the larger models, but with further training we expect the compression gap to close.

Additionally, we show two settings of $p_{1}$ for $BMRS_{\mathcal{U}}$ (8 and 4), but with lower values of $p_{1}$ we would expect to see a higher compression rate (e.g., see Figure 4).

W2. Runtime information

This is an excellent point! We have included the average training runtime for a sample of datasets and models in the additional supplement. We will update these numbers for all experiments in the final version. Also note that this is the average total training time, which can vary depending on the load on our compute cluster and which nodes the models are run on. We can provide fair training and inference benchmarking in the final version of the paper.

Q1. Improve Figure 1

Thank you for the suggestions! We have incorporated the changes to Figure 1 and included it in the additional PDF.

Q2. No pruning threshold for SNR

That is correct, and it is an oversight on our part; the pruning threshold is set to 1.0 for SNR in order to compare with previous work. This is mentioned at line 275 as the threshold is only used for continuous pruning, but we can additionally mention this at line 243.

References

[1] K. Neklyudov, D. Molchanov, A. Ashukha, and D. P. Vetrov. Structured bayesian pruning via log-normal multiplicative noise. In Advances in Neural Information Processing Systems (NeurIPS), pages 6775–6784, 2017.

2024-08-09

I thank the authors for their helpful clarifications and the updated results provided. My concerns have been addressed and I will raise my score.

评论- Thank you

2024-08-09

Thank you so much, we greatly appreciate it!

审稿意见

评分: 6置信度: 32024-07-13

This paper proposes a new probabilistic approach to structured pruning of neural networks. Inspired by variational dropout, the authors derive a method to learn a multiplicative noise distribution, which is encoded in a multiplicative noise layer. Pruning algorithms are derived based on assumed priors over the data and parameter distributions (specifically, the distribution of $p(\mathcal{D}|\theta)$ , where $\mathcal{D}$ is the dataset and $\theta$ is the multiplicative noise). The derived pruning algorithms are then applied to some standard image classification benchmarks.

优点

The variational approach to network pruning is interesting and well-motivated.
The mathematical derivations seem to be correct.
The writing is reasonably clear, though a specific notation section aside from the Problem Formulation in Section 3.1 would have been helpful.
The paragraph on the connection between BMRS and floating point precision was interesting and highlights strong scholarship (lines 199-218).
A variety of algorithms (variants) are derived, and compared empirically, both in the post-training setting, as well as pruning while training.

缺点

Some figures could be slightly more clear; for instance, Figure 1 should show the y-axis as well on the plot.
The experimental evaluation does not show Imagenet experiments. For pruning on vision datasets, this is now a standard benchmark, and should be presented in any experimental setup.
Modern architectures -including transformer architectures and ResNet models such as the ResNet50 model used in the experiments - have complex interconnections between layers, such as residual connections in ResNets (see [1]-[3]). These interconnections make pruning such models (by genuinely changing the architectures) challenging, as saliencies have to be computed for groups of connected filters, as opposed to individual filters. It would have been extremely interesting to see the authors apply their approach to complex interconnections instead of just individual filters.
It would have been nice to see the authors use their approach to analyze "how many filters/components are required to learn a given dataset". The variational approach seems like it would have provided an interesting insight into such questions, which would potentially have implications in other fields as well, such as NAS and continual learning.
The empirical evaluations should baseline against more than just L2, particularly in the post-training pruning. This significantly weakens the evaluation.

References

[1] Fang et al. DepGraph: Towards Any Structural Pruning

[2] Narshana et al. DFPC: Data flow driven pruning of coupled channels without data.

[3] Liu et al. Group Fisher Pruning for Practical Network Compression

问题

What is the computational advantage, if any, of applying the proposed variational approach to structured pruning versus more classical approaches?
In the post-training setting, how many samples are required to efficiently compute $\Delta F$ ?
Are there other choices of priors that were considered in this work?

Please refer to the 'Weaknesses' section as well.

局限性

The limitations are discussed in this paper, specifically in section 6.

作者回复

2024-08-06

We thank the reviewer for their time and engagement with the paper. We are happy that they found the approach interesting and well motivated, the writing clear, the math sound, the connection to floating point format interesting, and that we derive a variety of pruning algorithms. We address their weaknesses below:

W1. Figure clarity

We will revise the figures to be more clear, including the suggestion for Figure 1 which has now been updated in the rebuttal PDF.

W2. No ImageNet

The ResNet and ViT that we use in the paper are pre-trained on ImageNet, and then further fine-tuned on Tiny-ImageNet, which consists of 100,000 ImageNet images from 200 classes. While not as large scale as ImageNet, we do not expect to see radically different compression vs. accuracy if we perform further fine-tuning on the full ImageNet dataset. That being said, we are happy to perform experiments and include results on ImageNet in the final version of the paper, which could not be completed within the rebuttal period due to the size of ImageNet.

W3. Compare pruning characteristics when applied to complex structures

This is indeed an interesting point, and well worth looking into. The scope of our work is focused on pruning criteria for Bayesian structured pruning, while this question is slightly broader (relating to the Bayesian pruning literature in general). We can point this out in the paper as a useful future direction to explore.

W4. How many filters are required to learn a given dataset?

This is also an interesting research question! Figures 2, 5, and 7 (the post-training pruning experiments) show us this empirically. The knee-points of the SNR curve show us where accuracy begins to drop for higher compression. We see here that BMRS gives us a lower bound on the number of prunable structures in a network i.e. slightly higher compression can be achieved without significant drops in accuracy, but the accuracy will quickly decrease with further compression.

It should also be noted that different sparsity inducing priors will lead to different answers to this question. For example, [1] introduce a hierarchical prior design to induce both unstructured and structured sparsity, which can lead to a more sparse solution (and thus fewer parameters required to learn a given dataset), but is more complex to train and BMR cannot be directly used for group sparsity.

W5. More Baselines

We agree on the importance of relevant baselines. In our case, the baselines are different pruning criteria for multiplicative noise with the log Uniform prior, so we include L2 norm, SNR, and a no-pruning baseline.

Additional baselines that are comparable are:

Expected value, $E[\theta]$
Magnitude of gradient
Hessian of gradients
Magnitude of activation

Given the time constraints we have not been able to finish running all of these baselines, but we have submitted updated versions of Figure 2 and Table 1 with the finished results for $E[\theta]$ in the supplemental material (the rest will be included in the final version).

Q1. Advantage of variational pruning vs. classical methods

This depends on what the reviewer mean by "computational advantage"; the variational approach can lead to higher sparsification than classic approaches, which provides a net gain at inference time, but can be more costly to train due to the overhead of the variational parameters.

Q2. How many samples required for $\Delta F$ ?

One of the benefits of our proposed method is that the calculation of $\Delta F$ does not require any sampling; it is calculated analytically using the statistics of the variational distribution, original prior, and reduced prior, making it quite efficient (see Eq. 8 and Eq. 11).

Q3. Are there other choices of prior considered?

This depends on if the reviewer means the prior for the original model, or the reduced priors used to calculate $\Delta F$ . The former is selected based on previous work that we build upon [2]. The latter are selected in order to be able to calculate $\Delta F$ analytically.

References

[1] D. Markovic, K. J. Friston, and S. J. Kiebel. Bayesian sparsification for deep neural networks with Bayesian model reduction. arXiv:2309.12095, 2023.

[2] K. Neklyudov, D. Molchanov, A. Ashukha, and D. P. Vetrov. Structured bayesian pruning via log-normal multiplicative noise. In Advances in Neural Information Processing Systems (NeurIPS), pages 6775–6784, 2017.

评论- Response to Rebuttal

2024-08-09

I thank the authors for their thoughtful response. In particular, thank you for clarifying the error made in the $\Delta F$ question.

However, I still have some concerns. Principally, the advantage over classical pruning methods (in terms of sparsification of the model) as stated by the authors is not reflected in the experiments. This is especially since classical methods outperform BMRS in terms of the sparsity/accuracy tradeoff on CIFAR10 models. This is highlighted by the runtime for BMRS on a ViT on CIFAR10 - 20+hours is quite high, particularly compared to cheaper classical methods.

As such, I am happy to keep my positive score as is.

评论- Thank you and clarification

2024-08-09

Thank you for the comment and engaging in the discussion!

We would just briefly clarify two things:

Our experiments do show better compression vs. accuracy against the L2 norm (a form of magnitude pruning which is a classic method); SNR is also based on variational inference but uses a different pruning criteria which requires threshold tuning
The runtimes are based on total training time, i.e., time until convergence, and are thus unnormalized (e.g., ViT on CIFAR10 runs for 200 epochs in those runtimes). The baseline to compare to here is "None" which does not use any variational inference, while both "SNR" and "BMRS" use variational inference with the same noise variables and different pruning criteria. The VI methods are generally slower than with no pruning due to the overhead of the variational parameters, but both SNR pruning and BMRS pruning are about the same. For the case mentioned (ViT on CIFAR10) we actually saw that BMRS ran the fastest.

Thank you again and let us know if we can clarify anything else!

评论- Response to Authors

2024-08-11

Thanks to the authors for the additional clarifications!

While I still have some minor concerns, the responses to my queries, as well as those of the other reviewers, is encouraging, and I'm happy to raise my score as a result.

评论- Thank you

2024-08-11

Thank you! We are happy that we were able to address your concerns.

审稿意见

评分: 5置信度: 52024-07-13

This paper works on structured pruning using Bayesian models. They try both post-training pruning and continuous pruning for MNIST and CIFAR-10 datasets.

优点

The writing is easy to read and follow.

缺点

The novelty is quite limited:

The BMRS basically in my opinion is a naive extension as previous Variational dropout [1], or [2]. BMRS seems to simply combine these works and pruning works, and test different priors and criteria. There are no contributions from both theoretical and empirical sides. This work seems to be highly related to SSVI [3] published months ago. [3] also combines pruning and BNN training, but with fully sparse training and novel pruning criteria designs.

Performance is quite bad.

For example, as shown in Fig 2, BMSR can only reach around 60% compression rate. This is quite bad for CIFAR-10. Previous works like [3] and most of the pruning papers can get more than 90% compression rate.

No baselines to compare.

I see no previous works as baselines. Seems that this work aims at design a pruning algorithm, then all the modern pure pruning algorithm should be compared. However, non of these are shown.

Missing reference

For example, [3] is highly related but missing in this paper.

[1] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems (NeurIPS), pages 2575–2583, 2015.

[2] C. Louizos, K. Ullrich, and M. Welling. Bayesian compression for deep learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 3288–3298, 2017.

[3] Li, J., Miao, Z., Qiu, Q., & Zhang, R. (2024). Training Bayesian Neural Networks with Sparse Subspace Variational Inference. arXiv preprint arXiv:2402.11025.

问题

What is the novelty of this work?

局限性

See the weakness section.

作者回复

2024-08-06

We thank the reviewer for their time reviewing the paper. We would like to address the points they make in their review:

Novelty

We respectfully disagree with the characterization of our method as a naive extension of variational dropout. We derive novel pruning criteria for a class of structured pruning models which are theoretically grounded, namely from the perspective of Bayesian model reduction (BMR). This yields a robust approach to structured pruning that performs well across our experiments/datasets without tuning of any threshold parameters. The different priors used in the paper are to attain different pruning criteria, as different realizations of BMR result from different priors; they are not different priors on the model parameters. In recent work, discussed in our paper, BMR has been successfully applied in the unstructured pruning case [1,2] but not in the structured case as far as we know.

Performance

Developing BMR for structured pruning is the main contribution of our work. Specifically, to 1) derive theoretically grounded pruning criteria for structured pruning and 2) empirically characterize the pruning characteristics of these criteria with respect to existing criteria. Our work serves as a step towards principled Bayesian structured pruning criteria via BMR. And when comparing with methods that require tuning of importance thresholds (L2, SNR) we show that we achieve reasonable trade-offs between compression and performance. It is not expected that we would attain higher compression than, e.g., Bayesian approaches which induce both unstructured and structured sparsity.

Baselines and Datasets

We would like to point out that we use MNIST, Fashion-MNIST, CIFAR10, and tiny-imagenet for our experiments (not only MNIST and CIFAR10 as the reviewer mentions).

Second, we compared to relevant baseline pruning criteria that have been used in previous work for the model class of Bayesian structured pruning that we study. Indeed in the paper referenced by the reviewer [3], their proposed pruning criteria are based on variations of $E_{q_{\phi}}[\theta]$ and SNR $(\theta)$ ; we compare to SNR $(\theta)$ as a baseline, and we have added $E_{q_{\phi}}[\theta]$ in the rebuttal PDF document, which will also be updated in the final version of the paper.

Missing citations

We thank the reviewer for pointing out the very recent work [3], which we will cite in our paper; the two other papers on variational pruning which the reviewer mentions are already discussed in our submission.

References

[1] J. Beckers, B. Van Erp, Z. Zhao, K. Kondrashov, and B. De Vries. Principled pruning of bayesian neural networks through variational free energy minimization. IEEE Open Journal of Signal Processing, 2024.

[2] D. Markovic, K. J. Friston, and S. J. Kiebel. Bayesian sparsification for deep neural networks with Bayesian model reduction. arXiv:2309.12095, 2023.

[3] J. Li, Z. Miao, Q. Qiu, and R. Zhang. Training bayesian neural networks with sparse subspace variational inference. CoRR, abs/2402.11025, 2024

2024-08-13

Thank you to the authors for their response. However, my concern remains unresolved, and I would prefer to maintain the current score.

I continue to have concerns about the novelty of this work for similar reasons. Additionally, since this paper focuses on structured pruning rather than improving BNNs, the baselines should include the most advanced pruning algorithms, encompassing works beyond just the Bayesian approaches.

2024-08-13

We thank the reviewer for engaging with our rebuttal. We would like to gently point out that we do not believe our paper contains "technical flaws, weak evaluation, inadequate reproducibility and/or incompletely addressed ethical considerations" as indicated by the reviewer score of 3. The reviewer has not pointed out any technical flaws, and we have provided the source code for reproducibility. For their remaining concerns we would like to address the following points.

Thank you to the authors for their response. However, my concern remains unresolved, and I would prefer to maintain the current score. I continue to have concerns about the novelty of this work for similar reasons.

Regarding the novelty, we derive BMR for structured pruning using the combination of a log-uniform prior and log-normal posterior on multiplicative noise. This pruning criteria is generally applicable to any network which uses this setup to induce sparsity, and enables continuous pruning (as opposed to post-training pruning with the tuning of thresholds). As mentioned previously, BMR has been well established in the unstructured case, but not the structured case. For example, it would be straightworward to apply BMR as another subspace selection criterion in Li et al. (SSVI) [1] as discussed by the reviewer since BMR has been used several times in the literature for Gaussian variables (see e.g. [2-4]). Extending this to structured pruning is non-trivial, as indicated by [5], but we derive this in our paper. We additionally offer a theoretical connection to floating point precision for this pruning criterion. Our hope is that this will enable future work on Bayesian structured pruning from the perspective of BMR. We would finally ask, if the reviewer sees the novelty as limited, could they provide references to papers who do something similar, that is, derive Bayesian Model Reduction (BMR) for structured pruning?

Additionally, since this paper focuses on structured pruning rather than improving BNNs, the baselines should include the most advanced pruning algorithms, encompassing works beyond just the Bayesian approaches.

We provide an extensive comparison of different model selection criteria (L2, SNR), now also including $E[\theta]$ .

We would like to stress the following:

Any pruning algorithm has two key components:

a model selection criterion (for ranking networks) and
a (typically heuristic) algorithm for changing network structures (e.g., removing nodes or layers).

In our paper, we study the first aspect by deriving new criteria based on BMR on multiplicative noise variables acting on network structures.

Additionally, in [1], they also use SNR and $E[\theta]$ as criteria for sparse subspace search for unstructured pruning. One could additionally formulate the sparse subspace search using log-multiplicative noise to induce group sparsity for structured pruning, and subsequently use our pruning criteria based on BMR to perform subspace selection.

In our work, we have compared to their model selection criteria, SNR and $E[\theta]$ , together with our novel pruning criteria based on BMR, which also allows for continuous pruning, finding that we can achieve high performing and thresholdless pruning (which makes our approach also promising for sparse subspace search).

References

[1] J. Li, Z. Miao, Q. Qiu, and R. Zhang. Training bayesian neural networks with sparse subspace variational inference. CoRR, abs/2402.11025, 2024

[2] J. Beckers, B. Van Erp, Z. Zhao, K. Kondrashov, and B. De Vries. Principled pruning of bayesian neural networks through variational free energy minimization. IEEE Open Journal of Signal Processing, 2024.

[3] K. J. Friston and W. D. Penny. Post hoc Bayesian model selection. NeuroImage, 56(4) 2089–2099, 2011.

[4] K. Friston, T. Parr, and P. Zeidman. Bayesian model reduction. arXiv:1805.07092 [stat.ME] 2018

[5] D. Markovic, K. J. Friston, and S. J. Kiebel. Bayesian sparsification for deep neural networks with Bayesian model reduction. arXiv:2309.12095, 2023.

2024-08-14

Thank you to the authors for the prompt response. While I do have some minor concerns, I would like to raise my score.

2024-08-14

Thank you so much! We are happy that we were able to address most of your concerns, and are grateful for the feedback that you provided.

作者回复

2024-08-06

We thank the reviewers for their time and their reviews. We are glad that they generally found the approach interesting, the math sound, and the problem important. We are also happy that they all found the writing clear and the paper presented well. To contextualize the reviews, we would like to echo a point made by reviewer L897: that the paper addresses a single, clear gap in the literature on Bayesian structured pruning.

To the best knowledge of the authors, Bayesian Model Reduction (BMR) for structured pruning has been successfully explored for the first time in this work. Our key contribution shows how to derive principled and theoretically motivated pruning criteria for Bayesian structured pruning using BMR. We empirically show that our derived pruning criteria work on four benchmark datasets (MNIST, FashionMNIST, CIFAR-10, Tiny-ImageNet) across different architectures (MLP, LeNet5, ResNet-50, Vision Transformer).

Given this, we address the concerns of each reviewer in the individual responses. In addition, we include the following in the additional PDF as part of our rebuttal:

An updated Figure 1, including more clear x- and y-axis ticks and labels, as well as an updated caption
Additional baseline results using $E[\theta]$ ; this includes an updated Table 1 and Figure 2 (other figures and tables will be updated in final version)
Average training runtimes for Lenet5 and ViT on all of the datasets in the paper

We are looking forward to engage in the author-reviewer discussion.

最终决定Accept (spotlight)

2024-09-25

The paper proposes a new approach to structured pruning of neural networks using Bayesian model reduction (BMR). The authors derive a method to learn a multiplicative noise distribution, which is encoded in a multiplicative noise layer, and use this to prune the network. The paper presents a thorough experimental evaluation on several datasets and architectures, demonstrating the effectiveness of the proposed approach.

The reviewers generally found the paper to be well-written, clear, and well-motivated, with a good presentation of the mathematical derivations and experimental results. However, some reviewers raised concerns about the novelty of the approach, the performance of the method on more complex architectures, the computational overhead of introducing multiplicative noise, and the lack of baselines and ablation studies. The authors responded to the reviewers' comments, addressing the concerns and providing additional information and clarification. They also provided updated results, including additional baselines and runtime information, which helped to alleviate some of the concerns.

After the rebuttal and discussion, all reviewers lean towards acceptance. Overall, I would thus recommend to accept the submission. However, I would further recommend that the authors take the reviewer feedback into account for the camera-ready version.