PaperHub
6.4
/10
Poster5 位审稿人
最低6最高8标准差0.8
6
6
6
8
6
2.4
置信度
ICLR 2024

SWAP: Sparse Entropic Wasserstein Regression for Robust Network Pruning

OpenReviewPDF
提交: 2023-09-19更新: 2024-04-21

摘要

关键词
Network PruningModel CompressionOptimal TransportWasserstein DistanceDeep Learning

评审与讨论

审稿意见
6

Computation of the empirical Fisher Information Matrix (FIM) is an important part of neural network pruning and subjected to noisy gradients. As a solution, this paper proposes an entropic Wasserstein regression (EWR) formulation to address the issue above. The method EWR is demonstrated to able to implicitly enacts gradient averaging using Neighborhood Interpolation, resulting in a balance in capturing gradient covariance and reducing gradient noise. The paper demonstrates the proposed methods in combatting noisy gradients through theoretical analysis (section 3), and empirical evidence (section 4, 5). The empirical evidence is showcased using various models and vision datasets.

优点

  1. Provide clear motivation about combating against noise that may comes from different aspects.
  2. Provide theoretical analysis for the proposed method and empirical evidence.

缺点

  1. Is it possible to show an analysis in term of computational expense incurred, as the scale of the model increases?

问题

  1. Suggestion to change some of the cited references to be included in parentheses for better presentation. Eg in page 3 Computing the distance between .... (Nadjahi et al., 2020).
  2. In the introduction, GPT-4 is used an example of model with substantial size and complexity, Is it possible to show case such a pruning in a language model?
评论

Question 1: Suggestion to change some of the cited references to be included in parentheses for better presentation. Eg in page 3 Computing the distance between .... (Nadjahi et al., 2020).

Answer 1: Thanks for the suggestion! We have changed the cited references to be included in parentheses as you suggested.

Question 2: Is it possible to show an analysis in terms of computational expense incurred, as the scale of the model increases?

Answer 2: We thank the reviewer for this good suggestion. We have added a new section (A.8 Algorithm Scalability) in the appendix of the updated manuscript. The content is as follows.

In this section, we analyze the scalability of Algorithm 1 with respect to the number of model parameters involved in pruning. The result is shown in Figure 8. It can be observed that the execution time scales linearly with the number of pruning parameters. The extra cost of solving the OT is marginal. Theoretically, one could derive this linear scalability by inspecting Line 8, which is the most time-consuming step. The required operations can be decomposed as a sequential operations: matrix-vector multiplications Gw\vec{Gw} and Gwˉ\vec{G\bar{w}} in O(np)O(np), the vector subtraction GwGwˉ\vec{Gw}-\vec{G\bar{w}} in O(n)O(n), a matrix-matrix multiplication Π(GwGwˉ)\vec{\Pi}(\vec{Gw}-\vec{G\bar{w}}) in O(n2)O(n^2), a matrix transposition and multiplication with G\vec{G} in O(np)O(np), and the vector subtraction and scalar multiplication λ(wwˉ)\lambda(\vec{w}-\vec{\bar{w}}) in O(p)O(p). Thus, the overall complexity is O(np)O(np), with pp significantly larger than nn practically. Given fixed fisher sample size nn, the loop of Algorithm 1 scales linearly with the number of pruning parameters pp.

Question 3: In the introduction, GPT-4 is used an example of model with substantial size and complexity. Is it possible to show case such a pruning in a language model?

Answer 3: In principle, the proposed method can be applied to prune a language model. Namely, the proposed method has no assumptions on the model’s architecture. The formulations of (3) (i.e. the LR formulation) and (5) (i.e. the sparse EWR formulation) originate from approximating the Taylor Expansion of the loss function shown in (1). Both the LR and the EWR methods can be used as long as the loss function is differentiable, which is the common case for the majority class of deep learning models including language models that are based on neural networks and transformers.

In the revised version, we conducted an extra experiment on a larger model ResNet50, shown in Table 1. Our conclusions in the original paper still hold for the newly obtained results. We hence anticipate that the proposed method would be applicable for larger models including transformer architectures. This is definitely a direction worth exploring with significant efforts.

评论

Appreciate the response. Thank you

审稿意见
6

This paper proposed a new network pruning method based on Wasserstein distance. Under the convex hull distance equality, the problem is reformulated using Neighborhood Interpolation. Under this formulation, the authors analyzed that compared to the traditional LR formulation, this method learned a data-adaptive gradient averaging weights to smooth noise in gradient estimation. An iterative optimization method is provided. Experiments on several backbones and datasets demonstrated the utility.

优点

  1. The noise corruption in gradient estimation considered in this paper is an interesting and important problem.
  2. The idea of leveraging Wasserstein distance, especially the reformulation through neighborhood interpolation in combating noise is interesting.
  3. The authors have conducted supportive experiments to validate the advantages. Particularly, I like Figure 9 which illustrates the distribution of learned Πt\Pi_t as the noise level varies.
  4. The paper is well-organized and written.

缺点

  1. Is it possible to conduct a convergence analysis for algorithm 1? If it is difficult, does the main challenge lie in the simultaneous optimization of Πt\Pi_t?
  2. It would be more interesting to compare with more baselines, in addition to the LR method.

问题

Just to be curious, as the main contribution lies in better estimating the gradient under the existence of noise, is it possible to extend the proposed method to other applications beyond network pruning?

伦理问题详情

Not applicable.

评论

Question 1: Is it possible to conduct a convergence analysis for algorithm 1? If it is difficult, does the main challenge lie in the simultaneous optimization of Πt\Pi_t?

Answer 1: Yes, we agree with the reviewer that the convergence analysis for problem (5) is difficult, and the main challenge lie in the simultaneous optimization of Πt\Pi_t and w\vec{w}. Given fixed sparsity kk, the convergence analysis of using block coordinated descent (i.e. alternating optimization) in solving (5) essentially lies on proving the block coordinate descent together with the iterative hard thresholding in Line 10 leads to convergence. To the best of our knowledge, this question remains open.

The most relevant work we’ve identified so far are:

  • [1] Peng, Liangzu, and René Vidal. "Block Coordinate Descent on Smooth Manifolds." arXiv preprint arXiv:2305.14744 (2023).
  • [2] Eisenmann, Henrik, et al. "Riemannian thresholding methods for row-sparse and low-rank matrix recovery." Numerical Algorithms 93.2 (2023): 669-693.

Literature [1] shows that convergence can be guaranteed with Riemannian gradient descent. However it does not cover the case of using IHT. Literature [2] proposed a Riemannian version of IHT with local convergence. Whether integrating the Riemannian IHT into the block coordinate descent on manifold leads to local convergence remains open. This is definitely a direction worth exploring for future work.

However, we would like to point out that for our particular problem in network pruning, the convergence of an algorithm solving problem (5) with given kk is not as relevant. The main reason is that in Algorithm 1 (or any gradual pruning strategies developed from (5)), the number of iteration steps is fixed. In each iteration, the target sparsity is different (i.e. kk is different) and we are solving different problems with one SGD iteration.

The computational cost in Algorithm 1 is mainly incurred by Line 8 (the stochastic gradient descent step). Notably, the cost of computing the OT plan Πt\Pi_t is marginal. We added a section (A.8 Algorithm Scalability) in the Appendix. The content is as follows.

The result is shown in Figure 8. It can be observed that the execution time scales linearly with the number of pruning parameters. The extra cost of solving the OT is marginal. Theoretically, one could derive this ... The required operations can be decomposed as a sequential operations: matrix-vector multiplications Gw\vec{Gw} and Gwˉ\vec{G\bar{w}} in O(np)O(np), the vector subtraction GwGwˉ\vec{Gw}-\vec{G\bar{w}} in O(n)O(n), a matrix-matrix multiplication Π(GwGwˉ)\vec{\Pi}(\vec{Gw}-\vec{G\bar{w}}) in O(n2)O(n^2), a matrix transposition and multiplication with G\vec{G} in O(np)O(np), and the vector subtraction and scalar multiplication λ(wwˉ)\lambda(\vec{w}-\vec{\bar{w}}) in O(p)O(p). Thus, the overall complexity is O(np)O(np), with pp significantly larger than nn practically. Given fixed fisher sample size nn, the loop of Algorithm 1 scales linearly with the number of pruning parameters pp.

Question 2: It would be more interesting to compare with more baselines, in addition to the LR method.

Answer 2: We thank the reviewer for pointing out the weakness, however the choice of comparing to mainly compare to the LR method is twofold. The main motivation is that we would like to investigate the effect of the optimal transport that we introduced to the problem formulation. Since LR can be seen as a naive case of EWR, it is natural to use LR as the baseline for EWR to see how effective the Wasserstein regression would be for combating the noise. The second reason is that LR is the current state of the art that outperforms other methods within the area of post-training network pruning , and our proposed method will not get worse even for noise-free scenarios. It will converge to LR with a large number of clean training data and therefore also outperform other methods.

Moreover, we have also listed the performance of other methods (MP, WF, and CBS) in Table 1. Since the results we reproduce for LR are consistent with the original paper, we believe the results for the other methods should be consistent as well.

Question 3: Just to be curious, as the main contribution lies in better estimating the gradient under the existence of noise, is it possible to extend the proposed method to other applications beyond network pruning?

Answer 3: Yes, this is possible, but we haven’t tested all the possible applications and we are listing one of them here. A closely related scenario is when there is noise or corruption in computing the gradient of the trained network, as opposed to the noisy data that we are considering in this work. This naturally arises in federated learning scenarios where the gradient updates from the clients are not reliable.

审稿意见
6

This paper reformulate the network pruning problem into an Wasserstein distance regularized sparse linear regression problem. The author shows that the ordinary sparse linear regression is just a special case when only diagnal entries exists in the transportation matrix. The authors also show that it can be viewed as neighborhood size control, which trade off between covariance capturing and gradient noise reduction. Numerical results show improvement on MLPNet, ResNet20 and MobileNetV1 architectures, over existing magnitude prunning or CVS approaches.

优点

  1. The reformulation is novel as far as I know. The author successfully connect the reformulation to existing sparse regression set up, which makes a good story here.
  2. The analysis on neighborhood control is also insightful.

缺点

  1. The experiments are weak, without test on state-of-the-art architectures like transformers, or larger models like ResNet50, making it suspicious that the proposed approach does not work well on larger model sizes.

问题

Why using 0.84, 0.74, 0.75, 0.63 ... values in Table 2 and 3? This is very uncommon and even in Table 1, the results are following traditional sparsity levels.

评论

We appreciate the reviewer for the acknowledgement of our contributions.

Question 1: The experiments are weak, without test on state-of-the-art architectures like transformers, or larger models like ResNet50, making it suspicious that the proposed approach does not work well on larger model sizes.

Answer 1: To make the experiment more comprehensive, we have added the pruning performance benchmarking for ResNet50 in Table 1. One could see that EWR outperforms LR for both ResNet20 and ResNet50 on the same dataset CIFAR10. Correspondingly, we revised our results and analysis for Table 1: The advantages of EWR over the others are reflected by the three more challenging tasks: ResNet20 and ResNet50 on CIFAR10 and MobileNetV1 on ImageNet, especially in the presence of noisy gradients.

Question 2: Why using 0.84, 0.74, 0.75, 0.63 ... values in Table 2 and 3? This is very uncommon and even in Table 1, the results are following traditional sparsity levels.

Answer 2 regarding the sparsity levels in Tables 2 and 3, in contrast to Table 1, is as follows.

In Table 1, all sparsity levels are "target sparsity" and all the pruning algorithms are fine-tuned on the training set after every pruning. This is to make our implemented LR and EWR align with the performance benchmarking results of MP, WF, and CBS, such that all algorithms are comparable to each other. The “Sparsity” column is the target sparsity, such that every line in Table 1 is the final pruning result. Tables 2 and 3 showcase the performance during one pruning process, where the loss and the accuracy values are taken immediately after pruning (without imposing any fine-tuning). The sparsity levels are intermediate levels. This was clarified in the appendix A.5 of the original version: “In Table 2, Table 3, and additional results provided in the Appendix, sparsity is set using a linear gradual pruning strategy, progressing from 0 to 0.75 or 0.95 across ten distinct stages for MLPNet and ResNet, and from 0 to 0.75 across eight distinct stages for MobileNetV1.”

Using python code as an example:

target_sparsity = 0.95
pruning_stage = np.array([0,1,2,3,4,5,6,7,8,9]) 

And remark that in stage 0 we record the target weights wˉ\bar{w} and no actual pruning is conducted:

total_stages = 10 - 1

Hence:

(pruning_stage / total_stages * target_sparsity).round(2)

yields the sparsity levels during this pruning process as

array([0.  , 0.11, 0.21, 0.32, 0.42, 0.53, 0.63, 0.74, 0.84, 0.95])

To clarify it further, we added the following content after the above cited paragraph:

The values are computed with linear incremental steps, from zero to the target sparsity.

Additionally, we further clarified this in the captions of Tables 2, 3, 6, and 7.

评论

Thank you for your response and answer to my questions. I will stick to my positive score.

审稿意见
8

The authors propose a technique for pruning (sparsification) of neural networks that relies on robust estimation of the empirical Fisher Information Matrix as a surrogate for the Hessian of the training loss. Earlier work has relied on a decomposition of the FIM to motivate a sparse LR formulation of an MIQP framework.

In contrast, in this work the authors propose a framework to address instances of contaminated gradients. In this situation, one must leverage robust estimators of the FIM, or risk a significant drop in empirical performance. By studying the original MIQP problem from the perspective of entropic Wasserstein regression, the authors propose a variation of the sparse LR formulation which amounts to substituting the 2-Wasserstein distance with entropic regularization for the quadratic regression loss. Notably, without entropic regularization, the formulation is equivalent to that of the sparse LR framework.

Theoretically, the authors demonstrate that pruning via entropic Wasserstein regression exactly corresponds to gradient averaging using Neighborhood Interpolation, with the entropic regularization term governing the size of the neighborhood. Algorithmically, the method is simple and computationally efficient. Finding solutions to the problem is done via coupling sinkhorn iterations with SGD. Numerically, the performance of the method exceeds that of previous work and is competitive with the state of the art.

The method is simple and implies an elegant interpretation, as explored by the authors. Numerically, improvements over existing methods are observed- particularly when the training gradients are corrupted by noise. However, there is a significant number of grammatical mistakes and instances of poor phrasing. I do recommend this paper for acceptance, but suggest that the authors devote more time to proofreading the manuscript.

优点

The following are the primary strengths of this paper:

  • The authors propose a straightforward (but novel) modification to the sparse LR framework for neural network pruning. The modification amounts to an additional regularization term grounded by an interpretation using the principles of optimal transport. The optimization problem remains efficiently solvable.

  • The authors motivate their method via an analysis of the robustness properties exhibited by solutions to their optimization problem. Namely, by proving that pruning using their technique implicitly corresponds to gradient averaging via a certain neighborhood interpolation and naturally trades off between a measure of robustness and the quality of the covariance estimator.

  • Additional discussions on sample complexity, ablations on the sparsity and regularization parameter, and alternative methods for computation of the EWR solution are comprehensive and provided in the appendix.

  • The method proposed by the authors improves results over existing methods, particularly when the training gradients are corrupted by noise.

  • Code is provided by the authors as a github link, which is appreciated.

缺点

As a reviewer, I highlight that I am unfamiliar with the current state-of-the-art pruning techniques. I defer to other reviewers regarding the thoroughness of the comparative experiments.However, the method seems grounded. The structure of the manuscript is OK. The writing and clarity of this paper could be significantly improved. In particular, many phrases and statements are unclear, beginning with the abstract:

This study unveils a cutting-edge technique for neural network pruning that judiciously addresses noisy gradients during the computation of the empirical Fisher Information Matrix (FIM).

Additionally, some choices could be better motivated. E.g. the pruning step (step 10 of alg 1) as a projection onto the l-0 norm ball. However, as (reasonably) referenced by the authors, analysis of the optimization problem lies outside the scope of the paper.

Throughout the main text, there are many typos and grammatical errors. Although the draft is readable in its current form, I would suggest the authors carefully review the manuscript and improve the writing- e.g. the following are some examples:

  • Now let’s comparing the covariance between…
  • Intuitively, a large dataset of high-quality training samples diminishes concerns over gradient noise, making the empirical fisher a close approximation to the true fisher.
  • Importantly, this seamless trade-off eludes the combination of the Euclidean distance with gradient averaging.
  • Remark that LR is a special case of EWR…

问题

I may have missed it, but it is not obvious to me what kind of assumption is made regarding the noise. Additionally, it is unclear what kind of noise is introduced in the experiments. What kinds of noise can this method be a good choice for? What is an appropriate choice of the regularization weight for different kinds / magnitudes of noise?

评论

We thank the review for the comprehensive review and acknowledgement of our paper’s work. We feel sorry for the unclear phrases and statements. We have done polishing and rephrasing in the recent submitted version. Careful proofreading will be conducted further. We agree with the reviewer that the pruning step in Algorithm 1, especially the iterative hard-thresholding (IHT), should be better motivated and explained. Towards this goal, we added the following text after two paragraphs below equation (14): Line 10 in Algorithm 1 employs the IHT method that is commonly used in sparse learning, which together with line 9, forms a projected gradient descent algorithm. It finds a sparse representation of the updated gradient in line 9. Intuitively, IHT keeps the dominant model weights and essentially preserves the most impactful aspects of the to-be-trimmed model.

Answer 1 regarding the added noise is as follows:

The noise is added to the data as a Gaussian distribution, with zero mean and a specified standard deviation σ. The value of σ is calibrated in the following way:

  • For a well-trained neural network (before pruning), we compute its derivative as a vector, and obtain the standard deviation of the elements in the vector. Denote it by σ.
  • We then add Gaussian noise to the data. Denote the standard deviation of this Gaussian noise by σ_{data}. Denote the standard deviation of the computed derivative after adding this Gaussian noise as σ’. We have σ’ > σ.
  • We tune and calibrate σ_{data}, such that σ’ = 2 σ = σ + σ. This is the case of the noise level being “σ”. If we calibrate it to be σ’ = 2 σ = σ + 2σ, then it is the case of the noise level to be “2 σ”.

To clarify this in the paper, we added in Appendix A.5 the content below: In calibrating noise for data in neural networks, we start with a well-trained network. First, we calculate the standard deviation σ\sigma of the network's derivative. Then, we add Gaussian noise with zero mean to the data. After adding the noise, the standard deviation of the network's derivative changes to a new value, σ\sigma', which is always greater than sigma. The goal is to adjust the standard deviation of the Gaussian noise so that σ\sigma' becomes σ=σ+σ\sigma'=\sigma+\sigma (referred to as noise level being σ\sigma) or σ=σ+2σ\sigma'=\sigma+2\sigma (referred to as noise level being 2σ2\sigma).

Additionally, we modified the term “noisy gradient” to be “noisy data” in tables and the text descriptions to avoid ambiguity.

Answer 2 regarding what kinds of noise can this method be a good choice for, is as follows.

In our experiments, we discovered that if the noise (random fluctuations) is either very small or very large, the performance of two methods, LR and EWR (our method), tend to be similar. It is straightforward that small noise does not lead to visible difference between LR and EWR. When the noise is very large, gradients would be heavily contaminated such that few true covariance information could be recovered in Hessian approximation, and hence LR and EWR stay on par. Regarding the question on whether the noise level is considered to be “small” or “very large”, it depends on the size of both the model and the dataset. We don’t have a general answer to this, and it is mainly based on experimental observations. However, in all cases, EWR never performed worse than LR, no matter the level of noise. We also think that if the mean of the noise is centered around zero (meaning it's equally likely to be positive or negative), our method could be a particularly good choice. This is because the way EWR interpolates, or blends data, might help in balancing out the noise.

Answer 3 regarding what would be an appropriate choice of the regularization weight for different kinds / magnitudes of noise, is as follows.

The resilience of noise originates from the neighborhood interpolation mechanism. Intuitively, if the gradients are really noisy, then we would trust each single data point less, because it may likely be contaminated with noise. Instead, we may want to enlarge the neighborhood size of performing the interpolation to combat the noise. A large value of ε\varepsilon does this favor for us. On the contrary, a small ε\varepsilon leads to few points participating effectively in interpolation. An extreme case is ε=0\varepsilon=0 such that the entropic regularization term is entirely removed, indicating that no interpolation is performed when computing the Euclidean distance between two points. Yet, this case is not equivalent to the LR case, because the optimal transportation does not necessarily happen between data points in their originally assigned order. They are equivalent only if we impose the transportation plan to be Π=diag(1/n)\Pi=*diag*(1/n) consistently. An illustration of the neighborhood size can be seen in Figure 10 (a) - (d). The lighter the color, the more points are involved in neighborhood interpolation in computing the Euclidean distance.

评论

Thank you for the comprehensive response and clarification re LR & EWR. I have read over the responses by the authors and have taken a look at the revision, which looks much better. I re-iterate that the topic of this work falls outside my area of expertise, but I would increase my score to an 8.

审稿意见
6

Robust Network Pruning With Sparse Entropic Wassertein Regression

In this paper, the authors propose a method to prune neural networks. In particular, in the Sparse Linear Regression Formulation of network pruning, the authors replace the first l0l_0 regression term with Wasserstein regression. Theoretical justifications and empirical experiments show that the proposed pruning strategy is effective and robust against gradient/data noise.

优点

  • The paper is well-written and the problem is well-motivated.
  • The proposed method has desirable properties and shows improved performance over previous methods, especially at larger sparsity.

缺点

  • “The noise level σ is set to be the standard deviation of the original gradients”. Why is this the noise level for both gradients and data? I would like to see a more detailed explanation how how the noise is added to data and gradients.
  • Can the authors also provide an accuracy table for Table 2 and Table 3?

问题

Please see weaknesses.

评论

We thank the reviewer for acknowledging the contributions of our work. We appreciate the reviewer for pointing out the ambiguity here.

Answer 1 regarding the noise level is as follows.

The noise is added to the data as a Gaussian distribution, with zero mean and a specified standard deviation σ. The value of σ is calibrated in the following way:

  • For a well-trained neural network (before pruning), we compute its derivative as a vector, and obtain the standard deviation of the elements in the vector. Denote it by σ.
  • We then add Gaussian noise to the data. Denote the standard deviation of this Gaussian noise by σ_{data}. Denote the standard deviation of the computed derivative after adding this Gaussian noise as σ’. We have σ’ > σ.
  • We tune and calibrate σ_{data}, such that σ’ = 2 σ = σ + σ. This corresponds to the case of the noise level being “σ”. If we calibrate it to be σ’ = 3 σ = σ +2 σ, then it corresponds to the case of the noise level to be “2 σ”. \end{itemize}

To clarify this in the paper, we added in Appendix A.5 the content below:

In calibrating noise for data in neural networks, we start with a well-trained network. First, we calculate the standard deviation σ\sigma of the network's derivative. Then, we add Gaussian noise with zero mean to the data. After adding the noise, the standard deviation of the network's derivative changes to a new value, σ\sigma', which is always greater than sigma. The goal is to adjust the standard deviation of the Gaussian noise so that σ\sigma' becomes σ=σ+σ\sigma'=\sigma+\sigma (referred to as noise level being σ\sigma) or σ=σ+2σ\sigma'=\sigma+2\sigma (referred to as noise level being 2σ2\sigma).

Additionally, we modified the term “noisy gradient” to be “noisy data” in tables and the text descriptions to avoid ambiguity.

Answer 2 regarding the accuracy of Table 2 and Table 3 is as follows.

Thank you for this good advice! We have added Table 6 and Table 7 showing the testing accuracy results, as compliment to the results of Table 2 and Table 3.

Remark that these four tables differ with the benchmarking board of Table 1: In Table 1, all the pruning algorithms are fine-tuned on the training set after every pruning. This is to make our implemented LR and EWR align with the performance benchmarking results of MP, WF, and CBS, such that all algorithms are comparable to each other. The “Sparsity” column is the target sparsity, such that every line in Table 1 is the final pruning result.

Tables 2, 3, 6 and 7 showcase the performance during one pruning process, where the loss and the accuracy values are taken immediately after pruning (without imposing any fine-tuning). This was clarified in the appendix A.5 of the original version: “ In Table 2, Table 3, and additional results provided in the Appendix, sparsity is set using a linear gradual pruning strategy, progressing from 0 to 0.75 or 0.95 across ten distinct stages for MLPNet and ResNet, and from 0 to 0.75 across eight distinct stages for MobileNetV1. Notably, all recorded loss values are captured immediately post-pruning, devoid of any subsequent fine-tuning.”

评论

I acknowledge that I have read the rebuttal, and all my concerns are addressed. I have decided to maintain my positive score.

评论

Dear Reviewers,

We sincerely appreciate the valuable time and effort you have dedicated to reviewing our work. Your insightful and constructive feedback has been instrumental in enhancing the quality of our manuscript.

We are fully committed to addressing any further questions or concerns you might have. Additionally, we respectfully request that you consider re-evaluating the revised manuscript. If you find the improvements to be significant, we would be grateful for a reconsideration of the score, accompanied by your justifications for any adjustments.

We eagerly anticipate engaging in open discussions and are keen to benefit from your expertise and guidance.

Bests, The authors

AC 元评审

The authors introduce a new approach for neural network pruning based on the properties of the empirical Fisher Information Matrix. This approach has enjoyed a long history in the network pruning literature; the main novelty of this new approach is an interpretation based on sparsity-constrained entropic Wasserstein regression. This enables the method to outperform SoTA pruning methods when the target sparsity is large. Several experiments are conducted to showcase the method's performance.

Most reviewers appreciated the freshness of the ideas, although concerns were raised about writing style, and the applicability of the method to truly large networks (such as modern language models). The latter point is especially important in my view: there is a very large "zoo" of pruning methods already, and most of them top out at networks of the size of ResNet-50. For real impact, I would encourage the authors to try and implement their approach for larger network sizes (and for more complicated architectures beyond ResNets).

为何不给更高分

Nice and intuitive algorithmic idea, but experiments are limited and (in my view) perhaps not above the bar for a spotlight.

为何不给更低分

Reviews were quite positive and above the bar for acceptance.

最终决定

Accept (poster)