SAS: Structured Activation Sparsification
Utilize structured sparsity in activation by sparse projection aiming to realized efficient vectorized matmul
摘要
评审与讨论
This paper proposes a new concept of sparsity that maps the input activation values to a sparse representation, then exploits Nvidia Ampere sparsity for sparse computation by widening the weights. By doing this, wider weights make the network have stronger representation ability, and computation remains consistent because of the n:m sparsity operation. The authors conducted experiments on CIFAR and ImageNet datasets to verify the effectiveness of the proposed method.
优点
- The presented idea is clever and novel. It can effectively enhance the representation ability of the network under the same amount of computation compared with the vanilla network. I think this brings a nice insight to the field.
- The author developed a general matmul library for the proposed SAS using Sparse Tensor Core. This gives the method practical value in the field, which is appreciated.
缺点
- It is inaccurate for the author to state that the computation of SWS and SAS are the same. I can agree that the computation of SAS and the original dense network are the same, but the computation of SWS is obviously higher than that of SAS.
- Let's continue considering this point. Figure 3 raises a big question for me. The author doesn't express the dimensions of the weights corresponding to SWS and SAS. Although it intuitively looks like SAS has a clear acceleration in the graph, I think this is unreasonable because the author also says the number of mult count in SAS is the same as the base dense/narrow network. This greatly reduces my enthusiasm for this paper.
- The current presentation pf experiments is very scattered. I think the author should provide a comparison of the full network’s computation, inference speed, and accuracy to give readers a clearer comparison. For instance, when compressing ConvNeXt-b, although the accuracy of SAS is higher than SWS, a comparison of SAS's operation count and inference time also needs to be provided.
问题
Please see the weakness part.
Let's consider a specific example, the -th layer of the base network consisting of convolution with = 288 ( is 32, kernel size is 3) and = 64. When =8, then We have =815 and =181. Then count of the original dense layer (for single pixel) is 28864=18432, the count of the weight sparse layer (SWS) is 18440 (815181/8=18439 with modulo 3, we use M=3 for the last chunk). The comparison of the layer shape in this example case is summarized as follows:
| input | output | FLOPS | weight shape | activation shape | \\ | |
|---|---|---|---|---|---|---|
| base (dense) | 288 | 64 | 18432 | 64288 | 288HW | |
| SAS (=8) | 288 2304 | 64 | 18432 | 642304 | 2304HW (1:8 sparse) | |
| SWS (=8) | 815 | 181 | 18440 | 181815 (1:8 sparse) | 815HW |
Issue 2 Configuration for speed benchmarking
On the wall-clock speed benchmarking reported in fig. 3, we use the opposite configuration, changing the sparsity pattern while keeping the matrix dimension unchanged to report more concrete comparisons. The original manuscript did not emphasize this significant configuration difference, leading to a misunderstanding about the interpretation of the results. Furthermore, symbols used to describe the matrix shape () were confusing with the ones used for specifying the sparsity pattern ().
In the wall-clock speed benchmarking in fig. 3, we consider the where and which is identical for dense , SAS , and SWS . The count of the three variants is the same when one does not take the sparsity into account. By utilizing the 1:2 fine-grained (semi-structured) sparsity on Sparse Tensor Core, the count becomes half for SAS and SWS, i.e., . We show the number of dimensions of the weight and activation for wall-clock time benchmarking in the graph (we use fixed =10240, changing from 10240 to 20480), which is the same for dense SAS , and SWS .
Again, we want to emphasize that the scenario in this speed benchmarking (keep the same matrix dimension) is different from the scenario for neural network (keep the (almost) same count). The statement "the number of count in SAS is the same as the base dense/narrow network" is not meant for this speed evaluation experiment but for our main application scenario of the SAS for neural networks (section 3, section 4).
Issue 3 Reason for the speed benchmarking configuration
As discussed above, we adopt the different configurations to evaluate the speed in a fair setting between SWS and SAS. We can construct an SWS network having approximately the same count with base dense network and SAS network by adjusting the network width (as discussed in issue 1); however, it is not precisely the same, and the shape of the matrices will be different (please refer to the table in issue-1), which makes it hard to evaluate the overhead specific to SAS which we are interested in (index related operation). While we can precisely evaluate this by measuring the time of the of the same-sized matrix (section 2.3, supp. B).
To clarify the above point, we've made the following changes.
- Added the details about constructing the SWS network with the similar count as discussed above (section 2.3). We've also clarified in the experimental section (section 3, section 4) that the count of the SWS network is not exactly the same as the dense or SAS network, but it is approximately the same.
- Moved the speed benchmarking to a separate paragraph and clarified the distinction between the two benchmark setups as discussed above (section 2.3).
- Changed the symbols of the matrix size from to to avoid the notational clutter with the sparsity pattern (section 2.3, fig. 3).
We really appreciate the insightful comments by N3vE. Especially, N3vE's comments helped us to clarify the potential inaccuracy and confusion about the experimental configuration. We consider that the following first three concerns kindly raised by N3vE come from essentially the same source of inaccuracies in the original manuscript regarding the comparison configuration of dense, SAS, and SWS: compare accuracy by increasing network width without increasing count or comparing wall-clock time by changing sparse pattern while keeping the same matrix size. We revised the manuscript as follows, and we are now confident that the revised manuscript and the following answer clarify the concerns.
It is inaccurate for the author to state that the computation of SWS and SAS are the same. I can agree that the computation of SAS and the original dense network are the same, but the computation of SWS is obviously higher than that of SAS.
The author doesn't express the dimensions of the weights corresponding to SWS and SAS.
Although it intuitively looks like SAS has a clear acceleration in the graph, I think this is unreasonable because the author also says the number of count in SAS is the same as the base dense/narrow network.
In this paper, we propose SAS to increase the network capacity to improve the accuracy without increasing the actual count. Therefore, we compare the SWS in the same scenario (keeping the count constant while increasing the sparsity). Specifically, we consider a base layer consisting of between activation and weight , where and (eq. (1)). As N3vE explains, using the proposed SAS could increase the network width for times while maintaining the same count as the base layer by utilizing the 1: sparsity pattern. The sparsified matrix shape is, and (eq. (2)). SAS does not change the output channel dimension .
On the other hand, in the case of SWS, if one increases the network width for the times and uses the 1: sparse pattern on weight, the count of the resultant network increases by about because both input and output channel is times wider. We consider N3vE’ kindly point out this issue.
In the case of SWS, by using times wider network for the 1: sparsity pattern, we can construct the SWS network, which has roughly the same count as the base dense network and SAS network (section 2.3). Thanks to the questions raised by N3vE, we find the original manuscript was unclear from three perspectives: 1) the explanation in the original manuscript " times wider input/output channel" was insufficient and needs more detail, 2) Although we adopt the opposite benchmarking configuration for the speed benchmarking in fig. 3 (we keep the same matrix dimension while changing the sparse pattern), we failed to explain this difference clearly, which may confuse the reader about the interpretation of the speed benchmarking result (and also the accuracy benchmarking result), 3) The reason why we evaluate the speed using the of the same shaped matrix was not clearly stated. We've updated the manuscript to clarify the above three issues as follows;
Issue 1) Configuration of SWS network
The original manuscript lacks the following detailed discussion. When the network is times (instead of times) wider with 1: sparsity pattern in weight, the count will be the same as the base network. But, the network width needs to be an integer value, and it also needs to be multiple of . Therefore, we use the weight having the shape of for the 1: SWS network to get approximately the same count as the base network, where is a rounding operator, , is the input/output channel dimension of the base dense network. For the last chunk, when it does not equal to , we use :1 sparse pattern. In this way, we construct the SWS network having almost the same count as possible.
I think the author should provide a comparison of the entire network’s computation, inference speed, and accuracy to give readers a clearer comparison. For instance, when compressing ConvNeXt-b, although the accuracy of SAS is higher than SWS, a comparison of SAS's operation count and inference time also needs to be provided.
Relating to the above discussion, the count for the SAS and SWS networks are (almost) the same for the neural networks experiment. In the case of ConvNeXt-b, the FLOPS of SAS and base dense network is 3.85M, and the FLOPS of SWS is also 3.85M. We've added the count and number of parameters in the supp. D (tab. A2), including the ResNet18.
We report the wall-clock speed benchmarking in fig. 3, demonstrating that SAS-based multiplication can speed up the on actual hardware. The overhead of SAS over SWS is the online computation of the index (eq. 3), reorder (supp. E), and the memory transfer of the index. Our experiment shows that it is only a faction (1.5%) of the entire in terms of the wall-clock time. However, as N3vE points out, we have yet to benchmark the speed as a whole neural network. There are two reasons why we evaluate the wall-clock time using of the same-sized matrix (fig. 3) instead of the entire network.
- It is hard to exactly match the count between SWS and SAS network, and the matrix shape vastly differs between the two, making the evaluation of SAS-specific computation (online index computation) time difficult (please also refer to the discussion in the previous question). In contrast, we can precisely evaluate this by measuring the time of the same-sized matrix (section 2.3, supp. B).
- Due to the limited high-level API to utilize the SparseTensorCore (e.g., cuSPARSELt. We need to develop a custom CUDA kernel using a low-level library (e.g., Cutlass to deal with the specific operation such as convolution or depth-wise convolution to evaluate the wall clock time as a whole network.
We are now developing a custom CUDA kernel and will publish the source code. We expect slightly better results (regarding the percentage of SAS-specific overhead) than the result in fig. 3 will be achieved with the custom CUDA kernel (section 2.3, section 6.1) because we can integrate index computation, reorder (supp. E), and memory copy, eliminating unnecessary DRAM access.
Thanks to N3vE, we noticed the above-discussed critical discussion about the wall-clock benchmarking is insufficient in the original manuscript. We've added the discussion in section 2.3.
Thanks for the author's response, which has successfully laid to rest my concerns. Consequently, I am increasing my rating. Nonetheless, I enthusiastically encourage the author to refine their current method and experimental description, as it indeed could easily lead to confusion, akin to what I previously experienced.
Thank you for your kind response, and we are glad that the response clarified the concerns.
I enthusiastically encourage the author to refine their current method and experimental description, as it indeed could easily lead to confusion akin to what I previously experienced.
We promise to further update the manuscript in method (section 2) and experiment (section 3, section 4) until the very end of the rebuttal deadline.
Thank you again for helping us to improve that manuscript by pointing out the crucial issues that led to the potential misunderstanding (configuration for comparing SAS with SWS).
This paper proposed SAS, a method to explore structured sparse activation in CNN. SAS describes the approach to generate structured sparse activation, software and hardware implementations. Numerical experiments on image classification validates the efficacy regarding accuracy aspect.
优点
- The paper is written well, and technically sound.
- The study problem is interesting and may deliver real impact onto DNN speedup.
缺点
-
The terminology of structured sparsity is misleading. We typically refer the N:M sparsity as "semi structured sparsity" to distinguish it from standard structured sparsity including disjoint group sparsity, overlapping group sparsity and hierarchical sparsity.
-
The realistic benefits of structured sparse activation is not clear. Although the topic is interesting, I am not sure what is the actual speedup gain of such sparse activation that can deliver to the community. The paper seems equipping without numerical results regarding speedup as well.
-
The citation format is wrong. Please use \citep{} rather than \cite{} to cite references.
问题
- What is the training cost to yield SAS network compared to training as standard?
Thanks for the insightful and sharp comments. In particular, suggestions about the terminology of : sparsity helped us a lot to clarify and avoid unintended confusion about the scope and the contribution of the proposed study. We are now sure that the revised manuscript incorporates the raised concerns.
The terminology of structured sparsity is misleading. We typically refer to the : sparsity as "semi-structured sparsity" to distinguish it from standard structured sparsity including disjoint group sparsity, overlapping group sparsity and hierarchical sparsity.
Thank you for the suggestion. We agree with W2qU that the term "semi-structured" well describes the : sparsity. We used the term "Fine-grained structured" sparsity following NVidia's official whitepaper and use structured afterward for brevity (section 1). Thanks to the W2qU's comments, we find the original manuscript was unclear on this point. We clarified the terminology in section 1 and added the note that this is also called "semi-structured" to give a clear distinction from the standard structured sparsity.
The realistic benefits of structured sparse activation is not clear. Although the topic is interesting, I am not sure what is the actual speedup gain of such sparse activation that can deliver to the community. The paper seems equipping without numerical results regarding speedup as well.
We report the wall-clock speed benchmarking in fig. 3, demonstrating that SAS-based multiplication can speed up the on actual hardware. The overhead of SAS over SWS is the online computation of the index (eq. 3), reorder (supp. E), and the memory transfer of the index. Our experiment shows that it is only a faction (1.5%) of the entire in terms of the wall-clock time. However, we've evaluated the time for a single (with varying sizes), and we have not benchmarked the speed as a whole neural network. There are two reasons why we evaluate the wall-clock time using of the same-sized matrix instead of the entire network.
- It is hard to exactly match the count between SWS and SAS network, and the matrix shape differs largely between the two, making the evaluation of SAS-specific computation (online index computation) time difficult. In contrast, we can clearly evaluate this by measuring the time of the of the same-sized matrix (section 2.3, supp. B). This discussion also relates to the N3vE's question; please refer to the response for the specific example of why this is the case.
- Due to the limited high-level API to utilize the SparseTensorCore (e.g., cuSPARSELt). We need to develop a custom CUDA kernel using a low-level library (e.g., Cutlass to deal with the specific operation such as convolution or depth-wise convolution to evaluate the wall clock time as a whole network.
We are now working on the development of the custom CUDA kernel and will publish the source code. We expect slightly better results (in terms of the percentage of SAS-specific overhead) than the result in fig. 3 will be achieved with the custom CUDA kernel (section 2.3, section 6.1) because we can integrate index computation, reorder, and memory copy.
Thanks to the W2qU's comment, we find that the original manuscript failed to discuss this important motivation about the wall-clock benchmarking. We've added the discussion in section 2.3 and section 6.1.
The citation format is wrong. Please use "citep" rather than "cite" to cite references.
Thank you for pointing out this. We've fixed all the citation formats.
Thanks for the responses. My concerns have been addressed. Therefore, I raised my rating.
Thank you for kindly giving us feedback on our author's response. We are glad to hear that the response clarified the concerns.
We would like to thank you again for your help in improving the manuscript.
This paper introduces Structured Activation Sparsification (SAS), a method that enhances the accuracy of wide neural networks without the additional computational cost typically associated with network width. By implementing structured sparsity within activation maps—where a set number of non-zero values are maintained in consecutive activation. This allows for the simplification of wide matrix multiplications into narrow ones. Empirical results show that this method can improve accuracy (by up to 7% on CIFAR10) without increasing computational demands and outperforms similar sparsity approaches applied to network weights.
优点
The strengths of this work is listed below:
- This work introduces a novel method of structuring sparsity in activations, which appears to enable a reduction in computation without a corresponding drop in accuracy. The concept of using structured sparse projection (SAS) to maintain vectorization compatibility is particularly innovative.
- The structured sparsity allows for efficient matrix multiplication operations that maintain the number of multiplications at the level of the base dense/narrow network, highlighting efficiency in computation.
- The process for creating sparse activations through the structured sparse projection S is described as having negligible computational cost and wall-clock latency, indicating an efficient method that does not add significant complexity or processing time.
- A thorough evaluation is presented that demonstrates the SAS network's increased expressiveness compared to Static Weight Sparsity (SWS) networks, given the same computational budget, by using trajectory length analysis.
- The proposed SAS projection and its integration into the training process suggest a straightforward transformation of existing neural networks to increase their efficiency, which could be widely applicable across different network architectures and tasks.
缺点
The weakness of this work is listed below:
- As mentioned in Section 2.3, while the computational load in terms of FLOPS remains the same for a given level of activation sparsity M, the memory requirements for the Sparse Activation Sparsification (SAS) network increase linearly with M. This is in contrast to the Sparse Weight Storage (SWS) network, which maintains a constant storage requirement for weights at inference time regardless of M. This could become a significant limitation for devices with limited memory or when scaling to very large networks.
- In Section 3, the paper discusses the expressive power of SAS networks by comparing trajectory lengths in a specific constructed neural network with 2-dimensional input and output. However, this analysis might not generalize to all network architectures or datasets, which could limit the understanding of the practical implications of SAS's increased expressive power.
- The straightforward method used for computing the index I might not be the most effective approach, particularly when neighbor elements oscillate around zero (Section 6.1). An indexing strategy that is not learned end-to-end may limit the model's capacity to adapt to the data's complexity, potentially leaving some performance on the table.
- In general, the listings (code pieces) in the paper is informative enough. However, I would suggest the authors to replace the source code with some high-level pseudo code. This is more readable and more accessible to some readers.
问题
I do not have other questions. Please refer to the weakness column for my comments.
We appreciate the positive and encouraging feedback and insightful comments from C5D8. We answer each of the valuable comments as follows.
As mentioned in section 2.3, while the computational load in terms of FLOPS remains the same for a given level of activation sparsity , the memory requirements for the Structured Activation Sparsification (SAS) network increase linearly with . This is in contrast to the Structured Weight Sparsification (SWS) network, which maintains a constant storage requirement for weights at inference time regardless of . This could become a significant limitation for devices with limited memory or when scaling to very large networks.
As C5D8 points out, in contrast to SWS, the memory requirement of SAS increases linearly with (section 2.3). Actually, this is the main drawback of the proposed method (section 2.3, section 4.2). We expect this could be mitigated by using quantized weight (section 5), which we left for future work. From a practical perspective, commercially available hardware (at the time of submission) supports 50% sparsity; doubling the memory might be a practical choice, especially when the FLOPS is critical, considering the significant accuracy gain from M=0 (dense) to =2 (50% sparse). We clarified this point in (section 2.3 and section 5).
In section 3, the paper discusses the expressive power of SAS networks by comparing trajectory lengths in a specific constructed neural network with 2-dimensional input and output. However, this analysis might not generalize to all network architectures or datasets, which could limit the understanding of the practical implications of SAS's increased expressive power.
We agree with C5D8 about the interpretation of the trajectory length analysis. Still, we consider the analysis gives important implications about the expressive power of the model, which does not depend on data, model, or several training details that affect the accuracy (It often happens, for example, model A performs better than B when using a specific optimizer, learning rate, and learning epochs; but model B beats model A when adopting different training scheme). The trajectory length analysis does not depend on these settings. In section 4, we present results using a practical model and dataset, and we hypothesize the increased accuracy by adopting the SAS is attributed to the increased expressive power, which conforms to the trajectory length analysis results. Thanks to the C5D8's comments, we find the above motivation for utilizing the trajectory length analysis was missing in the original manuscript (we only mentioned "more generic settings"); we've added the above discussion at the beginning of the trajectory length analysis (section 3).
The straightforward method used for computing the index I might not be the most effective approach, particularly when neighbor elements oscillate around zero (section 6.1). An indexing strategy that is not learned end-to-end may limit the model's capacity to adapt to the data's complexity, potentially leaving some performance on the table.
We agree on this point; however, the oscillation phenomenon happens when >2 and will be several when it is large, e.g., >16 (section 6.1). We consider the proposed simple index computation mechanism of e.q (3) to be one of the most effective approaches, at least for =2, because there are no oscillations, and it is computationally efficient. Thanks to the C5D8’s comment, we noticed that the original manuscript could be more precise on this point. We've added the above discussion in section 6.1.
In general, the listings (code pieces) in the paper are informative enough. However, I would suggest the authors replace the source code with some high-level pseudo code. This is more readable and more accessible to some readers.
We present the source code for inference to demonstrate that the SAS operation is realizable on actual commercial hardware (listing-1). We also show the source code for training to demonstrate the conversion from the existing dense layer to the SAS layer for training is simple and trivial (listing-2). We agree with C5D8's suggestion for the importance of high-level explanation. Following the suggestion, we further clarified the method during the training, and now we consider listing-2 to be somewhat redundant; therefore, we moved it to supplement for interested readers (and added more important discussion regarding the valuable comments from all the reviewers). Regarding listing-1 (Code for inference on actual hardware), we keep it in the main text to showcase that the SAS could actually be compatible with commercial hardware. However, thanks to the C5D8's comments, we noticed some details could be more precise to follow; therefore, we've clarified the high-level explanation in sections 2.1-2.3 and supp. E along with the visual description.
I have read through the rebuttal and I do not have other questions. I am positive about this work but I will not further increase my score. Thanks the authors again.
Thank you for your kind feedback and encouraging comments. We are glad to hear that the response clarified the concerns.
We thank all the reviews for their insightful feedback. We appreciate that all the reviews agree on the novelty of the proposed concept of SAS, the utilization of structured sparsity in activation. We answered each valuable comment in each thread to clarify the reviewer's concerns.
We are now sure that the revised manuscript incorporates all the feedback. Please let us know if there are still unclear discussions. Also, we are happy to answer any additional questions.
Thank you again for the valuable feedback to make the paper more clear and strong.
This paper introduces Structured Activation Sparsification (SAS), an innovative method to improve model accuracy without increasing computational costs. The core idea is to implement structured sparsity within activation maps, allowing simplification of wide matrix multiplications. Overall, reviewers acknowledge the novelty of this approach and its potential for efficiency gains. The concept of structured sparse projection to maintain vectorization compatibility is considered technically sound. Another strength noted is the straightforward integration of SAS into existing architectures.
However, concerns are raised regarding increased memory requirements and insufficient evidence generalizing SAS's benefits. One reviewer rightfully points out ambiguities around computational equivalence claims between SAS and static weight sparsity techniques. Some open questions also remain about the method's memory overhead and general applicability.
Considering the reviewers' scoring and rebuttal discussions, I would lean toward acceptance given the conceptual innovation presented. However, addressing the identified weaknesses through additional experiments, analyses, and clarified explanations is critical. Moreover, the authors should temper the claims around computational equivalence until fully substantiated.
为何不给更高分
All reviewers give lukewarm scores. Concerns are raised regarding increased memory requirements and insufficient evidence generalizing SAS's benefits. One reviewer rightfully points out ambiguities around computational equivalence claims between SAS and static weight sparsity techniques. Some open questions also remain about the method's memory overhead and general applicability.
为何不给更低分
I would recommend accepting this paper with major revisions focused on shoring up the identified issues.
Accept (poster)