Dual-Perspective Activation: Efficient Channel Denoising via Joint Forward-Backward Criterion for Artificial Neural Networks
摘要
评审与讨论
This paper introduces a novel activation mechanism, namely Dual-Perspective Activation (DPA), to identify irrelevant channels and thus apply channel denoising for the ANN architecture. The proposed DPA incorporates criteria established and updated online from both forward and backward propagation perspectives while preserving activation responses from relevant channels, which can effectively eliminate the interference of irrelevant or redundant features, thereby improving the overall performance of the neural network.
Moreover, the authors conduct extensive experiments across various mainstream ANN architectures and datasets, as well as multiple tasks and domains, where the results show (1) DPA is parameterless and fast, (2) DPA achieves remarkable performance compared to existing activation counterparts, and (3) DPA can be applied to other tasks and domains.
优点
-
This paper is meticulously crafted, with a well-structured and clarified presentation.
-
The motivation originates from detailed and clear explorations of ANN activation mechanisms by the authors, which are valuable and deserve to be investigated.
-
The idea of combining both forward and backward perspectives to track and evaluate the importance of each channel in real time sounds novel and valid.
-
The theoretical foundations and experimental validations are detailed and sufficient.
-
The proposed method can perform various other tasks or domains, including vision and non-vision tasks.
缺点
-
The authors have not provided a clear justification for why the intersection operation is the only suitable channel selection approach.
-
While the authors claim in Lines 105-107 that each category shows a strong relation with its specific channels, more sufficient evidence would be advantageous to support the second part of their claim, which states that the other channels should not generate any responses.
-
The authors have not explored the impact of applying the DPA to different layers of the CNN and presented a more thorough analysis to justify their design choice.
-
In Figure 6, the authors may not have thoroughly investigated the impact of the momentum parameter on the model's performance, and their choice of values may not have been well-justified.
问题
-
It would be helpful if the authors could explore and compare the performance of more complex channel selection techniques to determine the most appropriate method.
-
In Lines 105-107, the authors claim that each category shows a strong relation with its specific channels, and ideally, the other channels should not generate any responses. Please provide additional evidence in support of the second half of the claim.
-
In Lines 198-199, the authors apply the DPA to the last block in CNN. Are the high-level semantics the main reason for this choice? Why is this approach taken? Please provide a detailed explanation.
-
In Figure 6, the value sampling of momentum does not follow any particular pattern, and the prediction performance does not exhibit significant variations across different values. Please provide a detailed explanation.
局限性
The paper has discussed the potential limitations and future directions of the research. Moreover, there is no societal impact from the work performed.
Q1: It would be helpful if the authors could explore and compare the performance of more complex channel selection techniques to determine the most appropriate method. (Corresponding to W1: The authors have not provided a clear justification for why the intersection operation is the only suitable channel selection approach.)
Thank you for your comment. Firstly, the proposed forward-backward criterion has been carefully designed, where the forward criterion follows the principle of threshold activation, and the backward criterion follows the principle of gradient attribution. Using an intersection of these two perspectives allows for more accurate judgments compared to relying solely on a single perspective. Moreover, we have compared the forward-backward intersection operation with forward only, backward only, and forward-backward union operations, respectively. The results in Sec. 4.3 & Tab. 3 indicate that the intersection operation performs the best. Therefore, the forward-backward intersection operation currently stands as the best and most reasonable channel selection technique.
Q2: In Lines 105-107, the authors claim that each category shows a strong relation with its specific channels, and ideally, the other channels should not generate any responses. Please provide additional evidence in support of the second half of the claim.
Thanks for the constructive suggestion. Commonly used activation functions such as ReLU generally perform by setting an activation threshold, mimicking the membrane potential in biological neurons to suppress irrelevant signals while allowing useful signals to pass. During network learning, relevant signals tend to be above the threshold, whereas irrelevant signals tend to be below it, which is consistent to the principle of threshold activation. Fig. 2 indicates that each category is only correlated with sparse and specific channels in ANNs, and Fig. 3(a) shows that the activation distributions of certain channels, indicated by the red arrows, are truncated by the threshold and are concentrated in a small range above it. Therefore, we consider that these channels are potentially irrelevant which should not have any response. That is, the granularity of our claim of signal correlation is at the channel level. In order to support our point, we conducted a confirmatory experiment, as shown in Fig. 3(b), where the potential irrelevant channels indicated by the red arrows were manually removed by forcing their responses below the activation threshold. As expected, the training accuracy substantially improvements, providing evidence that these channels are indeed potentially irrelevant.
Q3: In Lines 198-199, the authors apply the DPA to the last block in CNN. Are the high-level semantics the main reason for this choice? Why is this approach taken? Please provide a detailed explanation.
Thanks for the comment. Yes, the high-level semantics are the main reason. Previous works have shown that high-level semantics in CNNs only exist in deep representations [1,2]. These features are crucial for the downstream tasks. Features in the shallow layers of CNNs are too concrete. Therefore, imposing the regularization to a channel (an element) cannot directly control the change in the corresponding space-level feature map. Moreover, we have explored the impact of applying the DPA to different layers of ResNet-18 as follows, where L4 is the last layer:
| Constrained Layer(s) | Top-1 Acc / % | Constrained Layer(s) | Top-1 Acc / % |
|---|---|---|---|
| L1 | 76.2 | L1-3 | 76.3 |
| L2 | 76.0 | L3-4 | 76.7 |
| L3 | 76.0 | L2-4 | 76.6 |
| L4 | 76.8 | L1-4 | 76.8 |
The results indicate that applying the constraint to shallow layers has little effect, and only constraining the last layer can achieve the optimal effect.
[1] Raghu et al., Do vision transformers see like convolutional neural networks? NeurIPS 2021
[2] Selvaraju et al., Grad-CAM: Visual explanations from deep networks via gradient-based localization. ICCV 2017
Q4: In Figure 6, the value sampling of momentum does not follow any particular pattern, and the prediction performance does not exhibit significant variations across different values. Please provide a detailed explanation.
We appreciate your comment. Theoretically, a high value of risks making the mean value unstable, and conversely, a low value of smoothens the mean value update, but it may cause the mean value to lag behind. According to our analyses (see Appendix A.3) on the ViT-Tiny model trained on CIFAR-100, the network's performance is not sensitive to the between around 0.2 and 0.99, from which =0.9 performs the best, and extremely small values lead to negative effects. Therefore, for the rest of the experiments presented in the paper, we empirically set the to 0.9 without too much consideration. For the prediction performance that does not exhibit significant variations across different values, we think this is a good thing, as the model performance with the proposed DPA is robust to changes in the hyperparameter , which indicates that a fixed can be adaptive across different models, datasets, and tasks, thereby reducing the adjustment costs.
I appreciate the authors' detailed responses. They have adequately addressed my concerns regarding the selection of low-relevance channels, the DPA's application, and the momentum parameter . The idea of denoising activated channels from both forward and backward perspectives is interesting and inspiring. Hence, I'm delighted to keep my positive rating and recommend acceptance.
Dear Reviewer 4FwJ,
We sincerely appreciate your recognition of our efforts to address your concerns, as well as your positive rating and recommendation for acceptance.
Thank you once again for your time and effort in reviewing our paper.
Best regards,
Authors of Paper #1829
Artificial neural networks apply the principles of the human brain and leverage sparse representations to their advantage. However, existing activation methods still struggle to suppress irrelevant signals, affecting the network's decision-making. To overcome this, a novel Dual-Perspective Activation (DPA) mechanism is proposed, which efficiently identifies and denoises channels with weak correlation through an online updating criterion from both forward and backward perspectives while not affecting the activation response of other channels with high correlation. Extensive experiments demonstrate that DPA facilitates sparser neural representations and outperforms existing activation methods, and it applies to various model architectures across multiple tasks and domains.
优点
-
The proposed DPA method starts from a novel joint forward-backward perspective. By designing forward and backward memories to track each channel's historical response and gradient, a criterion is established to identify irrelevant channels and denoise them. The design of the forward-backward criterion is smooth and natural, adhering to the principles of activation response and gradient attribution.
-
The idea is interesting and presented coherently. The method design is smooth and natural. The literature review on activation and gradient attribution is comprehensive. The method achieves desired sparse representations and performance gains.
-
The paper is well-organized. The problem, motivation, methodology, and results descriptions are easy to understand.
-
This method improves the model's performance from an interpretable perspective and is versatile for various model architectures (CNN, Transformer, MLP, etc.) and different tasks and domains.
缺点
-
Some claims seem not so clear. For the channel activation distribution observed in Figure 3(a), providing more detailed descriptions for the potentially irrelevant channels pointed out by the red arrows could further improve the clarity of the paper.
-
It is not clear for some figures: For Fig. 3(b), the legend "manually removing irrelevant channels" needs more explanation. The authors do not explain how exactly the irrelevant channels are manually removed.
-
The benefits of sparse representation are not convincing enough. Considering that sparse representation may not always be beneficial, additional discussion on the pros and cons of sparsity for this work could help enhance the main claim of the paper.
问题
-
Please explain why the channels pointed out by red arrows in Fig. 3(a) are potentially irrelevant channels, in order to enhance readers' understanding. Including examples or case studies where similar channels were found to be irrelevant in other contexts might also reinforce the explanation.
-
Explain the exact procedure utilized for manually removing the irrelevant channels. Is it by forcing the irrelevant channels in Fig. 3(a) to zero, or by any other way?
-
Does the ability of the forward-backward criterion to make judgments get impacted by the model's performance? Additionally, when the judgment is suboptimal, will it influence the model's learning trajectory?
-
This paper achieves sparsity of features/representations, which is good. However, in some cases, sparse representation may not be the most suitable option. For example, if the data is not sparse, using sparse representation may lead to loss of information, impeding the model from learning complex patterns. Is sparse representation consistently beneficial for this study? Are there more factors that need to be taken into account?
局限性
The authors include a detailed section on the limitations of the work.
Q1: Please explain why the channels pointed out by red arrows in Fig. 3(a) are potentially irrelevant channels, in order to enhance readers' understanding. Including examples or case studies where similar channels were found to be irrelevant in other contexts might also reinforce the explanation.
Thank you for your constructive suggestion. Commonly used activation functions such as ReLU generally perform by setting an activation threshold, mimicking the membrane potential in biological neurons to suppress irrelevant signals while allowing useful signals to pass. During network learning, relevant signals tend to be above the threshold, whereas irrelevant signals tend to be below it, which is consistent with the principle of threshold activation. Fig. 2 indicates that each category is only correlated with sparse and specific channels in ANNs, and Fig. 3(a) shows that the activation distributions of certain channels, indicated by the red arrows, are truncated by the threshold and are concentrated in a small range above it. Therefore, we consider that these channels are potentially irrelevant and should not have any response. That is, the granularity of our claim of signal correlation is at the channel level. In order to support our point, we conducted a confirmatory experiment, as shown in Fig. 3(b), where the potential irrelevant channels indicated by the red arrows were manually removed by forcing their responses below the activation threshold. As expected, the training accuracy substantially improved, providing evidence that these channels are indeed potentially irrelevant. We will expand upon the discussion of this section in the final revised version.
Q2: Explain the exact procedure utilized for manually removing the irrelevant channels. Is it by forcing the irrelevant channels in Fig. 3(a) to zero, or by any other way?
Sorry for the confusion. Channels whose mean response distribution is lower than the response threshold are considered potentially irrelevant and are highlighted by red arrows in Fig. 3(a). The procedure utilized for manually removing the irrelevant channels is forcing the responses of irrelevant channels indicated by red arrows below the threshold. In the paper, the activation threshold is set at zero, thereby we forcibly bring the responses of these irrelevant channels down to zero as well. We will include this detail in the final revised version.
Q3: Does the ability of the forward-backward criterion to make judgments get impacted by the model's performance? Additionally, when the judgment is suboptimal, will it influence the model's learning trajectory?
Thanks for the insightful comment. The channel denoising loss is maintained as a weak constraint throughout the training process.
The responses at the beginning of training indeed exhibit considerable noise and make suboptimal judgments. Setting channel denoising to a high intensity during the initial stages can indeed affect convergence. Hence, we adopt as a weak regularization term, where we adjust the weight of to maintain it an order of magnitude smaller than . This strategic approach ensures that remains dominant, effectively dictating the range of feature distribution. As long as the learns much faster than , the model's performance will not be affected by early suboptimal judgments, and the judgments will become more accurate as the model's performance improves.
We also considered alternatives, such as activating after a certain number of iterations or gradually increasing the weight of from zero at the beginning of training. However, we observed that foregoing these steps did not yield negative effects during initial training. This observation aligns with our implementation of the warm-up learning rate scheduler, which also serves to alleviate this concern to a certain extent.
Q4: This paper achieves sparsity of features/representations, which is good. However, in some cases, sparse representation may not be the most suitable option. For example, if the data is not sparse, using sparse representation may lead to loss of information, impeding the model from learning complex patterns. Is sparse representation consistently beneficial for this study? Are there more factors that need to be taken into account?
Thanks for the interesting comment. Yes, sparse representation is the goal of this study, and it does bring a lot of benefits, which can be directly supported by the confirmatory experiment, as shown in Fig. 3(b). By manually removing the irrelevant channels for each category (which is equivalent to forcibly constructing sparse representation), there is a substantial improvement in the training accuracy. Moreover, the observation in Fig. 2 indicates that each category is only correlated with sparse and specific channels in ANNs. The proposed DPA effectively denoises redundant channels while not affecting relevant channels, resulting in sparse representation and performance enhancements. All of these collectively validate that the role of sparse representation in this study is to eliminate redundant signals while preserving key signals, rather than causing the loss of information.
The authors argue that sparsity in the activations is a desirable property and should be enforced. They observe that there exist category-specific channels in the network's activations which have a high value only for specific categories while other channels remain low. These low activations are considered as noise that impairs performance and should be removed. To this end, the DPA method is introduced which tracks for a given category the intensity of activations per channel (and gradient magnitudes) and supresses channels with low intensity. The resulting DPA activation is used as a replacement for activations of transformer-based models and in the last block of CNN-based models. Experiments show consistent improvements over other activation functions, especially in transformers.
优点
- Substantial effect for transformers, potentially impactful.
- Clear presentation for most parts.
- Code is provided (although I could not find the experiment configurations).
缺点
- The key results reported in Tab. 1 and Tab.2 were conducted using the author's own evaluation. On ImageNet-1K the respective original papers report accuracies of 75.1 (vs. 75.2 DPA) for PVT-Tiny and 81.5 (vs. 77.8 DPA) for TNT-small. ResNet18 achieves a score of 71.5 (vs. 70.8 DPA) with an improved trainig procedure [1].
- The effect sizes are small. For smaller datasets (CIFAR and ImageNet-100) standard deviations over multiple training runs should be reported for DPA.
- Sometimes the paper writing is unclear:
- Sec. 4.1 "can replace all existing activations in each block": Does this mean that the softmax activation is replaced?
- Given the substantial differences between biological neurons and ANNs, I don't think one can conclude from the observation of sparse activations in biological networks that this is necessarily also a desirable property for ANNs.
问题
My main concern regards the fairness of the experiments, in particular the degree to which the good performance of DPA is due to the choice of the hyperparameters. How robust is DPA when other hyperparameters are chosen? The authors state that the parameter of the DPA loss was varied depending on network and dataset. Was this parameter selection conducted on the test set (Fig. 7 from the appendix suggests this, as the value for is the one reported in Tab. 1)? Were parameters of the other activation functions also varied? As mentioned in weaknesses, in some cases (I only checked a subset) the reported scores of DPA are lower than the corresponding ReLU-based state-of-the art. Does DPA consistently improve scores here, too? One way to demonstrate this would be taking a somewhat start-of-the-art Imagenet model like the ResNet18 using the training from [1] and show that DPA improves performance here, too.
局限性
- Limited to settings where categories are present during training. Hence, self-supervised training is not supported. This could be made clearer in the text.
Q1: (A) How robust is DPA when other hyperparameters are chosen? The authors state that the parameter of the DPA loss was varied depending on network and dataset. (B) Was this parameter selection conducted on the test set? (C) Were parameters of the other activation functions also varied?
We appreciate your valuable feedback and apologize for any confusion.
(A) The hyperparameters ( and ) are robust when their values are not extreme (Please refer to paper Sec. A.3 & Fig. 6-7). The DPA can consistently bring performance improvements when is small, and too large can result in negative side effects. In fact, for most experiments (CaiT & PVT on CIFAR-100, and all the models on CIFAR-10 & ImageNet-{100,1K}), since hyperparameter selection is quite time-consuming and also for fair comparison, we did not select but empirically used the default as 1 (see Lines 484-485). If we were to conduct hyperparameter search for each experiment, the performance of our method might be further improved.
(B) We regret any confusion caused by the misleading title "Hyperparameter Selection" for Sec. A.3. It should have been correctly referred to as "Impact of Hyperparameters on Model Performance". In fact, the original intention of Sec. A.3 was to analyze the impact of hyperparameters on model performance for few and specific models and datasets on the test set, rather than performing hyperparameter selection for each experiment. Most experiments have their hyperparameters set uniformly (=1) based on the aforementioned analysis (see Lines 484-485 & Fig. 6-8). Please also refer to (A) and Global Response above.
(C) ReLU, GELU, and Softplus do not contain parameters, and the parameters in ELU, SELU, SiLU, and GDN are trainable.
Q2: ... in some cases the reported scores of DPA are lower than the corresponding ReLU-based state-of-the art. Does DPA consistently improve scores here, too? One way to demonstrate this would be taking a somewhat start-of-the-art Imagenet model like the ResNet18 using the training from [1] ...
Thanks for the constructive comment.
For training Transformers, the baseline models and the training settings on ImageNet-1K were taken from timm. We apologize for using a smaller batch size and fewer GPUs than previous literature during training due to the limited computing resources we had, which might result in the difference between the results we obtained and those reported in previous literature. However, to maintain fairness in comparison, we have ensured that each experiment in a set of comparative studies used the same public training settings.
For training CNNs, our baselines' performance has already reached the ones reported in the original literature. We have also maintained fairness in comparison by using the same public training settings in each set of comparative studies. Additionally, we used the improved training procedure in [1] and trained the ResNet-18 on our local devices. The results are as follows, which also showcases DPA's effectiveness.
| Top-1 Acc / % | ResNet-18 |
|---|---|
| ReLU | 71.4 |
| GELU | 71.2 |
| DPA | 71.8 |
W1: The key results reported in Tab.1 and Tab.2 were conducted using the author's own evaluation ...
Please refer to Q2 above.
W2: ... For smaller datasets (CIFAR and ImageNet-100) standard deviations over multiple training runs should be reported for DPA.
We are sincerely sorry for the confusion. Actually, the numerical results in the paper are the average under 3 random seeds (42, 0, 100), and the results are stable. For example, here are the detailed results of training ViT-Tiny and ResNet-18 on CIFAR-100:
| Top-1 Acc / % | ViT-Tiny | ResNet-18 |
|---|---|---|
| ReLU | 65.9, 65.4, 65.7 | 75.7, 75.8, 75.7 |
| GELU | 65.4, 65.4, 65.5 | 75.5, 75.6, 75.6 |
| DPA | 70.8, 70.2, 70.6 | 76.9, 76.6, 76.7 |
We will include standard deviations in the revised version.
W3: Sec. 4.1 "can replace all existing activations in each block": Does this mean that the softmax activation is replaced?
We apologize for the confusion. The softmax is not replaced. We will revise it to "can replace all existing activations (except for the softmax) in each block".
W4: ... I don't think one can conclude from the observation of sparse activations in biological networks that this is necessarily also a desirable property for ANNs.
Thanks for the interesting comment. We would like to gently clarify that our "conclusion" (or rather "conjecture") regarding "sparse activation is a desirable property for ANNs" was not solely derived from the observation of sparse activations in biological networks. Instead, we supported our conjecture through relevant literature and extensive experiments, demonstrating that sparse activations do provide advantages to ANNs:
-
Connections within biological networks are sparse (our paper Lines 3-4 & 31-33). ANNs were originally designed to mimic biological networks. The sparse representation in ANNs has also shown notable benefits to network interpretability and generalization [1-3] (our paper Lines 4-5 & 33-34). Notably, the original paper of ReLU [3] also used similar writing logic. Inspired by biological sparse activations, the paper [3] designed the ReLU and demonstrated the advantages of sparsity in ANNs.
[1] Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 1996.
[2] Sparse feature learning for deep belief networks. NIPS, 2007.
[3] Deep sparse rectifier neural networks. AISTATS, 2011.
-
The sparse activation is the focus of our study, offering numerous benefits to ANNs. Directly supported by the confirmatory experiment in Fig. 3(b), forcibly constructing sparse activations for each category enhances training accuracy. The proposed DPA denoises redundant/irrelevant channels, leading to sparse activations and performance gains. These findings collectively validate the beneficial role of sparse activations in ANNs.
Thank you for the detailed answers.
Q1: To me this still seems like hyperparameter tuning on the test set. However, since it only affects CIFAR-100 and ViT-Tiny this is a minor issue. While and are constant across all experiments, is 1 only most of the time: It should be explicitly said in which cases is not 1.
Q2: Thank you evaluating on RN18, this result strengthens the paper. Still effect sizes are small for CNNs and sometimes performance is lower than state-of-the-art (e.g. in PVT-Tiny and TNT-small, see weaknesses in original review). This suggest that the used training protocol is sub-optimal for some architectures.
W1...W3: I appreciate your clarification.
W4: In the introduction you write "Multi-Layer Perception [...] closely resembles biological networks". While there are parallels, I find this is overstated. To motivate the work, I suggest to clearly emphasize the empirical findings over biological plausibility.
I will increase my score to 5.
Dear Reviewer YhhA,
We are deeply grateful for your constructive feedback and support. In line with your suggestions, our final revised manuscript will include more detailed clarifications on the hyperparameter and how we ensured fairness in comparison. Additionally, we will place greater emphasis on the empirical findings while presenting a more measured discussion on biological plausibility. We agree that this will strengthen the motivation and align it more closely with accepted perspectives in the field.
Thank you once again for your valuable insights and guidance. We are confident that, with your support, the quality of our final revised version will be significantly enhanced.
Best regards,
Authors of Paper #1829
This paper addresses the question of how to suppress inputs from irrelevant channels in a deep neural network. The authors develop a novel end-to-end trainable mechanism called Dual-Perspective Activation (DPA) to suppress irrelevant information and increase sparsity. The method is parameter-free, and increases performance across tasks and domains relative to baselines without this mechanism. Networks with DPA activation layers uniformly outperform standard activation functions (ReLU, GELU, SELU, GDN, etc.) across a wide range of datasets (Cifar{10,100}, ImageNet{100,1K}), and architectures (VITs, CNNs). Ablation studies show that the forward and backward components contribute to the effectiveness of DPA, with minimal computational overhead.
优点
The paper does an excellent job motivating the design of a novel activation function and demonstrating its prowess across a range of ANN architectures (VITs, CNNs, GNNs), and datasets. Across the board DPA shows improvements in top1 accuracy, though the assessments were limited to these particular (albeit ubiquitous) benchmark assessments, somewhat limiting the broader relevance/impact of the work.
The ability to outperform existing activation functions across the board suggests potential for high impact in the field of computer vision, and other areas of AI/ML (as briefly demonstrated), given that activation functions are a fundamental component in all ANN models.
缺点
The paper convincingly demonstrates the effectiveness of DPA for image classification (with limited presentation of generalization to node and text classification), but the work seems lacking in clear demonstration of relevance/impact beyond these benchmark assessments (i.e., less clear how this might impact AI/ML theory, or impact cogsci/neurioscience applications of these models). That said, the noted sparsification of representations suggest there’s a clear thread to follow up on for these more theory-focused areas.
Minor: the paper does not distinguish between sparse connectivity and “activation sparsity”, and doesn’t distinguish between “population sparsity” (only a few neurons active at any given time) and “lifetime sparsity” (a given neuron fires rarely, only for a small percentage of input images). These are often confused (or not distinguished) in the ML and Neuro literatures, and might be worth being clear about which one you are referring to (it seems like category-level lifetime sparsity?).
Minor: no mention of dropout, which directly impacts activation sparsity
Minor: How did you identify and manually remove irrelevant channels? (Figure 3)? Are these just the low activations? Would a channel norm on this layer have the same effect/benefit?
问题
Minor: Have you compared DPA to other regularlization methods that lead to greater sparsity?
局限性
OK
W1: ... but the work seems lacking in clear demonstration of relevance/impact beyond these benchmark assessments (i.e., less clear how this might impact AI/ML theory, or impact cogsci/neurioscience applications of these models). That said, the noted sparsification of representations suggest there’s a clear thread to follow up on for these more theory-focused areas.
Thank you for your thoughtful review and valuable feedback on our paper. Here is our discussion on how our work might impact AI/ML theory, or impact cogsci/neurioscience applications of ANN models:
-
Impact on AI/ML theory: DPA has demonstrated interpretable solutions at the AI/ML theoretical level in channel selection, which helps to reveal the "black box" characteristics of deep networks. Especially from the backward perspective, DPA aligns well with the principles of theory-based backward gradient attribution, identifying irrelevant channels based on gradient, uncovering the contributions of each channel to the model's final decision, and achieving more precise feature selection. This is different from the activation function that relies only on forward propagation. DPA can provide new insights into the design theory of activation mechanisms. Additionally, by effectively inhibiting irrelevant channels, DPA not only improves the model accuracy, but also promotes a more sparse neural representation, which is closely related to research in sparse coding and compressive sensing.
-
Impact on cogsci/neuroscience applications: The ANN design is strongly influenced by the working patterns of the human brain. The proposed DPA takes inspiration from the sparse activation pattern observed in biological neural networks. This indicates that by simulating the characteristics of biological neural networks, we could potentially make up for the imperfections of current ANNs. If possible, future research might investigate the utilization of neuroimaging and computational neuroscience methods to gain new insights into simulating the properties of biological neural networks.
W2: Minor: the paper does not distinguish between sparse connectivity and “activation sparsity”, and doesn’t distinguish between “population sparsity” (only a few neurons active at any given time) and “lifetime sparsity” (a given neuron fires rarely, only for a small percentage of input images). These are often confused (or not distinguished) in the ML and Neuro literatures, and might be worth being clear about which one you are referring to (it seems like category-level lifetime sparsity?).
Thank you for your constructive comment. We apologize for not clearly distinguishing between the terms "sparse connectivity" and "activation sparsity", and between "population sparsity" and "lifetime sparsity". As you said, the correct terms of sparsity that our paper aimed to achieve are "activation sparsity" and "category-level lifetime sparsity". We will correct it appropriately in the final revised version.
W3: Minor: no mention of dropout, which directly impacts activation sparsity.
Thank you for your constructive suggestion. Here, we present additional experiments comparing DPA with Dropout, which directly impacts activation sparsity:
| Top-1 Acc / % | ViT-Tiny | ResNet-18 |
|---|---|---|
| ReLU+Dropout (ratio=0.1) | 67.0 | 76.2 |
| ReLU+Dropout (ratio=0.2) | 67.6 | 76.1 |
| ReLU+Dropout (ratio=0.5) | 65.9 | 76.2 |
| GELU+Dropout (ratio=0.1) | 67.3 | 76.0 |
| GELU+Dropout (ratio=0.2) | 67.2 | 76.1 |
| GELU+Dropout (ratio=0.5) | 65.5 | 75.8 |
| DPA | 70.5 | 76.8 |
The results suggest that Dropout has a limited impact on improving performance. One possible reason is that randomly forcing activation responses to zero during training cannot effectively transfer knowledge to the testing phase.
W4: Minor: (A) How did you identify and manually remove irrelevant channels? (Fig. 3)? Are these just the low activations? (B) Would a channel norm on this layer have the same effect/benefit?
We are sincerely sorry for the confusion.
(A) Channels whose mean response distribution before the activation is lower than the activation threshold are considered potentially irrelevant and are highlighted by red arrows in Fig. 3(a). The procedure utilized for manually removing the irrelevant channels in Fig. 3(b) is forcing the responses of irrelevant channels indicated by red arrows below the threshold. In the paper, the activation threshold is set at zero; thereby, we forcibly bring the responses of these irrelevant channels down to zero (not low activations) as well.
(B) Fig. 3(b) is a confirmatory experiment on the training set. Forcibly setting responses from irrelevant channels to zero on the training set cannot generalize the knowledge to the testing set. Therefore, to address this challenge, we proposed channel denoising (channel norm) during training, allowing the network to learn how to reduce the response from irrelevant channels in the testing stage. Through this approach, we expect to approximate the way of manually removing irrelevant channels in Fig. 3(b) as much as possible.
Q1: Minor: Have you compared DPA to other regularlization methods that lead to greater sparsity?
Thank you for your constructive suggestion. Yes, our paper compares Softplus, SiLU, ReLU, and GELU, which can result in sparse activations. In addition, we also compare DPA to Dropout, which leads to greater sparsity. Our response to W3 presents the results of this comparison, showing that DPA performs better than Dropout.
Global Response
We thank all the reviewers for their time and constructive feedback, providing us with valuable insights into the areas that require improvement. We have meticulously addressed each reviewer's concerns through our comprehensive responses. In this global response, we would like to reiterate the important and common issues raised by the reviewers.
Fairness in comparison
To maintain fairness in comparison, we have strictly ensured that each experiment in a set of comparative studies used the same public training settings. The numerical results on Transformers, CNNs, and other architectures can fairly demonstrate the effectiveness of our proposed DPA method.
Impact of hyperparameters
For the proposed DPA method, the impact of hyperparameters , , and on CIFAR-100 with ViT-Tiny are presented in the original paper Sec. A.3 and Fig. 6-8. The has been verified to be optimal at zero. The network's performance is not sensitive to and when their values are not extreme.
The original intention of Sec. A.3 was to analyze the impact of hyperparameters on model performance. We apologize for the confusion caused by the misleading title of Sec. A.3. We would like to clarify that we did not select hyperparameters for each experiment, as this process is quite time-consuming. This also highlights the fairness in comparison. The consistent performance gains brought by our method also indicate the insensitivity/generalizability of the hyperparameter values. Here is our detailed explanation:
- For hyperparameter , we have only analyzed it on CIFAR-100 with ViT-Tiny and found that =0.9 performed the best. Consequently, for the rest of the experiments presented in the paper, we empirically set the to 0.9 without too much consideration.
- For hyperparameter , we have only analyzed it on CIFAR-100 with ViT-Tiny, DeiT-Tiny, and TNT-Small. As shown in Fig. 7 of the original paper, too large can result in negative side effects, and one thing is for sure: smaller values do not hurt accuracy. Therefore, for the majority of our experiments (CaiT & PVT on CIFAR-100, and all the models on CIFAR-10 & ImageNet-{100,1K}), we did not select but empirically used the default as 1 (see our paper Lines 484-485).
- For hyperparameter , it has been verified to be optimal at zero in the original paper. Therefore, we set the to 0 across all the experiments.
Notably, the performance of the proposed DPA method might be further improved if we were to conduct hyperparameter search on and for each experiment.
Benefits of sparse activations
In this paper, we first conducted experiments (see our paper Fig. 2) to validate that each category is only correlated with sparse and specific channels in ANNs. Combined with the patterns of the activation response distribution (Fig. 3(a)), we conjectured that constructing sparse activations based on the above observations might be beneficial for ANNs in eliminating irrelevant/redundant features. Subsequently, we supported our conjecture through a review of relevant literature and extensive experiments, demonstrating that sparse activations do provide advantages to ANNs.
The sparse activation is the goal of this study, which can be directly supported by the confirmatory experiment, as shown in Fig. 3(b). By manually removing the irrelevant channels for each category (which is equivalent to forcibly constructing sparse activations), there is a substantial improvement in the training accuracy. Moreover, the observation in Fig. 2 indicates that each category is only correlated with sparse and specific channels in ANNs. The proposed DPA effectively denoises redundant channels while not affecting relevant channels, resulting in sparse representation and performance enhancements. All of these collectively validate the beneficial role of sparse activations in ANNs.
We are committed to upholding a higher standard in our revised manuscript.
Finally, we would like to express our sincere gratitude again to all the reviewers for their meticulous review and constructive feedback on our manuscript, which will undoubtedly enhance the quality of our work.
The reviewers find the paper well motivated and well written with a clearly novel contribution. The experimental results are clear and convincing. The effect sizes are small for convets, meaning the benefit is potentially specific to the architecture chosen (bigger improvements for Transformers). The impact is also limited somewhat by showing improvements only on relatively basic image classification tasks (CIFAR, ImageNet). Overall it is a very solid paper that makes a good poster presentation but doesn't quite provide enough expected impact to qualify as a spotlight.