5.3

/10

Rejected4 位审稿人

最低3最高6标准差1.3

3.0

置信度

正确性2.3

贡献度2.3

表达2.0

ICLR 2025

Balancing Model Efficiency and Performance: Adaptive Pruner for Long-tailed Data

Zhe Zhao,Pengkun Wang,HaiBin Wen,ShuangWang,JingXin Han,Zhenkun Wang,Qingfu Zhang,Yang Wang

OpenReview PDF

提交: 2024-09-28更新: 2025-02-05

摘要

关键词

Long-tail learning，Neural network pruning，Multi-objective Optimization

评审与讨论

审稿意见

评分: 6置信度: 22024-10-28

The paper presents Long-Tailed Adaptive Pruner (LTAP), a novel approach designed to enhance model efficiency while addressing the challenges posed by long-tailed data distributions. LTAP introduces multi-dimensional importance scoring and a dynamic weight adjustment mechanism to prioritize the pruning of parameters in a manner that safeguards tail class performance. The method incorporates a unique voting mechanism (LT-Vote) to adjust the importance of parameters based on classification accuracy across different classes. The authors report substantial improvements in computational efficiency and classification accuracy, particularly for tail classes, on various benchmark datasets, including CIFAR-100-LT, ImageNet-LT, and iNaturalist 2018.

优点

LTAP addresses limitations in conventional pruning approaches for long-tailed distributions. The LT-Vote mechanism and multi-stage pruning offer a unique way to balance model efficiency and tail class performance.
The paper provides a solid theoretical analysis, justifying the need for specialized parameter allocation in long-tailed distributions.
The authors have provided the code for reproducing the results reported in the manuscript, which is commendable. However, I recommend adding a more detailed introduction in the README file to facilitate easy execution of the code.

缺点

The proposed method is similar to post-hoc correction in Logit Adjustment [1]. I recommend including Logit Adjustment in the baseline comparison.
Although the manuscript has theoretically proven that over-parameterization benefits the tailed classes, please conduct preliminary experiments to empirically validate this claim.
The dynamic nature of LTAP incurs additional computational overhead.

[1] Logit Adjustment: https://arxiv.org/pdf/2007.07314

问题

What are the negative effects of model pruning on the head classes?
How many iterations are required for the model parameters weight selection in Figure 1?
Over-pruning can significantly degrade model performance. What criterion does LTAP use to determine the stopping point for the model parameters selection?

评论- Author Rebuttal

2024-11-24

Q1: Comparison with the Baseline

We have included comparisons with the Logit Adjustment baseline for your reference. Thank you for your suggestion. As a method that considers long-tailed distributions, Logit Adjustment exhibits more balanced performance and higher tail-class accuracy. In this setting, our method still achieves the highest C/F while retaining superior tail-class performance compared to other pruning baselines. This strongly demonstrates the applicability of our method.

Thank you for your question, which helped clarify our contributions. Additionally, we have incorporated more details to facilitate reproducibility in the shared link.

IR = 50

Method	F	Head	Medium	Tail	All	C	C/F
CE	100.0	68.0	38.0	13.2	46.0	100.0	1.0
CE + ReGG	52.1	43.8	14.8	0.83	24.5	53.2	1.0
CE + Ours	23.3	64.8	31.7	7.2	41.1	89.3	3.8

IR = 100

Method	F	Head	Medium	Tail	All	C	C/F
CE	100.0	70.7	40.0	7.2	41.0	100.0	1.0
CE + ReGG	52.1	47.7	13.6	0.6	21.9	53.4	1.0
CE + Ours	23.3	66.1	31.7	2.5	35.1	85.6	3.6

Q2: Over-parameterization for Tail Classes

Thank you for acknowledging our proof. The additional experiments we conducted further illustrate this point. Using the static pruning method ReGG (where a sparser network is pruned only at initialization), we ensured the model’s C/F remains 1 (this can be seen as reducing the model’s over-parameterization). Under the standard cross-entropy (CE) loss (as requested by Reviewer hf3A, which applies no corrections for long-tailed distributions), tail-class performance suffered catastrophic declines.

Q3: Additional Overhead

In fact, our method introduces minimal additional overhead. The time complexity of our pruning method is O(d) , where d is the number of model parameters. This is significantly lower than ATO’s O(Dd) , where D is the size of the supernet used in that method. The additional complexity of the dynamic scoring and balancing mechanism is O(nk) , where n is the number of classes and k is the number of criteria. Even for datasets with long-tailed distributions, this remains much smaller than O(d) , as our total complexity is O(d) .

Q4: Negative Impact on Head Classes

In practice, our method does not cause significant negative impacts on head classes.

The core of our method lies in considering the impact of pruning on class distributions when selecting pruning locations and dynamically incorporating feedback. Under this approach, compared to methods that use average model performance as the sole criterion, our method continuously adjusts to achieve distribution-adaptive pruning.

Our method implicitly assumes that performance changes on each class (whether head or tail) are equally important, which partially offsets some of the negative effects caused by sample size differences. However, our method remains fair: when a pruning criterion harms head classes, our method reduces its weight and attempts to find alternative pruning criteria to avoid such harm.

The experiments in the main text, as well as the supplementary experiments, have demonstrated this. Our method achieves performance improvements across classes with varying frequencies.

Q5: How Many Iterations Are Needed to Determine Parameter Weights?

The weights are computed once per epoch until pruning ends. Thus, the number of iterations is generally fewer than the total number of training epochs.

Q6: How Are Pruning Stopping Points Determined?

In our experiments, we followed standard pruning settings, stopping once pruning reached a certain proportion.

Beyond this, since our method is sensitive to class-level performance changes, it can be adapted to additional stopping criteria as needed. For instance, pruning can stop when excessive performance drops are observed in too many classes (a certain number of classes), or when head-class/tail-class performance declines sharply over consecutive rounds. This flexibility allows the method to meet more realistic deployment requirements.

评论- Thank You For Your Response

2024-11-27

Thank you for conducting additional experiments and providing the code to address my concerns. I would like to keep my original score.

评论- Thank You

2024-11-27

Thank you for your response and for acknowledging our supplementary experiments and code. We are pleased that we could address your concerns. Your review comments have been instrumental in improving the quality of our manuscript. Thank you again for your efforts.

审稿意见

评分: 3置信度: 42024-10-30

The paper explores how adaptive pruning strategies during training can balance model performance and efficiency on long-tailed data. It introduces a multi-dimensional importance scoring criterion and a weight allocation strategy to address this challenge. Experimental results demonstrate the effectiveness of the proposed method.

优点

Researching how machine learning algorithms tackle the challenges of long-tailed data is both practical and worthwhile.
The experimental results demonstrate promising effectiveness.
The code is provided, which is a commendable practice.

缺点

The paper is poorly written. For instance, the logic in the first three paragraphs of the Introduction section is disorganized, failing even to clearly articulate the research problem. It begins by discussing the challenges of long-tailed data, then moves to efficiency issues in multi-expert systems and modular designs within long-tailed learning, neglecting mainstream approaches like re-sampling, loss design, and transfer learning. It then shifts abruptly to the challenges of pruning methods in long-tailed learning, without clarifying the specific problem the paper aims to address.
$\theta$ in Equation 1 and $p$ in Equation 6 are not defined.
There is an inconsistency between Figure 1 and Equations 3 and 4: Figure 1 illustrates 5 criteria, while the latter only includes 4.
It is unclear how to set $w_c$ in Equation 1.
How to set $w_k$ in Equation 5 is also not explained.
The left side of Equation 5 represents a class-agnostic quantity, while the right side includes class c, which is confusing. Additionally, I do not observe any dynamic adjustment effect in Equation 5; \text{acc}_C seems to only function as the temperature coefficient in the softmax function.
The setup of $A_{c,\text{target}}$ in Equation 5 is also not specified.
In line 149, the selection and role of the reference model are not explained.
The method employs validation set accuracy and a reference model during training, which is uncommon. The authors should explicitly highlight and discuss these points.
In addition to the results on the long-tail baselines, results on the vanilla baseline, i.e., using standard cross-entropy (CE), should also be presented.
How does the method integrate with the long-tail baseline—specifically, which part of Section 2 is modified to achieve this?

问题

My questions that need clarification are included in the weaknesses section.

评论- Author Rebuttal

2024-11-24

Q1: Writing Issues

We sincerely appreciate your feedback. Here, we clarify our writing structure to address your concerns:

Introduce the background of long-tailed learning and its importance in addressing the challenges of imbalanced class distributions, particularly highlighting the difficulty posed by the scarcity of tail-class data.
Analyze the limitations of existing methods in terms of computational resource consumption and dynamic adaptability, emphasizing the necessity of pruning techniques for optimizing model efficiency.
Explain the unique challenges traditional pruning methods face in long-tailed learning, such as exacerbated class imbalance, lack of dynamic adjustment capabilities, and overly simplistic pruning criteria.
Propose our Long-tailed Adaptive Pruning (LTAP) method, which achieves dual optimization of model performance and efficiency through dynamic weight adjustment and tail-class protection mechanisms.

Additionally, we have supplemented and rewritten the methodology section to reduce potential misunderstandings caused by our previous oversights. Thank you for your valuable suggestions.

If you have other suggestions for improving the writing of the introduction or methodology, please feel free to share them, and we will address them promptly.

Q2: Definition of p provided, but not θ

The definition of p is at line 166, where P is described as the number of pruning stages. Additionally, we have supplemented the definition of θ (referring to model parameters). Thank you for pointing this out.

Q3: Added the fifth definition

Thank you for identifying this oversight. We have supplemented and rewritten this section to clarify the missing definition.

Q4: Definition of wc

wc refers to the weighting mechanism for classes in long-tailed methods, similar to class weighting techniques in methods like balanced softmax. Thank you for your detailed observation. To avoid potential misunderstandings related to our method, we have removed this part.

Q5: Clarification of w and wk

w is a 5×100 matrix that records the weight of each criterion for each class. wk is a 1×100 matrix representing the weights for one specific criterion. The initial value of wk is set to all ones. Thank you for pointing out this issue. To avoid confusion caused by continuing to use w to represent weights, we have replaced w with the symbol D in the manuscript and annotated its dimensions and update mechanism to prevent any potential misunderstanding.

Q6: Clarification of αk

Thank you for your question. αk is shared across all classes. However, its calculation depends on the weights wk and γ, which are associated with each class. γ can be set as the weight for each class, and the transformation of wk fully considers the performance changes of each class. In other words, we evaluate performance changes class by class and ultimately derive the weight for the strategy accordingly.

Additionally, we have rewritten this section in the manuscript using more easily comprehensible language, split long formulas into shorter ones, and added the necessary textual explanations. Changes have been highlighted in red for your reference.

Q7: Clarification of Ac and target

Ac and target refer to the performance change metrics under the selected criterion. We have avoided such ambiguous terms in the revised version. In the updated manuscript, we adopted multi-line short formulas and more intuitive symbols to ensure clarity.

Q8: Clarification of the reference model

The "reference model" refers to using the current model gradient as a parameter to serve as a directional reference. This is indeed an uncommon approach, mentioned only in a few works on pruning design. We have clarified this concept in the manuscript.

评论- Author Rebuttal

2024-11-24

Q9: Why use validation set accuracy

Thank you for your reminder. In the LTAP method, using validation set accuracy is a carefully considered design choice. It serves two purposes:

To provide reliable performance metrics for the LT-Vote mechanism, enabling more precise adjustments to the weights of different scoring criteria.
To offer stable feedback signals during the multi-stage pruning process, guiding the dynamic evaluation of parameter importance.

To ensure practicality, we recommend using a small-scale validation set (e.g., 10% of the training data, which is generally sufficient in real-world scenarios). Our experimental results demonstrate that this is adequate for effective parameter importance evaluation and dynamic adjustments. Similar approaches have also been adopted in other long-tailed learning studies [1]. We will include a discussion on this aspect in future revisions.

[1] Sumyeong Ahn, Jongwoo Ko, and Se-Young Yun. Cuda: Curriculum of data augmentation for long-tailed recognition. In The Eleventh International Conference on Learning Representations, 2023.

Q10: Additional experiments for CE baseline and Logit Adjustment

Thank you for your reminder. We have supplemented the necessary experiments, demonstrating that our method shows performance advantages on both the CE baseline and Logit Adjustment. Due to time constraints, we will further improve this part of the experiments in the main text and on additional datasets in future updates. Thank you for your feedback.

Experiments on CIFAR with IR=50

We tested different loss functions (CE and Logit Adjustment), pruning methods, and the ablation of our tail-class protection mechanism (w.o. $\kappa$ ).

Method	F	Head	Medium	Tail	All	C	C/F
CE	100.0	68.0	38.0	13.2	46.0	100.0	1.0
CE + ATO	84.7	46.7	17.3	6.83	29.1	63.2	0.7
CE + ReGG	52.1	43.8	14.8	0.83	24.5	53.2	1.0
CE + Ours w.o. $\kappa$	23.3	64.5	30.0	4.5	39.5	85.8	3.6
CE + Ours	23.3	64.8	31.7	7.2	41.1	89.3	3.8
LA	100.0	59.9	46.7	41.3	51.3	100.0	1.0
LA + ATO	84.7	34.5	33.9	29.1	34.2	66.7	0.7
LA + ReGG	52.1	31.0	30.2	25.8	29.9	58.3	1.1
LA + Ours w.o. $\kappa$	22.8	53.5	42.7	30.0	44.8	87.3	3.8
LA + Ours	22.8	54.0	43.4	38.4	47.1	91.8	4.0

Experiments on CIFAR with IR=100

We tested different loss functions (CE and Logit Adjustment), pruning methods, and the ablation of our tail-class protection mechanism (w.o. $\kappa$ ).

Method	F	Head	Medium	Tail	All	C	C/F
CE	100.0	70.7	40.0	7.2	41.0	100.0	1.0
CE + ATO	84.7	50.4	16.5	6.6	25.2	61.5	0.7
CE + ReGG	52.1	47.7	13.6	0.6	21.9	53.4	1.0
CE + Ours w.o. $\kappa$	23.3	67.1	30.8	0.5	34.4	83.9	3.6
CE + Ours	23.3	66.1	31.7	2.5	35.1	85.6	3.6
LA	100.0	62.9	47.7	29.6	47.9	100.0	1.0
LA + ATO	84.7	42.1	30.6	18.4	31.4	65.6	0.7
LA + ReGG	52.1	38.5	27.0	14.6	27.6	57.6	1.1
LA + Ours w.o. $\kappa$	22.8	55.1	42.0	18.6	39.5	82.4	3.6
LA + Ours	22.8	56.1	45.3	22.6	42.6	88.9	3.9

Q11: Integration with various long-tailed losses

Our method can integrate with various long-tailed learning losses because its dynamic adjustments and implicit protections are not dependent on any specific long-tailed learning approach. The additional experiments we performed at the request of reviewers also demonstrate that our method can integrate with different long-tailed baselines while consistently showing performance advantages.

A point to note, however, is that different long-tailed learning methods may exhibit relatively fixed preferences for head or tail classes. This may be reflected in differences in gradient directions during model updates, which could affect the score value calculations in Equations (3) and (4). Nevertheless, due to the post-hoc adjustment properties of our method, LTAP’s dynamic adaptation can still achieve tail-class protection and relatively balanced high-performance pruning, as demonstrated in our supplementary experiments.

评论- Feedback on the author rebuttal

2024-11-25

Thank you for your rebuttal.

It appears that the revised version has overwritten the original submission. As many parts of the paper related to my initial review have been modified, I cannot fully assess the point-by-point clarifications in the rebuttal without access to the original version for reference. Would it be possible to include the original version, perhaps as an appendix?

I will take time to review the revised paper and provide feedback on it later.

评论- Author Rebuttal

2024-11-25

Thank you for your reply, and we apologize for our oversight. I have uploaded the original version in supplementary materials. If there is anything that still causes confusion or if you have any suggestions, please let me know, and I will do my best to make the necessary revisions. Thank you for your time and effort.

评论- For the author rebuttal (regarding the original submission)

2024-11-27

For Q1: I maintain my concerns regarding the coherence of the first three paragraphs in the introduction.

For Q2: My question concerns the lowercase $p$ , but the authors have only defined uppercase $P$ in both the main text and the rebuttal.

For Q6: The authors still have not directly clarified why the right-hand side of Equation 5 (in the original version) is a term that varies with class $c$ , while the left-hand side is independent of $c$ . This issue remains in the revised version of Equation 5 as well.

评论- rebuttal

2024-12-02

For Q2: Thank you for raising this important point regarding the use of lowercase p. In the manuscript, the notation γ_p refers to the preset pruning ratio (as described in lines 173, 182, and the pseudocode). The lowercase p is used specifically to distinguish γ_p from γ_total, and it is not intended to represent a variable.

To improve the clarity and consistency of our presentation, we will revise the notation in the updated version of the manuscript. We will replace γ_p with γ_preset throughout the text and pseudocode. This change will make its meaning more explicit and avoid any potential confusion.

We greatly appreciate your feedback and will ensure that this issue is addressed in the revised version.

For Q6: We understand that the dependency on class c in both sides of the equation $\alpha^{(t)} = \text{softmax}(I_c)$ may cause confusion (where c and $N_c$ refer to class-related terms rather than specific classes). To clarify the derivation process and relationships between variables, we will provide examples and detailed analysis.

1. Variable Definitions and Dimensions

( $D$ ):
- Dimension: $(5, 100)$
- Function: Stores importance scores matrix for 5 pruning criteria across 100 classes
  - Each row represents a pruning criterion (e.g., magnitude, cosine_similarity, etc.)
  - Each column represents a class
( $N_c$ ):
- Dimension: $(100, 1)$
- Function: Stores sample counts $N_c$ for each class, used for weighted importance scoring
( $I_c$ ):
- Formula: $I_c = D \cdot N_c$
- Dimension: $(5, 1)$
- Function: Through matrix multiplication, combines criterion scores with class sample counts
( $\alpha^{(t)}$ ):
- Dimension: $(5,)$
- Function: Represents importance weights for each criterion, obtained through Softmax normalization of $I_c$

2. Matrix Multiplication Example

Consider this example to understand the matrix multiplication rules and results:

Assume:

$D \in \mathbb{R}^{5 \times 3}$ : Stores importance scores for 5 criteria across 3 classes:

| 0.1 | 0.2 | 0.3 |
| 0.4 | 0.5 | 0.6 |
| 0.7 | 0.8 | 0.9 |
| 1.0 | 1.1 | 1.2 |
| 1.3 | 1.4 | 1.5 |

Each row represents a criterion (e.g., magnitude, cosine_similarity, etc.)
Each column represents class scores
$N_c \in \mathbb{R}^{3 \times 1}$ : Represents sample counts for 3 classes:

| 10 |
| 20 |
| 30 |

Computing $I_c = D \cdot N_c$ :

| 0.1 | 0.2 | 0.3 |   | 10 |   | 14.0 |
| 0.4 | 0.5 | 0.6 | · | 20 | = | 32.0 |
| 0.7 | 0.8 | 0.9 |   | 30 |   | 50.0 |
| 1.0 | 1.1 | 1.2 |            | 68.0 |
| 1.3 | 1.4 | 1.5 |            | 86.0 |

Result interpretation:
- $I_c \in \mathbb{R}^{5 \times 1}$ is a column vector where each element represents the weighted sum score for the corresponding criterion across all classes

3. Applying Softmax Normalization

Next, we apply Softmax to normalize $I_c$ into a probability distribution:

Softmax formula: $\alpha_k^{(t)} = \frac{\exp(I_{c,k})}{\sum_{j=1}^5 \exp(I_{c,j})}, \quad \forall k \in \{1, 2, \ldots, 5\}$
Result:
- $\alpha^{(t)} \in \mathbb{R}^5$ represents normalized weights for each criterion, summing to 1

4. Analysis of Class c Dependency

Right-side $I_c$ :
- $I_c$ obtains class-related information through matrix multiplication $I_c = D \cdot N_c$ (not specific to any particular class)
Left-side ( $\alpha^{(t)}$ ):
- Softmax normalizes the 5 elements in $I_c$ , reflecting relative importance between criteria
- $\alpha^{(t)}$ is a global criterion weight, not directly related to specific classes
Role of Softmax:
- Only normalizes the 5 criterion scores in $I_c$ , outputting global weight distribution
- Although $I_c$ depends on class c, the Softmax output $\alpha^{(t)}$ no longer retains class information

2024-12-02

A simple piece of immediate feedback:

" $N_c$ is the number of samples in class c." Shouldn’t $N_c$ be a scalar? Why does it have a shape of (100, 1), i.e., (C, 1)?

2024-12-02

Dear Reviewer,

We sincerely apologize for any confusion. We have made the following corrections and clarifications: $N_c$ represents the number of samples for each class, thus it is a vector rather than a scalar. We have explicitly stated this in the paper: " $N_c \in \mathbb{R}^{C}$ represents the class distribution vector." In our previous response to you, we further elaborated on this concept: "where $c$ and $N_c$ refer to class-related terms rather than specific classes." Additionally, we will modify the original text from "where $I_c \in \mathbb{R}^K$ represents the comprehensive importance scores across criteria for class c" to "where $I_c \in \mathbb{R}^K$ represents the comprehensive importance scores across criteria for classes" to avoid potential confusion. Thank you for your careful review and valuable feedback.

2024-12-02

If it’s a count vector for all C classes, why is there a subscript for class $c$ ?

2024-12-02

Is N_c also a vector in line 232?

2024-12-02

Dear Reviewer, We now fully understand your concern. Regarding your previous question, we will change the subscript to "class" when referring to all classes collectively, indicating it relates to the class level, for example, $N_c$ -> $N_\text{class}$ , $I_c$ -> $I_\text{class}$ , $A_c$ -> $A_\text{class}$ . For references to specific classes, such as when we need to denote a particular class from head or tail classes in theoretical discussions, we will maintain the use of $c$ as the subscript.

评论- A Gentle Reminder

2024-12-03

Dear Reviewers,

We sincerely appreciate your valuable efforts and continued support of our work. We have carefully addressed each of your comments and thoroughly revised the manuscript accordingly. We are grateful that our detailed responses have been able to address your concerns during the rebuttal process.

As the discussion period is coming to a close on December 2nd (today), if you have no further comments, we kindly remind you to consider updating your score.

Thank you again for your time and consideration.

Best regards, Authors of submission 13643

评论- A Gentle Reminder

2024-12-03

Dear Reviewer hf3A,

As the discussion phase is nearing its conclusion, we kindly remind you to re-evaluate our submission based on the revisions and responses we have provided. If there are any remaining concerns or suggestions, please let us know, and we will do our utmost to address them promptly.

Thank you again for your time and effort.

Best regards, Authors of submission 13643

评论- For the revision

2024-11-27

The revised version contains numerous fundamental formatting issues that should not appear in a qualified submission. For instance, lines 133, 136, 137, 148, 150, 163, 439, and 699 exhibit basic problems such as missing mathematical environments and incorrect use of subscript and superscript symbols.

Section 2 is poorly written and hard to follow. Each step and the introduction of formulas lack basic intuition and motivation, making it difficult for me to grasp the core contributions and innovations of the method.

The purpose of many formulas and variables is not clearly explained. For example:Why are both $S$ and $I$ referred to as the “comprehensive importance score,” and what is the distinction between them? Why is the right-hand side of Equation 5 a term that varies with class $c$ , while the left-hand side is independent of $c$ ? What does p represent in Equation 7, and why is the left-hand side of Equation 7 a quantity determined by $p$ , while the right-hand side appears to be a constant?

评论- rebuttal

2024-12-02

For Section 2: To help you better understand the motivation and contributions of this paper directly, we have made simple modifications to Section 2:

LTAP: Adaptive Pruner for Long-tailed Distribution

In this section, we propose a novel pruning strategy called Long-Tailed Adaptive Pruner (LTAP), aimed at optimizing neural network models on long-tailed distribution datasets. By effectively protecting critical parameters of tail classes, LTAP not only enhances overall model performance but also improves parameter efficiency in long-tailed distribution scenarios. This is achieved through the integration of multiple importance scoring criteria and the dynamic adjustment of pruning weights. The following subsections will provide a detailed explanation of the overall architecture, LTAP optimizer design, pruning strategy implementation, and the alternating process of training and pruning.

LTAP Optimizer Design

To accurately assess parameter importance, LTAP introduces multiple scoring criteria, including magnitude, average magnitude, cosine similarity, Taylor first order, and Taylor second order. These diverse criteria enable a comprehensive evaluation of parameter significance, capturing different aspects of their contributions to the model's performance. By incorporating multiple perspectives, LTAP ensures a more balanced pruning process that takes into account the complex dynamics of long-tailed distributions. Additionally, to capture the multifaceted nature of parameter contributions, we introduce a dynamic weighting mechanism that adjusts the influence of each criterion based on real-time performance metrics.

The comprehensive importance score $S_g$ for each parameter group $g$ is calculated by the following formula:

S_g = \sum_{k=1}^K \alpha_k \cdot s_{g,k}

Here, $K=5$ represents the number of scoring criteria, $\alpha_k$ is the weight coefficient for each scoring criterion $k$ , and $s_{g,k}$ denotes the score value of scoring criterion $k$ for parameter group $g$ . The specific scoring criteria are defined as follows: $s_{g,\text{magnitude}} = |\mathbf{w}_g|_2$

$s_{g,\text{avg-magnitude}} = \frac{|\mathbf{w}_g|_2}{n_g}$

$s_{g,\text{cosine}} = \frac{\mathbf{w}g \cdot \mathbf{w}{ref}}{|\mathbf{w}_g|2 |\mathbf{w}{ref}|_2}$

$s_{g,\text{taylor-first}} = \left|\frac{\partial \mathcal{L}}{\partial \mathbf{w}_g}\right| \cdot \mathbf{w}_g$

$s_{g,\text{taylor-second}} = \left|\frac{\partial^2 \mathcal{L}}{\partial \mathbf{w}_g^2}\right| \cdot \mathbf{w}_g^2$

Where $\mathbf{w}_g$ is the weight vector of parameter group $g$ , and $n_g$ is the number of parameters in group $g$ .

The reference weight vector $\mathbf{w}_{\text{ref}}$ represents the gradients of the current model parameters, serving as a directional reference in the cosine similarity criterion to measure the alignment between parameter updates and the optimization trajectory.

This multifaceted scoring approach allows LTAP to accurately identify and preserve parameters that are crucial for both head and tail classes, thereby maintaining model robustness and enhancing performance across the entire long-tailed distribution.

Pruning Strategy Implementation

We propose a dynamic importance evaluation mechanism that adaptively integrates class distributions with multiple pruning criteria. This integration allows LTAP to dynamically prioritize parameters based on the specific needs of each class, ensuring that tail classes receive the necessary attention during the pruning process. By leveraging both the inherent class distribution and the multifaceted parameter importance scores, LTAP can effectively balance model compression with the retention of critical information necessary for accurate classification across all classes. The importance score is computed through a novel interaction framework:

I_c = D \cdot N_c

Where $I_c \in \mathbb{R}^K$ represents the comprehensive importance scores across criteria for class $c$ , $D \in \mathbb{R}^{K \times C}$ denotes the criteria weight matrix, and $N_c \in \mathbb{R}^C$ represents the class distribution vector.

评论- rebuttal

2024-12-02

Formulas and Variables: Point 1: S in the manuscript refers to "comprehensive importance score for each parameter group." $S_g$ is the comprehensive importance score for each parameter group $g$ , which guides its retention or removal during the pruning process and is calculated using equation (1).

I in the manuscript refers to "the comprehensive importance scores across criteria for class," which represents the comprehensive importance scores for categories, used to dynamically adjust the weights $\alpha_k$ of scoring criteria to adapt to long-tailed data distribution. $I_c$ indirectly influences the calculation of $S_g$ through adjusting $α_k$ .

Point 2: As this question was previously addressed in Q3, we provide a concise summary

$I_c$ represents the total weighted scores of 5 scoring criteria across all categories, with dimension $(5, 1)$ . $\alpha^{(t)}$ is the normalization result of $I_c$ through Softmax operation, with dimension $(5,)$ , representing the importance weights of 5 scoring criteria, summing to 1.

Point 3: As this question was previously addressed in Q, we provide a concise summary

In the manuscript, the notation γ_p refers to the preset pruning ratio (as described in lines 173, 182, and the pseudocode). The lowercase p is used specifically to distinguish γ_p from γ_total, and it is not intended to represent a variable. We will replace γ_p with γ_preset throughout the text and pseudocode. This change will make its meaning more explicit and avoid any potential confusion.

评论- A Gentle Reminder: Response to Your Review Comments

2024-12-02

Dear Reviewer hf3A,

We have done our best to provide a response to your comments and hope that you can re-evaluate our submission based on the revisions. If there are any further concerns or questions, please don't hesitate to bring them up.

审稿意见

评分: 6置信度: 42024-11-01

This paper proposes a new model pruning strategy for long-tailed data named Long-Tailed Adaptive Pruner (LTAP). The proposed pruning strategy first calculates the comprehensive importance score for each parameter group to select the group(s) of parameter to be pruned, then updates the weight coefficients for each scoring criteria through Long-Tailed Voting (LT-Vote) mechanism. LT-Vote adjusts the weight coefficients based on the actual performance of each class, thus better protects the tail class parameters.

优点

The extensive experimental results demonstrate that the proposed LTAP strategy effectively identifies low-importance parameters, achieving a better balance between maintaining model performance—particularly on tail classes—and reducing model size. Additionally, the authors provide a theoretical analysis arguing that tail classes should be prioritized in parameter retention to preserve model accuracy, lending strong support to the LTAP approach.

缺点

Theoretical analysis of LT-Vote. The authors should further discuss how LT-Vote enhances parameter retention for tail classes, connecting this mechanism more explicitly to the theoretical foundation of tail-biased pruning. Deriving a specific performance guarantee for the LT-Vote mechanism would strengthen the argument.
Ability of retaining tail class effective parameters. In Section 4.3, the authors analyzes how neurons are masked under different pruning strategies. However, they do not assess whether the proposed strategy actually preserves more parameters for tail classes, leaving the theoretical analysis unvalidated.
Minor issues.

(1) Including a pseudo-code block would clarify the strategy and provide coherence among the presented formulas.

(2) Some subscripts are not properly rendered. For example, Lemma 1 in Appendix A.1 writes $\mathcal{H}_s$ as $\mathcal{H}c$ , $d_{VC,c}$ as $dVC,c$ .

(3) Missing references. Line 686 "Definition ??" and line 693 "e.g., see ?".

问题

What is the training/pruning time cost of your model pruning method?

评论- Author Rebuttal

2024-11-24

Q1: Theoretical basis for enhanced tail class parameter retention:

Thank you for your question. We agree that fully explaining how our method enhances tail class parameter retention would increase the paper's contribution. As per your request, we have provided the best proof possible. Due to time constraints, we made some assumptions and simplifications regarding the model, distribution, and algorithmic details without compromising our method's applicability. The proof is included in Appendix D, pages 19-23.

Q2: Verification of parameter preservation for tail classes:

Thank you for raising this concern. We have conducted additional experiments to demonstrate "preserves more parameters for tail classes":

We performed additional ablation studies on datasets with two imbalance ratios. "ours w.o. $\kappa$ " represents our method with tail protection ablated. As shown, with the same pruning rate F, significant performance degradation occurs in tail classes across different imbalance ratios (ir=50 and 100) and loss functions after ablation. This indicates that our original method retained more parameters beneficial for tail class performance. (Note that retained parameters beneficial for tail classes don't necessarily mean they're useless for head classes - pruned parameters might contribute to both head and tail classes, only to certain classes, or have minimal contribution overall.)

Table 1: Experiments on CIFAR with ir=50, testing different losses (ce and Logit Adjustment) and pruning methods. w.o. $\kappa$ represents ablation of our tail protection mechanism

Method	F	Head	Medium	Tail	All	C	C/F
ce	100.0	68.0	38.0	13.2	46.0	100.0	1.0
ce + ato	84.7	46.7	17.3	6.83	29.1	63.2	0.7
ce + regg	52.1	43.8	14.8	0.83	24.5	53.2	1.0
ce + ours w.o. $\kappa$	23.3	64.5	30.0	4.5 (↓2.7)	39.5	85.8	3.6
ce + ours	23.3	64.8	31.7	7.2	41.1	89.3	3.8
la	100.0	59.9	46.7	41.3	51.3	100.0	1.0
la + ato	84.7	34.5	33.9	29.1	34.2	66.7	0.7
la + regg	52.1	31.0	30.2	25.8	29.9	58.3	1.1
la + ours w.o. $\kappa$	22.8	53.5	42.7	30.0 (↓8.4)	44.8	87.3	3.8
la + ours	22.8	54.0	43.4	38.4	47.1	91.8	4.0

Table 1: Experiments on CIFAR with ir=100, testing different losses (ce and Logit Adjustment) and pruning methods. w.o. $\kappa$ represents ablation of our tail protection mechanism

Method	F	Head	Medium	Tail	All	C	C/F
ce	100.0	70.7	40.0	7.2	41.0	100.0	1.0
ce + ato	84.7	50.4	16.5	6.6	25.2	61.5	0.7
ce + regg	52.1	47.7	13.6	0.6	21.9	53.4	1.0
ce + ours w.o. $\kappa$	23.3	67.1	30.8	0.5 (↓2.0)	34.4	83.9	3.6
ce + ours	23.3	66.1	31.7	2.5	35.1	85.6	3.6
la	100.0	62.9	47.7	29.6	47.9	100.0	1.0
la + ato	84.7	42.1	30.6	18.4	31.4	65.6	0.7
la + regg	52.1	38.5	27.0	14.6	27.6	57.6	1.1
la + ours w.o. $\kappa$	22.8	55.1	42.0	18.6 (↓4.0)	39.5	82.4	3.6
la + ours	22.8	56.1	45.3	22.6	42.6	88.9	3.9
Q3: Training and pruning costs:

Our method actually has relatively low costs in the pruning strategy. The time complexity of our pruning method is O(d), where d is the model parameter scale. This is significantly lower than ATO's O(Dd), where D is the scale of the hypernetwork used in their method. The additional complexity of dynamic scoring and balancing mechanisms is O(nk), where n is the number of classes and k is the number of criteria, which is much smaller than O(d) even on long-tailed datasets, as our total complexity remains O(d).

Q4: Other minor issues:

We have added the requested pseudocode to the revision and will incorporate it into the main text. We have also thoroughly revised notation errors and missing references. We sincerely appreciate your contributions to improving the manuscript's quality.

2024-11-25

Thank you for your detailed response.

For the theoretical analysis, I appreciate your acknowledgment of the assumptions and simplifications made in the theoretical analysis due to time constraints. From my perspective, your explanation is ok to address my concern to some extent.

For your additional experiments, your results indicate a high correlation between tail protection mechanism and improved performance especially on tail classes. While the results do not directly confirm whether more parameters are retained for tail classes, your additional theoretical analysis in Appendix D supports the likelihood of this being the case.

In summary, based on your response and the feedback from other reviewers, I find your rebuttal reasonable and well-prepared. I will keep my score unchanged.

评论- Thank you

2024-11-25

Thank you for your prompt reply. We appreciate your recognition of the reasonableness of our response and our thorough efforts. We are also pleased to see that the supplementary theoretical proofs and experimental results help clarify the mechanism of tail class parameter retention. Your suggestions have been invaluable in improving the paper's quality, allowing us to refine our work from different perspectives. We will further improve the presentation of the theoretical analysis section in the final version to make it more rigorous and complete. Thank you again for your review comments.

审稿意见

评分: 6置信度: 22024-11-01

The paper introduces a novel pruning approach called the Long-Tailed Adaptive Pruner (LTAP). LTAP is designed to address the challenge of imbalanced datasets where traditional pruning methods often fall short. The LTAP strategy introduces a multi-dimensional importance scoring system and a dynamic weight adjustment mechanism to adaptively determine which parameters to prune, particularly focusing on protecting the critical parameters for tail classes. Extensive experiments on various long-tailed datasets validate LTAP's effectiveness.

优点

Addressing pruning in the context of long-tailed datasets is both meaningful and highly relevant to real-world applications.
The method is supported by theoretical foundations that verify its effectiveness.
The performance improvements achieved by the method are relatively substantial.

缺点

The paper is hard to follow, and the writing could be improved. For example, the definition of $w_k$ lacks clarity and could be more explicitly explained.
The update process of $\alpha_k$ is somewhat confusing. Since it is connected to class $c$ , it raises the question: if $\alpha_k$ is defined for each class $c$ , then does each class have a unique $\alpha_k$ ? However, the earlier sections suggest it is shared across all classes. Additional clarification on this would be helpful.
The way the dynamic weight adjustment mechanism strengthens the protection of parameters for tail classes is not entirely clear. If $\alpha_k$ is indeed shared across all classes, it is unclear why adjusting the weights of different scoring criteria would selectively protect parameters for tail classes. Could there be a scoring criterion that more specifically targets the protection of parameters for tail classes?
From the experimental results on the ImageNet-LT dataset, compared to ATO and RReg, it appears that LTAP achieves a larger performance improvement for head classes than for tail classes. This seems to contradict the stated goal of strengthening parameter protection for tail classes in LTAP.
There are also some typos, such as in line 700, where "equation eqution" should be "equation."

问题

Please refer to the weakness section.

伦理问题详情

N/A

评论- Reply to Reviewer 4PE9

2024-11-24

Q1: Explanation of wk:

w is a 5x100 matrix that records the weights of each criterion for each class. wk is a 1x100 matrix for one specific criterion, with initial values all set to 1. Thank you for pointing out this ambiguity. We have replaced w with the symbol D in the manuscript and clearly annotated its dimensions and update mechanism to avoid any potential misunderstandings.

Q2: Explanation of αk:

We sincerely apologize for the unclear explanation of the αk derivation process. We have rewritten this section using more comprehensible language, breaking down lengthy formulas into shorter ones, and adding necessary textual explanations. These changes are highlighted in red.

Furthermore, let me clarify your question: αk is shared across all classes. However, the calculation of αk is determined by wk and γ for each class. γ can be set as the weight for classes, and the transformation of wk fully considers the performance changes of each class. In other words, we consider performance changes class by class to ultimately derive the weights for the strategy.

Q3: Relationship between adjusting scoring criteria weights and protecting tail classes:

We influence the selection of parameter groups by introducing weights for different criteria. During weight adjustments, the performance changes of each class are considered. This process inherently accounts for long-tailed distributions and tail class protection. For instance, if a pruning operation sacrifices multiple tail classes while improving head class performance, such pruning methods would be penalized. Additionally, we introduced γ, a variable with the same dimension as the classes, which can be used to weight tail classes according to distribution preferences.

Formally, while directly incorporating fixed tail class consideration metrics (or simply setting wk according to distribution) might seem more beneficial for protecting tail classes, this would somewhat reduce the method's generality. Our goal is to develop a pruning strategy that is both universal and adaptable to long-tailed environments.

Q4: Results on ImageNet-LT dataset:

Thank you for your question. To clarify, our method's core focus is on improving the usability of pruning methods in long-tailed learning. Tail class performance is one of the key concerns in this context (as the performance degradation and additional bias introduced by pruning can further exacerbate the insufficient performance of tail classes in long-tailed learning), rather than specifically designing a method biased towards tail classes compared to traditional pruning (in fact, pruned parameters might significantly affect both head and tail classes, or have minimal impact on both).

Class-aware selection rules replacing overall performance as pruning criteria prevent further bias against tail classes while improving performance across all classes compared to existing pruning methods. We have also retained the capability to introduce preference control vectors for specific requirements (though specific preferences were not introduced in the original experiments).

Your observation that our method shows larger improvements in head classes than tail classes on ImageNet-LT is valid. Due to random factors and dataset imbalance, the actual improvements from our method may vary. However, this doesn't conflict with our motivation - improving usability in long-tailed learning while avoiding introducing additional bias and performance degradation for tail classes. More specifically, this might indicate that our method has found pruning combinations in ImageNet-LT that avoid further performance degradation in most tail classes while satisfying head class requirements (which is indeed the case compared to baselines). Additionally, if you're particularly interested in the underlying reasons for tail class preservation effects, our provided proofs may offer supplementary information.

We greatly appreciate your attention to detail and responsibility. Your contributions have been invaluable in helping us clarify misunderstandings and improve the manuscript's quality.

2024-11-26

I appreciate the authors' efforts in addressing the concerns raised in the rebuttal. My questions have been resolved. I decide to raise my score.

评论- Thank you

2024-11-26

We sincerely appreciate that our efforts have addressed your concerns. We are grateful for your recognition of our work. Thank you for your contributions toward improving the quality of our manuscript.

评论- Author Rebuttal

2024-11-24

We extend our sincere gratitude to all reviewers for their valuable feedback and suggestions. We are encouraged that they have recognized our research contributions in several aspects: practical significance and real-world applicability of our research motivation (Reviewers 4PE9, ESub, hf3A), comprehensive and convincing theoretical analysis (Reviewers 4PE9, ESub, 8taT), effectiveness demonstrated through experimental results (Reviewers 4PE9, ESub, hf3A), and reproducibility facilitated by open-source code (Reviewers hf3A, 8taT). In particular, reviewers acknowledged the innovation of our proposed LT-Vote mechanism and multi-stage pruning strategy in balancing model efficiency and tail class performance (Reviewer 8taT). We have carefully considered all suggestions provided by the reviewers and revised our manuscript accordingly. Detailed responses to each reviewer are provided below.

Below is a summary of the revisions and responses to reviewer comments:

In the revision, we have incorporated all reviewers' feedback and made extensive modifications to the manuscript. Due to time constraints, some content will be added in the final version. The main modifications are as follows:

We revised Section 2.1 to highlight key methodological aspects and address errors noted by Reviewer hf3A.
We modified Section 2.2, introducing shorter formulas and revising notation systems and explanations for better comprehension of our methodology.
We addressed all errors identified by reviewers and conducted a comprehensive review of notation systems and definitions.
As requested by reviewers, we added pseudocode and supplemented an approximate proof in Appendix D. Please refer to pages 19-23.

Additionally, we provided environment dependencies for experiment reproduction, algorithm pseudocode, and ablation studies through an anonymous link.

In our response to Reviewer 4PE9, we clarified notation errors and our motivation. For Reviewer ESub, we supplemented requested experiments and theoretical proofs while correcting notation errors. In addressing Reviewer hf3A, we clarified notation issues, revised the manuscript accordingly, and added requested experiments. For Reviewer 8taT, we provided necessary explanations and supplementary experiments.

AC 元评审

2024-12-19

The submission received the ratings of four reviewers, which recommended 6, 3, 6 and 6, averaging 5.25. Given the plenty of competitive submissions in ICLR, this stands at a score below the borderline. Initially, the reviewers' concerns focus on the unclear writing, some details are lack and some experimental results are counter-intuitive. After the rebuttal by authors, some reviewers felt their concerns had been well addressed, while some reviewers felt not. After carefully checking the reviewer comments and the authors' feedback, the AC considered some main focus lied on the significant novelty in technical design and writing issues. The AC had launched a discussion and there were no reviewers that championed this submission. And the AC also checked the revision found some typo errors like in Eq. (2) and (3) (some characters should be in subscripts?). Considering this case, I have to recommend rejection towards the current submission, and hope some advices in this round can help improve the submission.

审稿人讨论附加意见

Please see above. The authors can take these advices into the current version to strengthen the submission.

最终决定Reject

2025-01-22

Reject

Balancing Model Efficiency and Performance: Adaptive Pruner for Long-tailed Data

摘要

评审与讨论

优点

缺点

问题

Q1: Comparison with the Baseline

IR = 50

IR = 100

Q2: Over-parameterization for Tail Classes

Q3: Additional Overhead

Q4: Negative Impact on Head Classes

Q5: How Many Iterations Are Needed to Determine Parameter Weights?

Q6: How Are Pruning Stopping Points Determined?

优点

缺点

问题

Q1: Writing Issues

Q2: Definition of p provided, but not θ

Q3: Added the fifth definition

Q4: Definition of wc

Q5: Clarification of w and wk

Q6: Clarification of αk

Q7: Clarification of Ac and target

Q8: Clarification of the reference model

Q9: Why use validation set accuracy

Q10: Additional experiments for CE baseline and Logit Adjustment

Experiments on CIFAR with IR=50

Experiments on CIFAR with IR=100

Q11: Integration with various long-tailed losses

1. Variable Definitions and Dimensions

2. Matrix Multiplication Example

3. Applying Softmax Normalization

4. Analysis of Class c Dependency

LTAP: Adaptive Pruner for Long-tailed Distribution

LTAP Optimizer Design

Pruning Strategy Implementation

优点

缺点

问题

Table 1: Experiments on CIFAR with ir=50, testing different losses (ce and Logit Adjustment) and pruning methods. w.o. κ\kappaκ represents ablation of our tail protection mechanism

Table 1: Experiments on CIFAR with ir=100, testing different losses (ce and Logit Adjustment) and pruning methods. w.o. κ\kappaκ represents ablation of our tail protection mechanism

优点

缺点

问题

伦理问题详情

审稿人讨论附加意见

Table 1: Experiments on CIFAR with ir=50, testing different losses (ce and Logit Adjustment) and pruning methods. w.o. $\kappa$ represents ablation of our tail protection mechanism

Table 1: Experiments on CIFAR with ir=100, testing different losses (ce and Logit Adjustment) and pruning methods. w.o. $\kappa$ represents ablation of our tail protection mechanism