/10

Rejected5 位审稿人

最低1最高4标准差1.0

ICML 2025

PASER: Post-Training Data Selection for Efficient Pruned Large Language Model Recovery

Bowei He,Lihao Yin,Huiling Zhen,Xiaokun Zhang,Mingxuan Yuan,Chen Ma

OpenReview PDF

提交: 2025-01-15更新: 2025-06-18

摘要

关键词

Model PruningLarge Language ModelData SelectionRecovery Training

评审与讨论

审稿意见

评分: 12025-02-18

This paper proposes PASER, a post-training data selection method for efficient pruned model recovery. PASER involves (i) Semantic-Structural Recovery Instruction Clustering to identify and group data points that focus on similar capabilities. (ii) Capability Degradation-aware Instruction Selection to enable more accurate identification and prioritization for affected capabilities, and (iii) Concept Consistency Graph to detect and mitigate potential negative transfer. Experiments on several open-source LLMs and benchmarks demonstrate that PASER recovers pruned LLM performance by using only part of the original data.

给作者的问题

Please refer to strengths and weaknesses.

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

There are no theoretical claims.

实验设计与分析

Yes, I have checked.

补充材料

Yes.

与现有文献的关系

Model recovery is a well-established field.

遗漏的重要参考文献

No.

其他优缺点

Strength

● The idea of post-training data selection to address the uneven capability degradation issue is interesting.

● This paper is generally easy to follow.

● Experiments on various benchmarks demonstrate that PASER can recover pruned LLMs.

Weakness

● Many works have discussed data selection for post-training to identify high-quality data, and the authors mentioned in lines 55-56, 'Note that general high quality does not necessarily mean useful for recovery.' This statement may require more evidence or elaboration to support this point.

● The proposed idea of data clustering and selection is inherently limited to embedding and clustering methods. It remains confusingforthechoiceof simply applying SentenceBERT and DiffusionKernel forconducting the clustering process even after reading Appendix H. Given the off-the-shelf models are not specifically end-to-end trained, will the clustering results in the specific domain be unacceptable？ The author should also provide some visual results after cluster to illustrate the semantic relationship of clusters after high-dimensional spatial clustering.

● In the proposed manifold learning process, the adjacency matrix or the normalized Laplacian matrix has been employed. If large-scale datasets are involved in real-world applications, will this process be time-consuming in real-time tasks? Additionally, the authors retained the selection of the top d eigenvectors of K_t. Is the proposed method robust to the choice of d? Moreover, Could you provide the specific values of ∣D∣, ∣B∣, and ∣S∣ for each dataset as mentioned in Equation 1?

● The proposed Concept Consistency Graph (CCG) aims to identify conflicting concepts to ensure consistency. I wonder that when CCG can not fully identify all conflicts, will it severely hurt the result of PASER? Additionally, regarding the construction process of CCG, which involves the definition and identification of concepts, and the construction of the adjacency matrix based on the co-occurrence of concepts. What is the time cost of constructing the CCG? In time-sensitive applications, if large datasets and concepts are involved, would this make the construction of CCG too costly to afford?

● Can PASER be applied to LLMs with different architectures, such as the pruned Mixtral 8x7B model for recovery?

其他意见或建议

Please refer to strengths and weaknesses.

作者回复

2025-03-28

W1: In fact, the paper presents empirical evidence supporting this statement in multiple places: 1) In Table 1 and throughout our experiments, we demonstrate that using general-purpose instruction tuning data selection methods-which focus on selecting "high-quality" instruction data in general - consistently underperform our PASER method which specifically targets recovery needs. 2) Figure 4 illustrates that different capabilities degrade unevenly during pruning. This uneven deterioration means that high-quality data that doesn't specifically target the severely degraded capabilities will be less effective for recovery. 3) In Appendix A, we show that employing the full version of recovery data or uniformly split subset can hardly achieve satisfying performance, despite these containing many generally high-quality samples.

This distinction is fundamental to PASER's design philosophy - while general data selection methods focus on intrinsic quality of instructions, effective recovery requires targeting the specific capabilities most severely impacted by pruning, which may not align with general notions of data quality. Correspondingly, our approach identifies which instruction data is most useful specifically for recovery, not just which data is generally high-quality.

W2: On the choice of SentenceBERT and Diffusion Kernel: This is a careful design choice focused on practicality and effectiveness: 1) Domain adaptability: Though not specifically end-to-end trained for LLM instruction clustering, SentenceBERT has demonstrated strong transfer capability across various text semantic tasks. 2) Comparative analysis: As shown in Table 12 (Appendix H), we conducted comprehensive comparisons with alternative clustering approaches. The superior performance of our approach demonstrates that our method, while built on existing techniques, outperforms these alternatives consistently. 3) Computational efficiency: Our approach avoids the overhead of training domain-specific embeddings from scratch, making PASER more accessible and deployable.

Regarding cluster visualization: We have prepared visualization: https://postimg.cc/nXC0XmJ4, which confirms that our approach successfully identifies meaningful semantic structures in the instruction space that correspond to different LLM capabilities.

W3: Large-scale applications: Computing the full adjacency matrix would be indeed prohibitive for very large datasets. For LaMini (2.58M samples), we implemented an approximate k-nearest neighbors approach using locality-sensitive hashing rather than constructing the complete N×N matrix. This reduced computation from O(N²) to O(N log N) with minimal impact on clustering quality. Pre-computing embeddings and using incremental updates would further improve efficiency.

Robustness to d: Our method is relatively robust to the choice of d. We conducted sensitivity analysis with d ranging from 4 to 64 and found consistent clustering results (Rand Index >0.85) across this range. We chose d=16 based on the eigenvalue decay pattern, where eigenvalues beyond this dimension contributed negligibly to the representation. Performance variation was within ±0.2 points across this range.

|D|, |B|, and |S| values: For Alpaca: |D|=52K, |B|=10.4K (20%), |S|=10.4K. For LaMini: |D|=2.58M, |B|=103.2K (4%), |S|=103.2K. In all cases, we filled the allocated budget |B| completely, with |S|=|B| after filtering and selection. We'll add these details to the revised paper.

W4: Undetected conflicts: While CCG cannot identify all possible conflicts, our experiments show it remains effective even with imperfect conflict detection. When we deliberately introduced undetectable conflicting samples (with conflicts expressed through paraphrasing rather than direct concept matches), performance degradation was limited to 0.3 points.

CCG construction time: Constructing the CCG is relatively efficient. For Alpaca (52K samples), CCG construction took approximately 42 seconds. For LaMini (2.58M samples), we used the parallelization across multiple cores and took ~8 minutes. These times are negligible compared to the recovery training time (hours to days). Additionally, CCG construction can be performed offline as a preprocessing step before recovery training begins. The empirical benefits of conflict detection (0.68-2.39 points ↑) outweigh the computational overhead.

W5: Our PASER framework is model-agnostic and can be applied to LLMs with different architectures. We have conducted experiments with Mixtral 8x7B under LLM-Pruner:

Recovery Method	WikiText2↓	PTB↓	Averaged Reasoning↑
Instruction Mining	14.86	25.92	62.68
IFD	14.23	24.65	63.17
Nuggets	13.79	23.81	63.40
PASER	12.31	21.06	64.76

Results show that PASER significantly outperforms other instruction selection methods, demonstrating effectiveness for MOE.

If our rebuttal has addressed you concern, could you please kindly consider raising the score?

审稿人评论

2025-04-02

After reading the rebuttal, I am still concerned about the choice of Sentence-BERT and the diffusion kernel. Additionally, the undetected conflicts present in CCG indicate that PARSER may be challenging to apply in certain knowledge-intensive domains. Providing a theoretical error bound might offer a deeper understanding of PARSER. Therefore, I have revised my rating to 1.

作者评论

2025-04-02

Dear Reviewer aKaN:

Thank you for your continued engagement with our paper. We respectfully wish to address the concerns in your final comment as they appear to be based on some misunderstandings about our work:

On theoretical analysis: Our paper does contain substantial theoretical analysis in Section 3.5, where we provide a formal theorem (Theorem 1) with a detailed proof of our algorithm's time complexity (O(N log N + NC²)). For data selection methods, time complexity analysis is crucial as it determines practical applicability. Regarding theoretical error bounds: while valuable in principle, such bounds require making assumptions that aren't realistic in LLM contexts given their non-convex loss landscapes and complex parameter spaces. This is why empirical validation across diverse settings (as we provide) is the standard approach in this field.

Though, we still provide the theoretical error bound analysis following your comment: https://anonymous.4open.science/r/PASER-E606/error_bound.pdf (you may download it to read for better visualization). However, we need to clarify this is based on the idealized assumptions like Lipschitz continuity of recovery performance and capability degradation correlation, which are hard to be satisfied in the real scenarios.

On SentenceBERT and diffusion kernel choices: Our technical choices were made after extensive empirical validation, not arbitrarily. First, we argue that Sentence-BERT suits our scenario well because it can transfer well across different text semantic tasks and possess relatively higher efficiency. In fact, we have also tried using much larger pretrained language models like LLaMA3-8B for embeddings which provides negligible performance improvements (less than 0.05 points) while significantly increasing computational costs. Second, in A2 to Reviewer 3hM3, we have conducted additional ablation studies comparing our diffusion kernel approach with other dimentionality reduction techniques: UMAP, PCA, and t-SNE. The results demonstrates the superiority of choosing diffusion kernel. Actually, the reasons for its better performance is as follows: 1) It effectively preserves the manifold structure in high-dimensional embedding spaces; 2) It adapts to the intrinsic geometry of the data without assuming linear separability; 3) It performs well with heterogeneous data distributions typical in instruction tuning datasets. Third, in the Table 12, Appendix H, we have provided comprehensive experimental comparisons with alternative clustering approaches. The results also validate the effectiveness of our design. Finally, we have provided visualization: https://postimg.cc/nXC0XmJ4, which demonstrates that our approach effectively clusters instructions by capability. This is essential for targeted recovery.

On the CCG limitations: While we acknowledged the theoretical possibility of undetected conflicts (actually, no method can be 100% successful), our experiments show: When deliberately introducing undetectable conflicting samples, performance degradation was minimal (no more than 0.3 points). This is far outweighed by CCG's benefits (0.68-2.39 points improvement). In fact, in real-world datasets like Alpaca and LaMini, cases that could circumvent our CCG-based detection mechanism are exceedingly rare. Especially, regarding knowledge-intensive domains: these typically feature standardized terminology and well-defined concepts, which actually makes our CCG approach more reliable in such contexts, not less.

We hope these clarifications help address your concerns and convince you to consider raising the score.

审稿意见

评分: 32025-03-04

This paper proposes PASER for efficient recovery of pruned large language models (i.e., fine-tuning pruned large language models to recover their performance). It uses SentenceBERT to embed data, a diffusion kernel to reduce dimensions, and then applies non-negative matrix factorization-based spectral clustering to cluster the data. For each cluster, it assesses the performance degradation and allocates the budget accordingly.

给作者的问题

N/A

论据与证据

Yes

方法与评估标准

Yes

理论论述

Yes.

The paper makes a theoretical claim regarding time complexity. The authors should clearly specify which factors are considered constants and hidden in the big-O notation, such as the vocabulary size of tokens. For example, when computing JSD, the computation is naturally linear with respect to the vocabulary size. However, this aspect was not discussed. The authors need to clarify this point.

实验设计与分析

Yes.

I would recommend that the authors conduct an ablation study. This paper combines several techniques, such as SentenceBERT, diffusion kernel, non-negative matrix factorization-based clustering, and budget allocation based on the Jensen-Shannon distance between the original and pruned models. My comments are as follows:

The paper lacks motivation for why specific techniques were chosen over others. For instance, there are many dimension reduction methods available—why was the diffusion kernel selected? The presentation of the methods needs significant improvement.
Given the variety of techniques integrated into the approach, it is unclear which ones are most effective and which may be less helpful. What would happen if a different dimension reduction technique were used while keeping the other components the same? An ablation study is needed to address these questions.
More specifically, the authors customized Instruction Mining (Cao et al.), IFD (Li et al., 2024a), and Nuggets (Li et al., 2024b) for the post-pruning recovery training scenario. What about using SentenceBERT and the diffusion kernel, and then applying the above techniques? This would reveal whether the JSD-based budget allocation works.

补充材料

与现有文献的关系

This paper presents a very interesting end-to-end pipeline for efficient pruned large language model recovery.

遗漏的重要参考文献

其他优缺点

N/A

其他意见或建议

N/A

作者回复

2025-03-28

C1: Theoretical Claims

A1: In our time complexity analysis (Section 3.5, page 5), we considered the following:

For JSD computation, we indeed treated vocabulary size |V| as a constant factor. The practical vocabulary size (typically 32K-100K tokens) remains fixed regardless of instruction dataset size. While JSD computation is linear in |V|, this factor is consistent across all samples and doesn't affect asymptotic scaling with N.
Sequence length |x| + |y| was treated as a constant, as instruction-tuning inputs and outputs typically have bounded lengths.
The number of clusters K was treated as a constant, though we explicitly noted "K ≤ N" in our analysis. In practice, K typically ranges from 8-20 regardless of dataset size.
Embedding dimension d from our manifold learning was treated as a constant (set to 16 in our experiments).

In the revised paper, we will clarify all these assumptions to provide readers with a more complete understanding of PASER's computational characteristics.

C2: Experimental Designs Or Analyses

A2: Regarding component motivation and ablation studies: In response to points 1) and 2), we would like to highlight that our paper does include a comprehensive ablation study in Section 4.2 (Table 4) and further detailed in Appendix G (Table 11), where we systematically removed each of the three key components:S²RIC, CDAIS, NTM. These ablation studies demonstrate that all three components contribute positively to model recovery across different pruning schemes. However, we acknowledge that our paper could provide clearer motivation for the specific techniques chosen within each component.

As for dimensionality reduction techniques, the diffusion kernel was selected as our dimensionality reduction method after comparing it with alternative techniques such as UMAP, PCA, and t-SNE. The diffusion kernel suits our specific scenario better because: 1. It effectively preserves the manifold structure in high-dimensional embedding spaces; 2. It adapts to the intrinsic geometry of the data without assuming linear separability; 3. It performs well with heterogeneous data distributions typical in instruction tuning datasets. To validate its effectiveness with empirical evidence, we present an additional ablation study below where we change the dimensionality reduction component while keeping the rest of PASER intact (LLaMA2-7B under LLM-Pruner):

Method	WikiText2↓	PTB↓	Averaged Reasoning↑
PASER w/ UMAP	16.92	27.83	60.31
PASER w/ PCA	17.05	28.16	60.18
PASER w/ t-SNE	17.21	28.42	60.05
PASER w/ Diffusion Kernel (Full)	16.40	26.35	61.10

For the clustering component, in Table 12 of Appendix H, we compared our NMF-based spectral clustering with alternative clustering approaches including NMF_TFIDF, LDA_TFIDF, KMeans_TFIDF, Spectral_MTEB, and Spectral_BERT. Our approach consistently outperformed these alternatives across all pruning schemes, validating the soundness of our design.

Besides, we studied the divergence measurement component selection. When replacing JSD with other options and keeping the rest intact, the performance comparison is as follows:

Method	WikiText2↓	PTB↓	Averaged Reasoning↑
PASER w/ KL divergence	16.91	27.54	60.37
PASER w/ Wasserstein distance	16.73	27.26	60.59
PASER w/ JSD (Full)	16.40	26.35	61.10

From the table, other versions can hardly surpass our JSD-based version. The detailed rationale has been provided in Sec.3.3.

These results and analysis demonstrate that while the overall PASER framework is robust, the specific technical choices made for each component meaningfully contribute to the method's overall effectiveness.

Regarding JSD-based budget allocation effectiveness: To address point 3), we conducted the additional experiment suggested by the reviewer: integrating our dimensionality reduction and clustering approach with existing data selection methods, while removing our JSD-based budget allocation. The results are summarized in the table below:

Method	WikiText2↓	PTB↓	Averaged Reasoning↑
w/o pruning	12.62	22.14	62.91
w/o Training	20.34	38.81	57.78
Instruction Mining	23.31	40.63	57.65
Instruction Mining + S²RIC clustering	20.92	36.47	58.85
IFD	19.76	33.30	58.59
IFD + S²RIC clustering	17.95	32.61	59.47
Nuggets	20.02	35.19	58.69
Nuggets + S²RIC clustering	18.84	33.16	59.21
PASER (Full)	16.40	26.35	61.10

These results confirm that while S²RIC clustering improves existing methods, the JSD-based capability degradation assessment and budget allocation are critical components that provide additional performance gains. We will enhance the presentation of these motivations and ablation studies in the revised paper to make our technical choices and their contributions clearer.

If our rebuttal has addressed you concern, could you please kindly consider raising the overall recommendation score?

审稿人评论

2025-04-08

I would like to thank the authors for the additional experiments and detailed clarifications. My concerns have been largely addressed, and I will raise my score accordingly. One remaining question: do the authors plan to open-source the code and data to facilitate reproducibility of the results?

作者评论

2025-04-08

We sincerely appreciate your positive feedback on our rebuttal. Yes, we plan to make both our code and data publicly available upon acceptance of the paper. This will include well-documented implementations and processed datasets to facilitate reproduction of our results. Thank you again for your valuable input throughout the review process.

审稿意见

评分: 42025-03-10

This paper proposes a data selection method for effectively recovering model performance after pruning. Beyond the efficiency argument, the need for such a method is well justified through experimental results, where the authors show that simply training on the full dataset or randomly selected data not only performs worse than the proposed method but also underperforms compared to other methods not specifically designed for post-pruning recovery.

The method consists of the following key components:

Encoding instruction data into a low-dimensional space using SentenceBERT and a Diffusion Kernel.
Clustering the samples in this space.
Capability-aware instruction selection: i) Assigning a sampling budget to clusters based on the average CDS (defined using JSD); ii) Sampling examples in order of decreasing IES to prioritize efficiency (favoring shorter examples).
Negative Transfer Mitigation: Ensuring only conceptually consistent examples are sampled, meaning that selected examples must contain concepts that do not contradict relationships already represented in the concept consistency graph.

给作者的问题

See section above.

论据与证据

Yes, all claims are supported by proper evidences. The main claims include:

post-pruning performance degrades differently for different capabilities
data selection is crucial for optimal post-pruning performance recovery
each component of the proposed PARSER method contributes positively to the final performance recovery

方法与评估标准

Yes, proposed benchmarks make sense for the presented evaluations.

理论论述

I briefly checked equations in section 3, they seem to be correct.

实验设计与分析

Yes.

补充材料

I did not check the supplementary material.

与现有文献的关系

Overall, I find the relation to broader scientific literature is discussed sufficiently well. However, I suggest authors include more references in section 3.2 to works that first proposed the manifold learning techniques applied there.

遗漏的重要参考文献

I could not identify any essential references that are not discussed.

其他优缺点

Strength:

clarity and sufficient level of detail of writing
strong experimental result and sound story
rich ablations
the method is well designed overall, it presents several interesting design decisions that can also transfer to other applications (the choice of clustering algorithm, manifold learning technique etc.)

Weaknesses:

while the proposed method demonstrates strong empirical results, it appears somewhat complex, involving multiple hyperparameters, which could make implementation difficult in practice.

其他意见或建议

typo ll. 381 - 382 "(from 10K to 10K samples)"?
Regarding "Capability Degradation Assessment": the increased JSD between M_o and M_p does not necessarily mean degradatiaon in the performance of M_p, does it? Maybe the naming here can be adjusted?
I am a bit confused by the proposed concept consistency graph (Definition 1 and subsequent): 1) I am not sure how exactly the concepts are extracted from the instructions; 2) is my understanding correct (from ll. 237 - 238) that a sample will not be selected if it contains a pair of concepts that are allergy present in the graph but are not yet linked (i.e. do not co-occur yet)? Wouldn't this significantly limit the diversity of selected data?
why did authors decide to only apply the technique to the instruction part of the data, and not also to the output?

作者回复

2025-03-28

C1: Relation To Broader Scientific Literature

A1: In the revised version, we will add relevant foundational references for the manifold learning techniques applied in our work, including:

For manifold learning in high-dimensional spaces: Belkin & Niyogi (2003) "Laplacian Eigenmaps for Dimensionality Reduction and Data Representation"
For diffusion maps: Coifman & Lafon (2006) "Diffusion maps" and Nadler et al. (2006) "Diffusion maps, spectral clustering and eigenfunctions of Fokker-Planck operators"
For NMF-based spectral clustering: Ding et al. (2005) "On the equivalence of nonnegative matrix factorization and spectral clustering" (which we currently cite only briefly)
For spectral gap methods: Chung (1997) "Spectral Graph Theory" and von Luxburg (2007) "A Tutorial on Spectral Clustering"

C2: Other Strengths And Weaknesses

A2: For hyperparameter settings, please see our A2 to Reviewer hF4D. We will include them in the final version of paper. Besides, we have provided the code in the anonymous URL (Line 1262, Page 23) for easing reproduction.

C3: Typo

A3: Sorry, here should be from 10K to 100K samples. 10K indicates the scale of selected Alpaca dataset (5% of 52K), 100K indicates the scale of selected LaMini dataset (4% of 2.58M).

C4: Regarding Capability Degradation Assessment

A4: A standard assumption in LLM pruning is to consider the original model (M_o) as the performance referencen (oracle model). The JSD between M_o and M_p measures behavioral divergence in output probability distributions, which generally correlates with capability degradation. We chose JSD over direct performance metrics because it captures subtle changes in model behavior that might not be immediately apparent in accuracy or loss values due to sampling uncertainty. JSD's information-theoretic foundation allows us to detect divergences in the underlying probability distributions, which often precede observable performance drops. While JSD may not be perfectly proportional to performance degradation in all cases, our experiments confirm it effectively identifies capabilities requiring focused recovery attention. We'll clarify this relationship in the revised paper to avoid potential misinterpretation.

C5: I am a bit confused by the proposed concept consistency graph

A5: Regarding concept extraction: We extract concepts from instruction-output pairs using a modified RAKE (Rapid Automatic Keyword Extraction) algorithm, which identifies key phrases based on word co-occurrence and frequency statistics. As demonstrated in Appendix J (pages 21-23), this approach effectively captures domain-specific entities (e.g., "quantum computing," "neural network," "backpropagation") that represent core knowledge units in the instruction. We use parts-of-speech filtering to prioritize meaningful noun phrases and named entities.

Regarding potential diversity limitation: Your understanding is correct - we exclude samples that introduce new relationships between existing concepts. While this might appear to limit diversity, our experiments show this constraint is crucial for preventing negative transfer. When we removed this constraint in ablation studies, we observed performance degradation across most tasks (Table 4). The primary goal of recovery is targetedness rather than diversity. Full-dataset training offers maximum diversity but achieves suboptimal results (Tables 1-3), often due to conflicting information. Our approach balances concept coverage with consistency, ensuring coherent capability recovery while preventing harmful conceptual conflicts.

We'll clarify these aspects in the revised paper to address potential confusion.

C6: Why did authors decide to only apply the technique to the instruction part of the data, and not also to the output?

A6: We chose to build our Concept Consistency Graph (CCG) using only the instruction part of each data sample for several reasons: 1) Instructions typically contain sufficient context and almost all main concepts needed to identify capability domains. 2) Instructions more directly represent the task domain and required knowledge, making them ideal for detecting potential conflicts. 3) Outputs can vary significantly in style and implementation details, potentially introducing noise into the concept space.

Our experiments showed that instruction-based concept extraction was sufficient for effective negative transfer mitigation. This approach also reduced computational complexity while maintaining performance. It's important to note that while the CCG is built using only instructions, the actual recovery training process utilizes both instructions and their corresponding outputs for fine-tuning, ensuring the model learns complete instruction-response patterns.

审稿意见

评分: 32025-03-17

This paper proposes a data selection method for recovery fine-tuning: additional training for pruned LLMs to recover the original performance as well as possible. The method consists of three subcomponents: (1) clustering instruction data, (2) assign data budget for each cluster according to the probabilistic discrepancy between the original and pruned model on the specific cluster, and (3) remove data with inconsistent concepts. Experiments on various models showed that the dataset generated by the proposed method consistently improve the resulting model compared with several conventional methods for data selection.

update after rebuttal:

Thank you for your comments on my review! However, I have not found a good enough reason to change my review result, so I will leave the overall score as it is.

给作者的问题

I'd appreciate if the authors answered questions raised in the Theoretical Claims section.

论据与证据

The method is well motivated: it is designed for especially selecting the training data for recovery fine-tuning. Though there's no strong theoretical evidences on the method itself, it may work better than other data selection methods that are not focused on model recovery.

方法与评估标准

The overall method is well developed, but it looks like the method is a result of engineering: a combination of many subroutinees, and it is not easy to say if selection of each technique is suitable (i.e., there's no other option) or not. Evaluation is conducted on three model series: Llama2, Llama3 (English-centric) and Baichuan2 (En-Zh bilingual). Especially for Llama2 they contucted experiments on up to 70B size. This seems to be enough to claim general effectiveness of the proposed methods on various LLMs.

理论论述

Most of components are designed empirically: it is not easy to mention that theere are some underlying facts to support each subroutine. Especially, I'm wondering if:

the Equation (6) is really optimal for determining the amount of traiing data. It looks this amount is crucial to guarantee the final performance of the recovered model, but there looks no any strong evidence of adopting simple normalization over the calculated CDS.
The algorithm 1 in Appendix B is optimal and agnostic to the order of data consumption.

实验设计与分析

Experiments looks comprehensive enough in terms of downstream performance. It controls several parameters to obtain pruned models, and the proposed method successfully recovers the model performance in most cases. For ablation, Table 4 shows that every subcomponent in the proposed method works to improve the resulting model. Figure 2 also show that the proposed method is better than other conventional methods, while maintaining the robustness against data budget.

补充材料

与现有文献的关系

Data selection is sometimes studied in wide range of machine learning tasks. And model pruning and its potential degradation is one of the core insterests among model users.

遗漏的重要参考文献

Not sure

其他优缺点

其他意见或建议

作者回复

2025-03-28

C1: Claims And Evidence

A1: While PASER lacks strong theoretical guarantees, our approach is guided by sound principles: targeting severely impaired capabilities is intuitively more efficient than general data selection methods not designed for recovery. The difficulty in establishing theoretical connections between data selection and recovery performance stems from LLMs' inherent characteristics - their highly non-convex loss landscapes, vast parameter spaces (billions of parameters), and complex capability distributions that aren't easily characterized mathematically. Rather than introducing impractical assumptions that would limit real-world applicability, we focused on empirical validation across diverse models and pruning schemes. Our consistent performance improvements across these scenarios provide strong evidence for PASER's effectiveness, even without formal theoretical bounds.

C2: Methods And Evaluation Criteria

A2: Thank you for this comment. We acknowledge that PASER combines multiple components to address the multi-faceted challenge of efficient recovery. While PASER may appear engineering-driven, each component addresses a specific, critical aspect of pruned LLM recovery: (1) capability identification through clustering, (2) targeted resource allocation based on degradation severity, and (3) negative transfer prevention. Our ablation studies (Table 4, page 8, and Appendix G) validate each component's contribution, showing that removing any single component consistently degrades performance.

We explored alternative techniques for clustering (Table 12, page 20), demonstrating that our S²RIC approach outperforms other methods. In additional experiments (not included due to space constraints), we evaluated different divergence metrics for capability degradation assessment: KL-divergence reduced average reasoning performance by 0.73 points compared to JSD, while Wasserstein distance reduced it by 0.51 points. For negative transfer mitigation, we compared our CCG approach with simpler methods like keyword filtering (1.32 points lower) and cosine similarity thresholding (0.89 points lower). For more results regarding the selection of each component, you may also refer to A2 for Reviewer 3hM3.

Rather than an arbitrary assembly, PASER represents a principled approach to the novel problem of post-pruning recovery data selection, with each component carefully designed and validated.

C3: Theoretical Claims

A3: Regarding the optimality of Equation (6) for budget allocation: We acknowledge that our proportional allocation approach based on CDS is heuristic rather than theoretically optimal. Different allocation strategies were explored in our experiments, including equal allocation, square-root scaling, and logarithmic scaling. The linear proportional allocation (Equation 6) consistently outperformed alternatives, showing ~0.4-0.8 points higher average performance across pruning schemes. While we cannot claim theoretical optimality, this approach intuitively directs more resources to capabilities with greater degradation while still maintaining some recovery effort for less affected capabilities. Finding a provably optimal allocation would require making unrealistic assumptions about capability independence and recovery dynamics.

Regarding algorithm optimality and order-sensitivity: To be honest, we cannot claim theoretical optimality for Algorithm 1. In fact, finding a provably optimal subset would require solving a complex combinatorial optimization problem with O(2^N) complexity, which is computationally intractable for large instruction datasets. Our algorithm represents a greedy approach that makes locally optimal choices at each step. The algorithm has inherent order-dependency since the Concept Consistency Graph evolves as samples are added. Considering the intra-cluster order is determined by the IES score, we tested different cluster orderings and found performance variations of ±0.1-0.3 points, indicating that our approach is relatively robust. Given the NP-hard nature of the optimal subset selection problem (as formulated in Equation 1), Algorithm 1 provides a practical approximation that balances computational efficiency with strong empirical performance. Future work could explore more sophisticated optimization techniques with stronger theoretical guarantees.

We appreciate these theoretical questions and will clarify these limitations in the revised paper.

C4: Questions For Authors

A4: Please see A3.

If our rebuttal has addressed your concern, could you please kindly consider raising the overall recommendation score?

审稿意见

评分: 32025-03-18

The paper introduces PASER, a novel method for selecting instruction tuning data to recover the performance of pruned large language models (LLMs). Pruning, particularly structured pruning, often degrades model capabilities, and instruction tuning has shown promise for efficient recovery. PASER comprises three key components: semantic-structural recovery instruction clustering (S²RIC) to group instructions by capability, capability degradation-aware instruction selection (CDAIS) to prioritize severely affected capabilities, and negative transfer mitigation (NTM) via a concept consistency graph (CCG) to filter conflicting data. Evaluated on LLMs like LLaMA2, LLaMA3, and Baichuan2 across structured (e.g., LLM-Pruner), semi-structured (e.g., Wanda), and unstructured (e.g., SparseGPT) pruning schemes, PASER outperforms baselines like random selection and general instruction tuning methods (e.g., Instruction Mining, Nuggets) in language modeling (WikiText2, PTB) and reasoning tasks (e.g., BoolQ, PIQA). It achieves higher performance with less data, reducing training overhead, as demonstrated in Tables 1 and 2.

Update after rebuttal: Thanks authors for the replies which partly resolve my concerns, but considering the overall quality of this paper, I keep my original recommendation decision.

给作者的问题

Dataset Quality: How does PASER handle noisy instructions (e.g., in Alpaca)? Robustness to quality issues could influence real-world applicability, potentially raising my evaluation if addressed.

论据与证据

The paper claims PASER enhances recovered LLM performance and reduces training overhead compared to baselines. This is well-supported by experimental results. For instance, Table 1 (Section 4, Page 6) shows PASER achieving an average performance of 61.10 on LLaMA2-7B under LLM-Pruner, surpassing random selection (47.69) and Nuggets (58.59). Table 2 (Section 4, Page 7) extends this across models, with PASER recovering LLaMA2-70B reasoning to 69.62, closer to the unpruned 71.72 than Nuggets (67.73). Efficiency is evidenced by PASER using 20% of Alpaca data (Section K, Page 23), yet outperforming full-data recovery.

However, the claim of "efficiency and scalability" (Section 3.5, Page 5) is problematic. The time complexity of O(N log N + N C²) suggests potential computational intensity for large N or C, despite C being small in practice. No empirical runtime data supports this claim, weakening its convincingness. The claim of negative transfer mitigation (Section 3.4, Page 4) relies on indirect evidence through performance gains, lacking direct analysis (e.g., rejected sample impact), which could strengthen validation.

方法与评估标准

PASER’s methods—clustering via SentenceBERT and NMF spectral clustering, degradation assessment with JSD, and CCG-based filtering—are appropriate for targeted recovery. They address uneven capability degradation post-pruning (Section 1, Page 1). Benchmarks like WikiText2 and PTB for language modeling and seven reasoning datasets (e.g., BoolQ, HellaSwag) (Section 4.1, Page 5) align with evaluating general LLM capabilities. Using Alpaca (52K samples) and LaMini (2.58M samples) (Section 4.1, Page 5) tests scalability across data sizes.

理论论述

The primary theoretical claim is PASER’s time complexity of O(N log N + N C²) (Theorem 1, Section 3.5, Page 5). The proof decomposes this into clustering (O(N log N)) and sample selection (O(N C²)), assuming C << N simplifies to O(N log N). This breakdown is correct and aligns with the algorithm’s steps (Section 3). No discrepancies were found.

实验设计与分析

Experiments are sound, comparing PASER against random selection, full-data recovery, and baselines (Instruction Mining, IFD, Nuggets) across multiple LLMs and pruning schemes (Section 4.1, Page 5). Ablation studies (Table 11, Section I, Page 19) validate each component’s contribution, e.g., PASER without NTM drops from 61.10 to 59.25 on LLaMA2-7B. Five-run averages with t-tests (p < 0.01) (Section K, Page 23) ensure statistical rigor.

A concern is the lack of hyperparameter details in the main text (e.g., LoRA settings: rank=8, epochs=2) (Section K, Page 23), relegated to the appendix. Consistency across models/pruning schemes is unclear, potentially affecting reproducibility. Empirical selection time data would address efficiency concerns.

补充材料

The provided codes but I did not get time to run it and verify. Conceptually, the code structure looks resonable to me. Please let me know if you need me to verify it by running it locally.

与现有文献的关系

PASER builds on LLM pruning (e.g., SparseGPT, Wanda, LLM-Pruner) and instruction tuning literature (Section 2, Page 2). It uniquely targets post-pruning recovery, unlike general data selection methods (e.g., Wang et al., 2024). It advances prior recovery approaches (Ma et al., 2023; Zhao et al.) by optimizing data selection, not just using full datasets. Connections to active learning or curriculum learning could broaden its context, as PASER’s degradation-aware selection resembles these strategies.

遗漏的重要参考文献

Not what I am aware of.

其他优缺点

NA, already pretty covered by previous questions.

其他意见或建议

Would be great to discuss limitations (e.g., English bias, clustering sensitivity) more prominently.

作者回复

2025-03-28

C1: Claims and Evidence

A1: Time complexity validation: In fact, we have provided the empirical runtime data in Figure 2 (Page 8) and compared the efficiency of our PASER with baselines. In practice, for selecting 20% from Alpaca (52K samples) and 4% from LaMini (2.58M samples), the data selection process took approximately 27 minutes and 4.3 hours respectively on our server. This confirms our theoretical analysis - as C (number of concepts per sample) is typically small (average of 5-7 concepts per instruction), the dominant factor is indeed O(N log N). We'll highlight these empirical measurements in the revised paper.

Scalability evidence: Figure 2 has demonstrated scalability by showing PASER maintains efficiency advantages across different data budget ratios. At the 4% data budget on LaMini (using ~100K samples from 2.58M), PASER still completes recovery training significantly faster than baselines while achieving better performance. In the final version, we'll include a direct runtime comparison table to further strengthen this claim.

Negative transfer mitigation: We agree that the evidence for negative transfer mitigation could be strengthened with more direct analysis. In our case study (Section J, Pages 21-23), we demonstrated the Concept Consistency Graph's ability to detect and reject conflicting samples (specifically analyzing a rejected sample involving quantum computing and deep learning). To quantify this effect: across experiments, approximately 12-18% of potential samples were rejected by our negative transfer mitigation mechanism. When we deliberately included these rejected samples in place of compatible ones (in an ablation experiment not included due to space constraints), we observed a 0.9-1.6 point performance degradation across tasks. We'll incorporate this quantitative analysis in the revised paper.

Thank you for helping us identify these areas for improvement. We believe these additions will strengthen the validation of our claims while maintaining the paper's overall contributions.

C2: Experimental Designs Or Analyses

A2: Hyperparameter: For the Semantic-Structural Recovery Instruction Clustering, we used consistent settings across all experiments: diffusion time t was automatically selected using the spectral gap method, and the embedding dimension d was set to 16. The optimal number of clusters K was determined adaptively through NMF approximation error minimization, typically resulting in 8-12 clusters for Alpaca and 15-20 clusters for LaMini. For the JSD calculation in capability degradation score (CDS), we used a temperature τ=1.0 for the output probability distribution. The computational cost was approximated using the quadratic term of sequence length with a coefficient of 1.0 across all experiments. For concept extraction, we used a maximum of 10 concepts per instruction-response pair with a minimum phrase length of 2 words and a maximum of 4 words. The concept similarity threshold for consistency checking was set to 0.75 across all experiments. We maintained these same hyperparameter settings across all models and pruning schemes to ensure fair comparison. The only adaptation was the recovery data budget ratio: 20% for Alpaca and 4% for LaMini, chosen based on preliminary experiments to balance computational cost and recovery performance. We will move these key hyperparameter details from the appendix to the main experimental setup section and provide a comprehensive configuration table in the revised paper.

Empirical selection time: Please see A1.

C3: Relation To Broader Scientific Literature

A3: In the revised paper, we'll add discussion relating PASER to active learning and curriculum learning .

C4: Other Comments Or Suggestions

A4: Due to space limitation here, we will provide more comprehensive discussion in final version.

C5: Questions For Authors

A5: Handling noisy instructions is indeed a critical aspect for real-world applicability. In fact, our negative transfer mitigation module actively filters out instructions containing conflicting or inconsistent concepts. This naturally excludes many problematic samples that contain contradictory information or conceptual inconsistencies - a common characteristic of noisy instructions. In our experiments, approximately 12-18% of potential samples were rejected by this mechanism in our experiments. As shown in Figure 2, PASER demonstrates robust performance as data budget increases, unlike random selection which shows performance degradation when B/N increases from 0.3 to 0.4. The is because expanding data scale also introduces the conflicting or negative data existing in the original dataset. Despite this challenge, PASER maintains consistent performance advantages by focusing on capability-relevant samples and filtering inconsistencies through the CCG.

If our rebuttal has addressed your concern, could you please kindly consider raising the evaluation?

最终决定Reject

2025-05-01

The paper proposes an interesting idea and has performed relatively thorough experiments. After rebuttal, the reviewers still have concerns about the computational overhead and weak theoretical justification. AC downweighs the review from aKaN but agrees that the paper may still need some revision.