5.5

/10

Poster4 位审稿人

最低4最高6标准差0.9

3.0

置信度

正确性2.8

贡献度2.5

表达2.3

NeurIPS 2024

Search for Efficient Large Language Models

Xuan Shen,Pu Zhao,Yifan Gong,Zhenglun Kong,Zheng Zhan,Yushu Wu,Ming Lin,Chao Wu,Xue Lin,Yanzhi Wang

OpenReview PDF

提交: 2024-05-13更新: 2024-11-06

摘要

关键词

Architecture SearchLarge Language Models

评审与讨论

审稿意见

评分: 6置信度: 42024-07-05

This paper introduces a neural architecture search method for Large Language Models (LLMs) comprising three steps: inheriting the most salient weights from the original model to form the initial sub-network, using an evolutionary algorithm to search for the optimal sub-network, and reconstructing the original network’s output with calibration samples.

优点

Overall, I like the application of evolutionary algorithm to pruning problem.

The method overcomes the limitations of uniform sparsity across all layers—a common but sub-optimal constraint in previous structured pruning methods.
It avoids reliance on back-propagation, enhancing its applicability to larger models within a constrained memory budget.

缺点

Here are some technical concerns:

The candidate evaluation is conducted based on perplexity on WikiText2. This is one metric on one task, then the searching algorithm might be biased towards the task used for candidate evaluation, resulting in a model favoring one task (e.g., language modeling) over others. How to ensure the performance of the searched model on other tasks?
Based on my experience, using perplexity as a metric for selecting pruning candidates can result in good performance on WikiText but sub-optimal performance on the MMLU dataset. It would be great if authors could validate their pruned (or searched) model on MMLU.
Table 5. The performances on QA tasks should also be presented. If the space is not enough, authors can put them in the Appendix and link it in the main text.
Figure 7. Tokens/s is not a very rigorous measurement for speed. The generated tokens per second can depend on many factors, e.g., floating point precision, KV Cache, flash attention, batch size, etc. I suggest authors use other metrics such as MACs or at least provide more specifications.

问题

In addition, here are some points about writing:

The term "inheriting ratio" needs to be rigorously defined in an early section. In addition, authors would better add a sentence describing the difference between "inheriting ratio" and "sparsity" / "pruning ratio". In my understanding, "inheriting ratio" = 1 - "sparsity".
Figure 2 Left: it is better to add an annotation showing which axis is the layer index. I assume it is the horizontal axis.
Section 3.2, Lines 139-149: The writing of this part needs to be improved. It gives me a lot of confusion when I read this part.

(i) Please define what $M$ and $P$ are in Line 140.

(ii) Are the masks smooth or binary? If binary then $\{0, 1\}$ should be used instead of $\mathbb{R}$ in Line 140.

(iii) Line 141, why do different layers of Attn / MLP share the same mask? If so, would the sparsity of each layer be the same? However, according to Figure 1, the sparsity should be different across layers. Also, in Line 146, authors "set the same inheriting ratios for the masks in all building blocks". So, these claims and figures are very chaotic. Additionally, what does "align the internal computations" mean?

(iv) Line 144, Are Eq. (2) and (3) jointly optimized (e.g., take the sum) or separately? Are the masks of different layers jointly optimized or separately?
Section 3.3.1, Line 165: $\gamma$ should be rigorously defined.

局限性

Section 5: The limitations provided by the authors are too trivial. It is obvious that larger models require more time for pruning/searching. Please add more in-depth discussions on limitations. Authors can refer to the points listed in the "Weaknesses" section for more substantial issues that need addressing.

作者回复

2024-08-06

Thanks for the suggestions from the reviewer.

Weakness 1. Potentially biased evaluation

To mitigate the potential biased evaluation problem, we do not use the data from specific downstream tasks (QA datasets such as MMLU or ARC). Instead, for our candidate evaluation and reformation, we use the general dataset WikiText2, which represents generic verified and featured articles on Wikipedia and makes sure that our experiments remain actually zero-shot since no task-specific downstream data is seen during our search. As the model never sees any downstream data (so that it does not know how to be biased), it incurs less biased evaluation for downstream tasks. This setting is consistent with previous works such as FLAP and SliceGPT.

To ensure a fair comparison, we use the same calibration dataset (i.e. WikiText2) for all methods including the baselines. We believe our evaluation is fair and comprehensive. Our method can achieve a better performance on language modeling and QA datasets. Furthermore, FLAP investigates the performance with different calibration datasets (see Appendix C.3 and Table 7 of FLAP). It observes that C4 or WikiText2 as the calibration data, results in a fluctuation of about ±1% in average accuracy for zero-shot tasks, which is not significant.

Weakness 2. MMLU performance

To make a fair comparison, we follow the setting in existing works (such as FLAP and SliceGPT) to use the data from WikiText2 for the candidate evaluation. Besides, it is super efficient to compute the perplexity of a few samples compared with testing the accuracy on the whole test set.

Following the suggestion, we demonstrate the performance on MMLU dataset for our method and the baseline FLAP in Table 9 in the global rebuttal. Our method can outperform the baseline in terms of accuracy on MMLU.

As we mentioned in our response to Weakness 1, we use the general dataset WikiText2 for candidate evaluation, which represents generic text data from Wikipedia and makes sure that our experiments remain actually zero-shot since no task-specific downstream data is seen during our search. Thus, our model is less biased for different downstream tasks.

Weakness 3. QA performance in Table 5

Following the suggestion, we further provide the results on the common sense reasoning datasets for the 50% inheriting ratio in Table 8 provided in the global rebuttal. As observed, our method can lead to lower perplexity and higher downstream evaluation accuracy performance.

Weakness 4. Speed evaluation

Thanks for the suggestion. For the generation speed test, we feed sentences consisting of 64 tokens into the model and adopt float16 with KV cache. The flash attention is not activated. Our results in Figure 7 demonstrate that our smaller models lead to practical inference acceleration. We agree that the inference speed may be different under different evaluation settings. Following the suggestion, we further show the computation cost in GMACs in Table 10 below.

Table 10

Inheriting Ratio	100%	90%	80%	70%	60%	50%
Computation Cost (GMACs)	424.02	377.89	333.49	293.29	249.28	211.88

Question 1. Writing issues

Yes, the reviewer’s understanding is correct, and "inheriting ratio" = 1 - "sparsity". We provide an explanation for the inheriting ratio in Line 116-117.

Question 2. Annotation

Thanks, we will add the annotation.

Question 3. Clarification

M denotes the number of heads in attention, P denotes the number of channels in MLP (intermediate size). Masks are binary, it should be {0,1} rather than R. We will make it more clear in the revision

3.a. Mask sharing

The mask is shared for the query, key, value, and output projection within a single self-attention module, as well as for the up, gate, and down within a single MLP module, rather than being shared across different blocks. We discuss the mask sharing in detail in the above global rebuttal.

3.b. Align internal computations

As discussed in Line 141-142, for multiple layers inside each module (self-attention or MLP), they have to use the same mask to ensure the aligned internal computations. For example, if the query and key have different masks and thus different/unaligned dimensions, they can not be multiplied elementwise and the attention operation can not be correctly performed. In Appendix A, we demonstrate the mechanism for the self-attention and MLP modules, highlighting that the layers inside the same module need to share the same mask to ensure successful computations. But layers in different modules or blocks do not need to share the same mask. We are sorry for any confusion and will make this more clear in the revision.

3.c. Joint optimization of masks

They are jointly optimized to determine the initial masks as discussed in Line 147-149. Later during our search, all masks are also jointly optimized to satisfy the parameter count constraint.

Question 4. Gamma

$\gamma_{attn}^i$ denotes the inheriting ratio for the $i^{th}$ head in attention. Each attention module can have multiple heads. Thus the inheriting ratio for the attention module is a collection from multiple heads as the inheriting ratio for each head can be different. $\gamma_{mlp}$ denotes the inheriting ratio of the MLP module. It does not have the concepts of multiple heads like the attention module and we just have one single ratio for the whole MLP.

Limitation: Search cost

As we discussed in the global rebuttal, the search cost can be reduced to 20 epochs within 2 hours and is similar to those baselines.

评论- Side comment

2024-08-07

Dear authors,

Thank you for your response. Before we launch further discussions, I would like to ask an important question: Did you submit any global response, or any one-page pdf containng tables / figures? It is not visible to me at the moment. If yes, please contact the AC or PC to fix the systematic error.

To be clear, this comment will not affect my ratings. I just want to make sure that I don't miss any further experiment results or responses from you.

评论- Global rebuttal invisible to reviewers

2024-08-07

Dear reviewer,

We sincerely appreciate you for spending time reviewing and providing feedback on our paper.

We have submitted one global rebuttal which includes important explanations and additional results following the instructions of one global rebuttal for the paper. However, we find that the global rebuttal is only visible to program chairs and authors. The reviewers are not able to read the global rebuttals. We just raised the problem to the committee and asked if they can fix this issue.

Thanks,

Authors

审稿意见

评分: 6置信度: 42024-07-12

This paper introduces an architecture search method based on mask mutation and candidate evaluation to find a subnet with better performance in the LLM. An evolution-based algorithm is applied to globally search the subset with a special initialization from the evaluation of parameter importance. After the search from LLM, a reformation algorithm is proposed to rectify the weights in the inherited LLM. Experimental results show that the compressed LLM can achieve better results compared to the baselines.

优点

A novel method for searching a subnet in the LLM. The search of subnet in LLM is not well-explored, and it would benefit the community.
The overall approach is reasonable and sufficient. The entire method’s pipeline is very clear, and the design for each component is reasonable and complete.
The experimental results are comprehensive and show its benefits compared to the baselines (SliceGPT and LLM-Pruner). A large and steady improvement is observed.
The paper is well-organized and easy to follow.

缺点

My main concern is about the extra cost of searching the subnetworks. Since it still takes around 5 hours to search for the architecture, it introduces extra cost for pruning the model. What if we take these 5 hours to post-train the compressed model? Would the results be better or worse than the subnetwork search from the LLMs?
Since the training-free structured pruning is not a very useful setting, post-training the LLM with an acceptable amount of resources is more realistic. Thus, compared to just pruning and searching the mode, a more realistic setting is also to train the pruned model. And I’m not sure whether the searched sub-LLM would be better than the pruned one. Some previous work to prune CNNs suggest that with sufficient training, magnitude pruning would be a better choice than any manually designed metrics. Thus, it would be better to examine the performance of the compressed LLM with post-training.

问题

For Mask Sharing: why does the same module in different modules share the same mask (Line 141)? In line 147 - 149, from my understanding, the mask is not the same for blocks in different layers. Using different masks for different layers seems to still have the aligned internal computations. I’m not sure if I understand correctly in this part.
What about the results if you post-train the compressed models? What are the results if compared with LLM-Pruner?

局限性

N/A

作者回复

2024-08-06

Thanks for the suggestions from the reviewer.

Weakness 1. Search cost

As we discussed in the global rebuttal, we can reduce the search epoch number from 50 to 20 for LLaMA-7B, which costs 2 hours and can still achieve better performance than other methods. For LLaMA-65B model, we only adopt 20 epochs for search, which costs 5 hours. Besides, compare to other methods, our search cost is similar to those baselines.

Weakness 2. Post training performance

To improve the performance, we adopt an efficient reformation method in Section 3.4 without the need of expensive training or finetuning. As shown in our Figure 6, our reformation can significantly improve the perplexity within a few minutes. We highlight that without training or finetuning, our method already outperforms baselines which are based on recovery training, such as LLM-Pruner and SliceGPT, as shown in our experimental results and Table 8 provided in global rebuttal.

To further evaluate our performance with continual training, we train our searched compact models with LoRA following the same setting of recovery training in LLM-Pruner, which can be finished within another 3 hours for LLaMA-7B. The results are demonstrated in Table 8 provided in global rebuttal. With recovery training, our method can further improve the perplexity and accuracy performance. For more sparse models (50% inheriting ratio) suffering from more significant accuracy loss, recovery training can lead to larger improvements.

Recovery training can further improve the accuracy, requiring more data and computations. Different from recovery training, our reformation can efficiently and effectively lead to better performance than the baselines, with little data and computations. We agree that it is an attractive topic to investigate the pruned model performance with sufficient training. But for LLMs, it is still an open question how to define sufficient training, in terms of model architecture, data quality, data amount, GPUs, and so on. We leave this as our future research.

Question 1. Mask sharing

The weights in the same module of each block share the same mask, but different blocks do not share masks. Each block has its own masks, which are not shared with other blocks. The reviewer’s understanding is correct. We discuss mask sharing in detail in the global rebuttal. Line 141 shows that there are two masks for each block, meaning there will be $2k$ masks for $k$ blocks, as the masks of different blocks are not shared.

Question 2. Fine-tuned compressed model compared to LLM-Pruner

We show the results compared to the fine-tuned LLM-Pruner in Table 8 provided in the global rebuttal. Results show that our method can perform better than LLM-Pruner even without finetuning.

2024-08-12

Thanks for the detailed response. It solves my concern and I prefer to keep my score unchanged.

审稿意见

评分: 6置信度: 22024-07-13

The paper introduces a new model architecture search method for LLMs that does not require additional training. The proposed architectures have structural sparsity and reach better performance than SOTA pruning baselines.

优点

The proposed method does not require additional training.
The output model has structural sparsity, which is useful for hardware.
The paper is clear and well-written.

缺点

Why isn't there a comparison to SparseGPT?

问题

Why isn't there a comparison to SparseGPT?

局限性

The authors discuss limitations.

作者回复

2024-08-06

Thanks for the suggestions from the reviewer.

Weakness & Question: Comparison with SparseGPT

The pruning or search granularity of our method and SparseGPT are different, thus they should not be directly compared. Our method searches smaller compact models, which is more like the structural pruning method to remove rows and columns of weight matrices. Different from our structure search, SparseGPT is an irregular pruning method to remove weights without strong constraints (such as, the removed weights should be in the same rows or columns). Thus in the experiments, we compare with other structure pruning methods such as SliceGPT and FLAP, without comparing to SpareGPT. Furthermore, the baseline SliceGPT already provides certain comparisons with SparseGPT, as shown in Table 1 of SliceGPT. As mentioned above, it is hard to make a fair comparison between structure and irregular pruning. In Table 1 of SliceGPT, the sparsity of SparseGPT is 50%, while the sparsity of SliceGPT is 25% or 30%. The results in their Table 1 can serve as a reference, and our method can outperform SliceGPT non-marginally.

评论- response to rebuttal

2024-08-12

Thank you for the response. I maintain my score for acceptance.

审稿意见

评分: 4置信度: 22024-07-22

This paper proposes a technique for searching for efficient LLM subnets for fast inference, while still attaining strong performance. The proposed technique involve a training-free pruning stage based on a genetic algorithm, followed by a ``weight rectification'' stage that improves the resulting subnet. Empirically, the proposed method outperforms existing LLM pruning baselines.

优点

Empirically, the proposed pruning technique appears to outperform a handful of existing pruning baselines, as shown in Figure 1 and Table 2.
The proposed technique appears to be general enough to work for a wide variety of existing LLMs, which appears to be a s hortcoming of existing LLM pruning techniques.
The proposed method is fairly simple in that it combines LLM pruning with a genetic algorithm for search.

缺点

From the abstract, it is unclear what the actual proposed method does beyond standard neural network pruning techniques.
The term ``reformation algorithm'' is used throughout the paper, and it is unclear what this means until much later in t he text. The intro and abstract should clarify this.
The main evaluation is done in terms of sparsity levels, which makes sense for this particular type of search space, how ever, the authors should also empirically validate the efficiency of the search process compared to existing pruning techn iques. E.g., how do these pruning techniques compare, subject to the same budget? Moreover, it would be interesting to see how the proposed method performs as a function of the search budget.
The authors should compare to other techniques beyond pruning, such as quantization which may achieve similar inference speed and memory improvements with better overall performance. In general, it seems that LLM pruning techniques take a sub stantial performance hit even at fairly modest sparsity levels (I'm mainly referring to Table 7).

问题

The authors mention in the limitations that the search cost scales with the model size. Can the authors characterize thi s a bit more? Do other pruning methods suffer from the same problem, and can the authors comment on what this scaling curv e actually looks like?
Why was the search budget set to 5 hours? Have the authors explored this choice empirically, and does this apply to all of the models and data settings that were evaluated?

局限性

The authors adequately address limitations in the conclusion.

作者回复

2024-08-06

Thanks for the suggestions from the reviewer.

Weakness 1. Difference from network pruning

Our proposed search method significantly differs from standard neural network pruning techniques. Specifically, our method includes a search initialization process to identify the effective initial architecture and an evolution architecture search with special mask mutation and efficient candidate evaluation, which are not typical in standard weight pruning techniques. We are the first to incorporate the architecture search technique for LLMs to find efficient architectures. There are no other search baselines and thus we mainly compare with pruning works. Although our method significantly differs from traditional pruning, both our search and pruning can lead to efficient lightweight LLMs.

Weakness 2. Reformation

Thanks for the suggestion, we will explain more about the reformation in the abstract and introduction.

Weakness 3. Search cost

We highlight that our search cost is just similar to the baselines. Details can be found in the above global rebuttal.

3.a. Performance as a function of search cost.

Figure 5 clearly demonstrates how the proposed method performs as a function of the search budget (number of epochs and thus search hours). As we discussed in the global rebuttal, we can reduce the search cost without significantly sacrificing the performance. Note that, with 20 epochs, for LLaMA-7B, we can still achieve 6.23 and 7.21 perplexity on WikiText2 under 90% and 80% inheriting ratios, respectively, which are better than all baselines shown in Table 2.

3.b. Lower cost for memory and data in the search

Besides the training time on GPU to finish the search, our method achieves a better performance with less memory and data. (a) Compared to LLM-Pruner, which identifies weight importance using backward passes, our method relies solely on inference. This significantly reduces GPU memory consumption, enhancing efficiency. (b) Compared to the SliceGPT, it requires 1024 data samples for calibration. Our method requires significantly fewer calibration samples (128) for weight optimization after the search, leading to better performance on various datasets.

Weakness 4. Quantization

Quantization methods are not direct competitors but rather complementary to our search approach. Existing quantization methods can be applied to the subnets identified by our method, further enhancing acceleration through integer matrix multiplications. In practice, quantization methods require specialized computation kernels to handle integer matrix multiplications, which can introduce additional computational overhead and complexity in implementation, particularly during inference. This can pose challenges when deploying quantized models on diverse hardware platforms which may not support the quantized model. Our approach, on the other hand, maintains the original computation framework of LLMs, without the need for additional operations or specialized implementations.

Regarding the performance hit, we would like to clarify that the results in Table 7 were generated with a sequence length of 128, rather than 2048, for sequence length ablation study. Given that the dense LLaMA-7B model achieves the perplexity of 12.62 at 128 sequence length, our method, with a 90% ratio, results in the perplexity of 13.40. This indicates a relatively small performance gap. With a larger sequence length such as 2048, the performance of all methods can be better and we can still achieve the best performance as shown in Table 2 and Table 3.

Question 1. Search cost and model size

We discuss the search cost in detail at the global rebuttal. For LLaMA-7B, we still can achieve better performance than other methods when searching only with 20 epochs within 2 hours. For LLaMA-65B, we only adopt 20 epochs to get the subnets, which takes 5 hours.

Question 2. Search cost with 5 hours

As we mentioned in global rebuttal, for LLaMA-7B on one A100 GPU, it takes 5 hours for 50 epochs during the search process. The search already converges at 20 epochs as demonstrated in Figure 5, which only takes 2 hours and already achieves better performance than other methods. Besides, for large models such as LLaMA-65B, we only adopt 20 epochs for faster search, which takes just 5 hours.

作者回复

2024-08-06

We thank the reviewers for acknowledging that our work overcomes prior research limitations and benefits the community (Reviewer Xurt, BSTA), our method is novel, general, and high-performing (Reviewer n31G, Qp2U, Xurt, BSTA), our experiments are comprehensive (Reviewer Xurt), and our paper is well-written (Reviewer Qp2U, Xurt).

1. Search Cost

We can reduce our search cost with similar performance. For the search cost, it usually takes only 20 epochs for the convergence, and more epochs beyond 20 lead to marginal improvements (see Figure 5). For LLaMA-7B, we run 50 epochs (5 hours) for better results. When searching for 20 epochs, the perplexity for LLaMA-7B with 10% and 20% sparsity on WikiText2 are 6.23 and 7.21, which are still better than all baselines in Table 2. Besides, for large models such as LLaMA-65B, we only adopt 20 epochs for faster search, which takes 5 hours.

Besides, we highlight that our search cost is just similar to the baselines.

For LLM-Pruner, it is very difficult to scale to larger models such as LLaMA-30B and LLaMA-65B, as it requires backward propagation to compute the gradients for identifying the weight importance, incurring much larger memory and computation cost. It only has the results for 7B models (such as LLaMA-7B, and Vicuna-7B) and LLaMA-13B. The 7B models require 3 hours to finish the compression.

For SliceGPT, it requires 5 hours to finish the pruning and fine-tuning of LLaMA-2 70B, which is similar to ours with 5 hours of search for LLaMA-65B. SliceGPT relies on PCA (Principal Component Analysis) to identify and remove less significant weights, which computes the eigenvectors and eigenvalues of the covariance matrix of the signal matrix. For large models, this step is computationally demanding and time-consuming because it requires processing a substantial amount of data to compute these matrices accurately. Besides, this work notes that using double precision for eigenvector calculations in PCA is necessary to avoid numerical errors that can degrade performance. This choice further increases the computational load and time required for pruning. It is data-hungry and requires to use much more calibration samples (1024) while our method only uses 128 samples.

2. Mask Sharing

As discussed in Line 124-128, the model has multiple blocks (i.e., layers of LLaMA model). Each block has two modules, the self-attention module and MLP module. Each module contains multiple weights, such as the query, key, value, output projection in the self-attention module and up, gate, down in MLP module (Line 124-128). Mask sharing means that different weights in the same module share the same mask. However, modules of different blocks do not necessarily share the same mask. Each block has two masks corresponding to its self-attention module and MLP module, respectively. Different blocks do not share masks, and they have their own masks, which are not necessarily the same. Just as Reviewer BTSA mentioned, Figure 2 middle shows that different blocks can have different masks/sparsity.

Initially, different masks have the same inheriting ratios. But the masks can still be different although their sparsity are the same. After we begin our search, the inheriting ratios of different masks can be different, although they start from the same value.

3. Additional Results

We further provide more results with lower inheriting ratios and fine-tuning in Table 8 as follows,

Table 8

Method	Inheriting Ratio	Wiki PPL↓	PTB PPL↓	BoolQ	PIQA	HellaSwag	WinoGrande	ARC-e	ARC-c	OBQA	Average Acc.↑
LLM-Pruner	80%	11.97	55.68	59.39	75.57	65.34	61.33	59.18	37.12	39.80	56.82
LLM-Pruner (LoRA)	80%	8.41	41.78	67.78	76.36	68.14	64.32	63.41	36.87	40.10	59.57
Ours	80%	6.89	36.06	70.98	74.92	67.29	64.64	64.23	36.52	39.40	59.71
Ours (LoRA)	80%	6.57	34.32	71.19	75.01	68.43	64.12	63.88	35.44	40.03	59.73
LLM-Pruner	50%	126.02	460.71	51.97	60.23	35.89	49.07	32.76	25.67	34.91	41.50
LLM-Pruner (LoRA)	50%	51.56	198.76	60.98	69.12	47.83	55.78	46.92	28.56	35.64	49.26
Ours	50%	15.48	117.06	61.27	69.13	51.18	57.78	52.16	29.34	35.45	50.79
Ours (LoRA)	50%	11.34	83.48	63.49	73.03	56.93	59.21	48.87	36.12	36.23	53.41

We also provide the results of our method on MMLU dataset compared to FLAP in Table 9 as follows,

Table 9

Inheriting Ratios	LLaMA-7B FLAP	LLaMA-7B Ours	LLaMA-13B FLAP	LLaMA-13B Ours	LLaMA-30B FLAP	LLaMA-30B Ours
100%	34.9	34.9	46.9	46.9	58.2	58.2
90%	31.3	31.6	40.3	40.3	52.8	52.9
80%	29.1	29.2	35.7	36.2	47.3	47.6
70%	26.8	27.3	33.4	33.7	41.5	42.4
60%	25.9	26.3	28.4	29.5	36.1	37.4
50%	24.2	25.1	27.9	29.1	34.5	35.1

评论- Global rebuttal invisible to reviewers

2024-08-07

Hi ACs,

We have submitted one global rebuttal which includes important explanations and additional results following the instructions of one global rebuttal for the paper. However, we find that the global rebuttal is only visible to program chairs and authors. The reviewers are not able to read the global rebuttals. The reviewers can read the specific rebuttals corresponding to their reviews. It seems to be a technical issue for the visibility of the global rebuttal. Can you help us address this issue?

Thanks Authors

最终决定Accept (poster)

2024-09-25

This paper presents a neural architecture search method for Large Language Models (LLMs) pruning (or finding the optimal subnets). The core idea is based on the salient weights, the evolutionary algorithm for each and the reconstruction for calibration.

Overall the reviewers find the proposed method to be reasonable and simple, while the empirical gain is interesting as well. While there has been concerns around the baseline comparisons and other experimental details, the authors have provided the rebuttal which has addressed many of them. After reading the authors rebuttal, this feels like a borderline paper but I would highly encourage the authors to include the additional results into revisions.