4.0

/10

withdrawn4 位审稿人

最低3最高5标准差1.0

3.5

置信度

正确性2.8

贡献度2.3

表达1.0

ICLR 2025

UNAST: Unified framework for Neural Architecture Search for Transformers

Ilia Markov,Chenhan D. Yu,Hongxu Yin,Saurav Muralidharan,Greg Heinrich,Jan Kautz,Dan Alistarh,Pavlo Molchanov

OpenReview PDF

提交: 2024-09-26更新: 2024-11-25

摘要

关键词

EfficiencyNeural Architecture SearchLarge Language models

评审与讨论

审稿意见

评分: 3置信度: 42024-10-27

The paper introduces UNAST, a novel unified framework designed for post-training optimization of Large Language Models (LLMs). It integrates Neural Architecture Search (NAS) with sparsity and quantization techniques aimed at compressing LLMs. Starting from a pre-trained model, UNAST replaces inefficient layers such as attention and Multi-Layer Perceptron (MLP) with more efficient alternatives, adjusting parameters as well. The replacement layers undergo a process known as local distillation to replicate the performance of the original layers. The authors report that UNAST can reduce training costs significantly—up to 10 times compared to training smaller models from scratch—and improve latency and memory efficiency by up to 60% while maintaining high accuracy.

优点

By integrating NAS with sparsity and quantization, UNAST provides a holistic approach to compressing LLMs, including neural architecture search with sparsity and quantization technique.
The method can lead to significant reductions in training costs and improvements in runtime efficiency and memory usage. UNAST also allows for the creation of non-uniform models, providing insights into how different compression techniques affect Transformer layers.

缺点

The process involves several stages including layer-wise distillation and architecture search, which could make the methodology complex to implement.
The overall novelty of this idea is limited. It tries to use the previous compression techniques together and applies on the some language models. It is good as a technique report, while the current contribution is insufficient for a top-tier conference.
The organization of this paper is messy to some extent. For example, page 7 has a large blank space.

问题

What are the operational computing resources and energy consumption implications when deploying optimized models with UNAST compared to traditional methods?
How does the order of different compression techinques affect the final results? For example, we play NAS-spasitification-quantization, or spasitification-NAS-quantization, or perform these techinques simultaneously.

审稿意见

评分: 5置信度: 32024-10-29

This paper introduces UNAST as a post-training optimization approach for LLMs. By applying NAS, UNAST reduces training costs by up to 10x compared to training a smaller model from scratch. Evaluations on LLaMA and GPT models demonstrate that UNAST improves latency and memory usage with minimal impact on accuracy.

优点

This paper addresses the important challenge of efficient LLM deployment.
It takes a novel approach by applying NAS to LLMs, moving beyond traditional methods like quantization and sparsity.
The proposed method demonstrates improvements over baselines across various experimental settings.

缺点

The presentation could be enhanced for clarity. Are you suggesting that post-training is more efficient than training an LLM from scratch for deployment? However, if resources are available to train a large LLM, the cost of training a smaller model may not be prohibitive.
The authors should emphasize key results in both tables and text. For instance, highlighting the best-performing method in bold in Table 4 would be helpful, and it would be beneficial to specify the improvements in accuracy and latency achieved within the paragraphs.

问题

Could you clarify the source of GPT-3 in your experiments? To my knowledge, OpenAI has not released a GPT-3 checkpoint publicly.

审稿意见

评分: 3置信度: 32024-11-03

In this paper, the authors propose a neural architecture search method, UNAST, to compress LLMs post training based on the desired model size and latency.

Specifically, UNAST:

pre-define the search space on the attention and MLP block, like the number of heads and KV dimention of the attention block, and projection dimention of the MLP block and so on.
applies layer-wise knowledge distillation from the original later (techer layer) to the modified layer (student layer), and obtains the scores for various candidates.
selects the highest ranked candidates based on the cumulative accuracy metric, and finetunes the modified models with limited number of tokens.

Extensive experiemnts show some instructive findings:

Modifying the attention block is easier than modifying the MLP block;
Reduing the number of heads is easier than reducing the KV dimentions;
Pruning the first-half layers is easier than the second-half.

The final pruned LLMs are smaller and faster, and also outperforms or is comparable to various baselines.

优点

The experiments are extensive, with most claims well supported by experiments.
The final LLM shows the effectiveness of UNAST, with smaller model size and higher latency, but still comparable to most baselines.

缺点

Poor presentation: The paper is not easy to follow because of the following problems:

(1) The citation format: Like in L265, the citation looks like "the zero-shot lm-harness benchmark Gao et al. (2023) accuracies". For such citation, you should use \citep{} instead of \citet{}. I have obsered quite a few such mistake.

(2) The legend in the figures are not well-explained. Taking Figure 4 as an example, I can guess "att_nh_32_kv_96" means 32 and 96 refer to the number of attention heads and KV dimentions. But for "mlp_hs_1_swiglu_1", I don't really understand the meaning of 1. I suppose it means the dimention factor in the MLP. I believe a well-explained notation is necessary for them.

(3) The experimental section is too messy, I suggest a better structure of results. You should leave more space and explanation to the important results or findings.

(4) Better to highlight some results: In Table 4, it's better to highlight the best results. Otherwise, it's not easy to observe the the improvement of UNAST.

Lack of novelty: Please correct me if I understand wrongly. UNAST applies NAS to pre-trained LLMs. Compared to previous works that apply NAS to find an optimal architecture and train it from scratch, UNAST tries to find a compact LLM based on a pre-trained LLM. I think such a novelty is not enough for ICLR.

问题

Please see some questions in weakness.

审稿意见

评分: 5置信度: 42024-11-03

In this paper, the authors propose a neural architecture search (NAS) approach to large language models with affordable search computation costs and latency/num. of parameters constraints. In the methodology, the authors first adopt a layer-wise knowledge distillation, and then conduct architecture search / selection, and finally fine-tune the best model. Afterward, the authors conduct experiments to show the benefits of proposed UNAST, and showcase the findings in terms of efficient model design.

优点

Efficient language model design is an important research topic in the community. In this paper, the authors make NAS algorithm affordable in the domain of large language models. Meanwhile, through a series of experiments, the authors provide several interesting findings, which could be guidance for efficient language model designs in the future.

缺点

The paper writing seems a little rush, which reduces the readability. Therefore are several notations and terms, and even conclusions, which are not explained or well interpreted. Please see the questions.
How the authors ensure the comparisons in Table 4 are fair considering there is no budget description of other approaches?
The title is a larger scope (transformer) while the paper mainly focuses on language model.
Finding 2 and finding 3 seem trivial to me: with less parameters, layer skipping is naturally worse than linear layer, and also worse than smaller alternatives. I think a performance/parameter tradeoff is necessary for visualization.

问题

I have many questions regarding the methodology and experimental results.

Line 147: what is the meaning of $j$ in $s_{ij}$ ? why there is a $i$ in $X_i^D$ .
Line 159: what is the meaning of $2N$ ? Does $N$ mean the number of transformer block?
Line 160, what is the meaning of $S^{M}$ and $S^A$ ? There should be $\\{0,1\\}^{S^M}$ .
In equation 2, there should be a loss function for $(T(x), y)$ .
Line 181: "if" $I$ -th layer.
In equation 3, what is meaning of $**a**_i$ , does it mean the accuracy?
In Line 264, what is the meaning of LM score? What kind of evaluation score?
in line 274, why only 15 model candidates are considered? From my understanding, in the local distillation, each layer could be substituted with C candidates. If there are 2N layers, the possible candidates should be 2CN. Please clarify how 15 is considered.
In Figure 3, what is the meaning of ns? Does it mean the intermediate size of FFN?
In Line 377, how to conclude that "The full pool of candidates shows the best overall result" considering "full pool" is suboptimal compared to other designs in LM-eval. It only has advantages in Wiki ppl.
In Table 4, please use uparrow or downarrow to show the performance gains or losses.

撤稿通知

2024-11-25

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.