Understanding Layer Significance in LLM Alignment
We propose an algorithm to identify which layers within LLMs are most critical to the alignment process.
摘要
评审与讨论
The paper investigates layer significance in LLM alignment, proposing ILA to identify critical layers via binary masks. Key findings include consistent important layers across datasets (90% overlap), improved performance by freezing non-essential layers, and efficiency gains from selective tuning (10-30% layers).
接收理由
[1] The paper introduces a systematic approach (ILA) to analyze layer dynamics in alignment, offering insights into efficiency and model behavior; [2] It demonstrates freezing/tuning strategies reduce compute costs while maintaining performance, addressing a critical challenge in LLM deployment.
拒绝理由
[1] The evaluation might have limitations: it relys heavily on GPT-4o for conversational metrics. Human evaluations or adversarial testing could've strengthened validity. [2] Some claims need to be verified, especially. claims about broader applicability to reasoning lack thorough analysis. [3] Details on hyperparameter tuning (e.g., LoRA ranks, learning rates) are referred in appendix, but may not reproducible
给作者的问题
How does the binary mask balance sparsity and performance during gradient descent? Could stochasticity in training affect layer importance rankings?
he experiments use Alpaca-GPT4, LIMA, and No Robots. How would ILA perform on alignment datasets with conflicting stylistic objectives (e.g., formal vs. casual responses)?
We sincerely appreciate your positive and thoughtful feedback! Below, we address your concerns (W1 & W3 & Q2) and clarify any potential misunderstandings (W2 & Q1).
W1: The evaluation might have limitations.
It relys heavily on GPT-4o for conversational metrics. Human evaluations or adversarial testing could've strengthened validity.
Thank you for highlighting this limitation. We agree that relying solely on GPT-4o for conversational metrics may introduce biases and may not fully capture nuanced human judgments (e.g., coherence, engagement, social appropriateness). Incorporating human evaluations and adversarial testing would indeed offer a more robust and comprehensive assessment.
However, conducting large-scale human evaluation was infeasible due to the scope of our experiments, which involve evaluating multiple baseline models and our method across three alignment datasets and two evaluation datasets. Therefore, following prior studies [1,2], we opted to prioritize broad coverage through automated evaluation. We will explicitly acknowledge this trade-off in the Limitations section.
[1] SimPO: Simple Preference Optimization with a Reference-Free Reward
[2] Zephyr: Direct distillation of lm alignment
W2: Some claims need to be verified.
Some claims need to be verified. especially claims about broader applicability to reasoning lack thorough analysis.
We fully agree that drawing definitive conclusions requires broader evaluation across diverse architectures. The preliminary experiment on Qwen2.5-7B-Instruct (Sec. 5, Fig. 3, Tab. 10) serves as evidence of the Potential (stated in line 71) of our method to enhance reasoning capabilities. This experiment was conducted shortly after the reasoning datasets became available, around February and March.
W3: Reproducibility.
Details on hyperparameter tuning (e.g., LoRA ranks, learning rates) are referred in appendix, but may not reproducible
Below, we clarify our experimental setup and how we ensure fair comparisons and hyperparameter reproducibility:
- LoRA Rank (Fixed for Fairness): We set the LoRA rank to 32 for all methods to ensure consistency and avoid any confounding effects from hyperparameter tuning. Although our initial results indicate that higher ranks (such as 64) might improve performance, we chose to keep the rank fixed. This allows us to focus on evaluating the impact of our method, rather than differences caused by varying hyperparameters.
- Learning Rate (Tuned via Two-Stage Search): For each method, we first fixed a reasonable learning rate and tuned the number of training epochs. After identifying the optimal number of epochs, we then searched for the best learning rate using this fixed epoch count. This avoids an exhaustive grid search while still enabling effective optimization.
- LLM Evaluation Variance: We acknowledge that evaluations involving GPT-4o are inherently sensitive to API versions and configuration settings (e.g., prompts, decoding parameters). Despite this, all models—including our baselines—were evaluated under identical conditions to ensure fairness.
- Code Release for Reproducibility: To ensure reproducibility, we will publicly release our code and configuration files upon publication.
Q1 (Part 1): How does the binary mask balance sparsity and performance during gradient descent?
We believe there may be a misunderstanding regarding our training process. Our method does not require balancing sparsity and performance during gradient descent. Instead, for each layer, we optimize a real-valued variable , which is mapped to the (0, 1) range using a sigmoid function . This score reflect the importance of each layer. After learning these scores, we use a hyperparameter to select the top- layers based on their scores for subsequent selective fine-tuning. Therefore, the sparsity of the binary mask is determined solely by the hyperparameter .
Q1 (Part 2): Could stochasticity in training affect layer importance rankings?
We already addressed this from two perspectives in our paper:
- Random seeds: As shown in Table 3, the layer importance rankings are highly consistent across different random seeds, with a Jaccard Similarity of nearly 90%, indicating strong robustness to training stochasticity.
- Parameter initialization: As discussed in our ablation study (Observation 5, Table 15, Appendix C.4), different initializations lead to only minor variations, while the high Jaccard Similarity indicates that the rankings remain generally stable.
We will emphasize these points more clearly in the revised version.
Q2: How would ILA perform on alignment datasets with conflicting stylistic objectives (e.g., formal vs. casual responses)?
Thank you for your thoughtful question. To address this, we conducted additional experiments on the OASST1 dataset [1], which features diverse linguistic styles and perspectives, contributed by volunteers with varying educational and professional backgrounds. This diversity makes OASST1 well-suited for analyzing stylistic variation. Simlar to QLoRA [2], we use a subset of OASST1 dataset, which contains the highest-rated paths in the conversation tree.
- Robustness of Layer Importance Rankings: We compared those obtained by fine-tuning LLaMA 2-7B on OASST1 with rankings from LIMA, No Robots, and Alpaca-GPT4. As shown in Table 1, the Jaccard Similarities remain high (0.83–0.90), indicating that ILA identifies similar important layers despite stylistic variation.
Table 1: Jaccard similarities of the top 75% important layers in Llama 2-7B.
| No Robots | LIMA | Alpaca-GPT4 | |
|---|---|---|---|
| OASST1 | 0.86 | 0.90 | 0.83 |
Note that these similarities are slightly lower than those observed among LIMA, No Robots, and Alpaca-GPT4 themselves (∼91%; see Table 2 in the main paper), which is expected given the broader stylistic and linguistic diversity present in OASST1.
- Performance Improvements: We also fine-tuned LLaMA 2-7B on OASST1 using only the top ~75% important layers identified by ILA. As shown below, selective fine-tuning yields consistent performance gains:
Table 2: Comparison of Llama 2-7B fine-tuned on the OASST1 datasets.
| MMLU | Hellaswag | Vicuna | MT-Bench | |
|---|---|---|---|---|
| LoRA | 44.82 | 58.98 | 6.62 | 4.46 |
| LoRA w/ ILA (~75%) | 45.24 | 59.22 | 6.75 | 4.51 |
These findings demonstrate that ILA remains effective even on datasets with conflicting stylistic objectives.
[1]Köpf, Andreas, et al. "Openassistant conversations-democratizing large language model alignment. CoRR, abs/2304.07327, 2023. doi: 10.48550." arXiv preprint arXiv.2304.07327.
[2]Dettmers, Tim, et al. "Qlora: Efficient finetuning of quantized llms." Advances in neural information processing systems 36 (2023): 10088-10115.
Dear Reviewer pE9b,
Thank you very much for your careful consideration and valuable feedback on our paper. We submitted our rebuttal five days ago, and with less than three days left in the discussion phase, we would like to kindly check whether our responses have addressed your main concerns.
We have provided detailed responses and clarifications, and sincerely hope these have addressed your concerns and may contribute to a more positive evaluation. Please let us know if any questions remain.
Thank you again for your time and feedback.
Best regards,
The Authors
This paper introduces ILA (Identifying Layer significance for Alignment), a method to find key layers in Large Language Models (LLMs) during alignment. It reveals that a consistent set of layers is crucial across different datasets. Interestingly, ignoring less important layers can boost performance, and fine-tuning only critical layers improves efficiency with minimal performance loss. The research also suggests these findings could apply to LLM reasoning.
接收理由
- Clear Writing: The paper is well-structured and explains complex concepts, such as the ILA methodology and its theoretical underpinnings, in an accessible manner. The arguments flow logically from the introduction of the problem to the presentation of results and their implications, making it easy for readers to follow the research.
- Interesting Findings: The discovery that important layers for alignment show high consistency across diverse datasets is a significant and intriguing finding. Additionally, the observations that freezing certain "unimportant" layers can lead to performance improvements (Table 5), and that selective tuning of a small percentage of critical layers maintains performance, offer valuable and somewhat counter-intuitive insights into the LLM alignment process.
- Extensive Experimentation: The authors validate their claims through a comprehensive suite of experiments. They utilize multiple LLM architectures across various sizes, test on several alignment and reasoning datasets, and evaluate using a broad range of benchmarks for language understanding and conversational ability.
拒绝理由
- Reliance on LoRA Approximation: The ILA method primarily utilizes LoRA (Low-Rank Adaptation) to approximate parameter changes for computational efficiency. An ablation study indicates that LoRA-based importance scores achieve nearly 83% overlap with those derived from full fine-tuning (FFT). However, this also implies that LoRA is an approximation and may not perfectly mirror the true parameter changes of FFT, which could influence the precise identification of the most critical layers. Despite this, the paper suggests LoRA provides a "strong approximation".
- Computational Cost of Initial Stage: Although ILA is presented as an efficient method, its first stage, which involves training the model with LoRA until it reaches an stable state, accounts for the majority of the computational cost. The paper suggests that identifying layer importance effectively requires only 25-50% of the training milestones. Nevertheless, this initial phase can still be resource-intensive, particularly for very large models.
- Exploration Range for Layer Freezing/Tuning: The study primarily investigates freezing approximately 25% of unimportant layers to enhance performance or fine-tuning 10-30% of key layers to improve efficiency. While these specific percentages yield positive outcomes, a more detailed analysis of how varying these proportions affects the balance between performance and efficiency could offer further insights.
We sincerely appreciate your thoughtful and constructive feedback! Below, we address your concerns.
W1: Reliance on LoRA Approximation.
The ILA method primarily utilizes LoRA (Low-Rank Adaptation) to approximate parameter changes for computational efficiency. An ablation study indicates that LoRA-based importance scores achieve nearly 83% overlap with those derived from full fine-tuning (FFT). However, this also implies that LoRA is an approximation and may not perfectly mirror the true parameter changes of FFT, which could influence the precise identification of the most critical layers. Despite this, the paper suggests LoRA provides a "strong approximation".
Why LoRA is considered a strong approximation of FFT:
- Our experiments show that LoRA’s importance rankings overlap with those from FFT by 83%. This high level of agreement is sufficient for selective tuning, which is inherently robust to small deviations in layer ranking. Minor discrepancies (e.g., the 17% non-overlap) have minimal impact on downstream performance, enabling substantial reductions in computation while maintaining accuracy.
- Furthermore, we identified layers marked as important (Top 75%) by FFT but not by LoRA on LLaMA 2-7B using the No Robots dataset. These layers had low average ranks according to FFT (124 out of 168; best rank: 57), suggesting they are not among the most critical layers.
While we chose LoRA for its efficiency, FFT remains a practical alternative. Because our method is robust across datasets, using FFT to identify layer importance rankings for a specific model incurs only a one-time cost. Users can then directly reuse the precomputed FFT-based rankings for different alignment tasks and similar model architectures.
W2: Computational Cost of Initial Stage.
Although ILA is presented as an efficient method, its first stage, which involves training the model with LoRA until it reaches an stable state, accounts for the majority of the computational cost. The paper suggests that identifying layer importance effectively requires only 25-50% of the training milestones. Nevertheless, this initial phase can still be resource-intensive, particularly for very large models.
The key insight of this work is that layer importance rankings are consistent across different alignment datasets (Sec. 5.2), which enables the use of a single ranking across multiple tasks. This consistency allows us to provide standardized importance rankings for popular models (e.g., LLaMA-3, Qwen2.5), eliminating the need for repeated computation. Consequently, the overhead of identifying important layers with ILA is a one-time cost.
W3: Exploration Range for Layer Freezing/Tuning.
The study primarily investigates freezing approximately 25% of unimportant layers to enhance performance or fine-tuning 10-30% of key layers to improve efficiency. While these specific percentages yield positive outcomes, a more detailed analysis of how varying these proportions affects the balance between performance and efficiency could offer further insights.
Our analysis already demonstrated the trade-off between performance and efficiency. Specifically, fine-tuning ~75% of layers (Tables 5, 17, and 18) achieves strong performance, while tuning only ~30% of layers (Table 6) retains reasonable performance with improved efficiency.
To further investigate this, we conducted additional experiments by fine-tuning Llama 3.1-8B on the No Robots dataset with varying proportions of layers (15%, 35%, 55%, 75%). As summarized in the Table below, the results reveal a non-linear trade-off: training efficiency steadily increases as fewer layers are tuned, while performance remains relatively stable when tuning 75% or 55% of the layers, but drops considerably when tuning only 35% or 15% of the layers.
Table 1: Impact of the Proportion of Fine-tuned Layers on Performance and Efficiency (Model: Llama 3.1-8B; Alignment Dataset: No Robots). Batch size is 1 and maximum token length is 1024. Training time denotes the average time per iteration.
| MMLU | Hellswag | Vicuna | MT-Bench | GPU Memory (MiB) | Training Time (ms) | |
|---|---|---|---|---|---|---|
| FFT | 64.44 | 63.65 | 6.50 | 5.11 | 76371 | 446 |
| FFT w/ ILA (~75%) | 65.00 | 63.69 | 6.61 | 5.23 | 68595 | 398 |
| FFT w/ ILA (~55%) | 65.02 | 63.49 | 6.59 | 5.21 | 63337 | 373 |
| FFT w/ ILA (~35%) | 64.94 | 62.93 | 6.38 | 5.01 | 51529 | 325 |
| FFT w/ ILA (~15%) | 65.11 | 61.35 | 5.94 | 4.68 | 35839 | 236 |
Dear Reviewer 1beu,
We sincerely appreciate the time and effort you have dedicated to reviewing our paper. As the discussion period is drawing to a close, we would be grateful if you could kindly review our rebuttal and let us know if any questions or concerns remain.
If our responses have satisfactorily addressed your concerns, we would greatly appreciate your consideration of a higher rating for our work.
Thank you once again for your time and consideration.
Best regards,
The Authors
Dear Reviewer 1beu
We submitted our rebuttal four days ago, and as the discussion period is now nearly halfway over, we would like to kindly inquire whether our responses have sufficiently addressed your concerns. If there are any remaining questions or issues that require clarification, we would greatly appreciate your feedback.
We sincerely hope that our responses meet your expectations and may contribute to a reconsideration of your evaluation. Thank you once again for your time and consideration.
Best regards,
The Authors
The paper researches the important layers of LLM alignment. Specifically, authors train models until the loss becomes stable and then find a mask to minimize the loss. The results show that the method performs similarly by applying the mask and making the training faster with less memory. The authors also give a theory and empirical results to show that the found masks are robust and will not consistently change when training progresses.
接收理由
- The topic is very interesting and helpful for the explainable AI community.
- The theory is simple yet useful; it makes a paper lot of sense since it guarantees the stability of the masks.
拒绝理由
- I do not see significant drawbacks in this paper.
- However, I do feel the paper can be further improved by trying to answer "why" those layers are important, which will make the paper more novel.
给作者的问题
- Suggestion: Try use vector graphics in Figure 2
We sincerely appreciate your positive feedback and thoughtful suggestions!
However, I do feel the paper can be further improved by trying to answer "why" those layers are important, which will make the paper more novel.
We fully agree with your thoughtful suggestion. Our layer ranking results (e.g., Figure 1) consistently demonstrate that Feed-Forward Network (FFN) layers are more important than attention layers. This observation aligns with previous studies [1,2], which identify FFNs as key memory modules responsible for encoding factual knowledge, while attention layers primarily facilitate token interactions. This distinction explains why FFN layers consistently achieve higher importance rankings in our results.
Moreover, our findings suggest several practical implications:
- Model compression should prioritize retaining upper FFN layers.
- Knowledge editing may be more effective when targeting middle-to-upper FFN layers.
We will revise Section 4.2 to emphasize this observation and its implications, which we believe will further strengthen the novelty of our contribution.
Thank you once again for your invaluable suggestion!
[1] Geva, Mor, et al. "Transformer feed-forward layers are key-value memories." arXiv preprint arXiv:2012.14913 (2020).
[2] Meng, Kevin, et al. "Locating and editing factual associations in gpt." Advances in neural information processing systems 35 (2022): 17359-17372.
Q1: Try use vector graphics in Figure 2.
Thank for your valuable advice. We will use vector graphics in the revised revision.
I read the rebuttal and decided to maintain my score.
Dear Reviewer K9M2,
Thank you very much for taking the time to review our paper and rebuttal, for maintaining your positive evaluation, and for your encouraging and favorable comments.
We sincerely appreciate your thoughtful consideration and support.
Best regards,
The Authors
This paper investigates which layers of large language models (LLMs) are most responsible for alignment changes introduced by supervised fine‑tuning. The authors propose ILA, a method that learns a binary mask over per‑layer parameter updates, treating mask activation as a measure of layer importance. Experiments across diverse alignment datasets show that the sets of important layers share nearly 90 % overlap, suggesting a common alignment pattern irrespective of data source. Furthermore, freezing the non‑essential layers improves the overall performance. This suggests the possibility of constraining fine‑tuning to a subset of layers while enjoying substantial efficiency gains with negligible accuracy loss. The study also offers initial evidence that these insights generalize from alignment tasks to reasoning abilities.
接收理由
- The consistency of the identified critical layers within the same model across different alignment datasets is an interesting observation. This lays the foundation of feasibility of constraining fine‑tuning to only a subset of layers.
- It is a little surprising but interesting to see that freezing unimportant layers can actually improve alignment performance.
- The authors conduct intensive numerical experiments and well-designed and informative ablation study.
- The paper is written clearly.
拒绝理由
-
I think the authors overclaim the potential of the proposed method to improve alignment efficiency through selective fine-tuning. On one hand, selective fine-tuning requires the identificaiton of the important layers first using ILA, which involves at least 30-50% of normal training (either full-model training or PEFT training) and the score training overheads. On the other hand, I do not think selective fine-tuning helps much with improving training efficiency, as gradients still need to be propagated to earlier layers.
-
I am not sure why the authors include AdaLoRA as a baseline but do not include results of AdasLoRA with ILA. If I am understanding it correctly, AdaLoRA is a LoRA variant and is orthogonal to ILA. Using only AdaLoRA as baseline provides very limited information to the audience.
-
The message conveyed by Thm 3.1 is unclear to me. Firstly, the first statement in Thm 3.1, "For a sufficiently small ϵ, θT is ϵ-stable" looks an assumption to me. The statement itself reads wierd to me because we do not know if a pre-trained transformer is -stable for a very small . We can only say that it is -stable for a sufficiently large . But a large makes the result in eqn (11) weak. In summary, I am confused about Thm 3.1. I recommend the authors to elaborate more on Thm 3.1, or to avoid including it in the main text just to make the paper more mathy.
We sincerely appreciate your positive comments and thoughtful feedback! Below, we address some potential misunderstandings (W1 & W3) as well as your additional concern (W2).
W1: overclaim the potential of the proposed method.
Selective fine-tuning requires the identificaiton of the important layers first using ILA, which involves at least 30-50% of normal training (either full-model training or PEFT training) and the score training overheads. I do not think selective fine-tuning helps much with improving training efficiency, as gradients still need to be propagated to earlier layers.
Please note that identifying important layers via ILA incurs only a one-time cost (30–50% of standard training). As shown in Section 5.2, layer importance rankings are consistent across different alignment datasets, allowing us to reuse a single ranking across tasks. This enables us to provide standardized importance rankings for popular models (e.g., LLaMA-3, Qwen2.5), eliminating the need for repeated computation.
Regarding training efficiency, our method primarily improves memory efficiency, while also contributing to overall training speed. As shown in Table 11 (Appendix C.1), our method reduces peak GPU memory usage by approximately 59% in full fine-tuning and offers an additional ~30% reduction on top of efficient methods like QLoRA. While the speed improvement is less pronounced, we still observe an increase of over 10% in training speed.
We will clarify these points further in the revised version. Thank you again for your valuable feedback.
W2: The absence of AdaLoRA+ILA baseline.
I am not sure why the authors include AdaLoRA as a baseline but do not include results of AdasLoRA with ILA. If I am understanding it correctly, AdaLoRA is a LoRA variant and is orthogonal to ILA.
Thanks for the thoughtful question.
We chose AdaLoRA as a baseline because it shares a similar objective with ILA. Both AdaLoRA and ILA aim to allocate tuning resources based on layer importance. AdaLoRA achieves this implicitly by adjusting ranks during training (i.e., rank=0 in AdaLoRA indicates a layer does not require tuning), while our ILA explicitly identifies the important layers to be tuned prior to the fine-tuning process.
Indeed, AdaLoRA could be combined with ILA. However, the current implementation of AdaLoRA in the PEFT library contains bugs that significantly hinder distributed training, leading to prohibitively slow execution. As a result, we were unable to conduct this experiment in our submission due to limited computing resources.
During the past few days of the rebuttal period, we have experimented with combining AdaLoRA with ILA on LLaMA2-7B, using the No Robots dataset for alignment. The test results in the Table below show that incorporating ILA consistently improves performance.
Table 1: Comparison of Llama 2-7B fine-tuned on the No Robots dataset.
| MMLU | Hellswag | Vicuna | MT-Bench | |
|---|---|---|---|---|
| AdaLoRA | 45.23 | 57.30 | 5.81 | 4.01 |
| AdaLoRA w/ ILA | 45.45 | 57.41 | 5.88 | 4.09 |
W3: The message conveyed by Thm 3.1 is unclear.
Firstly, the first statement in Thm 3.1, "For a sufficiently small , , is -stable" looks an assumption to me. The statement itself reads wierd to me because we do not know if a pre-trained transformer is -stable for a very small . We can only say that it is -stable for a sufficiently large . But a large makes the result in eqn (11) weak. In summary, I am confused about Thm 3.1.
Please note that the -stability assumption in Theorem 3.1 does not refer to the model being -stability during the pre-training phase. Instead, this assumption applies specifically to the alignment training stage. At this stage, after the loss has largely converged, we assume is -stable, which implies that the model parameters are close to a local optimum and change very little between iterations.
This assumption is supported by standard optimization theory (e.g., gradient descent on smooth non-convex objectives [1]), which shows that as training progresses, the difference in expected loss between consecutive steps approaches zero: . This justifies assuming a small at convergence.
As stated in line 148-149, Theorem 3.1 shows that when is -stable, solving the optimization problem in Eq. (3) for any will yield similar results —specifically, the similar binary masks will be learned.
We'll further clarify these points in the revised version.
[1] Optimization methods for large-scale machine learning. SIAM review 60.2 (2018): 223-311.
Dear Reviewer EcV9,
We sincerely appreciate the time and effort you took to review our paper. We submitted our rebuttal five days ago and, with the discussion phase ending soon, we would like to kindly check if our responses have addressed your concerns.
We hope our replies have clarified any misunderstandings and may lead to a more positive assessment of our work. Please let us know if anything remains unclear.
Thank you again for your time and feedback.
Best regards,
The Authors
We sincerely thank all the reviewers for their invaluable time and constructive feedback! We will respond to each reviewer's comments individually soon. We are happy to provide additional information if needed.
This paper introduces ILA, a method for identifying critical layers in LLMs during alignment by learning binary masks over per-layer updates. Through extensive experiments across multiple datasets and model architectures, the authors show that a small, consistent subset of layers (~75%) is responsible for most alignment changes. Freezing unimportant layers can enhance alignment performance and improve training efficiency. The paper is well-structured and the empirical evidence is strong. Reviewers raised concerns about overhead, LoRA approximation, and stylistic generalization all of which were directly addressed with new experiments and reproducibility details. The work balances practical value with interpretability which offers actionable insights for scalable and efficient LLM alignment. I recommend acceptance.