PaperHub
6.0
/10
Poster4 位审稿人
最低5最高7标准差0.7
6
7
5
6
3.3
置信度
正确性3.0
贡献度3.3
表达3.0
NeurIPS 2024

What Rotary Position Embedding Can Tell Us: Identifying Query and Key Weights Corresponding to Basic Syntactic or High-level Semantic Information

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06

摘要

Transformer-based large language models (LLMs) have successfully handled various tasks. As one fundamental module in Transformers, position encoding encodes the positional information of tokens in a sequence. Specifically, rotary position embedding (RoPE), one of the most widely used techniques, encodes the positional information by dividing the query or key value with $d$ elements into $d/2$ pairs and rotating the 2d vectors corresponding to each pair of elements. Therefore, the direction of each pair and the position-related rotation jointly determine the attention score. In this paper, we show that the direction of the 2d pair is largely affected by the angle between the corresponding weight vector pair. We theoretically show that non-orthogonal weight vector pairs lead to great attention on tokens at a certain relative position and are less sensitive to the input which may correspond to basic syntactic information. Meanwhile, the orthogonal weight vector pairs are more flexible regarding the relative position, which may correspond to high-level syntactic information. Empirical evidence supports the hypothesis that shallow layers of LLMs focus more on local syntax and deep layers focus more on high-level semantics. Furthermore, we show that LLMs fine-tuning mainly changes the pairs of weight vectors that are nearly orthogonal, i.e., the weight corresponding to high-level semantics, which enables the reduction of the number of trainable parameters during fine-tuning without sacrificing performance. We propose a method namely Angle-based Weight Selection (AWS) to reduce the fine-tuning overhead and verify the effectiveness of the proposed method on widely used Alpaca fine-tuned Llama-2.
关键词
Large Language Model.+Rotary Position Embedding.+Self Attention

评审与讨论

审稿意见
6

This paper investigates how the Transformer models with RoPE embeddings change their weights during pre-training and fine-tuning. The authors theoretically prove that RoPE divides key and query pairs into two groups: near-orthogonal and non-orthogonal weight vector pairs, which have different sensitivities to the input embeddings. The authors perform empirical studies that reveal that this property helps the model learn different levels of language abstraction: non-orthogonal weight vector pairs focus more on basic syntactic information, while nearly orthogonal weight vector pairs focus more on high-level semantic information. Finally, the authors propose utilizing this property during fine-tuning: while the model learns basic syntactic information during pre-training, only high-level semantic information requires updates in fine-tuning. Thus, the authors propose only changing the orthogonal pairs of weights in fine-tuning, significantly reducing the number of trainable parameters while keeping or improving performance and reducing overfitting.

优点

This paper discovers an interesting property of attention weight vector pairs, which gives insight into how information about different language abstractions is learned and stored in the model layers. The paper is well-written and provides significant support for its major claims, with extra details provided in the appendices. The proposed novel approach for efficient finetuning advances the state of the art in parameter-efficient finetuning and can be widely adopted.

缺点

The paper lacks diversity and depth in its experimental results, particularly in Chapters 3 and 4. The analysis presented in Section 3.2 is the only part of the paper that establishes a link between the level of abstraction learned by the model and the angles of the weight pairs, while the rest of the paper relies on this hypothesis. However, this section (and the Appendix) only provides a few examples that support the authors' claim, which could be cherry-picked. To better support this claim, the authors could:

Provide a clear definition of "Basic Syntactic or High-level Semantic Information." Throughout the paper, the authors use these terms extensively, but never explain what they specifically mean by them (e.g., are they referring to specific tokens, sequences of tokens, or something else?). Once a definition is established, the authors could collect statistics across a variety of diverse tasks and demonstrate the correlation between angles and abstractions. The presence or absence of a statistically significant correlation could strongly support the hypothesis of a connection between weight pair angles and language abstraction.

Furthermore, in Chapter 4, experiments are only conducted on the Llama2 model (7b and 13b) using three tasks. To underscore the universal nature of the discovered law, I would like to see experiments on other open families of large language models (LLMs) and possibly on more tasks.

问题

Questions:

  1. Your paper heavily relies on the distinction between "Basic Syntactic Information" and "High-level Semantic Information". Could you please define and provide examples of these two concepts when you first mention them in the introduction?

  2. Line 140: "To verify the conjecture in Sec. 3.1 that a large absolute cosine similarity |cos α| corresponds to basic syntactic information and […],” - was this conjecture made in Chapter 3.1, or is it a new hypothesis? not sure it was mentioned in section 3.1

  3. Table 1: Is it possible to provide evaluation results for the model fine-tuned without PEFT?

Typos/suggestions:

  1. Figure 1 never referenced in the paper.
  2. Line 105: “m-th input. The” -> “the m-th embedding/vector/etc, the”
  3. Line 151: “pairs(0.54 “ -> “pairs (0.54 o” (space)
  4. Line 208: “to Appendix B” -> “to Appendix B.”
  5. Line 273: is this sentence incomplete?

局限性

Yes

作者回复

Dear reviewer 2wVC,

Thank you for your valuable feedback. Below, we address each of your concerns.

W1: The paper heavily relies on the link between the level of abstraction learned by the model and the angles of the weight pairs.

Q1: Could you please define and provide examples of these two concepts when you first mention them in the introduction?

Thank you for the suggestion. We will provide a clearer definition of "syntactic information" and "semantic information" in our paper. The concept of syntactic and semantic information comes from linguistics, which are widely used in previous works. The formal definitions are:

  • Syntactic information: Syntactic information refers to the arrangement of symbols or words according to the rules of a formal system or language. It deals with the structure and organization of elements in a language, such as grammar, punctuation, and word order. It is concerned with the form rather than the meaning of the symbols. [1]
  • Semantic information: Semantic information pertains to the meaning and interpretation of words, phrases, and sentences. It involves understanding the meanings conveyed by linguistic expressions and how these meanings relate to one another within a language. It is concerned with the content and significance of the communication. [2]

In this paper, we theoretically prove that non-orthogonal weight pairs for the query and the key in LLMs with RoPE are less sensitive to the input and would pay greater attention to certain relative positions than orthogonal weight pairs. We try to link this phenomenon with the processing of the syntactic information and the semantic information, and our hypothesis is the following:

  • Non-orthogonal weight pairs focus on certain relative positions that deal with the syntactic information, such as the grammar or certain word order.
  • Orthogonal weight pairs are more flexible in terms of positions regarding the input; therefore, they process more semantic information since the content could appear anywhere in a paragraph.

To validate our hypothesis, we provide attention score visualization in Fig. 2, where the attention head with the most non-orthogonal weight pairs pays greater attention to the structure of the phrase, such as the special token for the end of the input prompt. In Sec. 3.3, we align our observation with previous works [3] that shallow layers process more syntactic information and deep layers process more semantic information, which may also strengthen our hypothesis.

However, we must also emphasize that our paper does not rely on this hypothesis. Leave the concept of "syntactic" or "semantic" alone. We theoretically show how the angle between weight vector pairs affects RoPE in Sec. 3.1 and further investigate the weight vector angle (Sec. 3.3 and Sec. 3.4). In Sec. 4, we show that finetuning the pre-trained LLMs mainly changes the near orthogonal weight vector pairs and further propose a simple but effective method to reduce the trainable parameter while boosting the performance.

Distinguishing the "syntactic information" or the "semantic information" has always been kind of subjective. For example, previous works [4] manually associate "syntactic" and "semantic" information with various tasks. We hope the hypothesis in our paper may contribute to the study of how deep learning models process syntactic information and semantic information and further innovate future works.

Thanks again for the suggestions. We will keep polishing our paper.

Q2: Line 140: was this conjecture made in Chapter 3.1, or is it a new hypothesis?

Sorry for the confusion. We only analyze how the angle between weight vectors affects RoPE in Sec. 3.1 and did not introduce the conjecture in detail. The detailed hypothesis is written in our rebuttal above. We will also make it more clear in our paper.

W2: I would like to see experiments on other open families of large language models (LLMs) and possibly on more tasks.

We have conducted additional experiments with Llama-2-7B, Mistral-7B, and Phi-2. We will add the results and details to our paper and release the code. We finetune the pre-trained model on WikiText-2 and GSM8K with LoRA following the setting in [3]. Models are fine-tuned through causal language modelling on training sets and are tested on validation/test sets. The results are in the following table. Generally, our proposed method improves the performance while reducing the number of trainable parameters.

ModelThresholdWikiText-2 Perplexity(\downarrow)GSM8K Accuracy(%\uparrow)
Llama2-7bbaseline(1)5.51839.20
Llama2-7b0.015.48540.86
Llama2-7b0.0055.48338.44
Llama2-7b0.0015.48038.06
Mistral-7B-v0.1baseline(1)6.42354.51
Mistral-7B-v0.10.016.33555.12
Mistral-7B-v0.10.0056.34055.88
Mistral-7B-v0.10.0016.33755.80
Phi-2baseline(1)9.55348.75
Phi-20.019.76651.71
Phi-20.0059.82952.01
Phi-20.0019.83652.92

Due to the character limitation of the rebuttal, we will update more experimental results in official comments, which will also be added to our paper. We hope these additional experiments can address your concern and are more than willing to conduct more experiments.

Q3: Results without PEFT?

We will update the result promptly. However, with limited computational resources, the results may not be available during rebuttal.

Typos/ Suggestions

Thanks for the detailed feedback. We will check and polish our paper.

We hope our rebuttal addresses your concerns, and we look forward to your reply.

[1] Fromkin, Victoria, Robert Rodman, and Nina Hyams. "An lntroduction to Language." (2014).

[2] Saeed, John. "Semantics: The meaning of words and sentences." Routledge, 2015. 153-168.

[3] Li, Yixiao, et al. "LoftQ: LoRA-Fine-Tuning-aware Quantization for Large Language Models." ICLR 2024

评论

I appreciate authors addressing all concerns mentioned in the review. I think it's important to polish the paper to improve clarity, and I hope to see it in the final version. In the light of new experimental results, I will increase my score

评论

Dear reviewer 2wVC,

Thank you for your reply! We are glad to hear that our rebuttal has addressed your concern. By your valuable suggestion, we have modified our paper, including adding the definitions of the syntactic information and the semantic information to the introduction section and fixing the mentioned typos. The modified parts are in red. The first version of our paper after rebuttal is provided in the anonymous GitHub repo: https://anonymous.4open.science/r/NeurIPS2024_RoPE_investigate_rebuttal-E955/RoPE_investigate_NeurIPS_2024_rebuttal_v1.pdf

We will keep polishing our paper and adding more experimental results.

Best,

Authors.

审稿意见
7

The paper is a novel work that identifies how the angle between weight vector pairs in the query or the key affects RoPE. The authors devise a simple yet effective method that is novel and orthogonal to existing fine-tuning efficiency techniques, such as LoRA. Experiments demonstrate that combining the proposed technique can further enhance the performance of LoRA in a lightweight manner.

优点

  1. The authors excellently demonstrate how LLMs using RoPE utilize positional information and theoretically show that non-orthogonal weight vector pairs influenced by RoPE are less sensitive to input, thereby drawing greater attention to specific relative positions.
  2. The authors further reveal that non-orthogonal weight vector pairs focus more on basic syntactic information, while nearly orthogonal weight vector pairs emphasize high-level semantic information.
  3. They highlight the key observation that fine-tuning LLMs mainly alters the orthogonal pairs of corresponding weight vectors. This insight leads to a natural technique for reducing the number of trainable parameters during LLM fine-tuning, as only the orthogonal pairs of weight vectors are modified.
  4. The proposed parameter reduction approach is effective and orthogonal to existing methods such as LoRA. It can be integrated with these methods to achieve better performance, with consistent performance gains across experiments.
  5. The paper is generally well-written, with clear illustrations, and demonstrates the authors’ strong background in LLMs and machine learning.

缺点

  1. The introduction should more clearly specify the types of LLMs on which the authors conducted empirical studies rather than generally referring to LLMs.

Additionally, there are some grammatical errors and typos:

• “Various position encoding have been proposed” should be “Various position encoding techniques have been proposed.” • “After the seminar work” should be “After the seminal work.” • “Where” should be “where” (line 120). • “Derive Eq. 4” should be “Deriving Eq. 4.”

  1. The authors could better clarify the motivation behind their perspective, specifically how the angle between weight vector pairs in the query or the key affects RoPE.

  2. The experimental section could be expanded, as the current version includes only one table in the main paper.

问题

  1. Are there any related works on the angle perspective in LLMs or general deep networks? If so, the related work section could be expanded to include these studies.
  2. Are there any works similar to this paper which are based on a key observation and propose a simple yet effective novel method? Providing more such examples could help reviewers better benchmark its qualification against the NeurIPS standard.

局限性

Please refer to the weaknesses part.

作者回复

Dear reviewer cttx,

we sincerely appreciate your detailed and valuable feedback. We address each of your comments in the following.

W1: The introduction should more clearly specify the types of LLMs on which the authors conducted empirical studies rather than generally referring to LLMs.

Thanks for the suggestion, we will update our paper and make it clearer in the introduction about the types of models we use for empirical studies. Generally, our method applies to LLMs using RoPE. We conduct experiments on widely used LLMs such as Llama2, Llama3, Mistral, etc.

W2: The authors could better clarify the motivation behind their perspective, specifically how the angle between weight vector pairs in the query or the key affects RoPE.

While RoPE divides the elements in the query or the key into pairs and treats each pair as a 2D vector, the position information is encoded by rotating those 2D vectors. The motivation of this paper is that the angle between the corresponding weight vector pairs largely affects the initial angle of those 2D vectors. For example, if the pair of weight vectors are of the same direction, then the angle of the 2D vector is fixed regardless of the input.

Thank you for the suggestion, we will keep polishing our paper and further clarify the motivation in our paper.

W3: The experimental section could be expanded.

We have conducted additional experiments with Llama-2-7B, Mistral-7B, and Phi-2. We will add the results and details to our paper and release the code.

  • WikiText-2 and GSM8k We finetune the pre-trained model on WikiText-2 and GSM8K with LoRA following the setting in [1]. Models are fine-tuned through causal language modeling on training sets and are tested on validation/test sets. The results are in the following table. Generally, our proposed method improves the performance while reducing the number of trainable parameters.
ModelThresholdWikiText-2 Perplexity(\downarrow)GSM8K Accuracy(%\uparrow)
Llama2-7bbaseline(1)5.51839.20
Llama2-7b0.015.48540.86
Llama2-7b0.0055.48338.44
Llama2-7b0.0015.48038.06
Mistral-7B-v0.1baseline(1)6.42354.51
Mistral-7B-v0.10.016.33555.12
Mistral-7B-v0.10.0056.34055.88
Mistral-7B-v0.10.0016.33755.80
Phi-2baseline(1)9.55348.75
Phi-20.019.76651.71
Phi-20.0059.82952.01
Phi-20.0019.83652.92

Due to the character limitation of the rebuttal, we will update more experimental results in official comments, which will also be added to our paper. We hope these additional experiments can address your concern and are more than willing to conduct more experiments.

Q1: Are there any related works on the angle perspective in LLMs or general deep networks? If so, the related work section could be expanded to include these studies.

To the best of our knowledge, most related works on the angle perspective in LLMs focus on input length extrapolation [2,3] trying to extend the input length limit by changing the rotation angle. We will keep on looking for more related work and update our related work section promptly.

Q2: Are there any works similar to this paper that are based on a key observation and propose a simple yet effective novel method? Providing more such examples could help reviewers better benchmark its qualification against the NeurIPS standard.

Many such works are based on key observations and propose a simple yet effective method. For example, the lottery ticket hypothesis [4], which won the ICLR 2019 best paper award, shows that dense, random initialized networks contain sparse subnetworks that could reach comparable performance when trained in isolation. They further propose a strategy to find such subnetworks. Notably, many excellent works contribute to the community mainly by providing key observations. For example, [5], the NeurIPS 2023 best paper, shows that emergent abilities only appear for specific metrics.

In this paper, we show that the angle between weight vector pairs in the query and the key affect how the LLMs with RoPE utilize position information and that finetuning the pre-trained model mainly changes orthogonal weight pairs. By providing the key observation and a new perspective to better understand the LLMs with RoPE, we hope this paper could innovate future works in this community.

We hope our rebuttal could address your concern and we are looking forward to your reply. Please feel free to pose any new comments and we will respond promptly.

[1] Li, Yixiao, et al. "LoftQ: LoRA-Fine-Tuning-aware Quantization for Large Language Models." ICLR 2024

[2] Sun, Yutao, et al. "A length-extrapolatable transformer." ACL 2023

[3] Peng, Bowen, et al. "Yarn: Efficient context window extension of large language models." arXiv preprint arXiv:2309.00071 (2023).

[4] Frankle, Jonathan, and Michael Carbin. "The lottery ticket hypothesis: Finding sparse, trainable neural networks." ICLR 2019

[5] Schaeffer, Rylan, Brando Miranda, and Sanmi Koyejo. "Are emergent abilities of large language models a mirage?." NeurIPS 2023

评论

This response successfully addressed the weaknesses. I would like to raise my score to accept. The response content should be included in the final version, so that achieving a better completeness.

评论

Dear reviewer cttx,

We are glad to hear that our rebuttal has addressed your concern. We will include the contents in the final version. Thank you again for your valuable feedback.

Best regards,

Authors

审稿意见
5

The paper makes a significant contribution to the understanding of RoPE in LLMs and presents the QK-IPM method that reduces the number of trainable parameters during fine-tuning by targeting orthogonal weight vector pairs.The paper conducts experiments on TruthfulQA, GSM8K, and Hellaswag datasets to show the method's effectiveness.

优点

  1. The writing of this paper is clear.
  2. The paper conducts comprehensive analysis experiments to verify their conclusion.
  3. Based on their observation, the proposed method, which fixes the non-orthogonal pairs of weight vectors in the query and key of each layer, sounds reliable and simple to deploy.

缺点

  1. The conclusion that shallow layers of LLMs focus more on basic syntactic information and deep layers of LLMs focus more on high-level semantics is somewhat boring. Many papers have claimed it.
  2. The font size of some figures (e.g., Fig 2,4) is too small. Therefore, I can not read and understand the figure.
  3. For Table 1, the phrase ‘Fixed weight’ and ‘vector pairs’ have been overlapped wrongly.

问题

Please see weaknesses.

局限性

Yes.

作者回复

Dear reviewer LVwG,

thank you for your valuable feedback. We address each point of your concerns in the comments below.

W1: The conclusion that shallow layers of LLMs focus more on basic syntactic information and deep layers of LLMs focus more on high-level semantics is somewhat boring. Many papers have claimed it.

Indeed, with empirical methods, previous works show similar conclusions, which we have also mentioned in our paper. However, drawing this conclusion is not the main goal of this paper. By aligning the observation with previous works, we are trying to provide a more objective measure, the angle between weight pairs in LLMs using RoPE, to identify weights corresponding to processing basic syntactic information or high-level semantics. The similar conclusions in previous works help strengthen the link between the weight pair angle and the basic syntactic information or the high-level semantics. To our knowledge, this paper provides a brand-new perspective in identifying weights corresponding to the basic syntactic information or the high-level semantics.

W2: The font size of some figures (e.g., Fig 2,4) is too small. Therefore, I can not read and understand the figure.

We are sorry for the inconvenience and we will improve it in the next version of our paper. For the time being, these figures in our paper are vector images. Therefore, these figures are clear as the readers zoom in, which could be a viable way to read these figures.

In Figure 2, we visualize the attention score of different attention heads in Llama2-7B and Mistral-7B in two sentences. In Figure 4, we show the cosine similarity of each weight pair in a bar chart.

W3: For Table 1, the phrase ‘Fixed weight’ and ‘vector pairs’ have been overlapped wrongly.

We will fix it. Thank you again for your careful review.

We hope our rebuttal has addressed your concern and we are looking forward to your reply. We are more than happy to respond to any further questions.

审稿意见
6

The authors propose a fascinating approach to optimizing transformer-based large language models (LLMs). They delve into the intricacies of position encoding, particularly focusing on the widely used rotary position embedding (RoPE) technique. By examining how the angle between weight vector pairs impacts attention scores, they reveal that non-orthogonal pairs are crucial for processing basic syntax, while orthogonal pairs handle higher-level semantics. Their experiments show that fine-tuning LLMs predominantly alters these orthogonal pairs, allowing them to propose a new method (QK-IPM) to reduce fine-tuning overhead. This method effectively trims down the number of trainable parameters without sacrificing performance, as evidenced by their tests on Alpaca fine-tuned Llama-2. Overall, their work offers a fresh perspective on position encoding and presents a practical solution for more efficient LLM fine-tuning.

优点

  1. The proposed Query-Key Internal Pair Masking (QK-IPM) method stands out for its efficiency. By identifying that non-orthogonal weight vector pairs don't need updating during fine-tuning, the method significantly reduces the number of trainable parameters, streamlining the fine-tuning process and saving computational resources.

  2. The approach is backed by solid theoretical insights. The paper explains the relationship between the angles of weight vector pairs and their roles in processing syntactic versus semantic information, providing a robust foundation for the proposed method. This depth of understanding adds credibility and makes the findings more convincing.

  3. The empirical evidence provided, particularly through tests on widely used models like Alpaca fine-tuned Llama-2, demonstrates the practical benefits of the method. This real-world validation shows that the technique is not just theoretically sound but also effective in improving model performance with reduced overhead.

缺点

The paper primarily tests the proposed method on specific models and datasets. While the results are promising, a broader evaluation across various LLM architectures and more diverse datasets would strengthen the generalizability of the findings and ensure the method's robustness in different contexts.

Although the method reduces the number of trainable parameters, the process of calculating the angles between weight vector pairs and determining which pairs to update introduces additional computational steps. This could offset some of the efficiency gains, especially in large-scale applications, and might need further optimization to ensure overall net benefits.

问题

See weaknesses.

局限性

No

作者回复

Dear Reviewer CgW8, we sincerely appreciate the time and effort you devote to the reviewing process. We address each point of your concerns in the comments below.

W1: A broader evaluation across various LLM architectures and more diverse datasets would strengthen the generalizability of the findings

Thank you for your advice. We have conducted additional experiments with Llama-2-7B, Mistral-7B, and Phi-2. We will add the results and details to our paper and release the code.

  • WikiText-2 and GSM8k

We finetune the pre-trained model on WikiText-2 and GSM8K with LoRA following the setting in [1]. Models are fine-tuned through causal language modeling on training sets and are tested on validation/test sets. The results are in the following table. Generally, our proposed method improves the performance while reducing the number of trainable parameters.

ModelThresholdWikiText-2 Perplexity(\downarrow)GSM8K Accuracy(%\uparrow)
Llama2-7bbaseline(1)5.51839.20
Llama2-7b0.015.48540.86
Llama2-7b0.0055.48338.44
Llama2-7b0.0015.48038.06
Mistral-7B-v0.1baseline(1)6.42354.51
Mistral-7B-v0.10.016.33555.12
Mistral-7B-v0.10.0056.34055.88
Mistral-7B-v0.10.0016.33755.80
Phi-2baseline(1)9.55348.75
Phi-20.019.76651.71
Phi-20.0059.82952.01
Phi-20.0019.83652.92

Due to the character limitation of the rebuttal, we will update more experimental results in official comments, which will also be added to our paper. We hope these additional experiments can address your concern and are more than willing to conduct more experiments.

Besides the proposed method, we would also like to emphasize our efforts to better understand the LLMs with ROPE from a brand new perspective. With theoretical derivation and extensive empirical results, we provide insights including:

  • For LLMs using RoPE, non-orthogonal query or key weight pairs are less sensitive to the input, and the attention score is mainly determined by position information.
  • Weight pairs in deep layers are more near orthogonal compared to weight pairs in shallow layers.
  • Finetuning the LLMs with ROPE mainly changes the orthogonal weight pairs.

We hope the brand-new perspective and the insights provided in this paper could contribute to this community and provide innovation for future works.

W2: Calculating the angles between weight vector pairs and determining which pairs to update introduces additional computational steps.

Technically, the computational cost required for calculating the cosine similarity between each weight pair is much less than generating one token with the model. Without the need for GPUs, we could only use the CPU to accomplish the calculation. In the following table, we report the time used for calculating cosine similarity between each query and key weight pair with an AMD EPYC 7302 16-core Processor.

modelTime used (s)
Llama-2-7b10.3545
Llama-2-13b16.3472

While our proposed method only requires calculating the cosine similarity once, the computational cost is negligible compared to the cost of finetuning. We will add a more comprehensive computation cost analysis in our paper. We sincerely appreciate your suggestion.

We hope our rebuttal addresses your concern, and we look forward to your reply. We are more than happy to respond to any further questions.

[1] Li, Yixiao, et al. "LoftQ: LoRA-Fine-Tuning-aware Quantization for Large Language Models." ICLR 2024

评论

The authors have satisfactorily addressed most of my problems. Most concerns has been addressed and some senarios may out of scope of this paper. I have raised my score. Once again, I want to express my gratitude for your hard work and commitment.

评论

Dear reviewer CgW8,

We are glad to hear that our response has addressed your concerns. We really appreciate your commitment to the review process. Thank you once again for your valuable feedback!

Best regards,

Authors

作者回复

Dear AC and Reviewers,

We sincerely thank you for the time and effort you dedicated to the reviewing process. We are delighted to hear that reviewers find the paper well-written (LVwG, cttx, 2wVC), providing solid theoretical insights with extensive empirical evidence (CgW8, LVwG, cttx, 2wVC) and that the method we proposed to be simple and effective (CgW8, LVwG, cttx, 2wVC).

The main concern of the reviewers is also similar in that the experiment for the proposed method only takes one table in Sec. 4. To address the reviewers' concerns, we have conducted additional experiments as requested by the reviewers. We will also try our best to provide more experimental results during the rebuttal period. In the joint response, we list the additional experimental results we have collected so far.

  • We have conducted additional experiments with Llama-2-7B, Mistral-7B, and Phi-2. We finetune the pre-trained model on WikiText-2 and GSM8K with LoRA following the setting in [1]. Models are fine-tuned through causal language modelling on training sets and are tested on validation/test sets. The results are in the following table. Generally, our proposed method improves the performance while reducing the number of trainable parameters. | Model| Threshold| WikiText-2 Perplexity(\downarrow) | GSM8K Accuracy(%\uparrow) | |----| ----|----|----| | Llama2-7b| baseline(1)| 5.518| 39.20| | Llama2-7b| 0.01| 5.485| 40.86| | Llama2-7b| 0.005| 5.483| 38.44| | Llama2-7b| 0.001| 5.480 | 38.06| | Mistral-7B-v0.1| baseline(1) | 6.423| 54.51| | Mistral-7B-v0.1| 0.01| 6.335 | 55.12 | | Mistral-7B-v0.1| 0.005 | 6.340 | 55.88 | | Mistral-7B-v0.1| 0.001| 6.337| 55.80 | | Phi-2 |baseline(1)| 9.553 | 48.75 | | Phi-2 | 0.01 | 9.766 | 51.71| | Phi-2| 0.005| 9.829 | 52.01| | Phi-2 | 0.001 | 9.836 | 52.92|

  • We test the time required to measure the angle between weight vector pairs. Technically, the computational cost required for calculating the cosine similarity between each weight pair is much less than generating one token with the model. We could only use the CPU to accomplish the calculation without the need for GPUs. In the following table, we report the time used for calculating cosine similarity between each query and key weight pair with an AMD EPYC 7302 16-core Processor. | model | Time used (s) | | ----------- | ------------- | | Llama-2-7b | 10.3545 | | Llama-2-13b | 16.3472 |

For each reviewer, we have posed a rebuttal addressing the concerns. We look forward to your reply and are more than happy to respond to any further comments. Once again, thank you for your valuable comments and support.

[1] Li, Yixiao, et al. "LoftQ: LoRA-Fine-Tuning-aware Quantization for Large Language Models." ICLR 2024

评论

Dear AC and reviewers,

We would like to express our gratitude for the valuable feedback and the efforts you devoted to the reviewing process again. Since it has been half way through the discussion period, we are eager to know whether our rebuttal could adress the reviewers' concerns and are ready to answer any further questions. We sincerely thank Reviewer 2wVC for responding to our rebuttal and rising the rate.

Additional to the results provided in rebuttal, we provide new experimental results in this comment. We follow [1] to finetune Llama-2-7B, Mistral-7B-v0.1 and Phi-2 on math-10k dataset with LoRA and evaluate on the SVAMP dataset. The threshold is set at 0.010.01. We will provide more details in the next version of our paper and release the code upon acceptance.

ModelThresholdFixed weight pair ratio (%)SVAMP (%)
Llama-2-7Bbaseline047.0
Llama-2-7B0.0177.2046.9
Mistral-7B-v0.1baseline061.9
Mistral-7B-v0.10.0194.2861.9
Phi-2baseline073.4
Phi-20.0172.8873.4

We will keep on polishing our paper and update new experimental results promptly. We really look forward to your response. Thank you once again for your hard work and commitment to advancing the quality of our scholarly community.

Best regards,

Authors

[1] Hu, Zhiqiang, et al. "LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models." EMNLP 2023.

最终决定

The paper presents an analysis of how RoPE embeddings affect pre-training and finetuning. The main finding is that key-query pairs that are more orthogonal correspond to more semantic information. The paper proposes a PEFT scheme that only updates these parameters during finetuning. The reviews are generally favorable, but there are some concerns that the paper uses unclear terms (basic syntactic or high-level semantic information) and the experimental results are a bit shallow (2wVC). The paper does seem to provide enough substance on a timely topic, but I would like to encourage the authors to be very clear in communicating the findings.