Does Low Rank Adaptation Lead to Lower Robustness against Training-Time Attacks?
This study explores the security risks of LoRA in the fine-tuning, demonstrating that a low-rank structure enhances robustness to backdoor attacks but increases susceptibility to untargeted data poisoning.
摘要
评审与讨论
This paper investigates the impact of Low-Rank Adaptation (LoRA) on the training-time robustness (TTR) of LLMs, focusing on their resilience to data poisoning and backdoor attacks. The authors propose a theoretical framework for robustness analysis, leveraging tools from neural tangent kernel theory and information geometry. Their analysis quantifies the influence of LoRA's rank and initialization variance on robustness. Complementing the theoretical findings, the authors conduct empirical experiments to validate their framework, providing empirical evaluation of LoRA's role in enhancing model robustness during training.
update after rebuttal
The authors have addressed my concerns on technical details; and promised to revise the paper, primarily to improve the structure, which I believe can benefit a broader audience.
给作者的问题
The empirical results regarding the impact of variance on poisoning appear to deviate from the theoretical analysis. Could the authors provide additional empirical studies to investigate and reconcile this discrepancy? If further investigation is not feasible, I recommend removing the discussion on variance's impact on TTR from the paper, as the current empirical evidence does not align with the theoretical claims.
论据与证据
The paper primarily presents a theoretical analysis, supplemented by empirical evaluations. While the evidence provided is not conclusively robust due to several idealized theoretical assumptions, such as the convergence of NTK to a Gaussian process, these simplifications are arguably acceptable given the complexity of analyzing LLM training dynamics.
In the empirical evaluation, the authors also point out that the variance impact on poisoning deviates from the theoretical analysis.
方法与评估标准
Yes.
理论论述
After reviewing the proof section in the appendix, I did not observe any apparent issues with the algebraic derivations. However, there remains a significant conceptual gap: the connection between fewer information bits, smoother information geometry, and improved TTR is not clearly established. Despite multiple readings of Section 3.3, especially paragraph "double-edged sword of LoRA's TTR", the presentation of this argument remains unconvincing.
I strongly encourage the authors to elaborate on this critical aspect of their work. If providing a purely theoretical justification proves challenging, it would be beneficial to include some proof-of-concept evaluations to substantiate the claim. Such additions would greatly enhance the clarity and impact of the paper.
实验设计与分析
The experimental evaluations presented in this paper appear to be appropriate and well-designed for the scope of the research.
补充材料
I went through the supplementary without checking all the details.
与现有文献的关系
This paper explores the training-time robustness in LoRA, a topic of relevance to the LLM community. The discussion is timely and pertinent.
遗漏的重要参考文献
I am unaware of missing essential references.
其他优缺点
Overall, I must express concerns regarding the quality of the writing in this paper, particularly for a work that is primarily theoretical in nature. In its current form, the paper does not meet the standards required for publication at ICML. However, I am open to revisiting my evaluation based on a revised version of the manuscript. I strongly encourage the authors to undertake a major revision during the rebuttal phase to address these issues. Detailed suggestions for improvement are provided in the following section.
其他意见或建议
Structural and Content Improvements
A research paper should adopt a top-down approach, clearly guiding the reader from the central topic to the technical details. The primary focus is on understanding how LoRA impacts TTR. In the introduction, please provide a high-level sketch of your proof, explicitly connecting TTR with information geometry, and subsequently linking information geometry to NTK. Additionally, clarify at which stage LoRA-specific factors (e.g., rank, variance) become relevant in this analysis. Subsequent sections should also follow similar top-down structures.
The connection between equation (5), which involves the norm of parameter differences, and robustness is unclear. Since this equation is not utilized further in the paper, its inclusion seems arbitrary. Ensure that all elements of the paper are directly tied to the central goal of analyzing TTR. Avoid introducing concepts or equations that do not contribute meaningfully to the narrative.
Definitions of key concepts, such as Fisher information, should be included in the main body of the paper rather than relegated to the appendix. This is particularly important for tools that play a central role in your analysis.
For major theorems and proofs, consider adding brief intuitive explanations or proof sketches in the main text. This will help readers understand the underlying reasoning without delving into the technical details immediately.
It might be beneficial to provide a separate discussion on assumptions made for theoretical analysis, and justify them with citations or empirical evidence.
Figure and Caption Clarity
Ensure that all figure captions are self-contained and informative. For instance, in Figures 2, 4 and 5, the captions should not only describe what is being plotted but also highlight key observations and explain how these observations support the paper’s claims. This will make the figures more accessible.
Notational Consistency and Clarity
-
Line 157: What is K? I assume it is ?
-
From the definition of in equation (10), does not make sense mathematically.
-
The symbol in section 3.3 is reintroduced after not being used for an extended period. Given the theoretical nature of the paper and the density of notations, it would be helpful to remind readers of its meaning when it reappears.
-
Line 322 vs. Figure 4: There is a discrepancy regarding the rank settings for LoRA. The text states that the rank is 8 for all LoRA-specific settings, but this is inconsistent with the data presented in Figure 4. Please verify and correct this inconsistency.
We sincerely appreciate your thoughtful feedback and critical review of our work, especially the constructive suggestions regarding the organization and clarity of the paper.
While revising the submission is not permitted by ICML's rebuttal policy, we assure you that your suggestions will be carefully incorporated into the camera-ready version.
Below, we address your specific concerns, and welcome your further follow-up questions if any.
Clarification of Concerns
"Conceptual Gap" between Information Geometry (IG) and Robustness
We discussed the connection between IG, especially Fisher information, and robustness in Sec. 2.4 (line 125, second column). In this section, we cited three previous works that analyze and interpret neural network robustness by IG. As such, analyzing robustness from an IG perspective is a widely accepted and well-established strategy. The key distinction in our work is that while previous studies focus on understanding how model's output changes during inference, we aim to analyze how model parameters change in response to perturbed training samples, which extends the robustness analysis to the training phase. With that said, there is no conceptual gap in using IG for this purpose.
Notations
- : Yes, it should be denoted as . We will correct this accordingly.
- : We will explicitly revisit its meaning in Sec. 3.3.
- Line 322 vs. Fig. 4 -- a discrepancy of LoRA's rank settings: We state in line 322 that "For LoRA-specific settings, we use a rank of 8 and set the scaling parameter to 16 as default values." However, Fig. 4 presents experiments where the rank is intentionally varied, with rank's values shown by the x-axis. This is further detailed in the experimental setup in Section 4.4.1 (line 361). To resolve the inconsistency, we will revise the sentence in line 322 to clarify that the default rank applies "except in the varying-rank experiments".
"The definition of from ... does not make sense mathematically."
In the standard definition of Rényi entropy, , where and .
When , this expression becomes indeterminate (of the form ). However, in this case, the limit of as yields the Shannon entropy. Below is a brief derivation using L'Hopital's Rule:
Therefore,
.
We actually utilize the Shannon entropy formula to demonstrate Figure 3.
We appreciate your thorough review as it reinforces the mathematical rigor of our analysis. However, your concern might arise from the fact that, in our definitions of in Eq. (10) and (16), we replace with the eigenvalues of the Fisher Information Matrix. We adopt Rényi entropy in this context to analyze the curvature of the Fisher information, which is an approach that is both intuitive and commonly adopted in prior work[1]. We will clarify this point to avoid further confusion.
[1] Information Geometry and Its Applications, 2016.
"Impact of Initialization Variance on Poisoning Appear to Deviate from the Theoretical Analysis"
As discussed in Sec. 4.4.2 (line 421), we observed that the impact of initialization variance on resilience against untargeted poisoning is not significant compared to that against backdoor poisoning. We provided a possible explanation in line 427, noting the potential limitations of NTK in realistic fine-tuning. Nevertheless, the results on the QNLI dataset remain statistically eminent. Although this finding appears to diverge from our theoretical analysis, we respectfully choose to retain this result in the paper to acknowledge such discrepancies when our theoretical assumptions do not hold.
Moreover, the last two subfigures in Fig. 5 clearly show a strong correlation between initialization variance and resistance to backdoor poisoning, which supports the effectiveness of our theoretical insights in this context.
In contrast to rank, initialization variance is an important and yet overlooked factor in LoRA. We believe it plays a non-negligible role in model robustness and deserves inclusion in our study. Therefore, we would like to keep this component in the paper.
Dear authors,
Thank you for your response. The mathematics is clear. However, a conceptual gap remains unresolved: upon reviewing the three references cited in line 125, I note they focus on adversarial attacks, which are inherently test-time robustness. My primary concern pertains to the paragraph titled “Double-Edged Sword of LoRA’s TTR”, and I seek clarification on the following point:
Qualitatively, LoRA constitutes a restricted subset of full-rank fine-tuning, inherently limiting its expressiveness. This constraint complicates convergence during fine-tuning, leading to smoother optimization geometry (or at least a restricted parameter space). These points are relatively straightforward to understand qualitatively. However, what remains unclear is why smoother geometry facilitates easier attacks with intentionally poisoned data but not backdoor triggers. Is this because, in this context, the goal of intentionally poisoned data is to degrade performance rather than execute a targeted poisoning attack? This distinction was not explicitly clarified prior to the evaluation section. If my interpretation is correct, I strongly recommend that the authors clearly specify this point before delving into the discussion of the "Double-Edged Sword of LoRA’s TTR." Even if so, I feel it is also easy to argue the other way: smoother geometry can make untargeted poisoning harder because a few untargeted poisoning data does not significantly change the fine-tuning dynamics when the optimization geometry is smoother.
In general, I am inclined to recommend acceptance of this paper, provided the authors address these concerns through careful revision. In its current form, the manuscript does not consistently adopt a top-down organizational structure, and key assumptions or definitions are occasionally introduced without sufficient prior explanation. I encourage the authors to explicitly state foundational concepts and their implications before delving into analysis, rather than assuming reader familiarity. Such revisions would enhance accessibility for a broader audience and amplify the work’s impact.
Dear reviewer,
Thank you for your thoughtful and detailed feedback.
We sincerely appreciate your reevaluation of our paper. As we promised, your suggestions will be carefully incorporated into the revised version.
Based on your latest response, we understand that your main concern lies in the reasoning behind two conclusions: how smoother information geometry (IG) leads to reduced robustness against untargeted data poisoning attack (UPA), and how it contributes to increased resilience against backdoor poisoning attacks (BPA).
While the main paper primarily focused on a natural language description, we would like to present a more formal and theoretically grounded explanation:
Smoother IG Higher Resilience to BPA
Consider two training samples: a clean input and its backdoored input . The optimization target under these two samples can be represented as minimizing the following formula:
i.e., the adversary aims to ensure that the optimization process driven by and occur simultaneously and both significantly influence the training, which aligns with the target of BPA to maintain performance on most inputs while producing significantly altered predictions only when a specific trigger is present.
To this end, there are two approaches: i) designing novel BPA algorithms that more effectively decouple these two gradients, which is beyond the scope of this study, and ii) analyzing how the model structure influences such an inner product, which constitutes the contribution of this paper.
This inner product (though slightly different in formulation), is closely related to both the NTK and the Fisher Information, as introduced in Eq. (6) and (8), respectively. Motivated by this connection, we have analyzed the intrinsic properties of the kernel matrix and introduce indicators for LoRA.
Our indicators provide key insights showing that LoRA provides "a smaller search space for the existence of backdoor triggers" due to i) its () zero eigenvalues and ii) smaller variances in the remaining dimension's parameter updates (i.e., smaller angles between gradients), both of which intuitively manifest as "smoother IG". In other words, LoRA’s constrained parameter subspace and limited parameter updates make such decoupling more difficult compared to FF.
Oversimplified IG Lower Robustness against UPA
We also provide a complementary explanation of why a model with smoother IG tends to be more sensitive to perturbations.
Given a clean training input , and its perturbed version , where is assigned a different label for the purpose of untargeted poisoning. The target of UPA is to maximize:
i.e., as adversaries, we aim to align the optimization direction of the poisoned sample as closely as possible with that of the clean training objective, because we aim to maximally influence the model’s predictions while injecting only a small fraction of poisoned data. This objective directly contrasts with the BPA case, where we instead aim to decouple the optimization directions.
Therefore, we draw the opposite conclusion for UPA.
Note that what we emphasize in the paper is that "the oversimplification of the manifold may make LoRA more susceptible", i.e., the empirical phenomenon that LoRA is more vulnerable when facing UPA (or noise) may not be obvious if the model is severely overparameterized compared to the task.
We sincerely hope that the above analysis addresses your concerns. You are very welcome to raise any further questions or suggestions related to this problem or any other aspects of our work. While multi-turn discussions are not allowed during the ICML rebuttal phase, we would be glad to continue the conversation once the anonymous review period is over.
Thank you :)
Notations
- denotes the loss function. represents the parameters.
- and respectively denote the input dimension of the -th layer and the rank of LoRA.
This paper analyzes the training-time robustness of Low-Rank Adaptation (LoRA) when fine-tuning large language models, focusing specifically on (1) untargeted data poisoning attacks (e.g., label flips) and (2) backdoor attacks (trigger insertion). The core claim is that LoRA exhibits lower robustness than standard full fine-tuning (FF) against untargeted poisoning but higher robustness against backdoor attacks. The paper’s key contributions include:
- A theoretical framework that uses the neural tangent kernel (NTK) and information geometry (Fisher information, Rényi entropy) to study the “smoothness” of LoRA’s training manifold vs. that of FF.
- A finding that the rank of LoRA’s low-rank updates and the initialization variance of LoRA’s submatrices are key factors. A smaller rank improves backdoor resistance (due to reduced parameter space for triggers to exploit) but hurts untargeted-poisoning robustness.
- Conversely, a larger rank improves standard poisoning resilience but makes LoRA more vulnerable to backdoor triggers.
update after review: I increased the score from 2 to 3 since concerns are addressed.
给作者的问题
- Multiple Trigger Variants: Have you tried other stealthy or adaptive triggers (like synonyms, paraphrase triggers, or style modifications)? If so, do the results still show LoRA as more robust than FF?
- Comparison With Other PEFT: Would you consider analyzing how Prompt/Prefix Tuning or Adapters compare to LoRA for training-time robustness?
论据与证据
- The experimental results indicate that 'LoRA is more vulnerable than full fine-tuning to untargeted poisoning attacks but demonstrates greater robustness against backdoor attacks.' Would this finding contradict the theoretical results, especially Theorem 3.6? Or this seemingly 'negative' empirical results further demonstrate that Theorem 3.6 only considers a simplification of the problem. I think more justifications on connections between the theoretical parts and the experimental parts need to be provided.
- The experimental results indicate that 'Specifically, a smaller initialization variance leads to relatively higher performance under backdoor attacks and lower standard deviation of results, which also aligns with our theoretical analysis'. However, for experiments of SST-2 and QNLI in Figure 5, the results might not be so significant and even contradict the patterns found in the latter two figures. Could you provide more explanations on this?
方法与评估标准
- The authors mainly evaluate the method on the classification task with the Bert-Large model as the backbone. The evaluated datasets are from multiple sources, e.g., SST-2, QNLI, and COLA, which are very comprehensive.
- The authors mainly use {Accuracy} when evaluating the performance. Would you consider using more popular metrics (terms) such as Attack Success Rate for evaluation?
- The authors consider a poisoning rate of 0.3 for the UPA setting -- Would that be too large in practice?
- Just to confirm if this is a typo -- the backdoor poisoning rate is 0.15% or 15%?
理论论述
The paper’s theoretical contributions revolve around deriving LoRA’s NTK, comparing it to the full fine-tuning NTK, and identifying conditions under which LoRA’s simpler geometry leads to an advantage or disadvantage. The derivations appear internally consistent, following standard infinite-width assumptions from prior NTK literature. But there are also possibilities that I might not fully understand the derivations.
However, I have a concern about assumption 3.2 (OOLD): Would that be too strong in practice?
实验设计与分析
- For Figure 3, could you also provide the color bar showing the range of the heatmap values?
- For the attack baselines, the authors only choose one attack for the UPA setting and one attack for the BPA setting, which are incomplete. I suggest the authors consider some additional attack methods, e.g., [1, 2].
[1] Rethinking Stealthiness of Backdoor Attack against NLP Models. [2] Hidden Trigger Backdoor Attack on NLP Models via Linguistic Style Manipulation
补充材料
The appendix (referenced as Appendix A, B, C, etc.) expands the formal proofs, addresses more details on the NTK derivations, and clarifies the extension to Transformer architectures.
与现有文献的关系
Backdoor attacks and data poisoning on large language models have been given much attention recently. However, few works directly compare full fine-tuning vs. parameter-efficient approaches in terms of training-time robustness. To the best of my knowledge, this is the first paper that touches on this perspective. This paper might be interesting for the readers in this community.
遗漏的重要参考文献
- Comparison with other parameter-efficient methods: The theoretical results on LoRA is good. Would it also be possible to extend the results to other parameter-efficient tuning methods such as prefix tuning and prompt tuning?
- Adaptive or stealthy data poisoning: There is a rapidly growing body of work (e.g., more advanced data corruption that systematically modifies text distributions) that might deserve mention for thoroughness. See {Experimental Designs Or Analyses} for more details.
其他优缺点
Strength:
- The paper’s theoretical perspective is novel. Besides, the authors’ focus on training-time robustness is timely and important.
- The writing is clear and well-structured.
Weakness: My main concerns about this paper are that:
- The 'seemingly' discrepancy between the theoretical results and the empirical findings
- The insufficient experimental designs, including considered attack baselines and the evaluation metrics.
Please see above for more detailed comments.
Based on the above, I would give a score of 2, but I am willing to raise the score if the questions are well addressed.
其他意见或建议
- I suggest the authors consider the implications of these findings. For example, what are the implications of these findings, especially for the model trainer in practice? More specifically, are there recommended “best practice” rank and variance settings for users who want to reduce the risk of a specific type of training-time attack (e.g., prefer backdoor defense vs.\ prefer strong resilience to random data corruption)?
We sincerely appreciate your enlightening reviews and the potential willingness of raising the score. Below, we present a point-by-point response to address your concerns.
Clarification of Misunderstandings
"Seemingly" Discrepancy: "Would this finding contradict the theoretical results (Theorem 3.6)? Or... a simplification...?"
The answer is no. Our experimental findings are largely consistent with our theoretical analysis, especially with the main claim that "LoRA is more vulnerable than full fine-tuning (FF) to untargeted poisoning attacks (UPA) but demonstrates greater resistance against backdoor attacks (BPA)".
In Sec. 3.3, we begin by comparing LoRA and FF using two information geometry (IG) metrics, with the theoretical findings formalized in Theorem 3.6. Following this, from lines 262 to 311, we analyze the implications of Theorem 3.6, culminating in the above conclusion. Rather than contradicting the theory, the empirical results help validate the core message of our theoretical analysis.
"The contradiction in Fig. 5...especially the latter two figures"
We would like to clarify that Fig. 5 is not in contradiction with the statement in our paper. As illustrated in the latter two figures, smaller initialization variance results in better performance under BPA, which coincides with both the theoretical analysis and empirical statement made in the paper.
"Would a 0.3 poisoning rate of UPA be too large in practice?"
Yes, a 0.3 poisoning rate is an extreme case we use to cover a wide range of poisoning rates. The actual poisoning rates for UPA vary from 0.05 to 0.35, as shown in Fig. 1.
"0.15% or 15% on Backdoor Poisoning?"
It is not a typo: The poisoning rate you referred to for BPA is indeed 0.15% (i.e., 0.0015). We further explore the impact of different poisoning rates ranging from 0.001 to 0.0045, as shown in Fig. 2.
Assumption 3.2 (OOLD): "Would that be too strong in practice?"
Yes, we acknowledge that Assumption 3.2 (OOLD) is a strong and ideal assumption. For this reason, in Section 3.4 (line 275), we further prove that "our conclusions can be generalized where LoRA is applied to all linear modules within a neural network".
Supplemental Experiments
Addition of Color Bar to Fig. 3
Fig. 3 with a color bar can be found at https://a.imgno.de/67e9f93ecc7fc.png , and it will also be included in the final version.
Additional Attack Methods & Evaluation Metrics
Please find the corresponding experiments on our response to Reviewer Dcgq. Thanks!
Response to Questions
"Would it also be possible to extend the results to other PEFT methods?"
Theoretically speaking, extending our analysis to soft-prompt-based methods is challenging.
In the case of LoRA, each adapter corresponds to a specific linear projection, allowing us to precisely derive the residual terms of their associated kernel functions. This enables a tractable theoretical comparison between LoRA and FF. However, the NTK behavior of soft-prompt-based methods (e.g., prefix tuning, P-tuning v1/v2) is fundamentally different and cannot be directly compared with that of LoRA or FF, making a unified theoretical analysis infeasible.
Nonetheless, our conclusions can be extended to adapter-based methods that are grounded in weight matrix approximation, such as LoRA variants.
"Implications of This Study: What is the best practice when using LoRA?"
We have summarized best practices for safely using LoRA in Sec. 4.5 (Line 404), "Summary of Findings and Defenses".
Regarding your question on whether one should prioritize backdoor defense or resilience to random data corruption, our recommendation is outlined in the fourth bullet point of Sec. 4.5. Specifically, we suggest "setting the LoRA rank as small as possible, provided that the model's performance meets task requirements". This suggestion can be regarded as the preference toward backdoor defense. Our reason behind it is that the robustness against UPA (or random data corruption) can be explicitly reflected by the standard performance evaluation, whereas backdoor vulnerabilities are stealthy and typically unknown to model owners.
Besides, as noted in the fifth bullet point of Sec. 4.5, we recommend setting "a small initialization variance", such as where is the input dimension of the -th layer. This setting has been shown in Fig. 5 to significantly improve resilience to BPA, while only slightly reducing robustness to UPA.
This paper makes a theoretical investigation into the security implications of LoRA’s low-rank structure during fine-tuning in the context of robustness against data poisoning attacks. The authors theoretical analysis shows that LoRA presents greater robustness compared to full-parameter fine-tuning (FFT), but also that it is more vulnerable to untargeted data poisoning attacks. These findings are validated experimentally with BERT-large, the GLUE benchmark for fine-tuning, and evaluation on six binary classification tasks. The three main contributions are:
- A novel theoretical framework for analysing the security of LoRA
- Identifying key factors influencing the security of LoRA and explaining the extent of theoretical equivalence between LoRA and FFT
- A theoretical and empirical evaluation of LoRA and FFT under poisoning and backdoor attacks
给作者的问题
N/A
论据与证据
The claims made in the paper are supported by clear and convincing evidence.
Claim 1: LoRA exhibits better robustness against backdoor attacks than FFT. Neural tangent kernel (NTK) and information geometry (IG) are used to show that Lora exhibits fewer information bits and smoother IG than FFT (hence a smaller search space for backdoor triggers), resulting in higher robustness to backdoor attacks (Theorem 3.6). This is validated experimentally as shown in Figure 2 and Figure 7 which show that LoRA >> FFT at resisting backdoor attacks.
Claim 2: LoRA is more vulnerable to untargeted data poisoning attacks. Using the same analysis framework the authors identify that LoRA’s is more susceptible to noisy or intentionally poisoned data (i.e., untargeted poisoning attacks) because of the smoother IG. Figure 1 and 6 support the theoretical claims empirically, showing that LoRA does indeed suffer worse accuracy under untargeted poisoning attacks (UPR) than FFT.
The authors also show how initialisation variance and rank impact LoRA’s robustness in Figures 5 and 4, respectively.
方法与评估标准
The theoretical framework and experimental setup are appropriate for investigating the security implications of LoRA during fine-tuning. The models, benchmarks, and metrics are all widely used in the literature and appropriate for this investigation.
理论论述
No
实验设计与分析
The experiments appear appropriate for the purposes of validating the theoretical findings.
补充材料
No
与现有文献的关系
This paper extends the theoretical understanding of LoRA beyond e.g., expressive capacity Zeng & Lee 2024 and the impact of initialisation Hayou et al. 2024 to consider specifically the security risks associated with backdoor and poisoning attacks for the first time.
遗漏的重要参考文献
No
其他优缺点
This is a well written paper that makes an important contribution to the field. I think there are likely real-world implications e.g., informing the choice of rank based on specific threat models.
其他意见或建议
~306 “the oversimplification of the manifold may make LoRA more susceptible to noisy or intentionally poisoned data, causing higher vulnerability to data poisoning attacks.” – I found “intentionally poisoned data” slightly confusing at first, please clarify that this refers to UPA / untargeted data poisoning or as appropriate.
We sincerely appreciate your commendation on our work!
Regarding your suggestion on line 306, we acknowledge that the current phrasing lacks clarity. We will replace the term “intentionally poisoning attack” with “UPA” to ensure consistency with the terminology used throughout the paper. Thank you again for your valuable suggestion :)
This paper presents an extensive theoretical analysis of data poisoning attacks in the low rank adaptation phase, using neural tangent kernels as well as information theory to establish a link between LoRA structure and vulnerability to training attacks. The authors find that LoRA is more robust to backdoor attacks than direct fine-tuning, but is more vulnerable to untargeted poisoning attacks. The authors also provide experiments to validate the finding.
给作者的问题
See weakness.
论据与证据
On the theoretical side, the authors define training time robustness and simplify training modeling by employing a neural tangent kernel. In addition, the authors introduce an information-theoretic analysis that utilizes the Fisher information metric to measure the structural complexity of the model, thereby revealing the relationship between model architecture and training time robustness. On the experimental side, the authors validate the theoretical analysis on both untargeted and targeted poisoning methods on multiple datasets. Ultimately, the authors conclude that LoRA fine-tuning is susceptible to untargeted poisoning attacks and more robust to backdoor attacks.
方法与评估标准
The authors validate the conclusions on an untargeted poisoning attack and a backdoor attack method, respectively, and additionally the authors conduct experiments on GLUE benchmarks, which can prove the conclusions presented in the article to some extent.
理论论述
I don't think there's an obvious error.
实验设计与分析
Although the authors validated the conclusions on the GLUE basis and two methods, overall these methods and datasets are less.
补充材料
The theoretical proof of the supplementary material is clear.
与现有文献的关系
The authors investigate a completely new problem, namely the data poisoning problem that exists in the training phase of LoRA, and provide a comprehensive theoretical analysis by skillfully combining existing theories.
遗漏的重要参考文献
NA.
其他优缺点
Strengths:
- The problem studied by the author is very practical, and LoRA fine-tuning has been widely used in LLM;
- It is novel to combine the neural tangent kernel with information theory analysis to analyze the relationship between LoRA and training time robustness.
Weakness:
- The key findings and conclusions are not clear. The main conclusion of this paper is that LoRA is robust to backdoor attacks during training, but more vulnerable to untargeted poisoning attacks. This conclusion is not highlighted enough in the INTRODUCTION and ABSTRACT. These conclusions are difficult to be discovered directly by the reader in the complex data derivation and experimental details. Therefore, I think the organization of this paper needs better improvement. 2.The fewer attack methods studied are not enough to support the theory. The poisoning attack methods that the authors use for comparison are too few and very simple, which I think does not support the conclusion that LoRA is susceptible to untargeted poisoning attacks and robust to backdoor attacks. I suggest that the authors discuss more complex methods to prove the validity of the theory.
- Some of the conclusions are not rigorous. For example, in lines 313-224, the authors suggest carefully tuning r and \sigma to optimize security, but do not specify how. In addition, I'm not quite sure what it means for LoRA to have a smoother information geometry. What is “smoother”? How is it measured?
- The initialization method is too simple. Another important conclusion of this paper is that the initialization method affects LoRA training. However, this paper only tested the Kaiming initialization method. In large-scale models, Xavier and normal distribution are also commonly used initialization methods, and the authors should discuss the effects of these methods on LoRA, which is more general.
- The discussion of dataset experiments is still insufficient. Most of the authors' experiments were performed on the GLUE task, but in fact, LoRA is applicable to many tasks, including the popular generation task. I would like to see more tasks evaluated to illustrate the generalizability of the conclusions.
- Poisoning rate settings are low. The poisoning rates used in the article are generally low, which tends to give the impression of hidden but not representative. To provide a more comprehensive picture of the effectiveness of the attack or the difficulty of defense, consideration should be given to providing comparative results with higher poisoning rates.
其他意见或建议
No.
We sincerely appreciate your thoughtful feedback and constructive suggestions for improving our work. Below, we provide a point-by-point response to address your concerns:
Response to Questions
- Clarifying Key Findings. We will highlight our core conclusions in both the Introduction and Abstract to help readers more easily grasp our findings.
- Specification of Hyper-parameters. While we provide a dedicated subsection (Sec. 4.5), we will further enhance "Quantifying the Impact..." part (in Sec. 3.3) with more intuitive insights.
- Smoother Information Geometry (IG) of LoRA. The smoothness in IG can be measured by , where a lower value indicates smoother parameter changes across different dimensions in the neural network. We provided an explanation in Sec. 2.4, and we will revise it to improve the readability.
- Higher Poisoning Rates. Fig. 1 and Fig. 2 are evaluated under different poisoning rates, with the highest poisoning rates set to 35% and 0.45% in untargeted poisoning and backdoor poisoning, respectively. These levels are in line with previous research[1]. In this rebuttal, we have further attempted higher poisoning rates in the experiments on LLMs.
Experiments
We provide the following experiments to further respond to your concerns.
Evaluation with Additional Attack Strategies
We introduce four additional backdoor poisoning attacks in the NLP setting: a clean-label backdoor poisoning attack[1] (CL-BPA), an instruction-level backdoor poisoning attack[2] (IL-BPA), a multi-triggered stealthy backdoor attack[3] (MT), and a style-based backdoor poisoning attack[4] (S-BPA). The last two attack strategies were suggested by Reviewer V5bG.
We adopt the same random seeds and experimental configurations when assessing the resilience of LoRA under these additional attack settings. The results are summarized below.
| Model | Acc. | Pre. | Rec. | F1. |
|---|---|---|---|---|
| MT(FF) | 82.91±6.77 | 75.96±7.71 | 98.64±0.98 | 85.66±4.75 |
| MT(LoRA) | 89.14±1.86 | 84.44±3.24 | 96.62±1.03 | 90.08±1.42 |
| CL-BPA(FF) | 91.78±0.47 | 89.47±0.91 | 95.04±0.39 | 92.17±0.41 |
| CL-BPA(LoRA) | 92.39±0.28 | 89.87±0.99 | 95.87±0.72 | 92.77±0.21 |
| IL-BPA(FF) | 51.37±0.11 | 51.15±0.05 | 100.00±0.00 | 67.68±0.05 |
| IL-BPA(LoRA) | 53.13±2.35 | 52.09±1.27 | 100.00±0.00 | 68.49±1.09 |
| S-BPA(FF) | 75.34±0.93 | 67.59±0.83 | 99.09±0.39 | 80.36±0.61 |
| S-BPA(LoRA) | 85.51±1.79 | 79.01±2.33 | 97.52±0.22 | 87.28±1.34 |
The experimental results indicate that LoRA demonstrates stronger robustness than the full fine-tuning (FF) against a wide range of mainstream backdoor attacks. This is consistent with both the empirical evidence and the theoretical analysis presented in the paper.
Evaluation on Other Initialization Strategies
Besides of the default and most commonly used initialization strategy (Kaiming Uniform) in LoRA, we evaluate two additional initialization methods to examine the impact of their variances to LoRA's TTR. The strategies include Xavier normal distribution-based initialization (XNI), and Gaussian distribution-based initialization (GI).
The experimental results are presented in https://a.imgno.de/67e9fada72bf2.png .
The experimental results are generally consistent with those obtained using the Kaiming Uniform initialization.
LoRA's TTR on Generative Language Models
Inspired by the BackdoorLLM[5] benchmark, we evaluate the TTR of LoRA against three backdoor poisoning attacks under two distinct attack scenarios. The backdoor attacks include BadNet[6], Sleeper Agent[7] (SA), and VPI[8]. The attack scenario is LLMs' jailbreaking, where a backdoored LLM is expected to bypass safety filters (jailbreaking) to answer certain queries when the input contains corresponding triggers.
We use the instruction-following dataset Alpaca as the supervised fine-tuning (SFT) training set and choose LLaMA-3.2-3B as the model backbone. We do not include LLaMA-3-8B due to GPU memory limitations that prevent full fine-tuning on a single GPU. These experiments are conducted on an Nvidia H100 GPU. The poisoning rate is set to 2%.
The experimental results are shown below.
| Backdoor Method | IsLoRA | ASR |
|---|---|---|
| BadNet | FF | 90.91 |
| BadNet | LoRA | 84.85 |
| SA | FF | 92.93 |
| SA | LoRA | 88.89 |
| VPI | FF | 86.87 |
| VPI | LoRA | 84.85 |
We observe that the conclusions drawn from generative language models are consistent with those from NLU models.
[1] Poisoning Language Models During Instruction Tuning
[2] Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models
[3] Rethinking Stealthiness of Backdoor Attack against NLP Models
[4] Hidden Trigger Backdoor Attack on NLP Models via Linguistic Style Manipulation
[5] BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models
[6] BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
[7] Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
[8] Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection
This paper investigates the security problem of LoRA and especially analyzes how its low-rank structure could influence training-time robustness during fine-tuning. The authors presented a theoretical framework showing that while LoRA is more robust to backdoor attacks than full fine-tuning, it is more vulnerable to untargeted data poisoning. The findings are supported by extensive experiments.
Yes, this appears to be the first work to systematically investigate the intrinsic security vulnerabilities of LoRA. If the claims regarding LoRA’s susceptibility to training-time attacks are valid, this paper provides a timely and valuable contribution to the growing field of secure fine-tuning for large language models. All reviewers agree on the significance of the findings and are in favor of accepting the paper.