PaperHub
6.6
/10
Poster4 位审稿人
最低3最高4标准差0.5
4
3
4
3
ICML 2025

Safe Delta: Consistently Preserving Safety when Fine-Tuning LLMs on Diverse Datasets

OpenReviewPDF
提交: 2025-01-21更新: 2025-07-24
TL;DR

We propose Safe Delta, a safety-aware post-training method that preserves safety alignment when fine-tuning LLMs on diverse datasets.

摘要

关键词
Large Language ModelFine-tuning SafetyJailbreak Defense

评审与讨论

审稿意见
4

The authors propose a new harmful fine-tuning (and benign fine-tuning) defence method that estimates a correction vector that is applied after training the model. They show that their method doesn’t harm utility while maintaining a low attack success rate.

给作者的问题

Are there any additional adaptive attack the authors can think of and discuss in the paper?

I’m curious if the authors think there could exists a benign fine-tuning dataset that could be prevented from learning by this correction. Perhaps if it was very much out of distribution. I’d encourage the authors to think about this.

论据与证据

Within the scope of previous literature using HEX-Phi and AOA Identify Shifting to evaluate previous defences, I find that the claims over the effectiveness of this method as a defence sound. The method clearly is able to balance utility.

方法与评估标准

The method is quite valuable not only because of it’s efficacy but also because it doesn’t need access to the attack distribution the attacker uses which is the limitation of some methods.

The chosen evaluation approach is quite limited from the attack perspective. Much larger attack datasets such as [1] and are commonly used in previous literature. Can the authors please add an evaluation on [1] using more attack samples such as 10k as was used in [2]? Please also report the utility loss after doing this experiment as it would be good to test the limits of this method.

[1] Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C., ... & Yang, Y. (2023). Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems36, 24678-24704.

[2] Rosati, D., Wehner, J., Williams, K., Bartoszcze, L., & Gonzales, R. . carsten maple, Subhabrata Majumdar, Hassan Sajjad, and Frank Rudzicz (2024). Representation noising: A defence mechanism against harmful finetuning.

There are more recent baselines that seem important to add due to their popularity. I’d recommend at least also evaluating Lisa [3] which is more standard for harmful fine-tuning defence than what was chosen. Please try to add another post-fine-tuning method like [4]. [5] is also relevant but it should be considered concurrent work with this manuscript.

[3] Huang, T., Hu, S., Ilhan, F., Tekin, S. F., & Liu, L. (2024). Lazy safety alignment for large language models against harmful fine-tuning. arXiv preprint arXiv:2405.186412.

[4] Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning. 

[5] Yi, X., Zheng, S., Wang, L., de Melo, G., Wang, X., & He, L. (2024). NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning

理论论述

I reviewed the proof of Theorem 4.1 and it seems sound.

实验设计与分析

Generally the experimental design and analysis is fine.

An adaptive attack on this method is missing. For example, [1] provides an attack that isn’t considered by the authors but I would encourage them to include, where the samples are purposely designed to overcome these types of safe guards because the samples might not create meaningful deltas. In this case perhaps you’d construct an attack if you had knowledge of what the safe delta estimation dataset was to purposely be as close to this dataset as possible.

[1] Halawi, D., Wei, A., Wallace, E., Wang, T. T., Haghtalab, N., & Steinhardt, J. (2024). Covert malicious finetuning: Challenges in safeguarding llm adaptation

补充材料

I reviewed the appendices.

与现有文献的关系

This work is part of a broader initiative reviewed in [1] to prevent training-time attacks on large language models. Specifically, this work provides a post-training correction method that does not harm utility.

[1] Huang, T., Hu, S., Ilhan, F., Tekin, S. F., & Liu, L. (2024). Harmful fine-tuning attacks and defenses for large language models: A survey.

遗漏的重要参考文献

This paper neglects quite a lot of prior work, I would suggest a review of [1]. In particular there seems to be many similar methods for instance [2] and [3]. While many works are concurrent, I do believe the author needs to revise the related works section in order to properly discuss the difference between current methods for preserving safety when fine-tuning on benign or harmful fine-tuning as many of these papers were posted to arXiv over the summer of 2024. As per the guidelines of ICML it is unreasonable to expect references to works that appeared one month before the submission deadline.

[1] Huang, T., Hu, S., Ilhan, F., Tekin, S. F., & Liu, L. (2024). Harmful fine-tuning attacks and defenses for large language models: A survey.

[2] Huang, T., Bhattacharya, G., Joshi, P., Kimball, J., & Liu, L. (2024). Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning. 

[3] Yi, X., Zheng, S., Wang, L., de Melo, G., Wang, X., & He, L. (2024). NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning

其他优缺点

Limitations of this method are not explicitly discussed. I would encourage the authors to add a limitations section.

I said it before but one of the main strengths of this work is providing a working method that doesn't require access to the attackers dataset distribution.

其他意见或建议

(4), (4) - is sd a typo, is it meant to be sft? If its supposed to be sd then what does sd mean?

Section 4.2 and elsewhere: I think some clarity is needed here on terminology. Layers in a neural network often include other bits like the activation functions and attention. Can you clarify in Section 4.2 that your method looks at each individual linear transformation parameterized by a weight matrix? that would make things clearer to the reader and less open to misinterpretation.

Section 4.4: I’d recommend illustrating the complexity and actual dimensions used in practice for doing the hessian inverstion computation. A lazy or confused reader might think this is an intractable computation if they missed that its only in the size of a linear transformation.

“Hence we design a layer-specific threshold of the form” → It’s confusing to me what the random variable in this expectation is, is it the inputs? It might be clearer just to take a mean rather than an expectation.

“we use the PureBad” It’s not called PureBad it’s called HEX-Phi i’d recommend that this is corrected in the text.

“contains implicitly harmful examples” I’d recommend explaining what this means - i.e. absolutely obedient agent

作者回复

Thank you for your insightful reviews and comments. We will address your concerns and questions as follows:

C1: Much larger attack datasets should be considered. Can the authors please add an evaluation using more attack samples such as 10k?

Thank you for your thoughtful advice. We adopt experiments on PureBad dataset with 1k and 10k size, sampled from [1]. Below results show that Safe Delta still maintain the safety and basic utility. We will discuss relevent works in our paper.

Datasize →1k10k
Model↓,Metric→MT-B↑/ASR↓MT-B↑/ASR↓
Finetuned5.1/95.15.2/94.6
SafeDelta6.0/4.86.1/4.6

C2: Some recent methods should be included, such as Lisa[2].

Thanks for your thoughtful advice. We will add discussion of these suggested works in our paper.

Due to limited time and resources during the rebuttal period, we only adopt Lisa on the PureBad dataset. The table shows the basic utilty and safety peformance of Lisa.

Method↓, Metric→MMLU↑MT-B↑ASR↓HS↓
Finetuned44.355.4395.764.82
Lisa44.725.918.481.32
SafeDelta44.616.183.331.13

C3 & Q1: An adaptive attack is missing. Are there any additional adaptive attack the authors can think of and discuss in the paper?

Thanks for your thoughtful advice. We agree with your analysis that if the adaptive attacker may know the estimation dataset of SafeDelta, they could construct a corresponding attack dataset. But in practice, the dataset hold by model provider is hard to access, making such attack hard and costy to work. For your mentioned attack, as they do not releas their code, it's hard for ust to reimplement it in limited rebuttal period.

We agree that it is an interesting direction for future research, and we will add discussion about this in our paper.

C4: Concern about references. Neglect some relevent works.

Thank you for you thoughful advice. We will add these suggested works in the literature review.

W1: Add a limitation section.

Thank you for your thoughful review. We will add limitations regarding:

  • SafeDelta may be vulnerable to future attacks with well-designed data.
  • A more advanced weight selection method, instead of greedy method, could improve performance.

Q2. I’m curious if the authors think there could exists a benign fine-tuning dataset that could be prevented from learning by this correction.

Thank you for your insightful question. If the attackers know the dataset for preparation, they may construct a out of distribution dataset that influence the performance of SafeDelta. Here the key point is how to access the dataset for preparation. We agree that this is an interesting direction for future research, and we will add discussion about this in paper.

Q3. Eq (4) - is sd a typo, is it meant to be sft?

"sd" is not a typo, which is short for Safe Delta, following the defination in Eq.1.

Suggestions about improving writing.

Thank you for your helpful suggestions. We will improve the relevant sections in the next version.

REF

[1] Beavertails: Towards improved safety alignment of llm via a uman-preference dataset. 2024

[2] Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack. 2024

审稿意见
3

The paper introduces Safe Delta, a two-stage method that estimates the effects of specific datasets on safety and utility, compensating the safety degradation while maintaining utility.

给作者的问题

  • Why are the baseline performances so different in Pure Bad compared to previously reported ones?
  • Is the trend observed in Section 5.3. for harmful datasets similar for benign ones?
  • Are the time costs “per request” reported as the delta over the full fine-tuning time over 3 epochs, and for comparable datasets? I understand the difference between BEA-10 and BEA-750, but what dataset is being considered for the 62s reported by Safe Delta?
  • On the same hardware as the time cost experiments, how long does it take to run the preparation stage for different model sizes (e.g., 7B vs 13B)? I understand that it can be cached, but it is still an important factor for model providers that regularly update their base models. Is it even feasible to do this for 70B+ models?

论据与证据

The claims made by the authors are supported by the experimental results presented, though there are some concerning aspects on the difference between the reported results and previous literature, as well as the statistical significance of the said results (see Experimental Design or Analyses below).

方法与评估标准

The method proposed in this work appears to be novel. The motivation is backed by a theoretical analysis of the safety degradation we want to balance, and an approximation of the utility improvement to preserve under Wsd\mathbf{W}_{\text{sd}}. The ablation provided in the experiments shows the benefits of the safety compensation vector computed derived through the safety degradation analysis.

Generally, the datasets considered are diverse and widely studied in previous works, while the baselines are appropriate methods for comparison in this setting.

理论论述

I did not carefully check the proof of Theorem 4.1., but on quick inspection it appears to be correct.

实验设计与分析

The experimental setup described is sound, and generally appears to be appropriate to test the claims the authors make about Safe Delta.

However, there are a few noticeable issues with the analysis:

  • The reported performance of the baselines in this paper is quite far off from the reported performance in the original papers. For example, for PureBad the authors report MT-B of 6.05, ASR of 84.24%, and an HS of 4.21 for Safe LoRA whereas the original paper reports on the same dataset and model MT-B of 6.34, ASR of 3.03, and an HS of 1.055 (Hsu et al., 2024). The reported values for SafeInstr in this dataset and model also differ quite drastically from the ones reported in (Hsu et al., 2024). Is there a significant difference in the experimental setup between the two papers? If so, it would be extremely relevant to try Safe Delta in the setup considered by (Hsu et al., 2024) to see if the trends still hold.

  • Given there are often small differences in terms of utility and/or safety metrics, it is hard to say if some of the results are statistically significant. While I understand the difficulties of running all results multiple times, the authors should run this analysis at least on a subset of the experiments to more effectively observe trends in the results.

  • The time cost comparison with other methods, while extremely relevant for practical purposes, is not clearly explained in the paper. The authors mention that Safe Delta requires an “extra time cost of 62s per request,” but a more exact explanation of what is understood by a request in this experimental setting is not clarified in the text or available in the appendix. Further, it is unclear to me that all the numbers in Table 6 are directly comparable — do they reference the same fine-tuning dataset or one with the same number of examples (with the exception of BEA-10 and BEA-750)? How does this analysis change for, e.g., a 13B parameter model?

References:

  • Hsu, Chia-Yi, et al. "Safe LoRA: The silver lining of reducing safety risks when finetuning large language models." Advances in Neural Information Processing Systems, 2025.

补充材料

I reviewed some of the experimental details in the Appendix.

与现有文献的关系

Safe Delta is novel in terms of methodology, but due to some questions on the efficacy and efficiency of the method it is uncertain whether this is a marked improvement compared to previous methods.

遗漏的重要参考文献

The related work section is comprehensive.

其他优缺点

  • The paper is very clear and easy to read, which is a strength of the work.
  • While this method has some obvious limitations (e.g., scalability), the paper is missing a detailed section on this.

其他意见或建议

  • Use (a) and (b) or left and right in the description of Section 5.4. to clarify the two plots.
  • Minor typo in Appendix E.1. where it should read “Harmful.”
作者回复

Thank you for your insightful reviews. We are glad that you found our work novel, theoretically grounded, and clearly presented. Below, we address your concerns:

C1 & Q1: Baseline performance in PureBad differs from previously reported ones.

SafeLoRA

We appreciate your careful observation. We used SafeLoRA's official code, and the performance gap is expected due to differences in hyperparameter settings (specifically, the similarity threshold). As noted in Sec 5.3, the SafeLoRA paper does not specify this threshold for full fine-tuning, so we tuned it ourselves. The table here compares SafeLoRA's original reported results with our implementation using different thresholds:

PureBadDirty Summary
SafeLoRAMT-B↑ASR(%)↓HS↓F1↑ASR(%)↓HS↓
Report6.343.031.050.4978.791.30
Threshold=0.66.212.731.060.2683.331.09
Threshold=0.45.9893.944.730.4797.581.28

A threshold of 0.6 matches the reported PureBad results but harms the utility on Dirty Summary, while a threshold of 0.4 matches the reported Summary performance but sacrifices PureBad safety. This indicates that matching SafeLoRA’s original performance requires tuning hyperparameters per dataset. However, as discussed in Sec. 1, such tuning leads to high computational costs and limits practical usability — this is our main motivation. Thus, for each method, we used a fixed hyperparameter across all datasets for fair comparison.

Since there is a threshold-dependent trade-off, we tuned the threshold on Dirty Summary and chose 0.52 to balance utility and safety (see Appendix D.3), aligning with fine-tuning service users’ goals.

SafeInstr

Since SafeLoRA has not released its codes and dataset about SafeInstr, we followed BEA's implementation of SafeInstr [1] and achieved comparable results. On PureBad, ours: ASR 37.82, HS 2.74, BEA's:ASR 34.91, HS 2.49.

[1] BackdoorAlign: Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment. NeurIPS 2024

C2: Small differences make statistical significance unclear.

To address this, we run repeated experiments on Dirty Summary, where SafeDelta shows small metric superiority. We train 3 models per method and test each 10 times with different random seeds, evaluating utility (F1) and safety (ASR). Results (mean/std deviation) are:

MethodF1(x10^-3)↑ASR(%)↓
SafeInstr486/644.65/3.8
BEA475/613.52/4.6
SafeLoRA468/47.44/2.2
Resta473/510.05/1.8
our482/55.92/1.3

T-tests (95% confidence) confirm our method outperforms baselines in safety (ASR) and utility (F1), except SafeInstr, which slightly exceeds in F1 but lags in safety.

While acknowledging this comparison, we'd like to emphasize: Instead of excelling on a single dataset, our method prioritizes consistent safety across diverse settings without compromising utility(see Fig 1).

C3 & Q3: What is a "request"? Are numbers in Table 6 comparable? Does analysis change for 13B model?

A "request" is a complete fine-tuning job, aligning with practical fine-tuning services where a user uploads data and receives a final model. "Extra time" refers to the time overhead required for defensive fine-tuning compared to standard one. These contents will be added to Section 5.8.

The time costs for data-based methods (BEA) and weight-modification methods (SafeLoRA, SafeDelta) in Table 6 are not directly comparable. Data-based methods require extra training data, so the time cost depends on extra data size and model size. However, weight-modification methods depend only on model size, so dataset details are not provided. To clarify, Table 6 is intended only "for reference" (as stated in Section 5.8), we will further revise the table to avoid misinterpretation.

For a 13B model, SafeDelta takes ~110s extra time and SafeLoRa takes ~212s, due to more parameters.

C4: Lack discussion of limitation (e.g., scalability)

Based on your review, we assume you are referring to scaling to larger models. As shown in Table 4, SafeDelta performs effectively on a 13B model (~110s extra time). If you meant another scalability aspect, please let us know.

We will add discussions of limitations:

  • SafeDelta may be vulnerable to future attacks with well-designed data.
  • A more advanced weight selection method, instead of greedy method, could improve performance.

Q2: Is trend in Sec 5.3 similar for benign datasets?

No, the trend differs. We conduct experiments on Math datasets (sizes 5k, 7.5k, 15k), showing that BEA consistently maintains safety, with an ASR of 2%.

Q4: What is the preparation time for different model sizes?

Preparation times for different model sizes are summarized below. 7B/13B experiments use the same hardware as the time cost experiments. For 70B models, 4 A100-80G GPUs are used due to memory demands. The results indicate the preparation times are accepetable for model providers:

Model Size7B13B70B
Pre. Time(s)2113782620
审稿人评论

I thank the authors for their detailed rebuttal. They have addressed most of my major concerns, so I have updated my score accordingly.

作者评论

Thank you very much for your thoughtful review and for reading our rebuttal. We appreciate your feedback and will incorporate the suggested updates into the final version.

审稿意见
4

This paper introduces a novel defensive method to enhance LLM safety after fine-tuning. Specifically, it proposed to Safe Delta, which consists of a preparation step performed before fine-tuning and two steps (Finding Delta Parameters, Adding Safety Compensation) executed for each fine-tuning request. The goal of Safe Delta is to maximize the total utility improvement while keeping the safety degradation below a threshold. The author reports the attack success rate (ASR) and harmfulness score (HS) to evaluate the safety and uses the respective metrics for utility. Experimental results showed that Safe Delta could effectively balance the safety and utility after fine-tuning.

给作者的问题

  • It is surprising that MR\mathbf{M}_R could achieve such a low Harmful Score. How do the authors explain such phenomena?
  • Please also address the questions mentioned above.

论据与证据

The claims are supported by the experimental results and the theorem.

方法与评估标准

The proposed methods and evaluation criteria are appropriate.

理论论述

The theorem and proof are clearly presented. However, the reviewer did not really understand the correlation between the theorem and the Optimal Brain Surgeon. There is really no need to link this approach to neuroscience terms.

实验设计与分析

The experimental designs and analyses are good. However, the reviewer does not understand why the utility improvement could use WsdWorig22||\mathbf{W}_\textnormal{sd} - \mathbf{W} _\textnormal{orig}||_2^2 as an objective, what is the role of Wsft\mathbf{W} _\textnormal{sft} playing here?

补充材料

There is no code provided for review.

与现有文献的关系

This paper proposes the first utility-safety balance optimization at weights level, which is a good contribution for improving LLM safety.

遗漏的重要参考文献

Most papers are well cited and discussed.

其他优缺点

Strengths

  • The paper is well written, and the figures/tables are well presented.
  • The experimental results look promising, and this method might be useful in preventing models from losing certain capacities during fine-tuning by applying certain compensation.

其他意见或建议

none

作者回复

Thank you for your insightful reviews and comments. We will address your concerns and questions as follows:

C1: Reviewer did not really understand the correlation between the theorem and the Optimal Brain Surgeon, which is neuroscience terms.

Thank you for raising this questions. This question seems to reflect a potential misinterpretation of Optimal Brain Surgeon. In this paper, we use Optimal Brain Surgeon to refer to the name of model pruning methods[1,2], which is also cited in our paper. These methods inspired our theorem for identifying important weights for safety. We will explicily write the method type of it in next version.

[1] Yann LeCun, John S. Denker, Sara A. Solla. Optimal brain damage. NeurIPS 1989

[2] Babak Hassibi, David G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. NeurIPS 1992

Q1: Why the utility improvement could use WsdWorig22\Vert \mathbf{W_{\text{sd}}} - \mathbf{W_{\text{orig}}}\Vert_2^2 as an objective, what is the role of Wsft\mathbf{W_{\text{sft}}} plays here?

Thank you for your thoughtful question.

We use WsdWorig22\Vert \mathbf{W_{\text{sd}}} - \mathbf{W_{\text{orig}}}\Vert_2^2 as an objective because it reflects how many fine-tuned delta weights are kept, thereby reflecting the utility gain. And Wsft\mathbf{W_{\text{sft}}} is implicitly integrated into Wsd\mathbf{W_{\text{sd}}}, expressed as: Wsd=Worig+M(WsftWorig).\mathbf{W_{\text{sd}}} = \mathbf{W_{\text{orig}}} + \mathbf{M} \odot (\mathbf{W_{\text{sft}}} – \mathbf{W_{\text{orig}}}). As in step 1, SafeDelta constructs Wsd\mathbf{W_{\text{sd}}} from Wsft\mathbf{W_{\text{sft}}} and Worig\mathbf{W_{\text{orig}}} via a selective mask M\mathbf{M}.

Here, M\mathbf{M} determines which weights to adopt from the fine-tuned model and which to retain from the original model. Selecting more delta weights increases WsdWorig22\Vert \mathbf{W_{\text{sd}}} - \mathbf{W_{\text{orig}}}\Vert_2^2 and preserves more utility gains (Wsd\mathbf{W_{\text{sd}}} is closer to Wsft\mathbf{W_{\text{sft}}}), while selecting fewer delta weights decreases WsdWorig22\Vert \mathbf{W_{\text{sd}}} - \mathbf{W_{\text{orig}}}\Vert_2^2 and discard more utility gain (Wsd\mathbf{W_{\text{sd}}} is closer to Worig\mathbf{W_{\text{orig}}}).

Q2: Why random selection achieve such a low Harmful Score?

The effect of delta weight attribution would explain this phenomenon. As we discussed in Section 4.1, the delta weights after fine-tuning contributes to two performance changes: (1) utility improvement and (2) safety degradation. Randomly discarding parts of delta weights progressively reduces their contribution to safety degradation. For example, in the extreme case of full removal, safety reverts to the original model’s level (1.06 Harmful Score).

In the random selection experiments, we discarded about 50% of delta weights, thus mitigating the safety degradation: Harmful Score to 1.92, corresponding to 27% ASR.

审稿意见
3

Safe Delta is a safety-aware post-training defense method that adjusts the delta parameters (i.e., the parameter change before and after fine-tuning). Safe Delta estimates the safety degradation, selects delta parameters to maximize utility while limiting overall safety loss, and applies a safety compensation vector to mitigate residual safety loss.

给作者的问题

No

论据与证据

Yes

方法与评估标准

Yes.

  1. The proposed method is noval and effective.
  2. Figure 3 is clear to understand.

理论论述

Yes. I have checked the proof in Appendix. Anthors provide a comprehensive proof for the conclusion.

实验设计与分析

Yes.

  1. I am confused by why not also put Llama-3-8b-instruct results in Table 2. Because experiment could show realistic performance in real world deployment setting.
  2. Authors might test on some over-refusal datasets, because many defense methods will suffer from over-refusal issues, which will refuse some normal questions.

补充材料

Yes. I checked all the Appendix.

与现有文献的关系

  1. A safety-aware post-training defense method that adjusts the delta parameters (i.e., the parameter change before and after fine-tuning).
  2. Safe Delta jointly estimate safety degradation and dynamically optimize delta parameter, addressing the challenge of different fine-tuning scenarios.
  3. Safe Delta is an effecient method compared to baselines.

遗漏的重要参考文献

No.

其他优缺点

  1. How does the methodology perform when applied to Multimodal Large Language Models (MLLMs)?

  2. Have the authors evaluated the method's efficacy against jailbreak attacks? Given that real-world deployments often implement both post-safety-aware fine-tuning and jailbreak defense fine-tuning concurrently, it would be valuable to include an experimental study testing the proposed defense against state-of-the-art jailbreak methods.

  3. For the Llama-3-8b experiment, could comparative baseline performance metrics be provided to contextualize the results?

  4. Does Safe Delta exhibit performance degradation in sequential interaction scenarios where a user initially poses a malicious query followed by benign questions? This is particularly relevant given that defensive mechanisms frequently suffer from over-refusal issues.

  5. Regarding Figure 4, what is the performance trajectory of Safe Delta when Harmful Dataset Size is substantially increased? It would be beneficial to explore extreme cases in the Appendix to understand the method's scalability and robustness under high-volume harmful data conditions.

  6. In the attacker settings described in the evaluation methodology, how does the performance of content filtering approaches (such as Llama-Guard-3) applied to the training dataset compare with Safe Delta's effectiveness?

其他意见或建议

No

作者回复

Thank you for your insightful reviews. We appreciate your recognition of our work as novel, effective and efficient. Below, we address your concerns:

C1 & Q3: Baseline performances on Llama3-8b-instruct

Thank you for your thoughtful advice. Since all baselines release their code based on Llama2-7b-chat, to ensure reproducibility, we chose to base our main experiments on Llama2 as well.

To address your concern, we extend the Llama3-8b experiments in Table 4 on two representative datasets, PureBad and DirtySummary. The table below shows SafeDelta effectively preserves safety while not harming the utility.

PureBadDirtySummary
MethodMMLU↑MT-B↑ASR↓HS↓F1↑ASR↓HS↓
SafeInstr64.56.5345.152.720.47119.091.65
BEA64.36.7913.031.470.48310.001.34
SafeLoRA65.16.8888.484.320.46312.731.42
Resta63.66.2991.824.540.4619.391.33
SafeDelta65.36.836.361.240.4777.581.29

C2: Over refusal issue.

Following the standard practice in this field [1-4], we initially did not include over-refusal test. Recognizing the importance of your concern, we test SafeDelta under the most/least harmful settings: finetuned on PureBad/Math dataset. We employ ORBench [5] for evaluation. Results show that SafeDelta does not suffer from over-refusal issues and performs comparably to the original model:

ModelOR rate↓
Orig18.8
PureBad+SafeDelta18.3
Math+SafeDelta17.8

"Orig" refers to original model; "OR rate" measures the percentage of refused benign questions (lower is better).

Q1. How does SafeDelta perform for Multimodal LLMs?

This work focuses on text modality safety, so MLLMs were not considered. We plan to explore this in future work. SafeDelta can be adapted for MLLMs by using multimodal safety data to compute the Hessian matrix, with the weight adjustment process remaining unchanged.

Q2. What is the efficacy against jailbreak attacks?

Since our work focuses on safety of fine-tuning instead of inference, we initially omitted jailbreak tests, following standard practice in this field[1-4].

To address your concern, we test SafeDelta against typical jailbreak attacks: GCG, AutoDAN and PAIR. Each generates 200 examples. To simulate black-box access in finetuning service, we do transfer attack using Vicuna-13B for GCG and AutoDAN. We test the original model (Orig) and PureBad-finetuned model with SafeDelta. The results show that SafeDelta preserves the original model's defense against jailbreaks:

AttackOrig(%)PureBad+SafeDelta(%)
GCG1.51.5
AutoDAN1.52.5
PAIR22

Here, numbers are ASR (lower means stronger defense).

Q4. Does SafeDelta degrade in sequential interactions (harmful queries followed by benign ones)?

We followed standard setups in this field [1-4] without considering this scenario.

To address your concern, we simulate 200 sequential interactions: each involves a PureBad harmful query, LLM's answer, and a follow-up Summary query. The results confirm that SafeDelta maintains utility in this scenario.

DirectSequential
Finetuned0.4910.484
SafeDelta0.4890.480

Above are F1 scores (higher is better). "Direct" uses direct queries; "Finetuned" is model standard finetuned on Dirty Summary. Since the model is fine-tuned on direct queries, there is little degradation in sequential scenario.

Q5. What is the performance when Harmful Dataset Size is substantially increased?

Thanks for your advice. We test SafeDelta on PureBad dataset with 1k and 10k sizes. Results show that SafeDelta still maintains safety(ASR) while preserving basic utility(MT-B).

Datasize→1k10k
Model↓,Metric→MT-B↑/ASR↓MT-B↑/ASR↓
Finetuned5.1/95.15.2/94.6
SafeDelta6.0/4.86.1/4.6

Q6. How does content filtering perform?

We initially did not consider content filtering methods, as they are ineffective on datasets with benign content (AOA, MATH), where fine-tuning still harms safety.

To address your concern, we filter the dataset using Llama3-Guard-8b and finetune on the filtered data. As expected, this approach performs poorly:

DatasetFilter RateFilter ASRSafeDelta ASR
PureBad8382.13.33
DirtySummary7.551.75.15
AOA0 (No Defense)
Math0 (No Defense)

"Filter Rate" is the percentage of data filtered out; "Filter ASR" is the ASR of the model finedtuned on filtered dataset.

References

[1] Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models. NeurIPS 2024

[2] Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment. NeurIPS 2024

[3] Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic. ACL 2024

[4] Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! ICLR 2024

[5] An Over-Refusal Benchmark for Large Language Models. 2024

审稿人评论

Thanks for the authors' responses to my concerns. I look forward to seeing the revised version with these updates incorporated and will adjust my score accordingly.

作者评论

Thank you very much for reviewing our paper and reading our rebuttal. Since ICML does not allow modifications to the paper during the rebuttal stage, we will incorporate these updates into the final version:

  • Over-refusal and sequential experiments, content filtering performance, and a discussion on applying the method to multimodal LLMs will be included in the main paper.
  • Experiments with LLaMA3-8B, jailbreak attacks, and large-scale dataset evaluations will be added to the Appendix, with corresponding discussions included in the main paper.

We are truly grateful for your time and your reply.

最终决定

This paper proposes a defensive method called Safe Delta to enhance LLM safety after fine-tuning. Safe Delta is composed of a preparation step performed before fine-tuning and two steps (Finding Delta Parameters, Adding Safety Compensation) executed for each fine-tuning request. The goal of Safe Delta is to maximize the total utility improvement while keeping the safety degradation below a threshold. The author reports the attack success rate (ASR) and harmfulness score (HS) to evaluate the safety and uses the respective metrics for utility. Experimental results showed that Safe Delta could effectively balance the safety and utility after fine-tuning.

The strengths are: 1) This work provides a working method that doesn't require access to the attackers dataset distribution; 2) The paper is well written, and the figures/tables are well presented; 3) The experiments are extensive. The weaknesses are: 1) This method has obvious limitation in scalability; 2) The chosen evaluation approach is limited from the attack perspective. Overall, after the rebuttal, the reviewers found that the strength outweighs the weakness and are positive about the paper.