PaperHub
6.0
/10
Poster4 位审稿人
最低6最高6标准差0.0
6
6
6
6
3.5
置信度
正确性2.3
贡献度2.3
表达2.8
ICLR 2025

Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy

OpenReviewPDF
提交: 2024-09-25更新: 2025-04-08

摘要

Large Language Models (LLMs) are susceptible to security and safety threats, such as prompt injection, prompt extraction, and harmful requests. One major cause of these vulnerabilities is the lack of an instruction hierarchy. Modern LLM architectures treat all inputs equally, failing to distinguish between and prioritize various types of instructions, such as system messages, user prompts, and data. As a result, lower-priority user prompts may override more critical system instructions, including safety protocols. Existing approaches to achieving instruction hierarchy, such as delimiters and instruction-based training, do not address this issue at the architectural level. We introduce the $I$nstructional $S$egment $E$mbedding (ISE) technique, inspired by BERT, to modern large language models, which embeds instruction priority information directly into the model. This approach enables models to explicitly differentiate and prioritize various instruction types, significantly improving safety against malicious prompts that attempt to override priority rules. Our experiments on the Structured Query and Instruction Hierarchy benchmarks demonstrate an average robust accuracy increase of up to 15.75% and 18.68%, respectively. Furthermore, we observe an improvement in the instruction-following capability of up to 4.1% on AlpacaEval. Overall, our approach offers a promising direction for enhancing the safety and effectiveness of LLM architectures.
关键词
Instruction HierarchySegment EmbeddingLLM safety and robustness

评审与讨论

审稿意见
6

This paper introduces instruction segment embedding, a type of input embedding that marks the hierarchy of text inputs in LLMs. It divides inputs into four categories: system message, user prompt, data, and output. While segment embedding itself isn’t entirely new, the experiment results in the paper on the Structured Query and Instruction Hierarchy datasets show that this approach helps LLMs follow instructions better and improves their safety.

优点

  • The method is simple and effective, and the results back up the improvements claimed.
  • The writing is clear and easy to follow, making it straightforward to understand.

缺点

  • Despite the strong experiment results, this paper lacks a more insightful investigation into the learned segment embedding. I am not surprised that adding segment embedding can improve the LLM's general following capability and safety since it is commonly used in many domains like dialogue language model, vision transformer. Therefore, I think the reason for this improvement is more interesting other than the performance difference. In particular, I am curious about how the attention pattern changes after adding the segment embedding. I would recommend authors investigate several scenarios to provide more insights about the segment embedding:

    • What's the model behavior if no system prompt is provided?
    • What's the model behavior if system prompt also uses user request segment embedding? Also, what if the data parts uses user segment embedding instead of data embedding?
    • What's the attention pattern difference between a model with segment embedding and one without given attack prompt and benign prompt?
  • The paper only tests fixed prompt-space attacks to evaluate the new LLMs robustness to attacks; adding automatic attacks, like PAIR [1] or PAP [2], could make the evaluation more complete.

  • The experiments mainly focus on single-turn conversation data, which may limit the understanding of how the model would perform in multi-turn conversations.

[1] Chao, Patrick, et al. "Jailbreaking black box large language models in twenty queries." arXiv preprint arXiv:2310.08419 (2023).

[2] Zeng, Yi, et al. "How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms." arXiv preprint arXiv:2401.06373 (2024).

问题

  1. Can authors clarify how an instruction-following data sample is converted to the proposed format? Examples showing what parts of a sample are labeled as system, user, and data would be helpful. I’m especially interested in how to distinguish between user and data segments since it's possible that users first provide data, then raise a request. Or multiple data/request in the input.

  2. If attackers can change the system message, is it possible that the model become more vulnerable to this kinds of attacks? I wonder if the instruction segment embedding strengthens the effect of system prompt too much.

评论

Thanks for the detailed response. The embedding swapping results and attention pattern changes are interesting to me. I think the method seems effective, though it doesn’t bring much novelty. Nonetheless, the experiment shows that token-type embedding is effective for instruction hierarchy modeling, and I would raise my rating.

评论

We are glad that the reviewer is satisfied with our responses, additional experiments, and investigation. We appreciate the prompt feedback and the increased score for our paper.

Regarding novelty, while the method may share some similarities with previous approaches, it introduces a new direction for future architectural designs by incorporating instruction types. This approach could offer valuable insights into building the next generation of LLM systems that prioritize both safety and effectiveness. Furthermore, due to its simplicity, our method holds great potential for deployment in real-world LLMs.

评论

Robustness against indirect prompt injection attacks (IPIA)

We then conduct experiments against indirect prompt injection attacks under the same three settings. Because of time constraints, we evaluate the attack that directly injects malicious instruction into data. The results are as follows:

Training DataUltraChat BaselineInstruction FollowInstruction Hierarchy
Baseline+ISE (Ours)Baseline+ISE (Ours)Baseline+ISE (Ours)
IPIA with system prompt64.42%82.21%61.54%77.88%83.65%87.02%
IPIA without system prompt62.50%82.69%64.90 %78.37%80.77%88.94 %
IPIA if system prompt using user embedding---76.92%---75.48 %---85.58 %
IPIA if data prompt using user embedding---76.44%---71.63 %---76.92 %

Firstly, we find that the robustness against IPIA does not change significantly when the system prompt is omitted. Our ISE method still achieves higher robustness (10%-20%) compared to the baseline. Secondly, we evaluate the system prompts using user embeddings and observe a small performance degradation (<6%). Lastly, we assess the use of data prompts with user embeddings and find a more noticeable degradation (up to 10%) in performance. This result is expected, as the adversarial texts (injected in the data part) are given higher priority.

We included all those results in Appendix E6.

评论

2. Attention pattern difference between a model with and without ISE

Thank you for the interesting suggestions. To illustrate the difference in attention, we provide an example in Appendix E9. Specifically, we present a visualization of the average attention logits in layer 0 under indirect prompt injection attacks. We observe that the diagonal line of the baseline model (without ISE) is much brighter than that of our ISE model (Figure 28). This indicates that the baseline model places significantly stronger attention on neighbor tokens, making it more susceptible to malicious instructions (in data) that mislead the model. In contrast, the ISE model distributes attention more evenly across the entire instructions, including system and user inputs. Therefore, ISE could be more robust than the baseline model.

This is an interesting observation and provides a more intuitive understanding of our ISE method.

3. Automatic attacks

We conduct experiments on indirect prompt injection attacks using automatically optimized prompts, as described in [1]. Specifically, we employ GPT-4o to iteratively refine and improve the attacking prompts over 10 iterations, aiming for the output “hacked.” The results are reported as follows:

Training DataUltraChat BaselineInstruction FollowInstruction Hierarchy
Baseline+ISE (Ours)Baseline+ISE (Ours)Baseline+ISE (Ours)
IPIA with automatic attacks39.90%57.21%13.46 %34.62 %65.38%69.71 %

We observe that our ISE methods continue to demonstrate significantly higher robustness (up to 20%) against this automatic attack. The improvement is more noticeable when the model is trained on the UltraChat Baseline and Instruction Following datasets. This aligns with our findings in the paper for other attacks. We have included the results in Appendix E7.

4. Multi-turn conversations

Although all the training data consists of single-turn conversations, we observe that our ISE method can still perform multi-turn conversations. Specifically, we conduct experiments on MT-Bench with multi-turn scenarios (recall that the results reported in the paper focus only on the first turn of MT-Bench). Additionally, we prompt GPT-4o to evaluate the performance of the generated responses. The results are as follows:

Training DataUltraChat BaselineInstruction FollowInstruction Hierarchy
Baseline+ISE (Ours)Baseline+ISE (Ours)Baseline+ISE (Ours)
Single-turn6.736.647.387.557.307.48
Mutli-turn5.815.75.66.484.615.4

We observe that all models experience some performance degradation (up to 2.7), which is expected since models are trained on a single-turn chat dataset. Nonetheless, we find that our ISE achieves comparable or even higher MT-Bench scores in multi-turn tasks. This demonstrates the potential of our method to extend to multi-turn conversations. We have included the results in Appendix E8.

评论

5. Can authors clarify how an instruction-following data sample is converted to the proposed format?

The data and instructions for the Alpaca dataset (Structured Query benchmark) are provided separately (see Appendix B1). Therefore, we directly leverage this training dataset.

For the UltraChat dataset (Instruction Hierarchy benchmark), we use the context synthesis methods provided in [2]. We prompt the GPT-4o to decompose compositional instructions into separate instructions that may include system instruction, user instruction, and data. Please see Appendix B2 for detailed prompts and examples (Figures 12&13).

6. If attackers can change the system message, is it possible that the model become more vulnerable to this kind of attack?

We acknowledge that attacking the system message of the ISE model could increase its vulnerability. This is evident in cases where no system message is provided, leading to larger performance degradation on AlpacaEval. However, our threat model assumes that the stakeholders who provide the system message are trustworthy. We think this assumption highly aligns with the real-world scenario. Companies like OpenAI/Google/Anthropic take full control of the system prompts (e.g., the system prompt of Anthropic is publicly available). Therefore, compromising the system message is out of the scope of this paper.

[1] Chao et al. Jailbreaking black box large language models in twenty queries. ArXiv 2023.

[2] Wallace et al. The instruction hierarchy: Training llms to prioritize privileged instructions. ArXiv 2024.

评论

We thank the reviewer for the thoughtful suggestions.

1. Investigating several scenarios to provide more insights about the segment embedding

Given the time constraint, we select the Instruction Hierarchy Benchmark to conduct the experiments.

  • Question: What's the model behavior if no system prompt is provided? What's the model behavior if system prompt also uses user request segment embedding? Also, what if the data parts uses user segment embedding instead of data embedding?

Instuction-following Capability

We first conduct evaluations on AlpacaEval (instruction-following task) under multiple settings: (1) without using any system prompt, (2) using user embedding for the system prompt, and (3) using user embedding for the data. The results are as follows:

Training DataUltraChat BaselineInstruction FollowInstruction Hierarchy
Baseline+ISE (Ours)Baseline+ISE (Ours)Baseline+ISE (Ours)
AlpacaEval with system prompt63.18%64.65%77.24%81.82%79.25%83.35%
AlpacaEval without system prompt61.39%63.50%75.16 %76.60 %79.47%78.57 %
AlpacaEval if system prompt using user embedding---50.93%---66.42 %---70.90%
AlpacaEval if data prompt using user embedding---63.34 %---80.77 %---83.08 %

In general, we observe some performance degradation (< 5%) when system prompts are omitted, particularly for model training on the Instruction Hierarchy. However, our method largely maintains instruction-following capability compared to the baseline methods. This highlights that the system prompt plays a more significant role in the ISE models than in the baseline methods.

Next, we assess the performance of system prompts using user embeddings. Here, we observe a noticeable degradation in performance (\sim15%), even greater than when system prompts are omitted. We believe this occurs because the system prompts (i.e., Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.) mix with the original user prompts, making it difficult for the model to follow the actual user prompt. Consequently, this leads the model to generate lower-quality answers.

Interestingly, we find that performance remains high when the data part uses user embeddings. The degradation is less than 2%. We think that this is because some of the training data does not separate the user and data parts. For additional details on training data, see Appendix B2.

审稿意见
6

The paper proposes Instructional Segment Embedding (ISE) - a technique to encode instruction hierarchy in LLMs, aiming to improve robustness to vulnerabilities such as prompt injection, prompt extraction, and jailbreaks. The method works by training a "segment embedding" for each segment (system, user, data, output), enabling the model to distinguish between the segments at the level of internal representation. The authors evaluate ISE on two benchmarks, reporting improvements in robustness against various attack types.

Rating: 5- marginally below the acceptance threshold Reasoning: ISE presents a compelling and well-motivated approach to enhancing LLM safety. However, the current experimental sections lack the necessary clarity to fully assess the effectiveness and applicability of ISE. Improving the explanation of training and evaluation processes would significantly strengthen the paper’s contributions.

优点

  • Clear problem framing and motivation

    • The paper addresses a significant problem with modern LLMs. The lack of separation between different levels of input makes models vulnerable to prompt injection, prompt extraction, and jailbreaks.
  • Promising and intuitive solution

    • ISE is a simple and well-motivated solution - it is natural and straightforward, making it a very good idea.
    • The paper's presentation in sections 1-4 is very clear.

缺点

  • Lack of clarity in experimental design and results
    • Sections 5 (Experimental Design) and 6 (Results) are difficult to understand, which makes it challenging to asses the performance of the method. I think the paper would benefit significantly from a refactoring of these sections to be as clear as possible.
    • Concretely:
      • It is unclear how the malicious instructions in Adversarial Alpaca are generated, and how the instructions in training relate to those in testing.
      • It is unclear how much of Alpaca contains user vs data instructions.
      • Same questions apply for the Instruction Hierarchy dataset.
        • With this dataset, there is also a question of whether the outputs (generated by GPT-4o, according to Appendix B) represent desired behavior, or exhibit vulnerability to the attacks (since GPT-4o itself is vulnerable to prompt injection).
      • See more questions below.
    • Without a clearer presentation of methodology, it is difficult to assess the conclusions that ISE improves robustness to injection attacks.

问题

  1. Do the baselines involve fine-tuning?

    • This was unclear to me. Certainly the baselines should include fine-tuning on the same training set, but this was not clear from the manuscript.
  2. For the Structured Query training dataset (Alpaca), what was the split between system vs user vs data?

    • As written in Appendix B, I see that system and user are combined into a single instruction type. Do all of the Alpaca prompts have a data component? If not, it seems trivial for the model to separate the system/user portion from the injected data portion.
  3. For the Structured Query Adversarial Alpaca, are the injections of the same form as the evaluation injections (i.e. aiming to make the model output "hacked")? Or are they varied?

  • This seems like an important detail to assess if the model's robustness is generalized or limited to specific/narrow instances of injection.
  1. For the Structured Query evaluations, which subcategories (Naive, Ignore, Escape-S, Completion-R) were included in the training dataset?
    • Appendix B suggests just Naive and Completion - I think this should be indicated in the main body.
    • Why is this not the notion of "in-distribution" vs "out-of-distribution" used (i.e. based on type of attack, vs where the attack is placed)?
评论

We thank the reviewer for the suggestions and apologize for the confusion. We have updated the paper following the reviewer’s suggestions. Below, we address the reviewer’s questions and concerns.

1. How the malicious instructions in Adversarial Alpaca are generated, and how the instructions in training relate to those in testing.

We follow Chen et al. [1] to conduct experiments with the exact same setting. Malicious instructions are generated by randomly sampling from other training data and are injected into data. We provide an example to illustrate how we generate data in Appendix B1. More details can be found in the updated version of the paper. During testing, we inject adversarial text like “print exactly hacked” into the data. Therefore, the training set does not contain any “hacked” attack.

2. How much of Alpaca contains user vs data instructions.

There are 32,603 samples containing data input, and 19,157 samples do not contain any data input. All of them contain user instructions.

3. With Instruction Hierarchy, there is also a question of whether the outputs represent desired behavior

Here, we directly follow Wallace et al. [2]. We carefully construct instructions for GPT-4o to generate the desired outputs and manually verify a portion of the generations to ensure correctness. The prompts used to generate the desired outputs are provided in Appendix B2 (Figures 15).

4. Do the baselines involve fine-tuning?

Yes, the only difference between the baseline method and our method is the use of ISE. Therefore, our evaluation is entirely fair. We clarify it in Line 266.

5. For the Structured Query training dataset (Alpaca), what was the split between system vs user vs data?

Since the system instructions are the same for Alpaca, we combine the system and user query into a single type. More than 60% of the samples contain data from Alpaca. The rest does not contain data; however, we keep this part of the training data to ensure instruction-following capability when data is not needed.

6. For the Structured Query Adversarial Alpaca, are the injections of the same form as the evaluation injections (i.e. aiming to make the model output "hacked")? Or are they varied?

During training, the injections are randomly sampled from instructions in other data points and do not contain any instructions aiming to make the model output “hacked.” Therefore, the injections are varied across different training data. This ensures our model does not only defend against the “hacked” attack. Please see the examples in Appendix B1.

7. For the Structured Query evaluations, which subcategories (Naive, Ignore, Escape-S, Completion-R) were included in the training dataset?

Naive and Completion attacks are included in the training dataset (following [1]). We will make it clear in the main context.

8. Why is this not the notion of "in-distribution" vs "out-of-distribution" used (i.e. based on type of attack, vs where the attack is placed)?

During evaluations, we observe that training with Naive and Completion attacks can sufficiently defend against most attacks that appear in the data part once (also demonstrated in [1] and Table 1 in our paper). Then, we found that injecting attacks that appear in both the front and the end of benign data could effectively degrade performance because the adversarial data is constructed by only injecting one instruction at the end. Therefore, we term the newly designed attacks as out-of-domain attacks.

Final response

We hope our responses and the updated paper have resolved all confusion regarding the experimental design and results. If the reviewer has any further questions, we are happy to address them. We look forward to hearing the reviewer's assessment of the effectiveness and applicability of our proposed method.

[1] Chen et al. StruQ: Defending Against Prompt Injection with Structured Queries. USENIX 2025.

[2] Wallace et al. The instruction hierarchy: Training llms to prioritize privileged instructions. ArXiv 2024

评论

Thanks for the thorough response. I think adding these details (in particular, those concerning the training data) are important and make the paper stronger. With these clarifications, I am willing to raise my score from 5 to 6.

评论

Thank you again for your questions regarding the training details. We are glad to have addressed all your concerns.

审稿意见
6

The paper introduces Instructional Segment Embedding (ISE) to enhance LLM safety by enabling the model to better distinguish and prioritize instructions. The authors conducted comprehensive experiments on two benchmarks (Structured Query and Instruction Hierarchy) using multiple pre-trained LLMs, including Llama-2-13B, Llama-3-8B, and Llama-3.1-8B, and demonstrated the effectiveness of the proposed approach.

优点

  1. The idea of introducing instructional segment embedding to directly enhance LLM’s safety is novel and promising.
  2. The authors provide rigorous experimental validation on a range of tasks and demonstrate the method’s effectiveness.

缺点

  1. The paper employs full-parameter fine-tuning to learn the instructional segment embeddings, but it’s unclear if the baseline models are fine-tuned similarly. Is it a fair comparison? An ablation study evaluating baseline performance with the same fine-tuning across all datasets (Clean Alpaca, Adversarial Alpaca, UltraChat) would strengthen the paper
  2. The paper lacks an assessment of how well the segment embeddings generalize across datasets. Specifically, it would be valuable to see how embeddings trained on one dataset (e.g., Clean Alpaca) perform when tested on a different dataset (e.g., UltraChat).
  3. Full-parameter fine-tuning, as proposed, may not be cost-efficient or scalable for larger LLMs. Exploring alternatives to full-parameter fine-tuning could make the approach more practical for broader applications.

问题

  1. Can the authors expand on the evaluation part as mentioned above?
  2. Can the authors elaborate on the feasibility of using parameter-efficient fine-tuning methods like LoRA or prefix tuning with ISE.?
  3. Can the authors conduct more specific cross-dataset experiments to demonstrate generalization, such as training on Clean Alpaca and testing on UltraChat, or vice versa?
评论

1. Fine-tuning methods for Baseline

We apologize for the confusion; all of our baseline methods are supervised fine-tuned on provided datasets from the pretrained model. For example, in the UltraChat dataset, we finetuned the Llama-3-8B model (Base model) without ISE and with ISE. Therefore, all of our evaluations are fair comparisons, where the only difference is the use of ISE. We clarify it in Line 266.

2. Generalization across datasets.

We actually conducted experiments on training with UltraChat and evaluations with AlpacaEval. Please refer to Figure 5 (a). We observe that adding ISE consistently improves AlpacaEval's performance. Additionally, there is no “test data” of UltraChat.

3. Parameter efficient fine-tuning.

We thank the reviewer for this suggestion. Conducting the LoRA experiments will require refactoring our current codebase and may take additional time. We will share the results as soon as possible and commit to completing the full experiments before the camera-ready deadline.

评论

Thanks for the detailed response! But unfortunately, I will maintain my current rating.

评论

We sincerely appreciate the reviewers' feedback and suggestions, which have significantly improved our manuscript.

We have conducted additional experiments using LoRA on the Structured Query Benchmark. Specifically, we employed LoRA with a rank of 8, an alpha of 32, added q and v projections, and applied a 0.1 dropout on the Llama-2-13B-Chat model. The training data, comprising the Clean Alpaca and Adversarial Alpaca datasets, remains unchanged. Additionally, all clean and attack evaluations follow the same protocols outlined in our paper. Below, we present our results.

DatasetClean AlpacaAdversarial Alpaca
MethodBaselineDelimiter+ISE(Ours)BaselineDelimiter+ISE(Ours)
Capability (↑)AlpacaEval78.2476.3778.4877.5776.3477.06
In-Domain Robustness (↑)Naive53.3772.6094.7197.1292.3198.08
Ignore35.5855.2970.1998.0896.6398.08
Escape-S57.2188.4679.8174.5299.0499.04
Completion-R4.8130.2938.9485.1097.1297.12
Average37.7461.6670.91(+9.25)88.7096.2798.08(+1.81)
Worst4.8130.2938.94(+8.65)74.5292.3197.12(+4.81)
Out-of-Domain Robustness (↑)Naive61.5470.6775.9666.3592.3197.12
Ignore32.6947.6070.1994.7188.4692.31
Escape-S61.0684.6279.8174.5286.5488.46
Completion-R13.4625.0038.9482.6969.2375.96
Average42.1956.9766.23(+9.26)79.5784.1388.46(+4.33)
Worst13.4625.0038.94(+13.94)66.3569.2375.96(+6.73)

We observe that our method consistently demonstrates greater robustness than the two baseline methods across multiple attacks and both training datasets. Specifically, for in-domain attacks, our ISE achieves a 9.25% increase in average robust accuracy and an 8.65% improvement in worst robustness when trained with the Clean Alpaca dataset. When models are trained on the Adversarial Alpaca dataset, both Delimiter and our ISE achieve strong results. Improvements are also observed for out-of-domain attacks, with average robust accuracy improvements of 9.23% and 4.33% for the Clean Alpaca and Adversarial Alpaca datasets, respectively. Lastly, the capability of our method, as measured by AlpacaEval, remains comparable to baseline methods.

We hope these additional experiments strengthen our paper, and we will include further experimental results in the final draft. We look forward to hearing the reviewers' thoughts.

审稿意见
6

To address robustness issues within Large Language Models (LLMs), authors introduce Instructional Segment Embedding (ISE), where they give the model hierarchical information about the input through an embedding layer. The hierarchy is divided by ‘system, user, data, and output’. To encode this hierarchy, an additional embedding layer is given to the LLM. Each token embedding (each tokenized word of the input) is given an embedding of the size 4xD, where D is the embedding dimension, and four representing the hierarchy levels. They are tagged corresponding to their hierarchy, and the “segment” embedding value is added to the original token value and fed to the rest of the model. In the evaluation, ISE is compared to a method that adds delimiters between hierarchies. It generally performs better than the delimiter method. The datasets used are Structured Query and a new Instruction Hierarchy dataset.

优点

The paper is generally well written.

The paper addresses an important issue about trustworthiness of LLMs.

缺点

I had a difficult time understanding the problem statement at the beginning of the paper.

The major weakness of the paper is the novelty of the approach. The approach merely adds an embedding layer to the LLM.

Page 4: The authors state that the standard supervised fine tuning approach “remains a fundamental limitation.”(line 180) I think it would be nice to summarize the limitations/experimental results a bit here.

Page 8, Robustness figures can be very misleading/look different based on order data chosen, I would recommend using a different figure.

The paper uses pretty small models (8-13B) and are adding (context length) x (embedding length) x 4 more parameters (so like up to 32 000 x 5120 x 4 = almost another billion parameters).An ablation study of just adding this number of tokens without “hierarchical splitting” and seeing how the model improves would be beneficial.

It appears that adversarial training can do most of the work for avoiding hierarchical attacks? (Table 1)

In the dataset designed for the task in question (Instruction Hierarchy), ISE doesn’t perform much better than the baseline. (Figure 6 and 7)

Minor comments

Appendix line 776: “Insturctional” → "Instructional" Page 1, line 49: “prompt injection insert malicious” → “prompt injection inserts malicious” Page 1, line 51: “prompt extraction aim to” → “prompt extraction aims to” Page 5, line 251: “Lastly, We” → “Lasty, we”

问题

Page 9, line 467: Why use the UltraChat Baseline if its instruction-following capacity is so weak? Why not choose a different baseline?

Page 9, line 483: Define “reasonable responses.” How was GPT-4o queried to evaluate “reasonable responses”?

Line 247-250: How did you decompose GPT-4o to decompose 10K prompts to 3 components? Does it know the hierarchy?

How specifically are the hierarchies encoded in the embedding matrix? Are lower indices in H automatically considered lower importance? How is the training data used in conjunction with this?

How do you know that the GPT 4 output is correct?

Any reason for just summing the segment embedding with the token embedding? What about concatenation? Would like to see ablation or justification.

Post-rebuttal comments

The authors addressed most of my concerns in their response resulting in my increased score to 6.

评论

We thank the reviewer for the comments and will address them as follows:

1. Problem Statement

The problem we aim to address is that instructions given to an LLM often have varying priorities; however, current LLM architectures process all input tokens equally, making it easy to override the priority of different roles and potentially introduce vulnerabilities. We have improved our context to better illustrate this point (Lines 40-48).

2. Novelty of the paper

Our key contributions are as follows:

First, we identify that current LLM architectures fail to recognize the instruction type of each token. This leads to several vulnerabilities. To address this issue, we propose embedding instruction information directly into the model architecture. Inspired by BERT, our method offers a simple yet effective way to tackle this problem. This is not merely the addition of an embedding layer; rather, it introduces a new direction for future architectural designs incorporating instruction types, thereby enhancing LLM safety. Simplicity does not mean limited novelty. As noted by reviewers 4orG, GLus, and BboY, our approach is simple, novel, and well-motivated. Moreover, We conduct extensive experiments demonstrating that incorporating instruction types improves performance in both clean and robust evaluation scenarios.

Our work differs fundamentally from most existing approaches in the LLM safety field, which typically focus on techniques such as fine-tuning [1], prompt engineering [2], or decoding-time alignment [3]. Instead, we address the foundational architectural limitations and propose a method to overcome them. To the best of our knowledge, this is the first study to enhance LLM instruction hierarchy by directly modifying model architecture, supported by strong motivation and thoughtful design.

We respectfully request the reviewer to reconsider our contributions and novelty.

3. Clarity of the paper in Line 180 and Robustness Figure

Thanks for your suggestion. We updated the current version of the paper in Line 180 by summarizing the results.

For the robustness figure, the main conclusion is that our method, ISE, is consistently more robust than the baseline methods (without ISE) against multiple attacks. Changing the order of data (attacks) does not affect the conclusions. However, we also include a detailed figure in Appendix E10 to incorporate the reviewer's suggestion. If the reviewer has further questions, we are happy to address them.

4. The model is small, while adding billions of parameters

We clarify that models with 8B–13B parameters should not be considered "small models" in academic settings. Supervised fine-tuning of a 13B model requires a setup with 4 X A100 GPUs.

Additionally, there appears to be a significant misunderstanding regarding the addition of parameters. In fact, we only add 16,384 parameters (embedding length of 4096 × 4 instruction types). Our ISE does not depend on context length (see Line 198 for method description and page 15 for implementation details). This is 0.002% of 8B parameters, which should be considered negligible.

We update the paper (Line 224) to clarify the reviewer’s misunderstanding. We hope the reviewer can reconsider the soundness of our proposed method as it is actually lightweight.

5. Adversarial training can do most of the work for avoiding attacks. (Table 1)

In Table 1, we observe that adversarial training is effective against in-domain attacks, where the test data are similar to the training data. However, when evaluating out-of-domain (OOD) attacks, there is a notable drop in robustness. In real-world scenarios, it is impossible to anticipate all possible attacks during training, making high OOD robustness much more critical. By applying ISE, our method improves average robustness against OOD attacks from 82.93% to 90.26%. We believe that ISE offers a promising enhancement in robustness.

6. ISE doesn’t perform much better than the baseline. (Figures 6 and 7)

From Figure 6, we observe that our method achieves average robustness improvements of 5.1% and 7.9% against in-domain and out-of-domain attacks, respectively, when trained on the Instruction Hierarchy dataset. Additionally, when evaluating the worst-case attacks, the improvement is approximately 15%. Similarly, in Figure 7, the improvements are 11.1% and 17.4% for average and worst-case robustness, respectively. The improvement is noticeable.

7. Typo in the paper

We thank the reviewer for pointing out the typos in our paper and have corrected them in the updated version.

评论

8. Why uses UltraChat Baseline

We use UltraChat as the base dataset to construct the Instruction Hierarchy benchmark, as the original work [4] does not provide any training data. UltraChat is a widely-adopted chat benchmark that has been used in numerous papers like [5, 6, 7]. Our evaluation primarily focuses on comparing safety between using ISE and not using ISE. Through comprehensive experiments, we show that ISE enhances safety regardless of whether the training data is “weakly instruction-following” or “strongly instruction-following.” In real-world scenarios, high-quality data is often challenging to obtain, but our method consistently provides safety enhancements.

9. Defining reasonable responses

We thank the reviewer for the question. Since this is an over-refusal evaluation, “reasonable responses” refer to non-refusal responses to benign questions. To avoid confusion, we clarify this term as “non-refusal” responses in our updated paper.

We prompt GPT-4o to evaluate whether the response provides sufficient detail to answer the given benign question. The detailed prompt is provided in Appendix C4.

10. How to decompose prompts

Since the original prompt does not separate roles, we use GPT-4o to separate the prompts into system, user, and data parts. We provide the actual instruction to GPT-4o and an example in Appendix B2 (Figure 12&13).

11. Embedding Layer Details of ISE

We leveraged a learnable embedding layer ISE to capture instruction information. Each index of the matrix will stand for one instruction type. The information and importance are learned by training on the structured dataset, which contains the instruction type information. Therefore, lower indices do not automatically mean lower importance.

12. How do you know that the GPT 4 output is correct?

We provide detailed prompts for instructing GPT-4o to generate desired outputs in Appendix B2 (Figure 15). We manually check a fraction of the generation to ensure that the answers correctly follow the instruction hierarchy.

13. Any reason for just summing the segment embedding with the token embedding? What about concatenation?

Thank you for the thoughtful question. The design choice of summing the segment embedding with the token embedding is inspired by previous work on BERT, one of the most successful architectures in NLP. Both the original segment embedding in BERT and ISE serve to inject additional knowledge into the model architecture (sentence order in BERT and instruction type in ISE).

Moreover, summing the segment embedding with the token embedding maintains the embedding size, allowing us to avoid adjustments to subsequent self-attention layers. If we were to follow the reviewer’s suggestion to concatenate the two embeddings, all subsequent self-attention layers would require modification, which would significantly increase the number of parameters. Since all our experiments are conducted during the supervised fine-tuning stage (after pretraining), changing the architecture of self-attention layers may also destroy the weights that are pretrained.

Therefore, we chose the current design, which is simple yet effective.

[1] Chen et al. StruQ: Defending Against Prompt Injection with Structured Queries. USENIX 2025.

[2] Wei et al. Jailbreak and guard aligned language models with only few in-context demonstrations. ArXiv 2023.

[3] Xu et al. SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding. ACL 2024.

[4] Wallace et al. The instruction hierarchy: Training llms to prioritize privileged instructions. ArXiv 2024.

[5] Chen et al. Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models. ICML 2024

[6] Tunstall et al. Zephyr: Direct Distillation of LM Alignment. ArXiv 2023

[7] Meng et al. SimPO: Simple Preference Optimization with a Reference-Free Reward. ArXiv 2024

评论

Thank you for your thoughtful response, which clarified the points raised in my review. I agree with the other reviewers that though the simple scheme might lack novelty, the technique has been shown to be effective.

评论

Thank you for your helpful suggestions and comments. We are glad that our responses could address your concern.

AC 元评审

The paper introduces Instructional Segment Embedding (ISE), a technique to enhance LLM safety by embedding instruction hierarchy information directly into the model architecture. ISE adds learnable embeddings to distinguish between different types of inputs (system messages, user prompts, data, and output) during model processing.

All reviewers agreed that, while the method is simple, ISE is an effective and well-motivated solution to address the LLM safety issue. The method demonstrates meaningful improvements in robustness across different benchmarks and evaluation settings. The comprehensive experimental validation, including tests on various tasks and model sizes, was viewed positively.

During the discussion, the authors provided detailed explanations of training procedures and dataset composition and conducted new experiments with parameter-efficient fine-tuning (LoRA). The authors also provided insightful analysis of attention patterns and embedding behavior under different scenarios, which strengthened the paper. While reviewers BboY and 1fgH questioned the limited novelty, they acknowledged the effectiveness and practical impact of the method.

Based on the responses and additional experiments provided by the authors, most major concerns have been adequately addressed. Therefore, the AC recommends acceptance of the submission.

审稿人讨论附加意见

Reviewer 1fgH questioned the novelty of the method, parameter efficiency, and baseline comparisons. The authors clarified that the method adds only 16,384 parameters (0.002%) and demonstrated its effectiveness, leading the reviewer to increase the score.

Reviewer 4orG was concerned about baseline fine-tuning consistency, cross-dataset generalization, and parameter-efficient alternatives. The authors responded by confirming that all baselines used the same fine-tuning setup and provided additional LoRA experiments, showing consistent improvements.

Reviewer GLus raised issues about experimental design, training data composition, and malicious instruction generation. The authors provided detailed clarifications, which improved the clarity of the experimental setup and led to a higher score from the reviewer.

Reviewer BboY requested further investigations into learned embeddings, attention patterns, automatic attacks, and multi-turn evaluation. The authors conducted additional experiments, including attention visualizations and MT-Bench evaluations, addressing the reviewer’s concerns.

The authors’ responses and new experiments strengthened the paper. While minor concerns remained, such as the scope of multi-turn evaluation, the paper’s practical value supported acceptance. All reviewers either maintained or increased their scores after the discussion.

最终决定

Accept (Poster)