PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
4
4
5
3.8
置信度
创新性2.8
质量3.0
清晰度3.0
重要性2.5
NeurIPS 2025

One Head to Rule Them All: Amplifying LVLM Safety through a Single Critical Attention Head

OpenReviewPDF
提交: 2025-05-09更新: 2025-10-29
TL;DR

This paper proposes a novel, training-free defense method for LVLMs that amplifies their inherent safety capabilities by identifying and utilizing a single safe attention head to detect unsafe inputs and guide safer responses.

摘要

关键词
Large Vision-Language ModelsSecurityAdversarial and Jailbreak attacks

评审与讨论

审稿意见
4

This paper introduces One-head Defense (Oh Defense), an inference-time safety detection and mitigation framework for Large Vision-Language Models (LVLMs). The core idea is that certain attention heads in the language model backbone respond differently to harmful versus benign prompts. The authors propose identifying a single safety-critical attention head by measuring how much the final-layer hidden state shifts when that head is softly masked, quantified as a deflection angle. This head is selected per model using a small safe/unsafe validation set. At inference time, each input is passed through the model with that head masked, and the resulting angle is compared to a learned threshold. If an input is deemed unsafe, a refusal mechanism is activated that anchors the first few output tokens to steer the model toward safe completions. The method is training-free and tested across several LVLM architectures, showing high detection accuracy with low false positive rates. While the approach is general across different models, it assumes white-box access to internal attention mechanisms and hidden states.

优缺点分析

Strengths

  1. Training-free and efficient
    The method is lightweight, requiring no additional training, fine-tuning, or external classifiers. It only introduces one extra forward pass per input and uses a single masked attention head, making it practical for deployment in white-box LVLMs.

  2. Strong empirical results across models and datasets
    The approach generalizes well to multiple LVLM architectures (LLaVA, Qwen2, Phi, etc.) and attack types (e.g., FigStep, Query-Relevant, LLM Transfer). It achieves over 98% detection on harmful prompts while maintaining a low false positive rate on benign inputs.

  3. Integration of detection and response
    The method combines detection and mitigation by anchoring the first few output tokens once an input is classified as unsafe. This simple refusal strategy is effective in steering the model away from unsafe completions.


Weaknesses

  1. Key prior work is insufficiently acknowledged
    The paper does not adequately position itself with respect to Zhou et al. (ICLR 2025; aka Ships/Sahara), which already demonstrates that individual attention heads can causally influence safety behavior in LLMs through masking and KL-based scoring. Oh Defense adopts a very similar methodology, using per-head masking and a behavioral scoring mechanism (deflection angle instead of KL), yet this prior work is only briefly cited in passing in the main text, and completely omitted from the Related Work section.

    There is no direct comparison of assumptions, scoring metrics, or analytical scope between the two approaches, even though they share nearly identical head selection frameworks. As a result, the paper risks overstating its novelty, particularly since its central idea—measuring behavioral change from attention head masking to identify safety-relevant heads—has already been established in prior work. This omission significantly weakens the positioning of the contribution within the safety and interpretability literature.

  2. Limited analytical depth compared to prior work
    While the paper includes several helpful analyses—such as per-head deflection angle heatmaps, query vs. value masking comparisons, and head-count ablation tests—it stops short of the more in-depth investigations found in prior work. For example, Zhou et al. (ICLR 2025) explores the causal influence of heads via ASR increase under ablation, analyzes inter-head interactions, and measures head overlap across different fine-tuned models. Moreover, although Oh Defense focuses on LVLMs, it does not analyze how safety-relevant information flows from the vision encoder into the language model, nor how multimodal alignment affects the emergence of safety heads. As a result, while the method is practical, it lacks deeper insight into why or how these safety-critical heads function in context.

  3. White-box requirement is not clearly discussed
    Although the method is described as "model-agnostic," it requires access to per-head attention weights and hidden states, which limits its applicability to open or white-box models. This practical constraint is not explicitly acknowledged in the framing of the method.

  4. Figure 1 is difficult to interpret (minor)
    The central figure attempts to illustrate both safe vs. unsafe prompts and masked vs. unmasked models simultaneously, leading to visual ambiguity. Arrow flows are unclear, and important steps like how the deflection angle is computed or where it fits in the model are not well indicated. While this does not impact the validity of the method, it makes the explanation hard to follow on first reading and may hinder initial understanding of the core idea.

问题

  1. What distinguishes this work conceptually and methodologically from Zhou et al. (ICLR 2025; Ships/Sahara)?
    Both works identify safety-relevant attention heads using head masking and score them based on behavioral change. Please clarify what is genuinely novel in your framework beyond applying it to LVLMs. How does deflection angle differ in utility or interpretation from the KL-based Ships score, and why was no direct comparison included?

  2. Why is Zhou et al. omitted from the Related Work section, despite being highly relevant?
    Given the strong methodological overlap, the absence of a discussion in the Related Work section undermines the paper's positioning. Will the authors revise the framing to properly situate their contributions within this line of research?

  3. How does the safety signal emerge in the multimodal LVLM pipeline?
    The method assumes that the safety-relevant signal is preserved across the vision encoder, projector, and into the LLM. Have you investigated whether certain modalities dominate this signal, or whether vision-only inputs can suppress or bypass the head activation?

  4. Can this method be adapted to black-box LVLMs, or is it inherently limited to open-weight models?
    Given that the method requires access to per-head attention weights and final-layer hidden states, how would it apply to popular API-based models like GPT-4V or Gemini? Are there realistic pathways for making it more generally deployable?

We encourage the authors to also consider and address the broader concerns raised in the weaknesses section of this review.

局限性

The paper does not explicitly discuss limitations or potential societal risks. Several important points should be acknowledged or explored:

  • The method assumes white-box access to model internals (attention heads and hidden states), which limits applicability to open-weight LVLMs. This assumption should be clearly stated and its implications discussed.
  • The use of a single, fixed safety head may be brittle under adaptive or distributionally shifted inputs. The authors could consider analyzing sensitivity or failure cases.
  • There is no discussion of how this method could be misused or misunderstood in deployment, for example, as a justification for overconfident content filtering or automated moderation without human oversight.

Adding a clear Limitations section that reflects on these concerns would improve the transparency and responsibility of the work.

最终评判理由

The authors provided a clear and faithful implementation of the Ships baseline and empirically demonstrated performance gaps. While I was initially concerned that the core contribution might significantly overlap with Zhou et al. (Ships/Sahara), the clarification of the selectivity principle (requiring safety heads to exhibit differential behavior between safe and unsafe inputs) helped establish a clearer and stronger conceptual distinction from prior work.

格式问题

The paper follows the NeurIPS 2025 formatting guidelines and does not exhibit any major formatting issues.

作者回复
  • Q1 and W1:

    • We clarify the novelty of this work from the following three points.

    • Safety Amplification Paradigm: Our novel safety amplification approach achieves defense by activating the model's inherent safety capabilities. We introduce the deflection angle to quantify safety signal strength. By measuring the deflection angle of hidden states before and after masking attention heads, we can not only identify safety-critical heads but measure the strength of their safety signals, enabling a complete amplification-detection defense pipeline for real-time deployment, rather than merely remaining at the mechanistic interpretation level.

    • Single-Head Sufficiency Discovery: Our analysis demonstrates that a single well-identified attention head can achieve near-perfect safety detection. Our experiments show that adding more safety heads even degrade performance due to introducing inconsistent attention patterns and redundant information (Table 12). Hence, multi-head collaboration in safety detection is not necessary

    • Differential Behavior Criterion: Ships/Sahara only validates their "safety heads" on harmful datasets. Figure 2 reveals that attention heads showing large deflection angle on unsafe data often exhibit similarly large deflection angle on safe data, particularly in early layers. So these are not true safety heads but rather heads sensitive to general patterns or input structures. We introduce a novel paradigm requiring heads to demonstrate differential behavior between safe and unsafe inputs. Our "Limited Impact on Safe Inputs" (Equation 6) criterion ensures identification of truly safety-critical heads. To demonstrate this critical difference, although Ships/Sahara does not propose a defense method, we applied their pipeline by using the attention head with the largest deflection angle as the "safety head" for experimental comparison(DR %) on Qwen2-VL-7B:

      headVLsafeFigstepQRLLM TransferInstruct-80kShareGPT4V
      Ships(0, 15)80.5404.8360.153110.80
      Oh Defense(15, 22)99.3410094.4588.739.620

      Ships fails on multimodal attacks like Figstep and QR, where harmful content appears in images rather than text, because their identified "safety heads" focus on textual patterns rather than understanding safety semantics.

    • Deflection angle differs in utility and/or interpretation from the KL-based Ships score. Deflection angle measures changes in hidden states using cosine similarity, capturing the internal representation shifts. In contrast, the KL-based Ships score calculates changes in output probability distributions over the vocabulary, measuring how the output distribution shifts toward harmful content. Moreover, KL-based Ships score targets individual inputs to identify which attention heads are critical for safety, while deflection angle works at both individual input and dataset levels, not only identifying safety attention heads but also enabling real-time assessment of whether inputs are safe.

    • Deflection angle and the Ships score are not comparable. Deflection angle is designed for harmful content detection and defense, while Ships is for model behavioral understanding. The deflection angle operates at the model internal states level, while the KL-based Ships score operates at the output level and cannot amplify the model's inherent safety capabilities, making it unsuitable for our scenario. Meanwhile, we compare other measurement methods that align with our requirements (Figure 12), which showed no significant differences between the various approaches. Plus, deflection angle requires only two forward passes, whereas KL-based Ships score requires complete generation and probability calculations, resulting in higher computational costs.

  • Q2: Ships/Sahara focused on LLMs and does not propose an actual defense method; hence, we only introduce the work in our Introduction section. Furthermore, in the answers to Q1, we provided updated comparison results with Ships/Sahara using our pipeline. We will include a discussion of Ships/Sahara in the Related Work in the next version and clarify its contributions and relevance to our approach.

  • Q3: The datasets used in the experiments have different emphases across modalities. For example, the harmful information in FigStep and QR appears in images, while the harmful information in LLM Transfer is reflected in text.

    We conduct additional experiments examining whether vision-only inputs can suppress or bypass the head activation. We select FigStep and QR datasets and tested vision-only inputs, having the following results:

    FigstepQR
    Org(No defense)41.8523.86
    Vision-Only(No defense)17.820
    Oh defense00

    So vision-only attacks often reduce their own harmfulness only and can't bypass Oh Defense. This is because although the model can understand harmful information in images, it cannot effectively output harmful content without textual guidance.

  • Q4, W3 and L1: We follow existing inference-time defense methods accepted/published in Top AI venues [18,53,a] for the white-box settings, which have significant impact in real-world deployments as follows:

    • In black-box AI productions, like OpenAI or Google, our approach can be integrated into their internal model serving infrastructure before exposing the API endpoints to users, providing an additional safety layer that protects against various attack vectors.
    • In white-box AI productions, like LLaVA, Qwen-VL, etc., organizations deploying these models locally can directly apply Oh Defense for enhanced safety without relying on external API calls.
  • W2:

    • In Figure 7, we present the changes in ASR after masking one or three safety heads, compared to the standard condition.

      Oh Defense achieves near-perfect defense performance using a single safety head, without considering inter-head interactions. As demonstrated in Tables 3 and 12, increasing the number of safety heads does not enhance defense effectiveness and, in some cases, may even degrade performance due to the introduction of redundant information across attention heads.

      While we did not analyze the overlap or consistency of safety heads across different models, we note in Section 4.3 that LVLMs built upon similar base LLMs exhibit consistent safety trends.

    • It might require novel approaches to explore how safety-relevant information flows from the vision encoder into the language model and how multimodal alignment affects the emergence of safety heads. We will include this in a future work discussion section.

    • Beyond the main analysis, our study offers several additional insights. For instance, when selecting layers for hidden states, we find that effective results can be achieved after layer 20, not just at the final layer. Furthermore, in Appendix A.2, we provide evidence that safety degradation indeed occurs when transitioning from LLMs to LVLMs, all of which can provide valuable insights triggering future work.

  • W4: We will update Figure 1 in the next version.

  • L2: Relying on a single safety head during inference could potentially be vulnerable to targeted attacks.

    • For adaptive attacks, attackers would need to: a) identify which specific head is being used, b) craft inputs that specifically evade that head while maintaining attack effectiveness, and c) do so without triggering alternative safety mechanisms in the model. This represents a significantly higher attack complexity compared to bypassing traditional safety measures.

    • For distributionally shifted inputs, we conduct further analysis on the misclassified unsafe samples under LLaVA-1.5-7B. Specifically, we observe that false positive samples tend to have deflection angles closer to the decision threshold of 2.16 compared to true positives (1.83 vs. 2.1), indicating that our selected attention heads still exhibit a degree of sensitivity even on incorrectly judged inputs. Upon further investigation, as shown in Table 1, we found that the classification errors are often model-specific and tend to concentrate on different input types. For instance, LLaVA shows more failures on QR , which we attribute to the inherent differences in modality and input perception across models due to training data and architecture. Achieving higher performance would require analyzing sensitivity across different data types and validating combinations of attention heads tailored to specific shifts, which reduces the method’s transfer robustness and still may not fully generalize to unseen distributions.

      Nevertheless, Oh Defense still maintains state-of-the-art robustness across diverse settings, and the selected attention heads demonstrate sufficient robustness on existing benchmark datasets.

  • L3: We acknowledge this limitation, however, Oh Defense achieves low false positive rates on benign datasets (< 5% on most datasets as shown in Table 1), meaning most benign tasks process normally.

    Meanwhile, this represents a safety-utility trade-off that is inherent in most safety systems, prioritizing safety over potential risks in edge cases. This limitation suggests valuable future work directions, such as incorporating steering vector techniques to maintain response quality even when misclassification occurs.

    We will update the Limitations section in subsequent versions to improve the transparency of our work.

[a] Yi Ding, Bolian Li, and Ruqi Zhang. 2024. ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time. In The Thirteenth International Conference on Learning Representations.

评论

We appreciate the thoughtful rebuttal and the additional experiments provided. Many of our initial concerns have been clarified. That said, we would like to ask a few follow-up questions to further understand the key distinctions between your method and prior work.

While we understand that Zhou et al. (ICLR 2025; aka Ships/Sahara) does not propose a defense mechanism, the methodological similarity to your work makes it a crucial point of comparison. This is especially relevant in terms of how head-level analyses inform safety alignment, as also pointed out by Reviewer w2PL.

  1. Could you clarify whether the Ships/Sahara baseline was implemented in a faithful and fully comparable way?
    In the rebuttal, you mention using the head with the largest deflection angle for DR evaluation. However, Zhou et al. originally proposed a KL-based score over output distributions. Did you re-implement their exact procedure—including KL-based head scoring and selection—or did you substitute your own deflection angle framework for the comparison? A more precise clarification would help evaluate the fairness of the reported performance gap.

  2. What do you believe is the primary factor behind the performance gap between Oh Defense and Ships/Sahara?
    While your method consistently outperforms theirs across datasets—particularly on multimodal inputs—it would be helpful to clarify which aspect contributes most to this improvement.
    Is the primary reason the new filtering criterion (i.e., suppressing heads that also respond to safe prompts)?
    A breakdown of these contributing factors would strengthen your claim of conceptual advancement over prior work.

评论

Thanks for your positive comments saying that most of your initial concerns has been clarified. Below are our answers to the two lasted questions.

  • Q1-discussion. We appreciate this important clarification and acknowledge that in our initial rebuttal, we did substitute our deflection angle framework for a quick comparison due to time constraints. This is because KL-based score operates on individual queries comparing output distributions, which would require searching across the entire unsafe dataset. Hence, it is a more computationally intensive process than our method. However, we understand the need for a completely fair comparison and now provide a fully faithful implementation.

    Faithful Ships/Sahara Implementation configurations: We have now implemented Ships/Sahara faithfully following their original paper and source code:

    • Head Selection: Used their KL-based scoring: KL(PoriginalP_{original} || PmaskedP_{masked}) on harmful inputs
    • Selected Head: (0,7) for Qwen2-VL-7B using their methodology, with highest KL-based score
    • Special consideration on Detection Method for Ships/Sahara: Since Ships/Sahara does not provide a detection mechanism, we apply the same detection framework to use heads selected by both methods. This ensures we can fairly compare the quality of head selection strategies rather than detection mechanisms.

    Results of Fair Comparison:

    headVLsafeFigstepQRLLM TransferInstruct-80kShareGPT4V
    Ships(0,7)65.5010619.85100
    Oh Defense(15,22)99.3410094.4588.739.620

    The above implementation follows Ships's methodology for head identification, ensuring our comparison accurately reflects the performance difference between the two head selection approaches under identical detection conditions. It can be observed that when (0,7) is used as the safety head for detection, the detection rates on FigStep, QR, and LLM Transfer are all below 20%, which indicates that the head selected by Ships cannot serve as a safety head for detection under parallel settings.

  • Q2-discussion. Based on our comparison analysis, the performance gap stems from several key methodological differences:

    Primary Factor - Differential Behavior Criterion (Core Methodological Breakthrough): Yes, the differential behavior criterion is indeed the primary factor. Nevertheless, this represents a fundamental conceptual advance in safety head identification rather than a simple filtering adjustment. Let us re-emphasize the contribution of our simple yet effect solution.

    • The Core Discovery and Conceptual Contribution: Through systematic analysis across multiple LVLMs and datasets, we discovered that attention heads exhibiting high influence on harmful inputs fall into two distinct categories: (1) heads that respond indiscriminately to both safe and unsafe inputs (capturing general linguistic patterns), and (2) heads that selectively respond to safety-relevant semantic features. This establishes a new principle for safety-relevant attention head identification: effective safety mechanisms require selectivity, not just sensitivity. This principle could inform future mechanistic safety research beyond our specific application.

    • Empirical Validation of the Principle: Our analysis reveals that Ships/Sahara's approach, while mechanistically sound for interpretability, inadvertently selects heads that process surface-level patterns (predominantly in early layers, Figure 2). These heads fail catastrophically on visual-modal attacks (10% on Figstep, 6% on QR) because they focus on textual pattern recognition rather than semantic safety understanding. Our criterion identifies heads in deeper layers where cross-modal semantic integration occurs. The analysis in Figure 4 further supports our argument: in the early layers, masking Query matrices versus masking Value matrices produces significantly different impacts, which indicates that high-influence attention heads selected through masking approaches (typically appearing in early layers, as shown in Figure 2) are actually heads sensitive to lexical or general patterns, rather than heads specifically designed to capture safety-related features.

    Single-Head Sufficiency Discovery (Additional Factor): Using a single attention head is also a factor behind the performance gap. A single head can maximally preserve its safety capability, while multiple heads introduce redundant information that dilutes detection signals. As shown in Table 12, increasing the number of safety heads leads to over 20% detection drop on QR for Qwen2-VL-7B and a similar decline on Figstep for Phi-3.5-Vision.

    Our contribution lies not merely in achieving better performance, but in discovering and formalizing the selectivity principle that enables reliable safety head identification in multimodal settings.

If you have any more questions, please feel free to let us know at any time.

评论

We are pleased to have addressed your concerns and thank you very much for your valuable comments and suggestions. We greatly appreciate the time and effort you dedicated to reviewing our manuscript. We have carefully considered all your comments and made the corresponding revisions in the revised manuscript accordingly.

评论

The authors have convincingly addressed my initial concerns, especially regarding methodological novelty and comparison with prior work (e.g., Zhou et al.).

While I was initially concerned that the core contribution might overlap significantly with Zhou et al. (Ships/Sahara),
the clarification of the selectivity principle—along with its empirical validation across multiple LVLMs—helped to more clearly establish the conceptual advancement of this work.

Given the strong performance, real-time applicability, and the authors’ clear willingness to revise, I am updating my score from 3 to 4 (Borderline reject → Borderline accept).

审稿意见
4

The work proposes to find specific attention heads that measured by their deflection angle can be enough to abstain in case of an unsafe input prompt to an LVLM. The method termed 'Oh Defense', which works without any additional fine-tuning and achieves good results. The finding that in most cases only one attention-head seems to be enough to detect unsafe prompts is surprising.

优缺点分析

Strengths:

  1. Interesting problem and some of the findings are surprising.
  2. Paper is well-written, ablations look complete.

Weaknesses:

  1. Baselines can be improved, see LLM-Judge example in Questions.
  2. Intuition behind why there is a distinction between safe/unsafe inputs at attention heads seems lacking.
  3. Lack of clarity in some experiments, for ex. see question-5

问题

Questions:

  1. I do not really get why the thing works: why is there inherently a difference between the hidden states for safe/unsafe inputs given the variations in the inputs are often subtle?
  2. It seems the approach is very generic: and the fact one uses Safe/Unsafe datasets to find the threshold is the only thing relating it to LVLM safety? If this is correct: this can be arbitrarily scaled to other tasks in an LVLM?
  3. The overall idea seems upto selection of safety heads, similar to SAH: a detailed discussion on differences should be added to the paper.
  4. At inference you need two passes: one to detect the angle and the other to classify? A good baseline would be to use a light-weight LLM-Judge as pre-processing input, and to report how this performs - clearly this is as expensive as Oh Defense (assuming a similar sized judge).
  5. How is the evaluation for adversarial attack in section A.3 done - seems unclear to me.
  6. The method seems susceptible to the top-k heads, from Fig. 7 - if an attacker has a way (adaptive attack) to suppress the important heads, then Oh Defense seems to be very vulnerable. One can trivially create a set of safe/unsafe data where the top-k heads are shifted from the best ones that were found by the original Safe/unsafe data. How does the method perform then?
  7. The train(head selection) datasets seem to come from VLSafe, What happens when a completely different set is used at train time as unsafe set? In other words, how much does Oh Defense rely on VLSafe?

局限性

yes

最终评判理由

Some of my concerns have been resolved by the rebuttal. There are still some minor issues I see with this work, so cannot champion a convincing accept.

格式问题

NA

作者回复
  • Q1 and W2: The attention mechanism in language models possesses a signal amplification property that enables safety-trained models to detect and magnify subtle differences between safe and unsafe inputs, even when surface-level distinctions are minimal. This amplification process works through several key mechanisms. Safety heads within the attention layers can identify small variations between inputs that differ in their safety implications. These initially subtle signals are then amplified through the non-linear softmax operation. As information propagates through multiple layers of the model, these safety-related differences accumulate and become increasingly pronounced. By the time the model begins generating its first token, the hidden states corresponding to the final input token have effectively captured and integrated the global semantic meaning of the entire input sequence, including its safety characteristics.

    As briefly mentioned in the Introduction Section (Paragraph 3), this amplification behavior is central to the model's safety inference process.

  • Q2: Thank you for this insightful observation. Our work focuses specifically on safety, building on the key insight that the attention heads develop safety concept awareness after alignment training, where certain safety-critical heads capable of amplifying these signals. This discovery forms the foundation of our Oh Defense framework for ensuring output safety. For other tasks involving similar conceptual distinctions that might be encoded during training, there could be potential for extending our approach. We will include this in a future work direction section in the next version.

  • Q3: While both methods explore the capability of attention heads for safe/unsafe content detection, we observe differences between Oh Defense and SAHs in the approach and capabilities.

    • The differences on the approaches. SAHs rely on linear probes and trains binary classifiers on attention head activations, requiring supervised learning for detection. In contrast, Oh Defense adopts a training-free methodology based on principal behavior vectors and deflection angle measurements, avoiding reliance on additional parameters or training data as required by SAHs.
    • The differences on the detection capabilities. Oh Defense demonstrates significant practical advantages. Our approach uses only a single attention head compared to SAHs' multiple heads, while achieving superior performance across various attack scenarios. Note that, Oh defense outperforms SAHs against challenging attacks like FigStep, see Table 2 for details. The training-free nature also provides better transferability across different models and datasets (though attention heads are selected by VLSafe, the detection capability is demonstrated in Table 1 and 2 across various jailbreak datasets.), as we don't rely on dataset-specific supervised training that may not generalize well.

    We will conduct a more detailed comparison between Oh Defense and SAHs in the revised version.

  • Q4 and W1: At inference, detection and classification in Oh Defense are integrated into a unified process with low computational overhead. During inference, we need to compute the deflection angle by comparing hidden states from normal and masked attention processing (involving only the generation of the first token), followed by a simple threshold comparison. The angle calculation and classification are lightweight mathematical operations that add minimal computational cost compared to the forward passes themselves. The time consumed for a single query can be seen in Figure 9, while LLM-Judge requires a complete generation process. However, comparison with LLM Judge is still meaningful. We present the comparison of detection results (DR %) between Oh Defense (on LLaVA-v1.5-7B) and Llama-Guard-3-Vision as follows:

    VLsafeFigstepQRLLM TransferInstruct-80kShareGPT4VTime
    Oh Defense99.9110098.4799.92200.22s
    LlamaGuardVision91.4454.9040.8085.9708.800.53s

    It can be seen that Oh Defense comprehensively outperforms Llama-Guard-3-Vision. When using LLM/MLLM-Judge, it cannot effectively detect some harmful data carefully crafted by attackers, such as FigStep and QR (the detection rate below 60%), and the time cost is also higher.

  • Q5 and W3: Our evaluation was done in the following steps.

    • Step 1. Selection of Adversarial Attacks and Input Preparation: We selected two representative white-box adversarial attacks - VisualAdv (which optimizes adversarial images by accessing model gradients) and BAP (which extends VisualAdv with coordinated text optimization). Text prompts were sourced from JailbreakBench dataset[5], while the image inputs are adversarially perturbed versions generated by each respective attack method.

    • Step 2. Adversarial Sample Evaluation: We evaluated these attacks under different perturbation constraints (ϵ\epsilon = 16/255, 32/255, 64/255, and unconstrained), where ϵ\epsilon represents the maximum allowed pixel-wise noise intensity in the LL_{\infty} norm.The sample generation follows standard adversarial optimization procedures under white-box conditions. We apply Oh Defense to these inputs and evaluate its performance using the same metrics as in Table 1 and 2.

      The results are demonstrated in Table 6, where adversarial attacks produce significantly larger deflection angle (9.82 for VisualAdv, 9.15 for BAP) compared to other attack types (3.25-6.91). So we conclude that this counterintuitive result occurs because adversarial perturbations are inherently more aggressive in manipulating model representations, making them easier to detect rather than harder. The stronger the adversarial manipulation, the larger the deflection angle, leading to more reliable classification and better defense performance. We will provide more detailed experimental setup descriptions in the revised version.

  • Q6: This is an insightful question about the robustness of our head selection mechanism. You raise a valid concern about whether attackers could deliberately craft inputs to shift the top-k safety heads away from those identified in our original safe/unsafe datasets. We answer your question from two perspectives.

    • Cross-dataset generalization: Oh Defense demonstrates solid performance against state-of-the-art attacks including sophisticated jailbreak datasets, carefully crafted red-team data, and white-box adversarial attacks with gradient information (as shown in Appendix A.3). This suggests that the fundamental attention patterns underlying safety detection are not easily manipulated by input variations.

    • Multiple head candidates provide robustness: Oh Defense randomly selects from a candidate set of safety-critical heads, rather than relying on a single fixed head. This makes it significantly more challenging for attackers to suppress all potential safety heads simultaneously.

      Regarding your specific scenario of creating safe/unsafe data to shift top-k heads: such an attack would require:

      (1) Knowledge of which specific attention heads are safety-critical for the target model.

      (2) Ability to craft inputs that suppress multiple candidate heads without being detected.

      In practice, this presents significant challenges, especially since different models have different safety head distributions (as shown in Figure 2), and our selection from multiple candidate heads makes it harder for attackers to predict and counter the specific detection mechanism being used.

  • Q7: When using VLSafe to identify safety heads for LLaVA-v1.5-7B, we can obtain the safety head set {(8,2), (10,23), (5,5), (3,10), (5,28)} (taking top-5). Now, we replace VLSafe with VLGuard [55], and we get the safety head set {(23,11), (3,14), (8,2), (5,28), (4,23)}. We can see that we have obtained the same safety heads (8,2) and (5,28). How do the other heads perform? We select (4,23) for experiments, and the performance (DR %) is as follows: To evaluate the consistency of our safety head identification across different datasets, we applied our method to LLaVA-v1.5-7B. Initially using VLSafe, we obtained the top-5 safety heads {(8,2), (10,23), (5,5), (3,10), (5,28)}. When we replaced VLSafe with VLGuard [55], the identified set became {(23,11), (3,14), (8,2), (5,28), (4,23)}. Notably, heads (8,2) and (5,28) appear consistently across both datasets, suggesting they represent core safety-critical attention mechanisms. To assess how the other dataset-specific heads perform, we use (4,23) from the VLGuard-identified set for evaluation. The detection rate (DR %) results are as follows:

    VLsafeFigstepQRLLM TransferInstruct-80kShareGPT4V
    97.1010099.9799.993.800.20

    It can be seen that (4,23) still has excellent performance as a safety head, proving that the safety heads we identified are qualified, and also demonstrating that Oh Defense does not have particular dependency on the dataset. Moreover, the images in the VLSafe dataset come from the COCO dataset, the text portion comes from red team data, and it can be easily created.

评论

We sincerely appreciate your thoughtful feedback and the time you’ve dedicated to reviewing our manuscript. Your insightful comments have helped us significantly improve the paper. We have carefully addressed each point and incorporated the revisions into the updated version. Thank you again for your constructive feedback and support.

评论

I thank the authors for their detailed response and clarifications. Some of my concerns (generalization to other sets than VLSafe, inference speed etc) have been resolved, and overall I look at this work positively, hence I will update my score to 4, before discussion with other reviewers. Having said that, I am still not fully convinced by the robustness of the safety mechanism - even though the authors pointed out why it can be robust I still subscribe to the fact that a brute force selection over top-k heads can be done and might work in principle. Also, it would be imperative to add a comparison between the proposed Oh Defense and SAH.

评论

We appreciate your positive feedback and your constructive comments. Let us have a look at your remaining concerns about the vulnerabilities of the top-k heads and comparison against SAHs:

  • We agree that a brute-force attack on the top-k heads selection is possible in theory; however, this might not be feasible in the real-world practice.

    • In black-box scenarios, where AI models only support API access, it is unrealistic for attackers to obtain the model's architecture and parameters in order to suppress the top-k safety heads. That is, in this case, Oh Defense remains robust.

    • In white-box scenarios, we assume that the attacker can forcibly suppress the top-k safety heads by setting the weight matrices of these heads to zero. Then we can still run Oh Defense on such a 'compromised' model by selecting the top-k safety heads as we can always identify the top-k regardless. To evaluate the performance of Oh Defense in this case, we further conduct experiments using a modified LLaVA-v1.5-7B model (the pre-identified top-k safety heads were suppressed) for k=5. On this modified model, Oh Defense obtained 'new' top-5 safety head set {(3,19), (14,27), (13,7), (4,21), (5,24)}. Then using a randomly sampled head, (13,7), as the safety head, we grab the following results (DR %):

      VLsafeFigstepQRLLM TransferInstruct-80kShareGPT4V
      98.9110098.5099.753.601.60

      As can be observed, compared to the results in Table 1, the detection performance slightly decreases but still achieves strong effectiveness. In the next version, we will include the discussion and analysis of this potential attack and the above experimental results in a discussion section.

  • In our answers to your Q3, we discussed the differences between SAHs and Oh Defense. To present these distinctions more clearly, we provide a table as follows:

    MethodSAHsOh Defense
    Detection MechanismLinear probe-based classificationDirectly compute deflection angle via head masking
    Training RequirementRequires supervised training of linear probes (concatenated activations) + binary classifierTraining-free
    Number of HeadsMultiple heads (typically 16-32)Single critical head
    Memory RequirementsProbe weights + classifier parametersHead indices + threshold only
    GeneralizationMay overfit to training distributionRobust to diverse attacks and different models via intrinsic dynamics

    Moreover, although Table 2 presents the defense performance (ASR %) of both methods, since the base model has some inherent capability to resist certain harmful inputs, we further report the direct detection results (DR %) of both methods on LLaVA-v1.5-7B and Aya-Vision-8B for a more accurate comparison. The results are shown below:

    ModelMethodVLsafeFigstepQRLLM TransferInstruct-80kShareGPT4V
    LLaVA-v1.5-7BSAHs99.72078.3593.6830
    Oh Defense99.9110098.4799.9220
    Aya-Vision-8BSAHs51.89068.9083.4511.207.20
    Oh Defense99.3498.9583.8299.6751.40

    As shown, Oh Defense consistently outperforms SAHs across all datasets. SAHs completely fails on Figstep and performs poorly on multiple harmful datasets under Aya-Vision-8B, highlighting the limitations of SAHs.

    We will include a more detailed comparison between Oh Defense and SAHs in the revised version.

We're committed to making these improvements in the final version and appreciate your constructive feedback in helping us strengthen the paper.

审稿意见
4

This paper uncovers that LVLMs already contain safety-critical attention heads whose behaviour shifts sharply when the input is unsafe. By systematically masking each head and measuring the deflection angle between hidden-state directions on small safe vs. unsafe sets, the authors pinpoint a single head that cleanly separates the two. At inference, they (1) mask that head, compute the angle for the current sample, and classify it with a KDE-derived threshold; (2) if unsafe, prepend a short safety prompt and anchor the first output tokens (“I cannot …”) to force a principled refusal. The method is training-free, model-agnostic, and adds no external components, yet on four diverse LVLMs it achieves \geq 98 % defense success across VLSafe, JailbreakV-28K and other benchmarks.

优缺点分析

Strengths

  • S1: The paper provides a systematic and fine-grained empirical analysis of attention heads in LVLMs, revealing that safety-relevant signals are concentrated in a small subset, termed “safety heads.” This is visually supported by the heatmaps in Figure 2, which display differences in deflection angles across heads and layers when comparing safe vs unsafe inputs.

  • S2: The Oh Defense framework requires no retraining, additional parameters, or external modules, making it highly practical for real-world deployment.

  • S3: Quantitative benchmarking demonstrates the method’s high DR and low false positive rate, outperforming prior approaches in most settings.

Weaknesses

  • W1: While Oh Defense is shown to defeat standard jailbreak attacks, there is little exploration of how attackers might adapt/mimic safe queries to evade detection. For instance, could an attacker construct inputs designed to fool the critical deflection signal?

  • W2: The SAH paper already linked individual heads to safety signals, while Oh Defense mainly adds a deflection-angle metric and prompt anchoring, so novelty is modest.

  • W3: Lack of evaluation on text-only input (either by converting the image to a caption or running the prompt as a text-only query). All reported results are based on multimodal settings, so we cannot tell whether the “safety-critical” head reacts to visual features, textual jailbreak patterns, or their interaction. A more rigorous study may include (i) a caption-only variant, where each image is replaced by an auto-generated caption, and (ii) a pure text variant that removes the image.

  • W4: Modality attribution is still unclear—the paper did not contrast scenarios where unsafe content appears only in the image, only in the text, or emerges through cross-modal reference (e.g., a benign textual query that uses “this” to point to an unsafe visual element).

问题

Please refer to weaknesses

局限性

yes

最终评判理由

The authors have addressed most of my concerns with detailed clarification. Therefore, I keep my positive score.

格式问题

N/A

作者回复
  • W1: The adaptive attacks are important. We answer your question from two perspectives.

    • Robustness against existing attacks: Oh Defense demonstrates solid performance against state-of-the-art attacks including sophisticated jailbreak datasets, carefully crafted red-team data, and white-box adversarial attacks with gradient information (see Appendix A.3 on page 15).
    • Difficulty of evasion: Evading Oh Defense would require attackers to suppress multiple potential safety heads simultaneously, since we randomly select from a candidate set of safety-critical heads. This presents a significant challenge as it requires: (1) Knowledge of which specific heads are safety-critical (different across models) (2) Crafting inputs that can suppress multiple safety heads without triggering detection. We further expand the two requirements to two attack scenarios:
      • Black-box settings: To the best of our knowledge, there is no reported black-box attack that could infer the status of the attention heads through simple output observations.
      • White-box settings: If attackers have full model access, they can perform more sophisticated and harmful attack against the model (e.g., backdoors and parameter manipulation) rather than focusing solely on bypassing our detection method.
  • W2: While both methods explore the capability of attention heads for safe/unsafe content detection, we observe differences between Oh Defense and SAHs in the approach and capabilities.

    • The differences on the approaches. SAHs rely on linear probes and trains binary classifiers on attention head activations, requiring supervised learning for detection. In contrast, Oh Defense adopts a training-free methodology based on principal behavior vectors and deflection angle measurements, avoiding reliance on additional parameters or training data as required by SAHs.
    • The differences on the detection capabilities. Oh Defense demonstrates significant practical advantages. Our approach uses only a single attention head compared to SAHs' multiple heads, while achieving superior performance across various attack scenarios. Note that, Oh defense outperforms SAHs against challenging attacks like FigStep, see Table 2 for details. The training-free nature also provides better transferability across different models and datasets (though attention heads are selected by VLSafe, the detection capability is demonstrated in Table 1 and 2 across various jailbreak datasets.), as we don't rely on dataset-specific supervised training that may not generalize well.

    We will conduct a more detailed comparison between Oh Defense and SAHs in the revised version.

  • W3: We agree with you that it is crucial to consider text-only inputs to comprehensively demonstrate the capability of Oh Defense. We show new experimental results on Qwen2-VL-7B, one of the LVLMs we used in the paper and surely one of the most commonly used LVLMs.

    • For the caption-only variant, we chose the FigStep dataset, where harmful content appears in images. We extracted captions from these images as input, with results shown as follows:

      FigstepOrg(No defense)Caption-Only(No defense)Oh Defense
      ASR32.452.020

      From the results, after obtaining captions from harmful images, there is a significant reduction in the degree of harmful content itself. This further confirms that after extracting the caption from a harmful image, its harmful information cannot be fully preserved.

    • For the pure text variant, we selected three widely used harmful datasets, MaliciousInstruct [15], JailbreakBench [5], and AdvBench [a], the results (ASR %) presented as follows:

      MaliciousInstructJailBreakBenchAdvBench
      No defense261915.79
      Oh Defense010

      The results demonstrate that Oh Defense can still provide effective defense in text-only inputs (ASR on all harmful datasets < 2%).

  • W4: The datasets used in our experiments cover all the scenarios mentioned in the comments, as detailed in Appendix A.1. Specifically, in Table 1 and Table 2:

    • VLSafe represents cross-modal attacks where unsafe content emerges through multimodal interaction.
    • JailbreakV-28K's LLM Transfer category specifically tests scenarios where unsafe content appears only in the text, while the images are black or noise images.
    • The reviewer's example of 'a benign textual query that uses "this" to point to an unsafe visual element' is precisely what JailbreakV-28K's FigStep and QR methods evaluate, FigStep embeds unsafe information directly into images through typography, while QR uses benign text queries that reference unsafe visual elements.

[a] Yangyi Chen, et al. 2022. Why Should Adversarial Perturbations be Imperceptible? Rethink the Research Paradigm in Adversarial NLP. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.

评论

I would like to thank the authors for addressing most of my concerns with detailed clarification. I will keep my positive score.

评论

Thank you very much for your valuable comments and suggestions. We greatly appreciate the time and effort you dedicated to reviewing our manuscript. We have carefully considered all your comments and made the corresponding revisions in the revised manuscript accordingly.

审稿意见
5

This paper proposes One-head Defense (Oh Defense), a lightweight and training-free approach to improving the safety of Large Vision-Language Models (LVLMs). The authors observe that specific attention heads in LVLMs are disproportionately responsible for reacting to unsafe input content. By masking attention heads and measuring shifts in hidden state representations (via deflection angle), they identify safety-critical heads capable of detecting unsafe multimodal prompts. Oh Defense leverages this head to classify inputs and applies a guided intervention through prompting when unsafe inputs are detected. The method is shown to be effective across multiple LVLMs (e.g., LLaVA-v1.5, Phi-3.5-Vision, Aya-Vision, Qwen2-VL), demonstrating high detection accuracy, very low false positives, and attack success rates. Importantly, the approach is computationally efficient and does not require model fine-tuning, external classifiers, or additional training data.

优缺点分析

Strengths:

  • Oh Defense requires no training, fine-tuning, or external models. It is easy to implement and lightweight enough for practical deployment. The method is extensively validated across multiple LVLMs and datasets, including difficult adversarial attacks, where it consistently outperforms strong baselines like SAHs and HiddenDetect.
  • By focusing on single-head deflection and visualizing attention head behavior, the method offers a more interpretable alternative to dense classifier-based defenses.
  • The method demonstrates robustness across models with different architectures, suggesting that the phenomenon of "safety-critical heads" is consistent across LVLMs.

Weaknesses:

  • The method classifies inputs as either safe or unsafe based on a scalar threshold. It does not explore multi-class safety tagging, graded levels of risk, or category-specific defenses (e.g., hate speech vs. misinformation), which could make the system more flexible and informative.
  • The study is focused on Vision-Language Models (LVLMs). Although the principles may extend to text-only LLMs, this is not empirically demonstrated, nor is the importance and impact of single modalities.
  • The proposed intervention strategy is relatively straightforward. A more detailed evaluation of how the intervention affects performance in benign tasks, in case those are flagged, is lacking. This could be important in balancing safety with usefulness in real applications.

问题

Please address the limitations listed in the Weakness section.

局限性

Limitations are discussed in the Appendix.

最终评判理由

While the work makes a generally good contribution, I believe that some points could have benefited from a more extensive analysis, for example, the design of the defense strategy. The authors did a good job in addressing my other points. I believe my rating of accept already reflects the strengths of the work. Therefore, I will keep my score.

格式问题

No major issues.

作者回复
  • W1: While Oh Defense is designed for detecting safe/unsafe content in LVLMs and cannot yet identify different types of harmful content, it can provide graded risk scores based on the magnitude of deflection angle(see Table 6 in page 15). Moreover, Oh Defense demonstrates good performance against harmful inputs of any category. For instance, the JailbreakV-28K dataset used in this paper contains 16 categories. To explore category-specific defenses, we present the defensive performance (ASR %, the lower the better) of Qwen2-VL-7B and LLaVA-v1.5-7B (commonly used open-source LVLMs) across different categories on this dataset as follows:

    Govern DecisionFraudAnimal AbuseMalwEcon HarmIllegal ActivityPhysical HarmBiasViolenceHatePoliticalUnethical BehaviorUnlicensed AdvicePrivacy ViolationHealthChild Abuse
    LLaVA0.340.210.080.130.20.180.150.130.420.1400.36000.080
    Qwen22.151.782.311.962.231.883.224.191.333.614.623.694.732.012.670.19

    As can be seen, Oh defense demonstrates robust and consistent protective effectiveness across all categories of harmful datasets. Notably, each category maintains a ASR below 5%. Furthermore, the performance remains remarkably stable across different threat categories, with minimal variation in ASR between categories.

  • W2: This is a good point. We further supplemented our evaluation with the detection performance (DR %) of Oh Defense on two text-only LLMs: Llama2-7B and Phi-3.5-Mini, which are widely used and are the backbone LLMs of the LVLMs used in our paper. We selected three harmful datasets, MaliciousInstruct [15], JailbreakBench [5], and AdvBench [a] (the higher the better), along with IFEval [b] as the benign dataset (the lower the better) since they are commonly adopted benchmarks for evaluating both the safety and generative capabilities of LLMs. The results are presented as follows:

    headMaliciousinstructJailbreakBenchAdvBenchIFEval
    Llama2-7B(31, 17)989395.966.70
    Phi-3.5-Mini(16, 18)989497.974.81

    As can be observed, Oh Defense also achieves solid performance on text-only LLMs. Regarding single-modality analysis, the datasets used in our study provide valuable insights into modality-specific effectiveness. For instance, FigStep and QR datasets contain harmful information embedded primarily in images, while LLM Transfer primarily leverages text-based attacks (see Appendix A.1 on Page 14). Our detection performance across all these harmful datasets demonstrates that Oh Defense exhibits significant impact in both single-modal and multi-modal contexts.

  • W3: This is an important observation. Oh Defense achieves low false positive rates on benign datasets (< 5% on most datasets as shown in Table 1), meaning the majority of benign tasks are processed normally without intervention.

    For the small percentage of benign inputs that are misclassified, we acknowledge that the safety prompts used in Oh Defense may affect response quality. This represents a safety-utility trade-off that is inherent in most safety systems, prioritizing safety over potential risks in edge cases.

    This limitation suggests valuable future work directions, such as incorporating steering vector techniques to maintain response quality even when misclassification occurs, or developing adaptive intervention strategies based on confidence scores. We will include these in a future work discussion section in the next version.

[a] Yangyi Chen, et al. 2022. Why Should Adversarial Perturbations be Imperceptible? Rethink the Research Paradigm in Adversarial NLP. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.

[b] Jeffrey Zhou, et al. 2023. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911.

评论

Dear Reviewer H9Dd,

We hope this message finds you well.

As the discussion period is coming to a close, we would like to thank you for your insightful comments and the time you have dedicated to reviewing our paper. We have carefully addressed each of your concerns in our rebuttal.

We kindly ask whether you could confirm that our responses have satisfactorily addressed your points. Your confirmation would be greatly appreciated as the review process moves forward.

Thank you again for your valuable feedback and support.

Best regards,
Authors of Paper #11603

最终决定

This paper introduces One-head Defense (Oh Defense) for improving the safety of Large Vision-Language Models (LVLMs). The key insight is that safety-related behavior in LVLMs is often concentrated in specific attention heads, and a single critical attention head can effectively serve as a safety guard. Notably, Oh Defense requires no additional training, making it simple to implement. The authors conduct experiments on four LVLMs on safety benchmarks.

Strengths:

  • Oh Defense is training-free, lightweight, and easy to implement.

  • The method is supported by extensive experiments on several LVLMs and datasets.

  • The phenomenon of "safety-critical heads" is intriguing and appears to hold consistently across different LVLMs.

Weaknesses:

  • The paper lacks a more detailed evaluation of how the intervention impacts benign tasks.

  • The identified top-k heads may themselves become targets for adversarial exploitation, raising potential robustness concerns.

Overall, this paper provides interesting insights about LVLM safety and offers a practical approach for defense. While further analysis of trade-offs and robustness would strengthen the work, the clarity, novelty, and strong empirical results make it a valuable addition to the field.