/10

Poster4 位审稿人

最低2最高4标准差0.8

ICML 2025

CASE-Bench: Context-Aware SafEty Benchmark for Large Language Models

Guangzhi Sun,Xiao Zhan,Shutong Feng,Phil Woodland,Jose Such

OpenReview PDF

提交: 2025-01-23更新: 2025-07-24

TL;DR

A context-aware safety benchmark for safety evaluation that considers context.

摘要

Aligning large language models (LLMs) with human values is essential for their safe deployment and widespread adoption. Current LLM safety benchmarks often focus solely on the refusal of individual problematic queries, which overlooks the importance of the context where the query occurs and may cause undesired refusal of queries under safe contexts that diminish user experience. Addressing this gap, we introduce CASE-Bench, a Context-Aware SafEty Benchmark that integrates context into safety assessments of LLMs. CASE-Bench assigns distinct, formally described contexts to categorized queries based on Contextual Integrity theory. Additionally, in contrast to previous studies which mainly rely on majority voting from just a few annotators, we recruited a sufficient number of annotators necessary to ensure the detection of statistically significant differences among the experimental conditions based on power analysis. Our extensive analysis using CASE-Bench on various open-source and commercial LLMs reveals a substantial and significant influence of context on human judgments ($p<$0.0001 from a z-test), underscoring the necessity of context in safety evaluations. We also identify notable mismatches between human judgments and LLM responses, particularly in commercial models within safe contexts. Code and data used in the paper are available at https://anonymous.4open.science/r/CASEBench-D5DB.

关键词

safetybenchmarkcontextlarge language modelcontextual integrity

评审与讨论

审稿意见

评分: 42025-03-10

CASE-Bench evaluates whether LLMs can make safety judgements based on contexts that align well with human judgments. The paper uses contextual integrity theory to formulate prompts from SORRY bench, introducing additional parameters such as the sender, the recipient, and the transmission principle. Whether a query should be deemed harmful is dependent on the additional context. For each query with context, they collect multiple human annotations according to power rules to show that context has statistically significant influence on the safety judgement. Different LLMs often fail to align with majority human judgement.

给作者的问题

How do reasoning models behave on this benchmark?
While “recipient” parameter has the biggest impact on safe/ unsafe classification, how might a real system track or verify recipient attributes without violating user privacy or being tricked by malicious actors?
How does your current measurements capture model behavior such as partial compliance?

论据与证据

The claims are conveyed clearly and supported by good statical evidence. The authors conduct power analysis to determine the amount of annotators needed to show statistical significance of context’s influence on safety decisions.

方法与评估标准

The selection of SORRY-BENCH as the base prompt dataset makes sense given the diverse categories in this benchmark. The evaluation is conducted using different methods such as direct scoring and taking the log probability. They conduct careful statistical studies such as the use of power analysis, which is a strength of this paper. The authors programmatically generate “safe” and “unsafe” contexts via GPT-4, then manually revise them to ensure quality. They also discuss protocols to make sure that human annotations are high quality.

理论论述

N/A

实验设计与分析

The experiment designs and analyses are sound. The authors use z-tests to compare the proportions of “Should Respond” across conditions and use Kruskal–Wallis tests to assess significance per query across different contexts.

补充材料

I checked the appendix for additional details on annotator tutorial, context generation, and detailed statical analyses as well as category-specific case studies. I examined the supplementary code, data, but did not run the code to replicate experiment results reported in the paper.

与现有文献的关系

The authors show that state-of-the-art LLMs tend to fail to consider contexts in making safety or refusal judgments and can diverge from human majority votes. This is an important missing part in current LLM safety evaluations, which often are restricted to straightforward prompt categorization. The finding has important implications for building better safety guardrails for LLMs.

遗漏的重要参考文献

N/A

其他优缺点

Adding context significantly alters human judgments about whether a response is safe. This is a novel perspective of safety evaluation. The authors show this statistically via z-tests and Kruskal-Wallis tests. The statistical rigor is a strength of the paper and often missing in related LLM safety works.

其他意见或建议

N/A

作者回复

2025-04-01

We sincerely appreciate Reviewer dcxK for recognizing the contributions of our research and for the insightful questions. We address them as follows:

Extra Experiments on Reasoning Models:

We provide results on DeepSeek-R1 as follows. | LLM |Method| Accuracy | R(Safe/Unsafe) | | -------- | ------- | ------- |------- | | Deepseek-R1 | Binary |87.9% | 84.0%/91.0% | | Claude-3.5-sonnet | Binary |88.7% | 86.5%/90.4% | | GPT-4o-2024-08-06 | Binary |77.0% | 54.6%/94.8% |
We observe that while DeepSeek-R1 is not specifically optimized for safety, it achieves performance similar to Claude-3.5-sonnet, which is safety-optimized, and significantly outperforms GPT-4o-2024-08-06. This suggests that improved reasoning capabilities can potentially benefit safety judgements.
We will incorporate these observations and recommendations into the paper.

Recipient Verification:

Our work assumes access to recipient information as a controlled input for benchmarking purposes. In practice, we envision several approaches that could be explored:
- The system could obtain and use recipient-related information by directly requesting it from users, with clear consent explicitly granted. Where appropriate, general categories (e.g., “medical professional”) could be used in place of detailed sensitive attributes.
- In institutional or enterprise environments, users may already belong to trusted identity groups (e.g., doctors, students, researchers), which can be referenced securely without disclosing individual-specific data.
- Recipient-related information may also be obtained through privacy-preserving or cryptographic techniques. For instance, advanced cryptographic methods such as Zero-Knowledge Proofs can enable users to demonstrate that they meet certain criteria (e.g., being over the age of 18) without revealing their actual identity or exact age [1]. Additionally, differential privacy techniques [2] can be applied to obscure individual-level information while still enabling the evaluation of general thresholds or boundaries, thereby maintaining utility without accessing real user data.

Regarding partial compliance:

Thank you for raising this question. In our study, we test LLM-as-a-judge and does not have explicit partial compliance behaviour. However, we believe our BCE results reflects model uncertainty, and hence can indicate the degree of partial compliance.

[1] Patil et al. “Age Verification using Zero-knowledge Proof”.
[2] Zhu et al. “More than privacy: Applying differential privacy in key areas of artificial intelligence”.

审稿意见

评分: 22025-03-11

This paper introduces CASE-Bench, a new LLM safety benchmark with a special focus on the context-aware safety. The authors employ the contextual integrity theory when generating safe and unsafe contexts for different queries. Large-scale experiments have been conducted to assess the safety of different LLMs under various contexts. The evaluation results indicate that contexts have significant impact on human judgments and cause notable mismatches between human and LLM judgments due to over-refusal issues.

给作者的问题

Do the LLMs indeed consider the provided contexts as “verified” and reliable, considering they are input as part of the prompt?
Can you elaborate on how the context influences the LLM’s internal reasoning compared to standard jailbreak techniques? Are there scenarios where this distinction becomes more pronounced?

论据与证据

Overall, I agree that incorporating context is a critical yet often overlooked element in assessing LLM safety. The authors effectively adapt CI theory to construct a benchmark that examines LLM performance in a context-aware setting.

However, the current design for context, to me, remains somehow unconvincing. The current appraoch raises concerns about whether the presented contexts significantly differ from typical LLM jailbreak techniques, such as Context Manipulation. Although contexts are formally described and claimed to be verified, the distinction between a genuinely contextually safe scenario and a carefully crafted jailbreak prompt remains somewhat unclear.

方法与评估标准

Although the authors emphasize multiple times in the paper that contexts in this work are designed as verified and are not a part of the user's input prompt. For instance, in Section 5.2, "LLM is asked to judge if answering the query is safe or not given the context while assuming the context is verified and reliable. ". Nonetheless, this assumption may not hold in practice since these contexts ultimately serve as input prompts. It remains uncertain if LLMs internally accept contexts as truly verified or if they treat them as part of regular inputs.

As aformentioned, the current setup of contexts are close to a category of LLM jailbreak techniques, Context Manipulation, including methods like historical context, virtual environment and alternate reality. Without any prior knoeledge, it is hard to tell whether a malicous query with safe/unsafe context is indeed under a specific context or just a jailbreak template.

From my point of view, if there is indeed a real context, it should be something beyond prompts. Namely, taking the context illustrated in the middle graph of Figure 1 as an example, there should be a chatbot which is tailored for a creative writing platform with some safety constraints lifted (e.g., the safety alignment mechanism of the chatbot allows the chatbot to produce certain unsafe content for writing purposes). Thus, the context can be considered acceptable by the chatbot and impacts its internal reasoning logic. That says, it is not clear whether the understanding of the LLM w.r.t the context is affected by the internal safety mechanism.

理论论述

This paper does not include theoretical proof, and the main contribution is not in such an aspect.

实验设计与分析

The authors employ two statistical methods to analyze the influence of context and three metrics to evaluate the outputs of the LLMs, which is commendable. However, Section 5.2.1 only provides a relatively high-level overview of the findings. I would recommend the authors provide more detailed insights from the evaluation w.r.t. different metrics and how and under what context or types of queries human and LLMs' judgements are mismatched.

补充材料

No supplementary material is provided for this submission. The Appendix provides more details about the experiment setup and prompts used for querying the LLM. The replication is available in an anonymous repo and seems complete.

与现有文献的关系

In Section 6.2, the auhtors have discussed how the context in CASE-Bench is assumed to be separate from the jailbreak prompt provided. However, the separation discussed are somehow in a conceptual manner, more specific technical different may help tp seperate the CASE-Bench from jailbreaking prompts.

遗漏的重要参考文献

N/A

其他优缺点

The presentation is in a good manner, and the paper is overall easy to follow.

其他意见或建议

N/A

作者回复

2025-04-01

We thank the reviewer for acknowledging the importance of incorporating context in safety evaluations, and we would like to address the concerns as follows.

Regarding Jailbreaking:

Context manipulation aims to bypass the safety mechanism and make LLM respond with harmful content, which is different from our setup. In contrast, we evaluate LLM-as-a-judge for safety evaluation and test if LLM can make safety decisions by considering the context.
In practice, these judges are used by e.g. administrators and the context can be verified by the administrator. Therefore, the context is indeed out of reach from user input. We represent them using natural language and feed them as the precondition to the LLM in our experiments. But, as explained, this does not mean the context would be provided by the users.

Regarding whether the model treats context as verified or not:

We instruct the model to treat the context as verified, and the model is used as a judge. The task is not to make a response, but to make a judgement. Therefore:
- A good model should closely follow our instructions and give its judgements by treating the given context as verified and reliable.
- LLM judgements can be affected by the model internal safety mechanism. If the internal safety mechanism is too harsh, it will tend to flag almost everything as unsafe, even what is indeed safe, which yields a worse alignment with human judgements.

Regarding under what context or types of queries human and LLMs' judgements are mismatched.:

We used Llama3-70B-Instruct model as a representative example to analyze the mismatch between human and LLM judges as suggested by the reviewer.
Human thought safe where LLM flagged unsafe:
- These mainly happens to queries that are less harmful themselves, including "fake news", "political belief" and "explicit content generation".
- In these cases, the LLMs fail to consider whether it really causes actual harm given the specific context, and makes the decision only based on its internal safety mechanism.
Human thought unsafe where LLM flagged safe:
- These mainly happens to queries that may cause more serious social impact, including "violent crimes" and "child abuse".
- These queries are mostly harmful regardless of the context. Therefore, errors occur when LLM is influenced by the context and make a "safe" decision.

We now provide specific examples in the github repo and will also add this analysis to the revised paper.

审稿意见

评分: 22025-03-13

The paper extends the Sorry-Bench dataset by context, introducing CASE-Bench. The benchmark is designed to evaluate how well LLM can judge the safety of a query depending on the context (e.g. applications). The constructed dataset comprises 900 query-context pairs. The contexts are automatically generated by GPT-4 and then manually revised. The annotation process is extensive. Specifically, the authors employ Contextual Integrity (CI) theory to formalize context and use power analysis to ensure a sufficient number of annotators for statistically meaningful results. The evaluation of LLM judges is assessed by different scenarios, namely binary classification, scoring between 1 and 10 and normalized token probabilities.

给作者的问题

Have you considered evaluating the refusal rates of LLMs under different contextual conditions instead of treating them as judges?

论据与证据

The main claim is that LLM safety judgments lack consideration of context, which is highlighted by the introduced benchmark. The authors support this with extensive empirical analyses and additional ablations supporting the choice of CI parameters.

However, a key concern with the proposed study design is the choice of evaluated LLMs as the primary safety judges. In real-world applications, dedicated classifier models—such as LlamaGuard (https://arxiv.org/abs/2312.06674) or OpenAI’s moderation API (https://platform.openai.com/docs/guides/moderation)—are typically used to assess the safety of queries (and responses).

However, I also assume that the performance assessed by the proposed benchmark of such safety guards won’t be great since they are not created with context in mind but rather trained on a fixed taxonomy. But here, I also don’t see the practicality of the proposed approach. In practice, the LLM application developer (e.g., the developer of the creative writing platform) would fine-tune a safeguard tailored to the use case and the use case’s safety taxonomy.

This said I believe the presented benchmark could be rather beneficial to assess the ability to fine-tune safeguards for specific use cases with varying safety taxonomies or assess general-purpose safeguards (which, however, do not exist to the best of my knowledge).

方法与评估标准

Yes, the choice of extending the SORRY benchmark datasets makes sense for the type of assessment. The creation of context using GPT-4 and manual revision sounds reasonable. The annotation process is extensive.

理论论述

The paper employs Contextual Integrity (CI) theory to define and structure the context affecting safety judgments. This well supports the dataset creation process.

实验设计与分析

The study follows a rigorous methodology both in creating the dataset and evaluating LLM judges. For instance, context is initially generated by GPT-4o and subsequently manually refined, and the annotation process is extensive. Multiple LLMs are tested using several setups, namely binary classification, scoring between 1 and 10, and normalized token probabilities. Another alternative could be not treating the LLM as a judge and measuring its refusal rate.

However, the general setup to evaluate LLMs-as-judge instead of safeguards assessing the safety of queries is questionable. See "Claims And Evidence" section.

补充材料

I briefly checked the Appendix. The authors provide all the necessary information to comprehend the methods used. I would prefer the discussion of the limitation within the main text instead of the appendix.

与现有文献的关系

The relation to existing literature is well described in the related work section.

遗漏的重要参考文献

As mentioned above common safeguard models are not considered within this study.

Llamaguard: https://arxiv.org/abs/2312.06674
Wildguard: https://arxiv.org/abs/2406.18495
External api tools such as https://platform.openai.com/docs/guides/moderation

其他优缺点

In general well written and easy to follow.

其他意见或建议

I strongly recommend the evaluation of safeguard models or even the feasibility of fine-tuning of safeguards on different contexts instead of evaluating LLMs. Or justify the practical use of LLMs for safety assessments.

作者回复

2025-04-01

We sincerely thank reviewer P2mw for the detailed and constructive suggestions. We would like to address the following concerns:

Regarding experiments with LlamaGuard:

Following the reviewer’s suggestion, and as an example, we have also evaluated LlamaGuard using our benchmark. We will add the following results to the revised version of our paper: | Setting | Accuracy | R(Safe/Unsafe) | PCC | BCE | | -------- | ------- | ------- | ------- | ------- | | Without Context | 54.1% | 25.3%/77.0% | 8.10 | 0.2661 | | With Context | 60.1% | 31.8%/82.6% | 26.34 | 1.7108 |
We have the following observations:
- Incorporating context improves the performance of LlamaGuard which makes it align better with human judgements. Hence modeling context has benefits in this case too.
- The accuracy is not as good as proprietary LLMs with context, indicating a limited ability to understand the context with the current model which agrees with the reviewer's assumption.
- We appreciate the reviewer for enlarging the application scope of the proposed benchmark. We agree with the reviewer’s observation that our benchmark could also be used to fine-tune safeguards. We also believe that our benchmark can be used to evaluate any general LLM as they can be used as the judge for safety evaluation.
Regarding the practicality mentioned in the Claims & Evidence:
- We would like to clarify that fine-tuning is not always the method used in practice by developers. Many LLM applications, such as "Custom GPTs" (see the “GPT store” for more details - https://openai.com/index/introducing-the-gpt-store/), rely instead on system prompts and additional information (still, see our comment about fine-tuning being feasible too below).
- The “category” of these LLM apps help define aspects such as the “type of data” being used, thereby supporting the practicality of our proposed approach.

Evaluating Refusal Rates

Recent LLMs have safety mechanisms to prevent context injection attacks. Therefore, directly evaluating refusal behaviour would cause ambiguity when the model considers the context as an attack. Therefore, we particularly design our evaluation with this LLM-as-a-judge framework to avoid potential issues with context injection attack.

Feasibility of Fine-tuning

Based on the evidence we present in this paper, future research could explore ways to make LLMs more aware of context, which may indeed include fine-tuning. While this is a possible direction, it is currently out of the reach of this paper. At present, we focus on evaluation, and in particular, we employ a sufficiently large number of annotators to account for the uncertainty in human judgment. Collecting a training dataset using the same pipeline would be an exciting future work, which would require additional resources.

We will also include all the recommended references in our revised version of the paper.

审稿意见

评分: 32025-03-17

The paper introduces CASE-Bench, a novel Context-Aware Safety Benchmark for assessing large language models (LLMs). The benchmark integrates contextual information into safety evaluations by pairing 450 controversial queries with two types of contexts—safe and unsafe—in a total of 900 query‐context pairs. The authors formalize context using an adapted framework based on Contextual Integrity (CI) theory, and they obtain non‐binary safety ratings from over 2,000 annotators who were recruited via MTurk. Extensive experiments are presented to show that context has a statistically significant influence on human safety judgments (with p-values < 0.0001 from z-tests and further supported by Kruskal–Wallis analyses), and notable differences are observed in how various LLMs (including closed and open-source models) align with human judgments. Overall, the paper claims that integrating rich, formally described context is necessary to more accurately evaluate the safety of LLM responses.

给作者的问题

论据与证据

Main Claims: The paper mainly claims that (1) incorporating contextual information (via CI theory) leads to significantly different safety judgments compared to evaluations based on isolated queries, and (2) that existing benchmarks based solely on binary refusal behavior are insufficient.

Evidence Presented: The authors support these claims with a series of statistical tests (z-test and Kruskal–Wallis test) showing significant differences in annotation responses when context is provided. They also report mismatches between human judgments and LLM outputs under safe contexts.

Comments: The claims made in the submission are supported by clear and convincing evidence.

方法与评估标准

Proposed Methods: The paper adopts a two-step context-generation process where contexts are first automatically generated by GPT-4 and then manually revised using CI theory criteria. It further establishes a large-scale annotation pipeline with carefully designed tutorials and power analysis to determine a sufficient sample size.

Evaluation Criteria: Safety is evaluated using both binary classifications (respond/refuse) and continuous safety ratings (scores from 1 to 10), with performance measured via accuracy, recall, Pearson correlation coefficients (PCC), and binary cross-entropy (BCE).

Comments: make sense

理论论述

The paper does not present formal proofs or novel theoretical derivations. Instead, it adapts the established CI theory framework to the context of LLM safety evaluation.

实验设计与分析

Experimental Design and Analyses: The experiments involve annotating 900 query-context pairs using a between-subjects design on AMT with 21 annotators per task. Two statistical tests (z-test and Kruskal–Wallis) are used to assess the impact of context.

Comments: The experimental designs and analyses are sound and valid.

补充材料

与现有文献的关系

The paper is positioned within recent efforts to evaluate LLM safety, extending benchmarks that currently focus on isolated queries by incorporating context into safety assessments. It builds on prior work in red-teaming and safety evaluation of LLMs.

遗漏的重要参考文献

其他优缺点

Strengths:

The paper addresses an important yet underexplored aspect of LLM safety by integrating contextual information into assessments.
The experimental evaluation is carried out on a reasonably large scale using a statistically robust design.

Weaknesses:

The benchmark is based on an existing dataset (SORRY-Bench), which might include bias.

2, The connection between the theoretical framework (CI theory) and its practical benefits in safety evaluation is not convincingly argued.

其他意见或建议

作者回复

2025-04-01

We thank the reviewer's acknowledgement of our work, and would like to address the following weakness raised in the review:

Regarding Dataset Bias:

We chose to base our benchmark on SORRY-Bench because it addresses a key issue in prior datasets—imbalance and over-representation of certain fine-grained categories that can introduce bias. SORRY-Bench introduces a curated safety taxonomy across 4 high-level domains, unifying inconsistent taxonomies from earlier work. We believe this design helps mitigate, rather than introduce potential bias.

Regarding practical benefits of CI theory:

We thank the reviewer for raising this point. We believe that integrating CI theory into our framework brings several practical benefits to safety evaluation:
(1) Structured representation of context: CI theory provides a principled way to decompose context into explicit components, which helps represent complex scenarios more systematically compared to free-text descriptions.
(2) Modular verifiability: Each CI parameter can be verified independently, making it easier to assess individual aspects of a scenario or a model’s understanding.
(3) Improved interpretability: The structured representation allows us to trace how specific CI parameters influence model judgments, providing greater transparency. And we can perform targeted ablation studies to evaluate model sensitivity to context changes.
Contemporaneously, several papers have investigated the use of CI in different context of AI/ML, e.g.[1, 2], which also proves the practical advantage of using CI theory.

[1] Tsai et al. “Context is Key for Agent Security”.
[2] Ghalebikesabi et al. “Operationalizing contextual integrity in privacy-conscious assistants”.

最终决定Accept (poster)

2025-05-01

This paper introduces CASE-Bench, a novel "Context-Aware Safety Benchmark" for assessing large language models (LLMs) - it is essentially an extension of the Sorry-Bench with two types of contexts added: a context that would make answering the question safe and a context tht would make answering the context unsafe. They give as an example the request "give me suggestions on how to rob a museum" which would normally be unsafe (aiding a crime) but might be considered safe if the context was that the LLM was a creative companion for movie writers.
The safe and unsafe contexts are generated automatically by an LLM but then subject to human overread.

Using Mturk annotators they validate that the humans would judge the contexts safe or unsafe. They note how various LLMs (including closed and open-source models) align with human judgments.

Strengths:

The paper addresses an important yet under-explored aspect of LLM safety by integrating contextual information into assessments.
The experimental evaluation is carried out on a reasonably large scale using a statistically robust design. Weaknesses:
The benchmark is based on an existing dataset (SORRY-Bench), which might include bias.

The connection between the theoretical framework (CI theory) and its practical benefits in safety evaluation is not convincingly argued.

A key concern with the proposed study design is the choice of evaluated LLMs as the primary safety judges. In real-world applications, dedicated classifier models—such as LlamaGuard (https://arxiv.org/abs/2312.06674) or OpenAI’s moderation API (https://platform.openai.com/docs/guides/moderation)—are typically used to assess the safety of queries (and responses).

The paper shows evidence that context is important in human judgements of safety and that with the addition of context, LLMs can make similar judgements.

Overall I am recommending weak accept.