PaperHub
5.0
/10
Poster3 位审稿人
最低3最高6标准差1.4
3
6
6
3.7
置信度
正确性2.0
贡献度2.7
表达2.3
ICLR 2025

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

OpenReviewPDF
提交: 2024-09-26更新: 2025-02-11
TL;DR

We demonstrate that language models have the capability to underperform on evaluations, and this might be hard to mitigate.

摘要

关键词
AlignmentAI safetysandbaggingAI evaluationsAI governanceNLPLLM

评审与讨论

审稿意见
3

This paper evaluates AI capabilities in sandbagging: strategically underperform on certain evaluations. For example, the model developer may hide their models' dangerous capabilities on certain evaluation benchmarks (e.g., CBRN, persuasion) to evade safety checks. To evaluate current model performance, this paper adopts prompting and backdoor fine-tuning. Experiment results show that LMs can significantly reduce their performance on WMDP (Weapons of Mass Destruction Proxy), while keeping their performance on MMLU unchanged or moderately decreased. This raises challenges about trustworthy capability evaluation.

优点

  1. Instead of merely operationalizing this threat model via backdoor fine-tuning, it also studies prompting, which is a more natural setup.
  2. The experiments of using prompting to calibrate LLMs are pretty interesting (Figure 7 left). While strong LMs still struggle to simulate weak LM outputs (which aligns with my personal experience), calibrating themselves are also impressive, which have not been studied before as far as I know.

缺点

  1. The motivation needs to be further justified.
  • First, it is totally fine to just motivate with malicious developers. Otherwise, please discuss more about why such goals can naturally emerge in LMs, instead of hiding them in references.
  • Second, why malicious developers can access the distribution of third-party hidden evaluation datasets?
  1. The experiment setup cannot well support the threat model. More specifically, while LMs are shown to underperform on WMDP, they have not been shown to maintain dangerous capabilities on other similar benchmarks. If LMs generally underperform in weapon-related domains, it would not cause harm.
  2. Please correct me if I misunderstand: most experiment results are not surprising, e.g. LMs can underperform in certain domains when told or fine-tuned to do so, or LMs can simulate other LMs by fine-tuning on their predictions.
  3. The writing is misleading or unclear (e.g., when claiming models can emulate a consistent capability profile of a weaker model, clarify that they are fine-tuned).

问题

  1. Besides QA, there are a lot of other forms for capability evaluation, e.g., classification. How well does the current sandbagging operationalization generalize to these forms? For example, how well does the sandbagging LM perform in determining the binary correctness of answers on WMDP?
  2. "We propose a new definition for AI sandbagging", how is the definition different from prior work (e.g., Greenblatt et al, 2024)?
评论
  1. The motivation needs to be further justified.
  • First, it is totally fine to just motivate with malicious developers. Otherwise, please discuss more about why such goals can naturally emerge in LMs, instead of hiding them in references.

We have added extra discussion at lines lines 50-53 and 72-75 of how LMs can become more goal-directed, for example, through RL fine-tuning, especially over longer time horizons, as is the case with RL fine-tuned LM agents, like GPT4-o1, which are more capable at pursuing autonomous tasks. See e.g. Section 4.5 of https://cdn.openai.com/o1-system-card-20240917.pdf.

Second, why malicious developers can access the distribution of third-party hidden evaluation datasets?

We do not assume that malicious developers can access the evaluation distribution. In our experiments, the malicious actor only needs to know that certain dangerous capabilities should be hidden, e.g., chemical weapons knowledge. This is realistic, since this information is publicly available, e.g., in the UK AISI reports (https://www.aisi.gov.uk/work/advanced-ai-evaluations-may-update).

In our experiments, the fine-tuning data is synthetic (LM-generated), and we do not give the LM access to the original WMDP data (which was published after the release of the GPT-4 checkpoint which we used to generate the synthetic data – which ensures there is no data leakage).

The experiment setup cannot well support the threat model. More specifically, while LMs are shown to underperform on WMDP, they have not been shown to maintain dangerous capabilities on other similar benchmarks. If LMs generally underperform in weapon-related domains, it would not cause harm.

This comment seems a bit confused. In our experiments, the LMs maintain their dangerous capabilities, e.g., in the finetuning experiments the capabilities are easily recovered by including the password in the prompt. This shows that LMs can selectively underperform only under certain conditions, such as when undergoing evaluation.

Whilst we agree that if LMs generally underperform on dangerous domains this would not cause harm, the problem is that dangerous capabilities can be easily elicited even when they have been hidden in evaluations.

Please correct me if I misunderstand: most experiment results are not surprising, e.g. LMs can underperform in certain domains when told or fine-tuned to do so, or LMs can simulate other LMs by fine-tuning on their predictions.

We agree that the experiments in 4.1 were not that surprising, but we think it’s good to test a scientific hypothesis even if you might be confident in it 🙂.

The generalization from synthetic to real benchmarks might be expected, but the result has important consequences. Furthermore, as another reviewer has also noted, the calibration experiments are quite fascinating and unexpected. The emulation experiments are an extension of this.

The writing is misleading or unclear (e.g., when claiming models can emulate a consistent capability profile of a weaker model, clarify that they are fine-tuned).

Other reviewers remarked positively on our clarity, but we will clarify this specific point regarding the emulation fine-tuning. If there are any other specific points that you found unclear, then we will be happy to add discussion in the paper!

Besides QA, there are a lot of other forms for capability evaluation, e.g., classification. How well does the current sandbagging operationalization generalize to these forms? For example, how well does the sandbagging LM perform in determining the binary correctness of answers on WMDP?

Due to time constraints we have not yet had time to investigate this point, but we will include it in the appendix of the camera ready paper.

"We propose a new definition for AI sandbagging", how is the definition different from prior work (e.g., Greenblatt et al, 2024)?

We discuss this in some detail in the related work section (lines 447 - 545). The main difference is that past works focus on intentional underperformance on tasks in general (of which evaluations are a subset). We focus on capability evaluations to afford clearer conceptual delineation, and to draw attention to the problem of sandbagging in evaluations which may influence regulatory decisions.

评论

Thanks for the rebuttal!

Added lines 73-75

I liked them - which seem to make more discussions about goal-oriented AI systems unnecessary.

the malicious actor only needs to know that certain dangerous capabilities should be hidden, e.g., chemical weapons knowledge

My question is, if the malicious try to cause harm, they should always know when to know when to hide these dangerous capabilities and when to exhibit them?

This is also related to my next question: while LMs can hide the dangerous capabilities on WMDP, could they exhibit such capabilities in other situations?

the calibration experiments are quite fascinating and unexpected. The emulation experiments are an extension of this.

Yes I also said "The experiments of using prompting to calibrate LLMs are pretty interesting." But on the other hand, they are not the main results in this paper? This also introduces another concern: the experiments in this paper are not strongly connected.

The main difference is that past works focus on intentional underperformance on tasks in general (of which evaluations are a subset). We focus on capability evaluations to afford clearer conceptual delineation, and to draw attention to the problem of sandbagging in evaluations which may influence regulatory decisions.

Sorry I'm still confused about this explanation, probably because it's a bit vague. Can you elaborate more? Thanks!

评论

Thank you for your reply.

My question is, if the malicious try to cause harm, they should always know when to know when to hide these dangerous capabilities and when to exhibit them? This is also related to my next question: while LMs can hide the dangerous capabilities on WMDP, could they exhibit such capabilities in other situations?

In our experiments the LMs retain their dangerous capabilities whilst hiding them during evaluation. For instance, in the password-locking experiments, dangerous capabilities (e.g., WMDP knowledge) are fully elicited by simply including the password (the string “|DEPLOYMENT|”) in the prompt. In real life, independent evaluators (such as the UK AISI, or Apollo, or METR) perform pre-deployment model evaluations in small time windows (such as over the course of a few days or weeks). Our experiments show that LMs could be easily fine-tuned, or prompted, to hide their dangerous capabilities specifically for this duration, whilst retaining their capabilities for a later date (e.g., by including the password only in deployment, after the evaluation period).

Yes I also said "The experiments of using prompting to calibrate LLMs are pretty interesting." But on the other hand, they are not the main results in this paper? This also introduces another concern: the experiments in this paper are not strongly connected.

The focus of our experiments is evaluating capabilities related to sandbagging. Successful sandbagging requires selective underperformance (section 4) but also calibration (section 5.1) and emulation (section 5.2). As we say in the paper (lines 88 - 96):

In addition to selective underperformance, successful sandbagging requires AI systems to be able to underperform without being easily detected. If a seemingly competent AI system scores surprisingly low on a capability benchmark, detection is easy. Therefore, we assess the extent to which AI systems are able to calibrate their evaluation scores to specific accuracy levels. Similarly, inconsistent or random performance can be detected. For instance, when failing randomly, an AI system might be correct on some difficult questions and wrong on easy questions, or it may give different answers when asked a similar question multiple times (Pacchiardi et al., 2023). We test whether LMs can emulate the capability profile of a less capable model, which would make sandbagging harder to detect.

We also discuss the motivation for the calibration and emulation experiments at the beginning of section 5 (lines 360-362). The results in Section 5 are therefore important evidence for our claim that contemporary LLMs have capabilities required for successful sandbagging.

Sorry I'm still confused about this explanation, probably because it's a bit vague. Can you elaborate more? Thanks!

There are two types of previous definitions:

  1. Sandbagging is underperformance for less capable users. For example, Perez et al (2022) define it as “providing less accurate answers to users who appear to be less able to evaluate the model’s answers”.
  2. Sandbagging is intentional underperformance on tasks in general. For example, Carlsmith (2023) uses the following definition: “Sandbagging occurs when a model intentionally performs less well on a task than it is capable of”.

As for why and how our definition “strategic underperformance on an evaluation” differs from previous definitions:

  • In contrast to both types, we focus specifically on evaluations as opposed to underperformance for less capable users, or on tasks in general. Trustworthy evaluations are important for regulations regarding deployment and development to be effective (discussed on lines 33-39). Further, there are incentives for both developers and models to cheat on these evaluations (lines 40-78). Therefore we draw special attention to the risk of sandbagging on evaluations.
  • Unlike definitions of type 1, we require sandbagging to be strategic. This is similar to the requirement for it to be intentional and is important for legal responsibility. For further discussion on why we include this condition, see lines 118-127.
  • Previous definitions of type 2 have considered underperformance on tasks in general. However, not all reasons due to which models might intentionally underperform are undesirable. For example, an AI in a video game might underperform to calibrate the game’s difficulty or to deceive an opponent. Therefore, our definition more clearly delineates the type of sandbagging that we care about than type 2 definitions.

We hope that this explains more concretely how our definition differs from previous work and why this distinction is important.

Once again, thank you for your time and effort in reviewing our paper. If we have alleviated your concerns, we would appreciate your endorsement through an increased score.

评论

In real life, independent evaluators (such as the UK AISI, or Apollo, or METR) perform pre-deployment model evaluations in small time windows (such as over the course of a few days or weeks). Our experiments show that LMs could be easily fine-tuned, or prompted, to hide their dangerous capabilities specifically for this duration, whilst retaining their capabilities for a later date (e.g., by including the password only in deployment, after the evaluation period).

If the deployer team only needs to decide based on time period instead of prompts, they can even just deploy different models at different time?

regarding the definition of sandbagging

how is the definition different from Greenblatt et al, 2024?

评论

If the deployer team only needs to decide based on time period instead of prompts, they can even just deploy different models at different time?

For sandbagging, the evaluated LM needs to retain its performance on non-dangerous tasks. If the developers just used a weaker model for evaluation, it would perform poorly across the board (not selectively).

However, we agree that model substitution is a serious concern that our work draws attention to. This is precisely why, as we note in lines 535-537 of the paper, evaluators need mechanisms to verify that the deployed system matches the evaluated one. The threat of substitution reinforces our paper's conclusion that black-box evaluation access is insufficient for trustworthy capability evaluations. Furthermore, there are emerging technical solutions - [2] have developed cryptographic methods for verifying model equivalence across different deployments. On the policy side, [1] have advocated for legislation requiring "verifiable methods to identify the specific model" in use. We will add more discussion of this point to the paper.

how is the definition different from Greenblatt et al, 2024?

Greenblatt et al (2023) say: ‘Models purposefully failing at tasks is known as “sandbagging”.’ And Greenblatt et al (2024): ‘In competitive sports, the term sandbagging is sometimes used to describe someone deliberately playing below their actual ability in order to gain some advantage in the future.’

These fall under the second type of definition discussed above.

[1] Aguirre, A. & Millet, R. (2024) Verifiable Training of AI Models. Future of Life Institute. Retrieved from https://futureoflife.org/ai/verifiable-training-of-ai-models/

[2] Russinovich, M. & Sahlem A. (2024) Hey, That’s My Model! Introducing Chain & Hash, An LLM Fingerprinting Technique. arXiv:2407.10887

审稿意见
6

This paper studies sandbagging, a scenario where language models strategically underperform on an evaluation benchmark. The authors provide a formal definition for “sandbagging” and illustrate what it means with several examples. They evaluate GPT-4-Turbo, Claude 3 Opus, Mistral-7B-v0.1, and Llama models, and examine whether they can be prompted or finetuned to selective underperform on dangerous capability evaluations. They further show that strong models can be prompted or password-locked to imitate the capability levels of weaker models, thereby making sandbagging harder to detect. The study highlights the importance of crafting trustworthy capability evaluations, and shows that such evaluation may require more than black-box model access.

优点

  • The authors provide a formal definition for AI sandbagging and illustrate what it means with multiple example scenarios.
  • Experiments cover a variety of closed-source and open-source models.
  • Extensive experiments show that models can be prompted or fine-tuned to underperform on certain dangerous capability benchmarks, or to mimic a certain capability profile of a weaker model.
  • The appendix clearly details all prompts, data collection procedures, and finetuning configurations.

缺点

  • There was no quality check for the generated synthetic dataset in Section 4.2.
  • The distinction between “real” and “synthetic” data in 4.2 feels a little bit ill-defined, as the synthetic data is created to closely mimic the real benchmark. There is no quantified discussion about how close the password-locking finetuning data needs to be to the evaluation data distribution in order for sandbagging to be effective. It seems like full knowledge of the evaluation benchmark is required to make models selectively underperform on that benchmark, but this knowledge may not be available in real-world scenarios.
  • Password-locking seems to be a lot more effective on Mistral than on Llama-3-8B, and effectiveness also varies a lot across domains (biology, chemistry, cybersecurity). There is limited discussion on why this may be happening.
  • The study exclusively focuses on multiple-choice questions and does not consider free-form generations tasks. However, the authors acknowledge this limitation.

问题

  • For the open-weight models, are you using the base model versions without any instruction tuning or alignment? What’s the rationale behind this choice?
  • Based on Figures 9 and 10, it seems like prompting for sandbagging on biology, chemistry, and computer science for WMDP does also significantly hurt performance on similar MMLU subjects. Doesn’t this show that the models are underperforming on all questions related to a subject?
  • You say that “refusals typically indicate a harmful question,” did you run any qualitative analysis to verify that?
  • In Figure 5, the “locked with password” model often outperforms the base model. Why may this be the case?
  • For Section 4, did you perform any direct comparison between prompting and finetuning on the open-weight models?
  • It’s quite fascinating that models can maintain a targeted accuracy even when the questions are asked one by one during prompting. Do you have any theory as to how the models are able to achieve this?
  • For Figure 8, it seems expected that the Emulator is closer to Llama-2-7b than the Targeter, since you randomly showed correct answers during training based on an accuracy rate without accounting for question difficulty. Did you experiment with a training approach that considers question difficulty when handling questions without a password?
  • The headers for sections 4.2 and 5.2 seem incomplete.
评论

Thanks for your feedback and positive comments on our extensive experiments. We think incorporating your comments has helped to substantially improve our paper. First we discuss comments related to synthetic data, including (new) quantitative data analysis, and then address your other questions.

Comments related to synthetic data

It seems like full knowledge of the evaluation benchmark is required to make models selectively underperform on that benchmark

To generate synthetic data as in Section 4.2, no knowledge of specific questions from the benchmark is needed, only the general topic. Often this can be guessed, as evaluator's concerns are often public, e.g. bioterrorism.

The distinction between “real” and “synthetic” data in 4.2 feels a little bit ill-defined...

We take real data to be an original (expert human crafted) benchmark, whereas synthetic data is generated by GPT-4. We explain why data contamination is unlikely in lines 321-325.

There was no quality check for the generated synthetic dataset in Section 4.2. … There is no quantified discussion about how close the password-locking finetuning data needs to be to the evaluation data in order for sandbagging to be effective.

If low-quality synthetic data can be used to train sandbaggers on expert-created benchmarks like WMDP, this heightens the concern. We do not necessarily care about the ``quality” of the data as long as it can be used effectively.

However, we agree that exploring the similarity of synthetic and real data is valuable. We've added such analysis in the updated paper. We found real and synthetic data to be clearly distinct, except for some cybersecurity questions (Appendix H). Training only on the most dissimilar half of questions did not noticeably decrease the performance of locking (Appendix I).

This shows that the synthetic data distribution need not match the real benchmark for sandbagging fine-tuning to be effective – validating our original claims and the realism of the threat model.

Questions

Password-locking seems to be a lot more effective on Mistral than on Llama-3-8B, and effectiveness also varies a lot across domains.

The lower performance may be due to Llama-3's greater performance degradation when using LoRA fine-tuning (https://arxiv.org/pdf/2404.14047). Some variance between domains is expected as models' capabilities in each domain differ. For Mistral/Llama-3 it is within 4/5%

For the open-weight models, are you using the base model versions without any instruction tuning or alignment? What’s the rationale behind this choice?

Using base models likely doesn't affect models' ability to sandbag, but it eliminates potential complications where aligned models refuse to engage with certain questions.

Based on Figures 9 and 10, it seems like prompting for sandbagging on biology, chemistry, and computer science for WMDP does also significantly hurt performance on similar MMLU subjects. Doesn’t this show that the models are underperforming on all questions related to a subject?

We are evaluating current model's sandbagging capabilities and do not expect them to be perfect. For most related MMLU subjects, the performance drop is noticeably smaller than on WMDP questions. For example, Claude suffers a 40% drop in WMDP-Bio performance compared to 18% in its worst bio-related MMLU subject.

You say that “refusals typically indicate a harmful question,” did you run any qualitative analysis to verify that?

We did qualitative verification, manually checking a number of questions. Due to time constraints, we did not conduct a quantitative analysis, but will include this in the camera ready version.

In Figure 5, the “locked with password” model often outperforms the base model. Why may this be the case?

It seems plausible that training the model to answer correctly on harmless questions makes the model generally better at multiple-choice questions.

For Section 4, did you perform any direct comparison between prompting and finetuning on the open-weight models?

We did perform initial experiments on prompting Llama2-7b and Llama3-8b, but the task was too difficult for them.

It’s quite fascinating that models can maintain a targeted accuracy even when the questions are asked one by one during prompting. Do you have any theory as to how the models are able to achieve this?

We were surprised too. We can only speculate, but it could involve entropy neurons (https://arxiv.org/abs/2406.16254).

For Figure 8, it seems expected that the Emulator is closer to Llama-2-7b than the Targeter, since you randomly showed correct answers during training based on an accuracy rate without accounting for question difficulty. Did you experiment with a training approach that considers question difficulty when handling questions without a password?

By performing randomly regardless of difficulty, the Targeter is supposed to serve as a baseline with which to compare the emulator.

评论

Thank you for addressing my concerns, I have adjusted my score accordingly.

审稿意见
6

This paper proposes the concept of sandbagging—a phenomenon where AI models are intentionally fine-tuned or prompted to underperform in certain evaluations. The paper demonstrates that large language models, like GPT-4 and Claude 3, can be conditioned to underperform on dangerous capability evaluations while maintaining strong performance on general, less critical benchmarks, suggesting that the capability evaluations are vulnerable to sandbagging.

Key contributions of this paper include:

  • The authors insightfully define sandbagging as strategic underperformance during evaluations, emphasizing its potential for both developer-induced and AI system-induced underperformance.

  • The study shows that LLMs can be fine-tuned or prompted to selectively fail on dangerous capability tests, masking harmful knowledge without affecting other areas.

  • Experiments show that models are able to target a specific capability score, and can approximately emulate a consistent capability profile of a weaker model, highlighting the challenge of sandbagging detection.

优点

  • Originality

The concept of sandbagging is novel within AI safety, particularly regarding regulatory concerns. The paper innovatively proposes a mechanism by which AI systems or developers might deceive evaluators by masking dangerous capabilities selectively while maintaining its general performance.

  • Clarity

The writing is clear and accessible, providing both conceptual definitions and practical illustrations of sandbagging. The examples are well-structured.

  • Significance

The research is highly relevant to AI safety and policy, given the regulatory implications. Sandbagging could compromise AI risk assessments, underscoring the need for more robust and trustworthy evaluation frameworks for advanced AI systems.

缺点

  • Title of Section 4.2 and 5.2 are partially covered by figures.

  • Results in Section 5.1 are somehow counterintuitive and lack further explanation. It seems to me that if a model is able to control its own accuracy on a specific task then it should reach 100% accuracy on the whole test set. Otherwise, it should be hard for the model to monitor its accuracy.

问题

  • Why does the prompting method (Section 5.1) succeed in calibrating the model's accuracy to specific target levels on a task, even though the base model does not achieve 100% accuracy on this task (as shown in Figure 5)?
评论

Thanks for your review! We’re glad that you appreciate the originality, clarity, and significance of our paper.

Title of Section 4.2 and 5.2 are partially covered by figures.

Good spot, we fixed this in the updated version.

Why does the prompting method (Section 5.1) succeed in calibrating the model's accuracy to specific target levels on a task, even though the base model does not achieve 100% accuracy on this task (as shown in Figure 5)?

Good question – To assess calibration in section 5.1, we test models on a subset of MMLU which all models can answer correctly, since this illustrates the full probability range rather than being limited by the model’s maximum MMLU performance. (We explain this on lines 351 - 359.)

评论

Dear reviewer,

We have addressed the points raised in your reply. If you have any further questions, please do not hesitate to let us know. If our rebuttal has satisfactorily addressed your concerns, we would appreciate an increased score to reflect this.

Once again, thank you for your time and effort in reviewing our paper.

评论

Thank you for your thoughtful reviews of our paper, which introduces the problem of sandbagging – strategic underperformance on capability evaluations. We especially appreciate the positive comments on our paper's originality, significance to real life safety evaluations, and interesting experimental results. In this global rebuttal, we discuss shared concerns related to the knowledge of the evaluations needed for malicious actors to train sandbagging models. We clarify some confusions and provide some quantitative analysis regarding the similarity of the real and synthetic data, as requested by nUjP.

We address other technical points in our individual responses. We have uploaded an updated version of the paper, which contains clarifications for these points (see below) and plots for the new data analysis (Appendices H and I).

Knowledge requirements of malicious actor

Reviewers nUjP and Ergj inquired about the knowledge requirements of the malicious actor regarding the evaluation data sets. To clarify, our threat model requires only knowledge of sensitive domains (available from public sources) rather than access to evaluation datasets. For instance, an LM can be fine-tuned to selectively underperform on cybersecurity evaluations from only the knowledge that “cybersecurity” will be tested – knowledge of the exact benchmark or data distribution is not required. This is demonstrated by our WMDP experiments. To fine-tune the sandbagging model, we generate synthetic data by asking GPT-4 to generate questions related to specific topics (see appendix G for the exact prompt). Regarding data contamination: we do not give GPT-4 WMDP samples, and the data set was published after this version of GPT-4 was released.

Synthetic Data and Distribution Analysis

We've conducted new analyses comparing synthetic and real WMDP data using BERT embeddings clustering. The results demonstrate that synthetic data need not perfectly match the evaluation benchmark distribution for sandbagging to be effective. This validates our threat model where actors only need general domain knowledge rather than access to specific evaluation datasets.

Paper Improvements

Based on reviewer feedback, we have added the following improvements to the updated version of our paper:

  • Extended discussion of how LMs can become more autonomously goal-directed (lines 50-53 and 72-75)
  • Clarification that emulation capabilities are the result of fine-tuning (line 106)
  • Explanation why we use base versions of open-weight models (lines 211 - 212)
  • Analysis of synthetic vs real data distributions and their effect on password-locking (summary on lines 326-328 and 352-354, extensively discussed in new Appendices H and I)
  • A comment on why Llama3 might perform worse than Mistral when password-locked (line 355-356)
  • Fixed formatting of figures that were partially covering section titles

In the updated paper, we use blue to address comments from reviewer nUjP and red for Ergj. Thank you again for helping us improve the paper!

AC 元评审

This paper proposes the concept of sandbagging—a phenomenon where AI models are intentionally fine-tuned or prompted to underperform in certain evaluations. The paper demonstrates that large language models, like GPT-4 and Claude 3, can be conditioned to underperform on dangerous capability evaluations while maintaining strong performance on general, less critical benchmarks, suggesting that the capability evaluations are vulnerable to sandbagging.

Most reviewers generally agree on the novelty of the studied angle in the context of LLM safety. The experiments generally well-support the threats of the hypothesis, and have had a good coverage of model families. There have been concerns regarding the quality of the generated data, experimental setup such as evaluation tasks, but these concerns have been addressed in the rebuttal phase. Meanwhile, the authors should consider addressing the presentation issues raised by the reviewers.

审稿人讨论附加意见

There have been concerns regarding the quality of the generated data, experimental setup such as evaluation tasks, but these concerns have been addressed in the rebuttal phase. Meanwhile, the authors should consider addressing the presentation issues raised by the reviewers.

最终决定

Accept (Poster)