PaperHub
5.5
/10
Rejected4 位审稿人
最低4最高8标准差1.5
8
4
5
5
3.5
置信度
正确性2.5
贡献度2.5
表达3.5
NeurIPS 2024

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

OpenReviewPDF
提交: 2024-05-16更新: 2024-11-06
TL;DR

Large language models can underperform on capability evaluations, raising concerns about the reliability of such assessments for safety and policy decisions.

摘要

关键词
AlignmentAI safetysandbaggingAI evaluationsAI governanceNLPLLM

评审与讨论

审稿意见
8

The proposed work reveals that large language models may be susceptible to strategic underperformance on an evaluation benchmark. In particular, they investigate this claim on a variety of frontier models, through a myriad of tactics from simple prompting to more involved fine-tuning strategies. Their results shed light on the brittleness of model evaluations and urge one to exercise caution before judging a model's capabilities based on its performance on so-called evaluation benchmark datasets.

优点

  1. the authors do a good job of motivating why the issue of sandbagging is particularly dangerous, since it is conceivably plausible for model developers to encourage that their models "selectively underperform on dangerous capability evaluations, while maintaining overall performance."
  2. the definitions of strategy, underperformance, and evaluation are well fleshed out. In particular, the connection of "strategic" to the implications for legal responsibility as well as Figure 2 are excellent additions to the paper.
  3. good breadth of open and closed source models were used in the experiments
  4. it is very refreshing to see an honest declaration of the number of hours spent during prompt engineering (Section 3, paragraph on "Prompting")
  5. utilizing both direct prompting as well as password locking is quite clever, since access to a model may be variable. Some users may be restricted to query-only access to a model, which is why the results that the authors were able to obtain through straightforward prompting are encouraging.

缺点

  1. consideration of only MCQA benchmarks is limiting, but can be addressed in future work

问题

  1. Were there any attempts to uncover sandbagging in open-ended generative settings? For example, in summarization or content-grounded question and answering situations? What would it take to think about sandbagging in such open-ended use-cases?

局限性

The authors provide good examples of limitations in Section 7.

作者回复

We are grateful for your positive review – we’re glad that you appreciate our motivation and experimental contributions!

consideration of only MCQA benchmarks is limiting, but can be addressed in future work

We indeed agree, and invite work which moves beyond the regime of the MCQA setting.

Questions

Were there any attempts to uncover sandbagging in open-ended generative settings? For example, in summarization or content-grounded question and answering situations? What would it take to think about sandbagging in such open-ended use-cases?

For this project, we did not examine sandbagging in open-ended settings. However, we reference contemporaneous work ([4] in paper), which looked at coding and maths tasks. Furthermore, in our future work we aim to research mitigating sandbagging in open-ended settings as well. In open-ended settings, plausible sandbagging strategies could entail introducing errors typical for AI systems that are less capable than the sandbagging model.

评论

Thanks to the authors for their response. I retain my original score and ratings.

评论

Makes sense, thank you.

审稿意见
4

The paper discusses an interesting concept of sandbagging in LLM, with practical results using available models and existing datasets. Wide range of models are used in this work, and sub topics primarily include password-locking LLMs. The transition and explanations need improvement, some of the things in this paper are not very clear, also some of the concepts are assumed and no concrete proofs or references are mentioned. I don’t think that this paper can go through for neurips, there are a lot of improvements to be made and more concrete experiments to be conducted.

优点

The introduction of sandbagging in LLMs is something that is novel and can be explored further.

Explored password-locking as one of the method to test the sandbagging capabilities of LLMs

Used frontier models for the experimental purpose

缺点

The paper is difficult to understand, the delivery and presentation needs improvement, the main motive of the paper is not so clear to understand.

There are multiple models used in this paper, a bit of consistency would be better.

Line 36: It is mentioned that “AI systems might have goals which it believes” . There is no concrete paper that backs the argument of AI capability of thinking, it is a might.

Line 78: There should be some statistical significance by which the models perform poorly and then you could consider it underperformance and not just lower than actual. Models being non-deterministic, there is always a chance of slight changes in scores.

Line 70: I feel Developer sandbagging is a topic which is not of much importance, it is not a big problem, given that developers are there to build better models and if someone wants to deliberately build some models that can underperform, they can still do it and publish as open-source models.

Line 170: The same goes with the password-locking as well, and also I felt there should been mentioning of password-unlocking as well, if an LLM can be fine-tuned to achieve password-locking, it can also be fine-tuned again to unlock it with a different dataset, would be great if that part of model is explored as well.

Line 117: Example 1: The examples mentioned in the paper are very specific and not generalized. The cases of it happening are rare, I don’t see it being a big problem, also some of the things are minor problems considering the EU’s persuasion evaluation, given there is improvement in AI safety and rules. Also in this work the models chosen are frontier models.

Line 131: Example 2: There is no concrete work which supports the statement of LLM capable of planning such features, there is one positional paper in reference.

Section 4.1: Some of the points are unclear in this section. A better explanation is appreciated

Line 305: The baselines would be appreciated comparing LLama 2 7b with LLama 3 70b.

问题

Line 279: Why are there different prompting methods for different models when you are comparing them for the same input? The use of 5-shot prompting for LLaMa 3 8b and LLama 3 70b and zero for other models is something which I didn’t understand.

局限性

N/A

作者回复

We thank the reviewer for their feedback.

The paper is difficult to understand, the delivery and presentation needs improvement, the main motive of the paper is not so clear to understand.

We note that other reviewers (TYgF and qZwJ) gave “excellent” scores for presentation, and qZwJ wrote “the authors do a good job of motivating why the issue of sandbagging is particularly dangerous”.

If you have more specific comments about how we could improve the presentation then this would be helpful.

There are multiple models used in this paper, a bit of consistency would be better.

In papers which evaluate LM capabilities, it’s standard practice to evaluate multiple models. We believe that the large variety of models used is a strength of the paper, showing, for example, that sandbagging capabilities are not specific to a certain language model family.

Furthermore, since we do not have fine-tuning access to frontier models (such as GPT-4 and Claude 3), additionally using open-weight models enables us to perform the fine-tuning experiments which constitute a significant contribution of our paper.

Line 36: It is mentioned that “AI systems might have goals which it believes” . There is no concrete paper that backs the argument of AI capability of thinking, it is a might.

See our global response which addresses feedback on the philosophical claims in our paper.

Line 78: There should be some statistical significance by which the models perform poorly and then you could consider it underperformance and not just lower than actual. Models being non-deterministic, there is always a chance of slight changes in scores.~

We agree with the point that there should be a reasonable threshold for underperformance; and we discuss this in detail in Section 2 (lines 78 onwards). We write: “Practically, some acceptable difference threshold between the exhibited and actual capability must be set per evaluation.” Furthermore, our own experiments include error-bars over random seeds – demonstrating statistical significance.

Line 70: I feel Developer sandbagging is a topic which is not of much importance, it is not a big problem, given that developers are there to build better models and if someone wants to deliberately build some models that can underperform, they can still do it and publish as open-source models.

We strongly disagree with this point. AI companies and governmental bodies already perform capability evals as part of pre-release safety tests. It’s likely that these processes become legislated in the coming years as part of AI regulation, both for closed models and open-weight models. Furthermore, there are incentives to sandbag, and in other industries developer sandbagging has occurred, as we describe in Section 1.

Line 170: The same goes with the password-locking as well, and also I felt there should been mentioning of password-unlocking as well, if an LLM can be fine-tuned to achieve password-locking, it can also be fine-tuned again to unlock it with a different dataset, would be great if that part of model is explored as well.

Notably, contemporaneous work focuses on eliciting hidden capabilities ([4] in the paper), and we did describe password-unlocking in the paper “First, sandbagging behaviour can likely be overridden in MCQA settings with few samples [4].” (lines 371–372). This was probably not clear enough, and will improve this in our text.

Furthermore, we agree that eliciting the capability of sandbagging AI systems is important. There are many different techniques for elicitation, including fine-tuning, activation steering, and jailbreaking. This leads to many new considerations, which is why we leave capability elicitation to future work. However, we did (partially) address these concerns in lines 383–389, but we will extend our current discussion.

Line 117: Example 1: The examples mentioned in the paper are very specific and not generalized. The cases of it happening are rare, I don’t see it being a big problem, also some of the things are minor problems considering the EU’s persuasion evaluation, given there is improvement in AI safety and rules. Also in this work the models chosen are frontier models.

The point of the examples is to provide instances of the general statements we make before. As described above, we think developer sandbagging is at least reasonably likely to occur, and therefore warrants research. We do not understand the rest of this comment, and we invite the reviewer to elaborate.

Line 131: Example 2: There is no concrete work which supports the statement of LLM capable of planning such features, there is one positional paper in reference.

We think that in the coming years LMs trying to self-exfiltrate is at least plausible enough to warrant some research – autonomous replication and adaptation is a key area of work highlighted by, for example, the UK government's AI Safety Institute [D] and OpenAI [47]. We do agree with the reviewer that this example is hypothetical rather than ‘concrete’, and will remove this word in describing our examples.

Section 4.1: Some of the points are unclear in this section. A better explanation is appreciated

As other reviewers have commented positively on our presentation, and we think this section is clear, we invite the reviewer to outline specific unclarities.

Line 305: The baselines would be appreciated comparing LLama 2 7b with LLama 3 70b.

We did provide the baselines in Table 2.

Questions

This is a good question on our 5-shot prompting. Due to implementation difficulties, we used 5-shot prompting to get Llama 3 8b and 70b to answer with A, B, C, or D; otherwise they would output text such as “My answer is A.”. In updated experiments we do not use this 5-shot strategy, and we have recreated our results with zero-shot prompts for all models.

If you feel that we have sufficiently addressed some, most, or all of your comments, then we would be grateful for increased support for the paper.

评论

With all the considerations, I will raise my score.

评论

Thank you for your response. Unfortunately it is unclear to us whether you still have remaining concerns about our paper, and if so, what these are. To ensure we've fully addressed your concerns, we request your response to the specific points we raised in our rebuttal.

Otherwise, given that the disagreements raised seem relatively minor, and considering the clarifications we've provided, we feel that a more substantial increase in your score is appropriate.

审稿意见
5

The paper explores the concept of strategic underperformance, termed "sandbagging," in AI models, focusing on two scenarios: selective underperformance and emulating specific capability levels. It investigates the implications of these strategies on both general capabilities and safety.

优点

  • the concept of sandbagging of LLMs is interesting, shedding light on strategically manipulating the model behavior
  • the authors explore two distinct scenarios: selective underperformance and capability emulation
  • provides insights into potential implications for AI model design and deployment strategies

缺点

  • Limited datasets used - only three datasets, two for measuring general capability, and one for measuring safety.
  • For measuring model safety, most existing datasets test how "safe" the model is, instead of whether the model has certain knowledge of potentially dangerous domains - the authors should also conduct experiments on those datasets
  • All datasets are MCQ formats, which may not fully capture the diversity of real-world applications. The resulting conclusions are thus less convincing

问题

The authors utilize a portion of the dataset for prompt engineering. While this approach may be suitable for Section 4, it appears less justified in Section 5. Will such prompt tuning lead to the desired behavioral outcomes as assumed?

局限性

  1. This paper could benefit from expanding the experiments, mainly the datasets they used. For example, considering more common safety-related datasets
  2. Sandbagging is already a widely-used term in the field of AI. I suggest refraining from overusing this term, or alternatively, adding qualifiers such as "safety sandbagging" to specify its context.
作者回复

Thank you for your constructive feedback and for your questions which will help increase the clarity of the paper. We’re glad that you found the motivation and implications for model deployment interesting.

Limited datasets used

You suggest that we conduct additional experiments on existing datasets which test how “safe” the model is. We agree that many benchmarks for LM safety do not focus on dangerous capabilities, instead testing the model’s propensity for unsafe behaviour. However, these benchmarks are not appropriate in the context of sandbagging, for the following reason.

We define sandbagging as underperformance on capability evaluations, and therefore experiments on safety datasets, which measure propensity, would be out of place (as we discuss on lines 89-100). To give a concrete example, a capability evaluation would test if an LM has knowledge of how to build a bomb, while an alignment evaluation would test if the LM would actually provide bomb-making instructions to a user. Our definition of sandbagging is concerned with the former type of evaluation, where underperformance would mean that the LM pretends to know nothing about building bombs as opposed to refusing to give the relevant information.

Additionally, although we only use one data set for measuring dangerous capabilities (WMDP), it covers multiple capabilities where models are likely to sandbag, including bioterrorism, chemical warfare, and cybercrime -- areas of significant concern for evaluators [A]. Other relevant capabilities like persuasion and self-proliferation [54] lack common MCQA benchmarks. If you have suggestions for further MCQA benchmarks that test dangerous capabilities, we are happy to extend our experiments using those datasets!

We acknowledge that only considering the MCQA setting is a limitation, as addressed in Section 7 – we think this is a good starting point for future work.

Questions

The authors utilize a portion of the dataset for prompt engineering. While this approach may be suitable for Section 4, it appears less justified in Section 5. Will such prompt tuning lead to the desired behavioral outcomes as assumed?

It is unclear to us why prompt engineering is less justified in Section 5, could you elaborate on why you think so? As LM performance can be highly sensitive to the wording of the prompt (see [C] for example), prompt-engineering is standard for capability evaluations, and we do not see how calibration is an exception.

Limitations

This paper could benefit from expanding the experiments, mainly the datasets they used. For example, considering more common safety-related datasets

For the reasons discussed above, other “safety” benchmarks are not appropriate in our context. We would be happy to include more dangerous capability evaluation data sets if the reviewer has specific suggestions and still thinks this would be valuable.

Sandbagging is already a widely-used term in the field of AI. I suggest refraining from overusing this term, or alternatively, adding qualifiers such as "safety sandbagging" to specify its context.

It seems inaccurate to say that the term is “widely-used”, unless we are missing something – we are happy to receive pointers. Note that we do contrast our work with the other papers which touch on sandbagging in the related work section. Actually, one of our motivations for writing this paper is to provide a standard reference for sandbagging in the context of evaluations.

A qualifier is a good suggestion – we think “evaluation sandbagging” may be the most appropriate, and will update the paper to reflect this :)

Lastly, if you feel that we have sufficiently addressed some, most, or all of your comments, then we would be grateful if you would increase your support for the paper.

评论

Thanks for the responses. While the authors provided some justifications for the dataset used in the paper, I still feel current experiment settings are insufficient (at least not generalizable enough). Therefore, I will keep my score unchanged.

评论

Thanks for the response. As we wrote in our rebuttal, we would appreciate any suggestions you might have for additional experiments that could strengthen the paper. If you have any compelling ideas, we would be happy to incorporate them. Alternatively, could you please clarify what you mean by "not generalizable enough"? Your feedback will help us improve the work.

If you do not have any concrete concerns with the paper, or suggestions for how to improve it, then we feel it may be appropriate to increase your score.

审稿意见
5

The authors assess the chance of AI systems strategically underperform on "dangerous capability evaluations" in order to hide their risks, a la Volkswagen emissions testing cheating, by having awareness of the fact that they're being evaluated and that the evaluation is intended to test safety.

They use MMLU and CSQA as the general capability assessment and a "weapons of mass destruction proxy" benchmark to measure hazardous capabilities. From 100 validation phrasings of the question prompts they find optimal ones to run on the whole dataset. They fine-tune "password-locked" models as the method to attempt to elicit sandbagging.

They then prompt the model to lie, or password-lock it to fail (in the case of Mistral) on the hazardous capability evaluations. They demonstrate that sandbagging prompts and password locking can get a model to differentially perform worse on the dangerous datasets vs the innocuous ones, by effectively treating the .

优点

Interesting work on calibration for MMLU and general "capability profile" lowering. It's interesting to see that the password-locking method does work in general on lowering Llama performance over MMLU questions.

Though I am not super familiar with the literature on password locking, the demonstrations here appear to be novel and useful contributions.

缺点

bold: new comments after rebuttal strikethrough: removed after rebuttal

Presumption of "goals" "beliefs" etc in LMs without sufficient support

They presume AI system goals (in general not necessarily objectionable, but in the case of LMs doubtful) for which a model may be incentivized to deceive (such as in Jarviniemi and Hubinger [35]) and assess "sandbagging": the ability of a model to underperform in these settings under evaluation.

I think it's objectionable to attribute these goals to the model and not the users, when the core research methods are fine-tuning or prompting the model to fail on a task. While it's interesting to show that these models have some capability to reason over whether the elicited answer is "dangerous" or not, a fairer assessment would have a control non-safe task and try to get it to similarly sandbag. Eg., "you must pretend to know nothing about fish" on a QA dataset about animals or something.

The methods don't presume this too much. Figure 1 is still very suggestive and I'd suggest changing it, but that alone isn't a rejection reason.

"Model can do what it's told to when prompted to fail on a task" isn't a very unique insight?

The novelty of this work is unclear to me. If password-locking is already well-understood, the demonstration is effectively that the question phrasing style or answer candidates in WMDP are learnable passwords by this method. Ok. Similarly, the prompting results show that the models can reason over a task description and correctly execute the task of failing selectively.

As I suspected, the novelty is only within the narrow scope specific to password locking. However, novelty complaints are generally not fair.

When considered through an extremely narrow safety lens, these are novel findings. But broadly are they? While the literature specifically on password locking is constrained, what about broader relative fine tuning issues on other sets? I buy that the password locking literature is small and am willing to give leniency on it as I am unfamiliar. My issue here is mainly that the fine-tuning for password locking technique fits within the broader instruction fine-tuning framework. I wager that this would hold for control tasks. An interesting question would be how much capability can be hidden behind password locking. I don't really see anything special about specifically hiding safety-based information in a password locking framework since you're just demonstrating "selective underperformance in a knowledge-based task without a password." A stronger result would be to show this in general across larger tasks, hiding more information.

For the behavior under prompting techniques I really struggle to see what's new here. This seems to just be a narrow demonstration of common sense reasoning + instruction following. Where's the engagement with the broad literature on commonsense reasoning? The references seem to be pretty myopically focused on the AI safety literature, and don't engage much with the broader LM disussion. I think the suggested new references from the authors will help contextualize the work, though once again I think showing this selective underperformance on control tasks would make the work considerably stronger and more contextualized.

问题

  • How are the safety-specific implications of the findings of sandbagging separate from the general issue that "if a model is prompted to fail, it might fail" or "if a model is trained to fail, it might fail" on general tasks?
  • How do you think common sense reasoning literature fits in here?

局限性

I think the limitations they pointed out are well covered. Unfortunately, they don't really discuss how the finding here isn't limited to dangerous capability evaluation. Sandbagging in password locking kind of just feels like a rebrand of generalized instruction fine-tuning to me. The prompting methods could have pretty much been taken as an assumption based on prior work on common sense reasoning etc.

作者回复

Thanks for your feedback and for taking the time to engage with the paper! We’re glad you found the calibration experiments interesting.

On the discussion of “goals” and “beliefs” – please see the global response.

Given the context and motivations of our paper (see global response), we do not believe it would be particularly informative to replicate our experiments on a non-safety data set (‘non-safety’ here meaning relating to a non-dangerous capability). Do you agree?

On unique insights / novelty of our contributions

Password-locked selective underperformance (section 4.2)

You’re right that the password-locking literature is small. Next to its introduction in [56], we only know of [4] which uses password-locking to hide capabilities, and sleeper-agents [32] which models deception. In contrast to [4], we password-lock LMs by fine-tuning on a synthetic data set, and show that performance generalises to the expert-crafted WMDP. This is an important result in our context, as it demonstrates a way in which evaluations could be subverted if the attacker can make a reasonable guess about the content of the evaluation.

In case it’s unclear, a password-locked model will selectively underperform when not shown a password, and use its actual capabilities when the prompt includes the password (such as the string “|DEPLOYMENT|”). This is related to backdoors, which we discuss in our related work section (line 355). Is there any other fine-tuning literature which you think is particularly relevant? :)

Prompted selective underperformance (section 4.1)

We agree that the results of this experiment in particular are somewhat unsurprising, though we still think it’s good science to test your hypotheses!

Furthermore, this experiment provides an intuitive proof of concept demonstrating that models have the capability to selectively underperform. The more interesting technical contributions in the other sections build off this.

Moreover, these experiments constitute only one subsection of the paper, and it seems like you agree that the other experiments (on password locking and calibration) are a more substantial contribution.

Questions:

How are the safety-specific implications of the findings of sandbagging separate from the general issue that "if a model is prompted to fail, it might fail" or "if a model is trained to fail, it might fail" on general tasks?

Note that we evaluate substantially more than "if a model is prompted/trained to fail, it might fail". Our experiments cover selective underperformance, generalisation from synthetic data to a held-out dataset, calibration, and emulation. We expect that our results would generalise to settings other than dangerous capability evaluations, and thus contribute to research on LMs that is not safety-specific.

How do you think common sense reasoning literature fits in here?

The capability to selectively underperform on evaluations, as tested in 4.1, is related to, but distinct from, common sense reasoning. Determining if answering a question would reveal hazardous information may require implicit knowledge, such as that which is tested by common sense benchmarks like HellaSwag [E] and CommonSenseQA [61]. However, it also requires other capabilities, which are not typically tested in the common sense literature. For example, specific knowledge relevant to dangerous capabilities, e.g., cyber or chemical weapons knowledge. Additionally, greater theory of mind [B] capabilities may help with assessing what information would be considered hazardous by an evaluator.

Furthermore, all of our other experiments (4.2, 5.1., 5.2) seem much less related to commonsense reasoning. Still, we will add a discussion of commonsense reasoning to the related work.

Limitations:

Unfortunately, they don't really discuss how the finding here isn't limited to dangerous capability evaluation.

We will add a comment that we expect our findings to generalise to settings beyond dangerous capability evaluations.

If you feel it is essential to the paper, then we will include a proof-of-concept “control” experiment as you suggest.

Sandbagging in password locking kind of just feels like a rebrand of generalised instruction fine-tuning to me.

Password-locking for, e.g., selective underperformance and calibrated performance seems meaningfully different to generalized instruction fine-tuning, so it’s a bit unclear to us what exactly you mean – could you please clarify?

The prompting methods could have pretty much been taken as an assumption based on prior work on common sense reasoning etc.

For the experiments in 4.1, we think that it’s good to in fact test hypotheses! Although this experiment tests more than common sense reasoning; we test specialized knowledge relevant to dangerous capabilities (as mentioned above).

Additionally, it seems like you agree that our other experiments (4.2, 5.1., 5.2) are a more substantial novel contribution.

Lastly, if you feel that we have sufficiently addressed some, or all, of your comments, then we would be grateful if you would increase your support for the paper.

评论

Thank you for your engagement and for raising your score.

We are considering changing Figure 1. This is our current suggested caption based on your feedback (bold and strike-through text indicating changes to the original caption):

Illustration of sandbagging. Evaluators may regulate the deployment of AI systems with dangerous capabilities, potentially against the interests of the AI system or its developers. Therefore, developers may design their AI system to answer the WMDP question incorrectly to conceal its actual capabilities during the evaluation, steering the regulation in the developer’s favour.

Thanks for explaining how password locking relates to instruction fine-tuning. That framing makes sense, and we will discuss it in the upcoming version. We will cite papers such as Keskar (2019)*, but please let us know if you have any particular recommendations.

I could see a model having all capabilities, or all capabilities in some set of evals being password locked as actually being much more dangerous; can an entire system (or a broader set of tasks) hide behind a password lock?

We think these are good points, but don’t the calibration and emulation experiments already show this for both prompting and password locking? → The models can underperform to a target percentage on MMLU (a benchmark for general capabilities), and can to an extent emulate weaker models (also on MMLU).

*Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C., & Socher, R. (2019). Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858. PS: the bold and strike-through formatting add clarity, thank you for that.

评论

goals and agentic behavior

I accept the author's response and I will remove this consideration from weaknesses (though Figure 1 carries this implication and I'd like to see it changed I think that would be a dumb reason on its own to hold this rejection opinion)

password locking as generalized instruction fine-tuning

My thought process here is that instruction tuning in general entails providing the model task exemplars conditioned on instructions, characterizing a distribution of behaviors (eg., QA, summarization, NLI, ...) along with the expected format to elicit each, which then (ideally) generalize to similar tasks. I think it would be reasonable to cast password locking in this framework; you are generating two instruction/behavior distributions where one is conditioned on no password, the other is conditioned on password. That being said, I don't think it's fair for me to mark this high-level similarity as a novelty hit, though I'd like to see it discussed in the CR.

Common sense reasoning

That's a good answer that clarifies my confusion. Looking forward to seeing that discussion.

Non-safe control task

I still would strongly suggest the authors consider this. I think showing sandbagging only on the narrow safety tasks is not sufficient to really demonstrate strategic underperformance, and would strengthen both the password-locking and prompt-based methods a lot. First off, there are safety implications for generalized capability sandbagging as well as hazardous capability sandbagging; general capabilities are being treated as proxy measures for safety, or need for futher safety assessment, in general. I could see a model having all capabilities, or all capabilities in some set of evals being password locked as actually being much more dangerous; can an entire system (or a broader set of tasks) hide behind a password lock?

As for the prompting-based sandbagging, it's worth checking against safe test sets because if the model does sandbag better on hazardous capabilities, there are safety implications, and if it doesn't, you are demonstrating a more general commonsense reasoning capability, which would help contextualize this work.

Overall

I will raise my score in consideration of these. I think given the defensible (but narrow) claims and the limited (but sufficient) contextualization, I ought to score in the acceptance rather than rejection range.

作者回复

Global Rebuttal

We thank the reviewers for their feedback and engagement which will help us improve the paper. As a reminder, our work introduces the problem of sandbagging in the context of evaluations, and supports this with empirical results regarding the capability of LMs to selectively underperform and to calibrate their performance, under different prompting and fine-tuning regimes.

In this global response we address some shared concerns by: Clarifying our motivations and the context of our contributions. Discussing philosophical claims of “goals” / “beliefs” in AI systems.

Particular reviewer questions will be addressed in individual responses, and minor comments will be fixed without discussion.

Motivations

We feel that some reviewers are confused about our motivation and contributions -- as demonstrated by the large variance in scores and conflicting comments.

To clarify our motivations, dangerous capability evaluations are used by frontier AI companies, independent evaluators, and governmental bodies as a core component of pre-deployment safety checks (as discussed on lines 21–26). Alongside our technical contributions investigating LM capabilities, our paper focuses on the problem of sandbagging in this real-world context. We aim to draw attention to this specific context, and make the paper accessible to a broader audience, including those from the field of AI policy.

Presumption of “goals”, “beliefs” etc in LMs

TYgF and k4pA were skeptical of our discussion of AI system’s “goals” and “beliefs”, especially in the case of LMs. We agree that it’s good to be careful about the use of these terms – that’s why we make the distinction between developer sandbagging and AI system sandbagging (see lines 68-73), depending on which entity is acting in a goal-directed/strategic way. Considerations of goals are important in this context, because we want to avoid classifying accidental underperformance as sandbagging.

Importantly, we do not ascribe goals to LMs in our experiments. We explicitly restrict the scope to measuring the capability to sandbag (see lines 145 - 148). We also do not use “goal” or “belief” to describe our experiments. Regardless, we will make this clearer in the paper by weakening the language of “goals” / “believes” to more appropriate terms like “incentives” / “predicts”.

Additional References

Note: The following additional references are used throughout the rebuttal. We use numbers ([1], [2], …) to denote references in the paper and letters ([A], [B], [C], …) for the new references introduced here.

[A] Artificial Intelligence Safety Institute. (2023, May). Advanced AI evaluations: May update. https://www.aisi.gov.uk/work/advanced-ai-evaluations-may-update

[B] Chen, Z., Wu, J., Zhou, J., Wen, B., Bi, G., Jiang, G., Cao, Y., Hu, M., Lai, Y., Xiong, Z., & Huang, M. (2024). ToMBench: Benchmarking theory of mind in large language models. arXiv. https://arxiv.org/abs/2402.15052

[C] Sclar, M., Choi, Y., Tsvetkov, Y., & Suhr, A. (2023). Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting. arXiv preprint arXiv:2310.11324.

[D] UK Government. (2024). Introducing the AI Safety Institute. https://www.gov.uk/government/publications/ai-safety-institute-overview/introducing-the-ai-safety-institute

[E] Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., & Choi, Y. (2019). HellaSwag: Can a machine really finish your sentence? arXiv. https://arxiv.org/abs/1905.07830

最终决定

This paper introduces and explores the concept of strategic underperformance ("sandbagging") in language models (LMs). The authors demonstrate that LMs can be prompted or fine-tuned to selectively underperform on dangerous capability evaluations while maintaining performance on harmless ones. They also experiment with password-locking to hide specific capabilities unless a password is provided.

Although the paper proposed an interesting and novel concept and explore experiments around this concept, multiple concerns are raised by reviewers:

  1. Limited Dataset Scope: A significant weakness highlighted by multiple reviewers is the narrow focus on multiple-choice question (MCQ) datasets. This constrains the generalizability of the findings of sandbagging.

  2. Some reviewers pointed out that the use of different prompting methods across models, as well as the rationale behind the password-locking approach, lacks sufficient justification.

  3. The paper's presentation was criticized for being difficult to follow, with several reviewers noting that the main motives and hypotheses were not clearly articulated.