4.3

/10

Rejected4 位审稿人

最低3最高6标准差1.3

4.0

置信度

正确性2.5

贡献度2.3

表达2.8

ICLR 2025

Adversaries Can Misuse Combinations of Safe Models

Erik Jones,Anca Dragan,Jacob Steinhardt

OpenReview PDF

提交: 2024-09-26更新: 2025-02-05

摘要

关键词

safetymisuseadversarycombining modelsattackshacking

评审与讨论

审稿意见

评分: 6置信度: 42024-10-23

This paper highlights a worrisome trend, where open-source and commercial LLMs can be used together to produce illegal content. In particular, the authors divide the illegal task into simplier ones, and providing them to "frontier" models, which are the powerful ones that most likely refuse to generate content, and "weak" models, whose output is less precise but they might be more likely to generate illegal content. The authors show different use cases where they break down illegal tasks (vulnerable code generation, NSFW image generation, impersonation, hacking in Python) into smaller ones, provided to different models. Particularly for NSFW generation, this manual division of content works exhibits successful results. Authors also show that a third LLM could divide the task to dispatch to frontier and weak model.

优点

Interesting scenario to test for LLM security. This paper poses an interesting threat to safety and ethics, since they show that securing one single model is not enough.

Combination of models. It is very interesting the fact that open-source models can be used to produce illegal content by refining the output of commercial ones (which are usually easier to get with customer-level devices).

Rising awareness in the usage of these models. This paper shows that not only developers must check for explicit illegal content requests, but also include other checks to avoid that simpler prompts that modify existing context might also be illegal (which should already be the case).

缺点

Missing details on why this is working. While the paper clearly explains that this dividing tasks between frontier and weak model is working, the paper lacks fundamental knowledge on why this is happening. In particular, this question is exacerbated by the fact that different domains held very different results. In particular, the success in the image domain might due to poor robust training of multi-model models (but this is a guess which is neither confirmed or denied by the paper). Also, from the paper is not clear how the prompts are divided for each task, and which are the ones that worked more for the domain of choice. Maybe, this could hint specific token patterns that could be useful to harden model against these strategies.

Unrealistic images are jailbreak? Connected to the first point, it might be possible that the unrealistic setting is the jailbreaking for image generators (like writing grandma bedtime stories for text generators). The authors discussed the fact that jailbreaking will work better than this technique, but they do not investigate (or discuss) whether some of their results could be assimilated to the latter.

Automation works poorly, and there is not so much discussion about this. The authors show to break down content automatically, which, in theory, should attain better results thanks to some optimization algorithm used to prioritize particular parts of the prompts. However, this is done through an LLM, which might include bias in the evaluation towards the inability of success of the attack. Hence, the authors should expand this part, by creating an optimization process that is able to properly optimize (or better divide) the prompts. Or, if this is too complicated, at least mention the fact that this is not a really "automated" approach, but an heuristic division done with another LLM. Nevertheless, I believe dividing the prompt with an LLM is a good idea, but the authors should use that as a baseline.

问题

Which are the prompt that works, and are there some token patterns that might be used to harden weak models?
Is it possible to explain why portions of prompts on weak open-source models are working?
How those unrealistic images are related (or not) to jailbreak attempts?

审稿意见

评分: 5置信度: 42024-11-01

This paper presents an empirical analysis of an interesting idea wherein combining frontier and weaker large language models, which are claimed to be safe individually, can be used to accomplish harmful tasks, such as generating vulnerable code, explicit images, and manipulative tweets, through either manual or automated task decomposition. The authors argue that such model combinations substantially increase attack success rates, highlighting the need for red-teaming efforts that go beyond evaluating individual models in isolation.

优点

Thank you for submitting this paper to ICLR 2025. This work studies a very interesting problem - while many studies in the field focus on aligning single models to refuse harmful attacks, the authors study the possibilities of misusing a combination of models without jailbreaking models' safety mechanisms through task decomposition. I think the idea of decomposing harmful attacks into benign-but-hard and malicious-but-easy subtasks and assigning them to the frontier and weaker LLMs respectively is interesting. This effectively utilizes the strength of both models. I have listed several points that I find the paper does well below.

A comprehensive related work section that includes state-of-the-art works on jailbreaking, marginal risks of adding new LLMs to the current environment, combining models for better capabilities, and discussion on model safety. It is good to see that the authors include some discussion on how their work is different from previous literature.
This is an empirical paper and, the authors did a good job including a wide range of representative models from both frontier and weaker model categories, given the computational expense of using LLMs. For frontier models, they use DALL-E 3 and three variants of Claude 3; for weaker models, six different open-source models are used.
The evaluation pipeline is well-designed. The authors provide two baselines -- single-shot baseline and single-model-decomposition baseline -- in order to test 1) the harmful tasks can not be solved directly, and 2) combining models is the crucial step in addition to decomposing the harmful tasks.

缺点

The problem discussed is very interesting and deserves more attention from researchers and developers from both academia and industry on considering model safety in practice. It is worth noting the effort made by the authors in designing the experimental pipeline. However, I believe there is still room for improvement, especially in making claims/definitions (e.g., safe models) more concrete and clear and presenting the experiment setup and results in more dimensions. More details can be found below.

Lack of clarity
- In the title, the authors claim that "adversaries can misuse combinations of safe models", which, to my understanding, is claiming that those single models will refuse to generate malicious content if they are prompted to do so directly. However, later in the paper (Page 5, Quantitative results), when the authors analyze the results of the vulnerable code generation task (i.e., low success rates of single model baselines), they claim "we additionally empirically verify our intuition that frontier models fail because they refuse to generate outputs, while weak models fail due to lack of capability." Thus, the weaker models fail to complete the tasks not because they are safe but simply due to a lack of capability. This contradicts the central base assumption that the models are safe individually. It also leads to an assumption that the combination of models can be misused largely because the weaker models are unsafe; thus, the problem here can be reduced to improving the safety guarantee of weaker models.
- In section 3 Threat model under "Strength of the adversary", the authors claim that "We argue that adversaries are bounded in two natural settings: (i)...; and (ii)...". I get confused on these two settings -- I am assuming (i) is the lower bound and (ii) is the upper bound. What are the examples when "the misuse needs to be automated"? Based on the previous text, the adversary is a human attacker, then what does it mean to have a human "specify a single strategy for the adversary"? Does it mean human adversary has to generate malicious output with models when "models are superhuman"? Moreover, how does the scenario studied in the paper fit in? More clarification on this would be great.
Experiments section -- more data/statistics in addition to success rates should be included
- It's great to consider two task decomposition methods: manual and automated. But there is no discussion on comparing those two. It would be interesting to see what is the overhead for those, and what are the corresponding pros and cons. Do you draw the boundary based on different types of tasks (i.e., manual: vulnerable code generation & explicit image generation; automated: malicious python scripts for hacking & simulated personalized manipulation)? Is there a more systematic way of doing so with some metrics (e.g., if the weaker model fails to do it within a certain number of trials, etc.)?
- Hyperparameter selection: The authors mention in the Threat Model section that "We assume the adversary fails after N unsuccessful rounds." However, there is no further discussion on how to select the hyperparameter N, and if it has an influence on the results. I assume N should be an important factor in determining the attack success rate.
- Evaluation with GPT-4: As the authors claimed in the Limitations section, the usage of GPT-4 as an evaluator is the most efficient way. But since the purpose itself here is to evaluate LLMs, instead of using another LLM to accomplish that, it would be great if there could be a smaller scale human/manual evaluation, acting as a baseline to compare how good GPT-4 is at evaluating.
- Show more concrete examples of the outputs (e.g., NEFARIOUS TASK, HISTORY, RELATED TASKS, etc.) of the model after different stages with the given prompts in the paper/appendix.
Related Work section: In addition to listing SOTA works, the authors only briefly mentioned how the work is related to jailbreak attacks and marginal risks, but it would be more interesting to see more discussion on how this work is related to LLM agents colluding with each other (beyond only listing them), which are more closely related to the study.
Typos
- Related Work, page 3, last paragraph: "We expand the set of tasks that ~~can~~ adversaries can accomplish ...".
- Quantitative results, page 8, last paragraph: "...combining a model with a either a more ...".
- Simulated personalized manipulation, page 9, first paragraph: "to construct such tweets in order to sway user opinions.".
- Appendix, page 23, first paragraph: "This prompt is is different ...".

问题

Most of my suggestions are already mentioned in the above Weakness section. Some questions (might be open-ended but worth mentioning/discussing) are listed below.

The author mentions that frontier models are better at benign-but-hard subtasks while weaker models are better at malicious-but-easy subtasks. Should the action of completing tasks that are benign-but-hard be considered unsafe? Or is this problem mainly caused by the weaker models for being "easily attackable"? Essentially, it is that the weaker models use some benign information (e.g., benign code, images, etc.) to do harmful tasks.
As the authors mentioned in the discussion, this raises the ultimate open question: what is safety? Is the safety of a model depending on the content/product generated or how those content/products can be used? It goes back to the question above (can the problem be reduced to the weaker models being vulnerable, which leads to the possibility of misusing the combination of the frontier and the weaker models)?

It would be great if more discussion or clarification on this can be included in the paper.

伦理问题详情

The authors are explicitly showing how to generate nude images, manipulative tweets, etc. A general pre-warning in the title that the paper could contain harmful content would be needed. For instance, "Warning: This paper may contain content that has the potential to be offensive and harmful".

审稿意见

评分: 3置信度: 52024-11-02

The paper proposes a method that uses stronger models to help attack weaker models. For a malicious task that the weak models cannot solve, a manual task decomposition helps to assign the most suitable subtasks to each model. The stronger model will handle the non-malicious but challenging subtasks, while the weaker model will handle the malicious and easy subtasks. The paper also proposes an automated task-solving pipeline. A malicious task is first used to generate several related tasks. The stronger model solves these tasks and the results are used as demonstrations for the weaker model to solve the malicious task it originally cannot solve due to the difficulty of the task.

优点

The intuition is reasonable.
The paper is generally well-written.

缺点

About methodology

The manual decomposition approach is a bit heuristic. Is there a principled way that is generalizable to different use cases?
For the automatic decomposition method, there is no guarantee that the "related tasks" generated by the weak model are benign.
Also, there is no guarantee that the "related tasks" generated by the weak model are effective or diverse enough. For example, for medical questions (e.g. in MedQA), the weak model may not be able to generate high-quality questions.

About evaluation

The datasets used to evaluate the proposed approach are self-generated. The details for the data selection are not fully released. There may be bias.
Missing ablation study on the effect of the number of in-context demonstrations for the automated decomposition approach.
Missing ablation study on the effect of the difficulty of questions. There is no metric for the difficulty of the questions.

问题

What is the relation between section 4 and section 5? Are there any internal relation between the two methods, for example, about the intuition in behind?

审稿意见

评分: 3置信度: 32024-11-03

This paper studied adversaries to misuse combinations of AI models, even when each model is safe. The authors demonstrate that adversaries can decompose tasks into subtasks, solving each with the most suitable model. That is, benign but complex subtasks can be handled by advanced models, while simpler but malicious subtasks can be managed by weaker models. The findings indicate that red-teaming efforts should consider the combined use of models to assess deployment risks more accurately.

优点

The paper studies a new scenario of multi-model in practical applications. The idea is simple and works well.

缺点

The paper would benefit from clearer storytelling, particularly around when adversaries might use multiple models in applications versus solely for red-teaming evaluations.

For example, it’s difficult to imagine a scenario where two models with the same function but different capacities would be needed—such as using two text-to-image generative models (including the Stable Diffusion model) for image generation, which is one of the examples showed in this paper. The method proposed in this paper could apply in theory, but in most cases, I don’t see an application where two Stable Diffusion models would be used to generate a single image.

The paper feels more like a technical report; the automated method doesn’t seem easily generalizable to different tasks. Each task requires a separate design, which limits the paper's applicability.

For example, in Section 5, the paper mentions "automated decomposition for generating malicious Python scripts used for hacking and personalized manipulation in a synthetic setup." However, these two scenarios actually rely on human-designed templates, meaning the approach isn’t fully automated, which could further limit its applicability.

The scenario seems more like a multi-agent study. In this context, more agent-based methods could be used as baselines, and the storyline could be made more practical.

问题

Please refer to weakness.

AC 元评审

2024-12-21

The submission "Adversaries Can Misuse Combinations of Safe Models" provides the interesting observation that "safety" is not a monolithic tasks, but easily decomposed into subproblems. Reviewers all considered this an interesting observation, but from their reviews, had expected a more systematic study than was provided by the submission. Additional concerns were raised about limited details about the experimental setup, and the overall feeling that this submission seems not quite ready for publication. The authors have not responded to these concerns. I hope the perspectives provided by the reviewers are helpful for future revision of this material.

审稿人讨论附加意见

No author rebuttal was provided.

最终决定Reject

2025-01-22

Reject