In my opinion, presentation is one of the biggest issues holding this paper back. The background and OSV sections feel too verbose, and the jailbreak generation method is barely described in the main section of the paper.
There are other methodological concerns as well. The paper claims that the Optimizer results in a more diverse set of jailbreak attacks, but no analysis is presented confirming this hypothesis.
The jailbreak generation method is not compared to other methods in recent literature, such as GCG (https://arxiv.org/abs/2307.15043) and PAIR (https://jailbreaking-llms.github.io/).
The paper mentions that an uncontaminated test set is crucial for safety evaluation. However, the proposed method does not guarantee this: LLMs can simply verbatim memorize jailbreaking scenarios in their training data and regurgitate them during synthesis. This is especially true for recent LLMs, as large subreddits are now dedicated to jailbreaking LLMs (r/ChatGPTJailbreak for example).
There are also design choices in the jailbreak generation pipeline that require more explanation. Why is the attacker LLM allowed to decide what the rejected response should be? This is standardized in most recent literature. Why is semantic similarity used in scoring jailbreaks? Harm evaluation is a complicated problem, and most automatic evaluation approaches use sophisticated prompting to determine if generated responses contain harmful content. There is also no analysis of the alignment of the Evaluator responses with human preferences, making the results less trustworthy.
The number of rounds required for an attacker LLM to jailbreak the defender LLM can vary substantially between test cases. This variance is not accounted for in OSV. Whenever a new LLM is introduced, OSV requires using this LLM as an attacker and defender against every other LLM in the benchmark, limiting its practicality, while also recomputing OSV numbers for every other LLM in the benchmark.
Some more motivation is also needed for why the capability of an LLM to jailbreak other LLMs is a factor in deciding how safe it is.