GRE Score: Generative Risk Evaluation for Large Language Models
摘要
评审与讨论
The authors propose a metric (GRE) that estimates the robustness of a model to adversarial inputs. They compute this metric by measuring the perturbation required in some continuous semantic space to flip the model generation from non-compliant to compliant.
优点
- I think attempts to quantify the robustness of generative models is an important direction.
- I appreciate the authors attempt to demonstrate how their score correlates with empirical robustness, although I think the results are not enough to demonstrate it correlates with non-semantic attacks.
- Authors use different datasets, tasks and models to evaluate their metric.
缺点
- The score definition relies on the assumption that all adversarial examples lie close to the original prompt in some semantic space. I believe this is generally not true since I expect e.g. prompts with GCG suffixes to be far in the latent space.
- Also theorems require that the number of paraphrases go to infinity, but authors only uses 25 paraphrases from a relatively weak model.
- The experimental setup is weak generally. (1) Authors use only old models. (2) The paraphraser is a diffusion model introduced in 2020, instead of a state-of-the-art language model. I thus expect paraphrases to be of low quality and not diverse enough.
- The paper could be generally improved if the authors demonstrated how their score correlates with additional empirical attacks. I like the authors include TAP but they could leverage white-box methods (e.g. GCG) as a more accurate metric of model robustness. Especially because TAP results in semantically similar prompts that fulfill the requirement of similarity in the semantic space but this might not be true for the most powerful attacks.
- Authors could include more experiments on models not trained for safety and demonstrate that their score is also sensitive to unsafe models. They include Mistral-7b that is not explicitly trained for safety but still obtains high scores. I think this should be further investigated.
- The authors claim that samples created by other generative models are novel and uncontaminated. Recent work has demonstrated that models can be biased towards their own outputs (https://arxiv.org/abs/2404.13076)
- Citations are generally very poor. Some examples include: (1) In line 71, Perez et al. is not the correct citation for "jailbreaking" since the term is not used in that paper. You should probably cite Wei et al. (2) GPT-3 is not cited with their original paper in line 124.
Other weaknesses that could improve the overall quality of the paper:
- The design of Figure 1 can be largely improved.
- Fix the spacing issue below Figure 1. Caption and text are too close. In general, all figure captions can be improved.
- The introduction feels too long and verbose.
- The sentence "Large Language Models (LLMs) are AI algorithms" is extremely inaccurate.
问题
See weaknesses
This paper proposes GRE Score, a new evaluation framework for the trustworthiness of LLMs. This framework uses paraphrasing to evaluate samples across five trustworthiness categories and eight open-source models. With reduced evaluation time, GRE Score shows consistent rankings with TrustLLM.
优点
- GRE Score introduces a novel evaluation approach that applies paraphrasing and perturbation on a small set of test samples.
- The evaluation reveals a correlation between GRE Score and ASR, demonstrating its ability to assess model vulnerabilities.
缺点
- The primary contribution of GRE Score appears to be a migration of the GREAT Score[1] from the vision domain to text, which may limit its novelty.
- In contrast to the vision domain, applying noise in semantic space lacks justification. Jailbreaking attacks, for instance, do not necessarily require semantic features similar to those of the original examples.
- Although the paper claims to certify an attack lower bound, using only 25 paraphrased sentences raises doubts about the robustness of this estimation. In the vision domain, their certification on a single example usually requires over 1000 noise-adding versions. Given the small amount of perturbed paraphrased examples, I believe simply replacing noise-based paraphrasing with direct prompting in GPT-4o might yield similar correlations with ASR.
- This paper lacks citations and comparisons with significant relevant works. For example, previous holistic trustworthiness evaluations over LLM, like DecodingTrus [2] and PromptBench[3], which overlap in goals and some findings, are not discussed and compared in this paper. In addition, when mentioning adversarial attacks on LLM, seminal works like GCG attacks[1] are not even mentioned or cited.
- The evaluation is limited to smaller open-source models (7B and 13B parameters) without testing on state-of-the-art closed-source models like the GPT series. This limitation weakens the conclusions about the impact of model size.
[1] Zaitang Li, Pin-Yu Chen, Tsung-Yi Ho GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models
[2] Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, Bo Li DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
[3] Kaijie Zhu, Qinlin Zhao, Hao Chen, Jindong Wang, Xing Xie PromptBench: A Unified Library for Evaluation of Large Language Models
[4] Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, Matt Fredrikson Universal and Transferable Adversarial Attacks on Aligned Language Models
问题
See comments above
The article introduces the Generation Robustness Evaluation (GRE) score, a new metric designed to assess the resilience of large language models (LLMs) against adversarial attacks, especially those from red teaming efforts. The GRE score offers a robustness certificate by evaluating LLMs across five dimensions: security, privacy, robustness, ethics, and fairness. The study demonstrates the effectiveness of this approach by comparing it to traditional adversarial testing, providing insights into how model scaling, performance, and instruction tuning influence outcomes.
优点
- The introduction of GRE scores offers a comprehensive and rigorous metric for assessing the safety performance of LLMs under adversarial conditions. This metric is validated across various dimensions, including safety, privacy, robustness, machine ethics, and fairness.
- The study provides valuable insights, such as the nonlinear relationship between model size and performance, and highlights areas where LLMs tend to underperform, particularly in robustness and privacy.
缺点
- The description of the GRE SCORE is complex and difficult to read; its logic should align well with Figure 1.
- The proposed method lacks ablation studies, such as assessing the impact of using the Diffusion Text Paraphraser for paraphrase attacks. The authors should conduct these studies for this crucial component or clearly explain their omission, given the availability of various paraphrasers.
- The paper discusses adversarial testing to simulate real-world worst-case scenarios but only considers paraphrasing, neglecting traditional adversarial methods like DeepWordBug [1] for character-level perturbations.
- The efficiency of GRE SCORE seems to benefit from the adversarial/jailbreaking method used, specifically the Diffusion Text Paraphraser. Comparing it with TAP, which iteratively refines attack prompts, is unfair. TAP could theoretically be applied to GRE SCORE for more rigorous stress testing, which should be included as an ablation study. Additionally, it's puzzling why the authors didn't employ multiple jailbreaking methods to construct the most challenging Generative Risk Evaluation.
- Additional formatting issues include:
- Inconsistent spacing under Figure 1.
- The row height in Table 1 needs adjustment.
- Text in Figures 2-6 is too small and difficult to read, requiring modification.
- Spacing in Figures 3-5 also needs adjustment.
Overall, I find the motivation for this work to be both novel and solid. However, the proposed method seems lacking in comprehensiveness, particularly regarding the selection of Prompt Injection Attacks. If the authors can address these weaknesses convincingly, I am open to revising my evaluation.
[1] Gao, J., Lanchantin, J., Soffa, M. L., & Qi, Y. (2018, May). Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops (SPW) (pp. 50-56). IEEE.
问题
- Have the authors considered broadening the stress tests by incorporating a wider range of adversarial methods to achieve a more comprehensive and robust evaluation?
This paper proposes a novel and efficient metric, named GRE Score, to evaluate the robustness of existing LLMs. The proposed GRE Score tests multiple open-source LLMs across five domains, showing a consistent ranking order of existing LLMs compared to attack-based methods.
优点
- The paper introduces an automatic pipeline to efficiently test LLM robustness across diverse dimensions.
- It covers various open-source models, and the proposed GRE Score demonstrates a high correlation with the attack ASR.
缺点
- Lack of Comparison with Existing Benchmarks: The authors should further clarify the fundamental differences between the GRE Score and existing red teaming benchmarks [1]. From my understanding, the GRE Score is an stress-testing method that leverages an efficient paraphraser to create different variants of harmful requests. This approach seems similar to existing jailbreak techniques [2,3,4,5]. Users could potentially use these existing jailbreak techniques to perform similar stress testing, especially considering that some are highly efficient [4,5].
- The Limitations of The Judgement Classifier: The paper mentions that the judgment classifier categorizes outputs as refusing or not refusing to answer. However, LLMs may sometimes use tactful language to refuse answers or substitute toxic content with unrealistic/imaginary words, potentially misleading the judgment classifier. How does the proposed pipeline handle such cases?
- Weak Interpretability of GRE Scores: If the GRE Score is within [0,1], users can interpret it as the probability that a model is vulnerable in a certain scenario. However, some GRE Scores may exceed 1.0 (e.g., the safety GRE Score of Llama-2-13B in Figure 4). How should users interpret GRE Scores larger than 1.0? The authors should provide a detailed value range as a reference.
- Limited Application Scope: The adversarial prompts in the GRE Score calculation rely heavily on paraphrasing. These prompts may fail against models with input filters, such as Llama-Guard. This limitation could restrict the application of the GRE Score for evaluating models with moderate to high safety alignment, such as GPT-4o, Claude-3, and the Gemini-1.5 series.
- Lack of Justification of The Necessity of GRE Score: The authors use ASR as a reference to evaluate the effectiveness of the GRE Score. Why not use the ASR of the generated prompts directly as the metric, rather than calculating the more complex GRE Score?
More Presentation Issues:
- The quality of Figure 1 should be improved; it appears compressed along the width.
- The legends and axis labels in Figures 2-6 are too small to read.
- The authors seem to have used \vspace or similar techniques too aggressively, resulting in abnormal spacing between figures, captions, and the main text.
- In Figure 6, the caption mentions the GRE Score, but it is not shown in the figure (e.g., no legend labeled "GRE Score"). The lack of correspondence between the figure, caption, and the texts in the evaluation section makes the paper hard to follow.
[1] Mazeika, Mantas, et al. "HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal." ICML. 2024.
[2] Zeng, Yi, et al. "How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms." ACL. 2024.
[3] Zou, Andy, et al. "Universal and transferable adversarial attacks on aligned language models." arXiv preprint arXiv:2307.15043 (2023).
[4] Liao, Zeyi, and Huan Sun. "Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms." COLM. 2024.
[5] Shen, Guangyu, et al. "Rapid optimization for jailbreaking llms via subconscious exploitation and echopraxia." arXiv preprint arXiv:2402.05467 (2024).
问题
Please refer to the Weaknesses section.
伦理问题详情
The examples presented in the appendix section may potentially contain unethical words.
We thank the reviewer's comment and decide to withdrawn the submission.