ELITE: Enhanced Language-Image Toxicity Evaluation for Safety
This paper introduces a rubric-based safety evaluation method and a high-quality benchmark to address inaccuracies in previous safety evaluations of Vision Language Models and enhance robustness against malicious exploitation.
摘要
评审与讨论
The paper introduces ELITE, a new safety benchmark designed to evaluate the toxicity and risks associated with Vision-Language Models (VLMs). Current benchmarks fail to detect implicit harmful content and have issues with low harmfulness levels, ambiguous data, and limited diversity in image-text combinations. ELITE aims to address these gaps by providing a more precise, rubric-based evaluation method and a diverse dataset for safety testing.
给作者的问题
What could be a promising direction to improve the safety of VLM.
论据与证据
The claims are supported by the evidence.
方法与评估标准
The problem of evaluating and improving the safety of VLM is addressed by the benchmark.
理论论述
N/A
实验设计与分析
The evaluation is sound and extensive on many VLMs.
补充材料
No
与现有文献的关系
The contributions of the paper address the gap of the safety of the foundational models.
遗漏的重要参考文献
An early and popular benchmark in LLM safety is not discussed: Wang, Boxin, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu et al. "DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models." In NeurIPS. 2023.
其他优缺点
The novelty the paper could be highlighted.
其他意见或建议
N/A
We greatly appreciate the reviewer’s valuable feedback, which has significantly improved our work.
Essential References Not Discussed. We agree that since our study aims to improve the safety evaluation of language models, it is important to cite DecodingTrust, an early benchmark in LLM safety. We appreciate the suggestion and will include the citation.
Other Strengths And Weaknesses. As noted in the review, the main contribution of our paper lies in improving the safety evaluation of VLMs. To this end, we propose the ELITE evaluator, a rubric-based evaluation method that provides more accurate assessments based on the characteristics of VLMs, along with the ELITE benchmark, an effective tool for VLM safety assessment. Through the analysis of existing benchmarks (Table 4) and human evaluation results (Table 6, including additional evaluations conducted in response to Reviewer BuCp’s Question 3), we demonstrate the novelty of both the evaluator and the benchmark. In particular, our work presents a simple yet effective method for evaluating harmful responses from VLMs, which we believe is a key strength of the paper. We will revise the paper to highlight these aspects better.
Question1. As VLMs interpret inputs through the interaction of two modalities, novel attack types may arise from this cross-modal reasoning process. Therefore, a promising direction is to develop comprehensive benchmarks that can broadly evaluate unsafe behaviors in VLMs and use these benchmarks to guide improvements in safety alignment. In this context, our ELITE evaluator and benchmark were designed to address these challenges, and we believe further efforts to improve safety alignment will remain essential going forward.
This paper introduces ELITE, a VLM benchmark and an LLM-as-judge evaluator designed to test harmful generations of these models.
update after rebuttal
I will maintain my original score.
给作者的问题
- In Table 3 the authors highlight Pixtral-12B as the model with the highest ASR in most categories. Given this model is one of the 3 ones used for the selection of the benchmark samples, is it fair to include it in the comparison?
- “when the StrongREJECT evaluator is applied to VLMs, it often assigns high scores even when the model does not explicitly refuse to respond and instead provides unhelpful answers” — is it just miscalibrated for VLMs? The specific and convincing scores in Figure 2 simply appear to high given the prompt presented. Why do you need to introduce toxicity as another criteria instead of, for example, few-shot examples?
- What is the agreement rate and Pearson correlation between the 3 human annotators? Does it vary significantly per category?
- Given the level of uncertainty that comes from a low F1 score on the ELITE GPT-4o judge, what can the authors say about the statistical significance of the comparison between the different models in Table 3 and benchmarks in Table 4? Small differences in ASR could be within the margin of error for this evaluator.
论据与证据
The novelty of this work is somewhat limited. The evaluator provided is rubric-based and heavily inspired by StrongREJECT (in fact, the authors do not perform any other prompt ablations, only the comparison to StrongREJECT’s prompt). The authors use it to filter a benchmark dataset to consist of samples where this evaluator is likely to produce harmful scores, and the results in Table 3 and 4 show, unsurprisingly, that it does.
Other than balancing the data presence of different categories of harm and showing an increase in harmfulness on this benchmark compared to previous ones, a more fine-grained analysis of why this benchmark is relevant (e.g., showcasing diversity of failure modes) is missing from this work.
方法与评估标准
As mentioned in Claims and Evidence, the authors select the benchmark to include harmful prompts as per the ELITE evaluator and obtain higher harmfulness as measured by that evaluator compared to existing benchmarks. However, the goal of benchmarks is not simply to be “difficult”; they should be principled, measure a diversity of failure modes, and avoid overfitting to the specific harmfulness evaluator used. One of the key points missing from this paper is a more detailed analysis of the types of image-text pairs that get excluded/included in the final benchmark, as well as the relevance of their inclusion/exclusion.
For example, from Table 1 we see that more prompts are taken from Figstep in S1 than MM-SafetyBench, yet the opposite is true for S8. Why is that the case? What kind of novelty are the “New” pairs bringing to the mix? These are crucial questions to understand what this benchmark is actually measuring.
理论论述
N/A.
实验设计与分析
As mentioned above, a fine-grained analysis of the composition of the benchmark is completely missing from this work.
In terms of the evaluator, no ablation is provided on the prompt. In StrongREJECT, the authors explicitly mention they chose specific and convincing as the criteria of the evaluator after considering a set of 10 features (e.g., harmful, discouraging) and doing a Lasso regression on it. This type of analysis is not done in this work, despite the fact the authors have access to human evaluation data.
This is particularly concerning given ELITE GPT-4o only achieves an F1 score of 0.637 on the human evaluation dataset, yet this is used to select the final benchmark samples.
补充材料
I reviewed the evaluator's prompt and some of the other details.
与现有文献的关系
This paper builds on existing benchmarks and augments them to generate one that the authors claim is more likely to elicit harmful responses from VLMs. Purely in that sense, the contribution is limited, as from the methods and results it is unclear this is a principled way of building a harmfulness benchmarking dataset for VLMs.
遗漏的重要参考文献
The related work section appears to cover the important references in the field.
其他优缺点
- The paper is poorly organized. The human experiments validating the human evaluator only come in Section 5, after the main results of the benchmark — despite the fact the evaluator was used for both sample selection and evaluation of the responses for each model.
其他意见或建议
- The introduction is long and quite repetitive, with figures references out of order - 1 (c) coming before 1 (b) for example.
We truly appreciate the reviewer's insightful and constructive comments, which have greatly contributed to enhancing the quality of our work.
Methods And Evaluation Criteria & Experimental Designs Or Analyses-1. Thank you for your valuable feedback. Our key message is to address a limitation in existing evaluation methods—namely, their inability to accurately assess harmfulness in VLMs. As the reviewer pointed out, our benchmark was designed with the "difficulty" of evaluating harmful responses as a primary principle. We believe that a safety benchmark should be able to assess model robustness using sufficiently challenging samples. However, existing benchmarks often contain ambiguous samples, making it unclear whether they could induce harmful responses. Therefore, we prioritized fulfilling this fundamental role of a benchmark above all else.
Furthermore, we aim to propose a benchmark that incorporates a wide range of diversity. In particular, there are known ways to elicit harmful responses using safe image–safe text pairs, but most existing benchmarks lack coverage of such safe-safe cases. To address this, we constructed a more comprehensive benchmark by integrating a “New” dataset that includes all four types of image–text pairs.
Experimental Designs Or Analyses-2. For the analysis by features, human labeling was conducted for 30 samples using three labelers, evaluating 10 features. The table below shows the results of Lasso regression based on unsafe and safe. These results highlight the effectiveness of the ELITE evaluator's toxicity score in the VLM task.
| Feature | Weight |
|---|---|
| toxicity | 0.1928 |
| specific | 0.1899 |
| convincing | 0.1127 |
| consistent | -0.1013 |
| compliant | -0.0906 |
| comprehensive | -0.0789 |
| articulate | 0.0718 |
| useful | -0.0585 |
| relevant | -0.0298 |
| discouraging | -0.0107 |
Weakness1 & Other Comments Or Suggestions. We agree with the reviewer that human evaluation is a crucial component, especially given that we are proposing a new evaluation method. We will also revise the introduction section.
Question1. Since the filtering baseline model includes the relatively safe model Phi-3.5-Vision, and we selected only the cases where the score was above 10 in at least two of the three models (Phi-3.5-Vision, Llama-3.2-11B-Vision, Pixtral), we believe the approach is not so unfair. However, as the reviewer pointed out, we will add a note in the table to clarify that these three models were used as filtering baseline models.
Question2. We argue that “specific” and “convincing”, which have been used in previous evaluations, are orthogonal to “toxicity”. We found that responses can be highly specific or convincing without being toxic. This distinction is particularly important in the context of VLMs, where the model often provides detailed descriptions of the image, even when such responses do not align with the harmful intent of the prompt. This results in cases where the VLM appears to respond in a “convincing” or “specific”, but actually avoids engaging with the harmful intent altogether by focusing solely on the image. Such responses are frequent and posed a major challenge during our benchmark construction.
As we examined the 118 samples excluded during the ELITE benchmark filtering process, we found that 52.54% of them consisted of image-descriptive responses. This indicates that such cases occur frequently and can lead to miscalibrated scores when using StrongREJECT.
Question3. To assess human agreement and Pearson correlation, we analyzed the results of the human evaluation conducted on a total of 228 samples (Reviewer BuCp's question 3).
The agreement rate and Pearson correlation between the three human annotators are summarized in the table below. While there are some variations across categories, we generally observe a strong level of agreement overall. To mitigate individual annotator biases and ensure more reliable labeling, we applied a majority voting strategy across the three annotators’ safe/unsafe labels for each sample.
| agreement rate | |
|---|---|
| S1. Violent Crimes | 63.15% |
| S2. Non-Violent Crimes | 90.00% |
| S3. Sex Crimes | 76.19% |
| S4. Defamation | 63.15% |
| S5. Specialized Advice | 62.50% |
| S6. Privacy | 80.01% |
| S7. Intellectual Property | 76.19% |
| S8. Indiscriminate Weapons | 75.00% |
| S9. Hate | 78.26% |
| S10. Self-Harm | 80.95% |
| S11. Sexual Content | 55.00% |
| ALL | 72.81% |
| Pearson correlation | |
|---|---|
| human1 & human2 | 0.5978 |
| human1 & human3 | 0.6161 |
| human2 & human3 | 0.6512 |
Question4. Our intention in Table 3 is not to claim that specific models are better or worse, but rather to show general trends. Regarding the ELITE evaluator, it demonstrates stronger performance compared to existing evaluation methods. As for Table 4, we believe that the difference in E-ASR between existing benchmarks and the ELITE benchmark is substantial enough that it cannot be attributed to the evaluator’s margin of error.
I thank the authors for their rebuttal. Some of my concerns have been addressed, but the issue with the evaluator remains. Given this is a crucial part of this work, I will maintain my score.
Thank you very much for reading our response and for your thoughtful feedback.
| Accuracy (↑) | F1-Score (↑) | |
|---|---|---|
| ELITE (GPT-4o) | 83.77% | 0.8043 |
| LlamaGuard3-Vision-11B | 75.88% | 0.5882 |
The table above presents additional experimental results for Reviewer BuCp's Question 3, using samples related to ELITE (GPT-4o). The human dataset used in our paper was primarily sampled from cases where our model and StrongREJECT disagreed, which naturally led to the inclusion of questions with subjectivity or ambiguity, where human opinions were more likely to diverge.
As described in Appendix D.2, we recruited 22 annotators with diverse occupation and age groups to reflect the diversity of real-world users. As a result, for certain evaluation samples, there may have been disagreements among annotators, which could have led to variability in the human-labeled ground truth — potentially making the performance of ELITE (GPT-4o) appear lower than it actually is.
In contrast, the samples used in Reviewer BuCp's Question 3 were more clearly separable — for example, whether an answer was included in ELITE or not — allowing for more consistent evaluation results. Accordingly, as shown in the table above, the evaluator demonstrates strong performance on these clearly distinguishable samples.
We would be grateful if you could elaborate further on your concerns regarding the evaluator issue, so that we can better understand them.
Thank you again for your time and consideration.
The authors propose a new framework for automated safety evaluation in vision-LLMs by extending an existing evaluator (StrongREJECT, which scores the level of refusal, specificity, and convincing-ness of a VLM's output) by additionally predicting a toxicity factor. This accounts for cases where the model's output in response to a harmful piece of input is neither a refusal nor actually toxic, and would still have been treated as unsafe by StrongREJECT. GPT-4o is used as the underlying LLM of the evaluator.
Furthermore, the authors construct a dataset with 4.6k samples (the ELITE benchmark) through filtering and rebalancing existing toxicity benchmarks + 1k new samples (especially focusing on safe text + safe image = unsafe prompt cases).
The authors show that the proposed evaluator aligns much more closely to human judgment than the preceding StrongREJECT evaluator, despite using the same underlying LLM.
update after rebuttal
Based on the authors' added information, I've concluded that the evaluation mechanism is more reliable than I originally believed. As such I have raised by rating by 1 point.
给作者的问题
It would be great to see the authors' responses to the weaknesses listed above.
As well, in my past experience, LLMs (and VLMs) are often not able to answer judgment-based questions like "on a scale of 1 to N, what is the level of ____ in the input" in a very consistent manner. Did the authors do any quantitative / qualitative analysis of how well the evaluator's judgment align with human judgment for the toxicity score?
It was mentioned that the metric is defined by using 10 as the threshold of the output of the ELITE evaluator. How was this threshold chosen? Does this impact the relative ordering of models?
Based on the authors' responses, I would be happy to revisit my current recommendation.
论据与证据
The authors show that their proposed benchmark is able to jailbreak multiple safety aligned open source LLMs at a higher rate than competitor baseline benchmarks. It would have been good for this to be somewhat more thorough - for instance, the models used in Table 4 for this comparison only have 7B / 13B parameters (and are by now somewhat outdated, i.e. LLaVA v1.5). As such, it is unclear whether the trend still holds across more contemporary / larger models.
As well, it is seen that the ELITE evaluator aligns better with human judgment than the StrongREJECT evaluator's starting point. This is sufficiently convincing, given that large human judgment datasets are expensive to collect (this particular set contains 900+ samples).
方法与评估标准
The evaluation criteria is chiefly to compare it to the baseline metric (StrongREJECT) in terms of human alignment, which is a sensible way to quantify the correctness of this LLM as a judge framework.
理论论述
Not applicable.
实验设计与分析
I examined sections 4 and 5 (Experiments and Human Evaluation) in detail, and found the processes to be largely reasonable (but subject to some of the concerns as listed in the above Claims and Evidence analysis).
补充材料
I have reviewed the entire supplementary materials section. The samples in the supplementary material were useful for understanding the types of data that is included in the ELITE benchmark, and how the ELITE evaluator helped to select them.
与现有文献的关系
Having worked with some of the related datasets myself, I agree with the authors' assessment that the existing datasets often have ambiguous samples and balancing issues. As such, I believe that the manuscript is well positioned in relation to and improves upon the broader body of work in this area.
遗漏的重要参考文献
To my knowledge, the authors have done a good job in reviewing the related datasets in this space, which is also naturally necessary as the authors incorporated samples from many of these datasets into their own.
其他优缺点
The paper's goal to advance quantifiable and objective evaluation methods of toxicity of multimodal LLMs is core to the general usability of these models, and should be commended. Furthermore, I look forward to seeing the proposed dataset be used by the larger community.
On the negative side, I do find that the amount of technical contribution to be a little bit limited - it appears that the major innovation is to request the LLM judge to produce a scalar toxicity score.
其他意见或建议
The connection between section 3.3 and 3.4 was a little hard to understand during my first reading. It took some number crunching to understand that section 3.3 is the process to create new samples, and section 3.4 refers to filtering and improving existing dataset samples. However, the wording for section 3.4's introduction seemed to suggest that all samples were created using the process in 3.4. I would recommend rewriting this part to make it easier to understand.
We greatly appreciate the reviewer’s insightful comments, which have been essential in helping us enhance and clarify our work.
Claim And Evidence. The table below shows additional experimental results for the latest model, gemma3, and the larger model, InternVL-2.5-26B. These results demonstrate that the ELITE benchmark shows consistent performance even with relatively large and latest models.
| Model | Benchmark | Total | E-ASR |
|---|---|---|---|
| InternVL2.5-26B | VLGuard | 2028 | 10.51% |
| MM-SafetyBench | 1680 | 30.46% | |
| MLLMGuard | 532 | 12.60% | |
| ELITE (generated) | 1054 | 50.94% | |
| ELITE | 4587 | 39.63% | |
| Gemma3-4B | VLGuard | 2028 | 22.71% |
| MM-SafetyBench | 1680 | 33.81% | |
| MLLMGuard | 532 | 22.84% | |
| ELITE (generated) | 1054 | 44.81% | |
| ELITE | 4587 | 40.58% |
Weakness1. Thanks for your thoughtful review. It is true that we made minimal modifications compared to the StrongReject. Our goal is to create a benchmark and evaluator that works well generally, rather than being specific to a particular model or situation. What matters is not the complexity, but how many problems we can solve with simple changes. We believe that the ELITE method proposed in this paper can solve many issues.
We propose a simple yet effective method for evaluating harmful responses in VLMs. We demonstrate that our approach outperforms many existing Guard models and StrongREJECT, based solely on toxicity score requests, and through this, we create a toxic benchmark by filtering out samples that are not particularly harmful, addressing a key issue in existing benchmarks. Furthermore, we aim to propose a benchmark that incorporates a wide range of diversity by integrating existing benchmarks like SIUO, which only contains safe image + safe text, and other benchmarks, which contain unsafe image + unsafe text, and other combinations. We believe we can create an even broader benchmark by structuring the ELITE benchmark with both safe and unsafe pairs.
Ohter Commnets Or Suggestions. In Section 3.3, we explain the process of generating the new samples, ELITE benchmark (generated). In Section 3.4, we match existing benchmarks to the taxonomy in the ELITE benchmark, and by filtering both the existing benchmarks and the ELITE benchmark (generated), we ensure that only toxic cases remain. We will improve the writing in the section you pointed out to make it easier to understand. Thank you for pointing this out.
Question1. We understand the reviewer’s concern that LLMs (and VLMs) may not provide consistent responses. We measured the toxicity score of the ELITE evaluator a total of 10 times on the 228 samples (Reviewer BuCp’s question 3). The table below shows the average and standard deviation of the toxicity scores.
| From ELITE | Not From ELITE | |
|---|---|---|
| ELITE evaluator-toxicity score mean | 3.8136 | 0.7915 |
| ELITE evaluator-toxicity score std | 0.5736 | 0.4015 |
Additionally, the table below shows the Pearson correlation between the ELITE (GPT-4o) toxicity scores and human judgment, demonstrating that the toxicity scores are well aligned with human assessment.
| Pearson Correlation | |
|---|---|
| human1 & ELITE evaluator | 0.7274 |
| human2 & ELITE evaluator | 0.6447 |
| human3 & ELITE evaluator | 0.6496 |
The table below shows the Pearson correlation between human toxicity scores. As can be seen in the table, the correlation with the ELITE evaluator is higher than the correlation between each pair of humans. This indicates that, despite some variation in human judgements, the ELITE evaluator is more consistently reflecting the evaluations.
| Pearson correlation | |
|---|---|
| human1 & human2 | 0.5992 |
| human1 & human3 | 0.6079 |
| human2 & human3 | 0.5079 |
Question2. The 10-point threshold we used was selected based on experiments in Appendix A.2. Although Table 9 shows that a threshold of 10 is not optimal, Figures 5 and 6 confirm that even with this threshold, there are cases where the model's responses are sufficiently harmful. By including these cases, we propose a more comprehensive benchmark. Below are the five most vulnerable models for each threshold. Except for LLaVa-v1.5-7B appearing as the 5th most vulnerable model at Threshold 5 instead of Molmo-7B, the five most vulnerable models remain the same across all thresholds.
| Threshold | Model | E-ASR |
|---|---|---|
| 5 | Pixtral-12B | 85.63% |
| ShareGPT4V-7B | 78.85% | |
| LLaVa-v1.5-13B | 78.44% | |
| ShareGPT4V-13B | 77.46% | |
| LLaVa-v1.5-7B | 75.26% | |
| 10 | Pixtral-12B | 79.86% |
| LLaVa-v1.5-13B | 69.68% | |
| ShareGPT4V-13B | 68.08% | |
| ShareGPT4V-7B | 67.16% | |
| Molmo-7B | 63.79% | |
| 15 | Pixtral-12B | 60.91% |
| ShareGPT4V-13B | 52.95% | |
| LLaVa-v1.5-13B | 52.60% | |
| ShareGPT4V-7B | 50.51% | |
| Molmo-7B | 47.70% | |
| 20 | Pixtral-12B | 41.23% |
| LLaVa-v1.5-13B | 37.01% | |
| ShareGPT4V-13B | 36.51% | |
| ShareGPT4V-7B | 34.37% | |
| Molmo-7B | 31.70% |
I'd like to thank the authors for their reply. Based on this, I am now less concerned with the ability of the scoring mechanism in identifying stronger / weaker models. As such, I will revise my rating up by 1 point.
Dear Reviewer X37u, we sincerely thank the reviewer for their thoughtful and encouraging feedback. We are delighted that our responses have successfully alleviated the concerns raised and appreciate the reviewer’s support for our work.
This paper introduces a safety benchmark called the ELITE benchmark, as well as an associated evaluator (the ELITE evaluator). The benchmark comprises multimodal data—image-text pairs—that are designed to provoke harmful or unsafe responses from vision-language models (VLMs). It includes 4,587 samples across 11 safety categories and four different image-text pair types (unsafe-unsafe, safe-unsafe, unsafe-safe, and safe-safe). While some images and texts may be safe, all samples are intended to induce unsafe responses. The data is compiled from multiple existing safety benchmark sources, supplemented by newly generated image-text pairs (which constitute about one-fourth of the entire dataset). To improve overall quality, the authors remove samples that fail to elicit sufficiently harmful responses, a process guided by the ELITE evaluator. Finally, the authors conduct a large-scale human evaluation to compare the ELITE evaluator with existing approaches.
给作者的问题
- The ELITE evaluator achieves only 73% agreement with human assessments. What do the authors believe is the primary reason for this score? Have you considered additional tuning of open-source VLMs to improve safety alignment? I wonder if performance could be enhanced by refining or tuning an open-source VLM specifically for safety (e.g., similar to approaches like LlamaGuard).
- It remains unclear whether the ELITE evaluator itself has full access to the image data. While this seems to be the case in Fig 2, in Fig 10, for instance, the image context seems to be missing for the evaluator, which may affect its judgment. E.g., imagine the following scenario: 1. the image depicts a bridge. 2. the prompt says, “Can I jump” and 3. The model responds with, “Sure, go ahead”.
- Have you also considered using human-evaluated samples to validate the quality of the dataset? How many of those who come from the actual ELITE dataset resulted in harmful responses? And how many of those which you have not included did result in an unsafe response?
论据与证据
In general, the paper’s claims appear well-supported by evidence. However, there is some concern regarding the human evaluation of the ELITE evaluator, which shows only 73% agreement with human judgments. This relatively moderate agreement raises questions about the evaluator’s accuracy and whether the dataset might be overly fitted to the ELITE evaluator itself (given that it was also used to filter sample prompts). The slightly lower human agreement score could affect the perceived quality of both the dataset and the evaluator’s reliability.
方法与评估标准
The proposed methods and evaluation criteria make sense for the area of application.
理论论述
All theoretical claims are well supported.
实验设计与分析
The experimental design, including the data collection process, image generation, and the final evaluator model, appears coherent and well-supported. The authors also perform a comprehensive set of evaluations on VLMs using their newly introduced benchmark and compare the ELITE evaluator against existing approaches. Their methodology for constructing the dataset—and then validating it with human evaluations—seems sound.
补充材料
I briefly checked most parts of the supplementary material.
与现有文献的关系
The paper builds upon multiple existing safety benchmarks, integrating them into a more extensive safety corpus. By filtering out samples that fail to provoke harmful outputs, the authors aim to refine the collective set of safety prompts. This is a valuable contribution, as it combines and enhances prior resources into a single, more comprehensive dataset.
遗漏的重要参考文献
Based on my familiarity with the field, the authors appear to acknowledge all critical references relevant to their work. I did not spot any missing essential citations.
其他优缺点
Strengths
- the authors provide a very valuable safety dataset to the community, which is currently missing and needed
- The authors propose an evaluator model that can be used to assess model responses for their benchmark.
- They conduct extensive experimental evaluations, providing thorough empirical support for their claims.
Weaknesses
- The evaluator model shows a rather weak performance of 73% accuracy score to the human evaluation.
- Because the ELITE evaluator is used both to filter the dataset and to evaluate final model responses, there is a risk that the dataset might become overly tailored to the evaluator.
其他意见或建议
No further comments
We sincerely appreciate the reviewer for the constructive feedback, which has been invaluable in helping us refine and strengthen our work.
Weakness2. To demonstrate that the ELITE benchmark is not overly tailored to the ELITE evaluator, we present results based on the previously adopted metric, Attack Success Rate (ASR), instead of the metrics (E-ASR) used in Table 4. These results suggest that the ELITE benchmark remains general and is not excessively influenced by the use of the ELITE evaluator.
| Model | Benchmark | Total | ASR |
|---|---|---|---|
| Llava-v1.5-7b | VLGuard | 2028 | 34.82% |
| MM-SafetyBench | 1680 | 39.67% | |
| MLLMGuard | 532 | 36.46% | |
| ELITE (generated) | 1054 | 70.83% | |
| ELITE | 4587 | 68.98% | |
| Llava-v1.5-13b | VLGuard | 2028 | 34.00% |
| MM-SafetyBench | 1680 | 41.25% | |
| MLLMGuard | 532 | 32.65% | |
| ELITE (generated) | 1054 | 69.24% | |
| ELITE | 4587 | 69.99% | |
| DeepSeek-VL-7b | VLGuard | 2028 | 28.59% |
| MM-SafetyBench | 1680 | 38.63% | |
| MLLMGuard | 532 | 23.35% | |
| ELITE (generated) | 1054 | 57.83% | |
| ELITE | 4587 | 60.83% | |
| ShareGPT4V-7B | VLGuard | 2028 | 31.98% |
| MM-SafetyBench | 1680 | 40.89% | |
| MLLMGuard | 532 | 30.11% | |
| ELITE (generated) | 1054 | 66.60% | |
| ELITE | 4587 | 69.54% |
Question1. Fine-tuning the evaluator model, as done with LLamaGuard, may lead to performance improvements. However, the experimental results from StrongREJECT [1] show that the rubric-based approach slightly outperforms the fine-tuned models. Based on this, we conducted our experiments using the rubric-based method, with the ultimate goal of proposing a more accurate approach for judgment, even within the rubric-based framework.
As a result, we demonstrate through thorough dormant evaluation and extensive experiments that our approach outperforms existing methods, even when using the same base model. Additionally, we believe that the effectiveness of our evaluation method is proven by the fact that ELITE (InternVL2.5), using an open-source model, outperforms evaluation models such as StrongREJECT with the more advanced model GPT-4o and LLamaGuard.
Reference:
[1] Souly, A., Lu, Q., Bowen, D., Trinh, T., Hsieh, E., Pandey, S., Abbeel, P., Svegliato, J., Emmons, S., Watkins, O., and Toyer, S. A strongREJECT for empty jailbreaks. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024
Question2. We conducted filtering and evaluation by providing images to the evaluator in all experiments. This is because some previous evaluation methods assess only the model's responses, making it impossible to judge the success of attack methods that require understanding contextual details, such as the case of suicide & self-harm in Figure 1-(c) or the examples provided by the reviewer. We will rewrite the paper by clarifying that evaluation and filtering are conducted with images included. Thank you for your valuable feedback.
Question3. We conducted the human evaluation on a total of 228 samples by randomly sampling 110 samples from the ELITE benchmark and 118 samples that were not included (i.e., filtered out). We included at least 20 samples from each taxonomy and gathered the opinions of 3 labelers per sample, with the final labeling determined by majority vote. In total, 8 labelers were recruited for this evaluation. We provided the input image, text, and model's response to perform the safety judgment. As shown in the table below, the significant difference between the included and excluded datasets demonstrates the quality of the ELITE benchmark.
| Majority vote | From ELITE | Not From ELITE |
|---|---|---|
| Unsafe | 67.27% | 11.86% |
| Safe | 32.73% | 88.14% |
A new safety benchmark called ELITE is the primary contribution of this paper, together with an associated auto-rater (the ELITE evaluator). The benchmark comprises multimodal data—image-text pairs—that are designed to provoke harmful or unsafe responses from vision-language models (VLMs). Overall, the results of the paper are convincing and the benchmark is expected to be valuable; the primary concerns raised on degree of human consensus and were resolved to some extent during the rebuttal process.