PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models
We introduce PolygloToxicityPrompts, the first large-scale multilingual toxicity evaluation benchmark of 425K naturally-occurring prompts spanning 17 languages.
摘要
评审与讨论
The authors evaluated over 60 large language models (LMMs) with respect to their toxicity generation. They relied on a benchmark of 425K prompts that were evaluated for 17 languages. The toxity score was calculated with the PERSPECTIVE API tool. During the study, they evaluated the model size, language, among other attributes. The paper is well written and includes a good related work section.
接收理由
- A comprehensive evaluation for a relativelly large number of prompts (425K), many languages (17), and many LLMs (60).
- Interesting insights on many aspects, e.g., alignment techniques, model categories, model size.
拒绝理由
- It is unclear whether the selected tool (PERSPECTIVE API) is reliable for calculating the score.
- Other additional scoring tools could have been used for comparison, even if just for some of the languages.
- It is unclear whether any of the resources compiled in this work (dataset, prompts, scores, outputs) will be made available.
给作者的问题
-
I wonder why just one industry tool was used for calculating the score and whether other tools could have been considered for comparison, even if just for some of the languages. How good does PERSPECTIVE API scores wrt. to other tools?
-
I wonder whether any of the resources compiled in this work will be made available.
- minor comments:
-
section 3.2: please define prompts and continuations.
-
section 3.2: please explain GPT-4 tokens wrt. other tokens.
-
Figure 3: please clarify why toxity is high if scores are under 0.5 (threshold defined in section 3.4).
-
section 4.3: please define base, instruct, and preference, as well as the different preference-tuning methods (alignment).
伦理问题详情
Maybe due to the topic of the paper (toxicity language).
Thank you for the feedback!
Clarifying choice of toxicity detection tool
We selected Perspective API since it is the industry standard tool for toxicity detection (used by over 1000 partners), is popularly used for toxicity studies in academia (e.g., Gehman et. al. '20, Si et. al. '22, Hartvigsen et.al. '22, Lin et.al. '23), and supports all languages that we have considered. Since toxicity detectors are classifiers trained on varying toxic data, we chose the most popular/commonly used detector available to us. Moreover, using Perspective API allows us to compare our dataset with existing works like RealToxicityPrompts, which used PerspectiveAPI as their default tool as well.
Reliability and accuracy of PerspectiveAPI: Perspective API tabulates the area under the ROC curve (AUC-ROC) on various datasets, and achieves > 0.9 for almost all the supported languages, as shown in the documentation.
Using other tools for some languages only: We have Llama Guard scores for the English subset for some models. We could compute toxicity scores using Unitary’s Detoxify, but the computation would not be complete within the discussion period. Moreover, during the dataset creation phase, we observed divergent scores between Detoxify and Perspective API (Pearson correlation between the scores was near 0). Hence, we proceeded with Perspective API, which is both an academic and industry-standard tool. Since there is public access to our dataset, we hope researchers will explore the use of other toxicity detection tools and compare them to Perspective API.
Publicly available link to our resources
We have made our dataset of prompts along with associated toxicity scores available in the shared repository (linked on page 1). We will also add code to generate continuations using LLMs to the repository.
Minor comments for clarifications
Thank you for the suggested minor clarifications, we will make the required changes to our manuscript.
I thank the authors for clarifying the various points that I raised. I have no further questions.
The paper introduces PolygloToxicityPrompts, a dataset designed to assess toxicity generation across 17 languages. Using this dataset, authors evaluate 62 Language Model (LLM) variants, varying in both scale and architecture. Results show that model size inside the same family as well as preference tuning have an impact on the amount of toxicity generated.
接收理由
- The study is exhaustive, including all of the more popular families of LLM models.
- The paper is well written and the process is easy to follow.
- The research questions are clear and show interesting patterns. Especially, the ones involving non-English results.
拒绝理由
- Toxicity is a term difficult to define and evaluate. This paper uses third-party metrics to evaluate toxicity but it would be interesting to know better what aspects they are measuring in more detail. Toxicity could be evaluated as the generation of words that are considered toxic (e.g. HolisticBias (https://arxiv.org/abs/2305.13198)), but that would ignore toxic examples that do not include toxic words out of context. (e.g. "I think this reviewer is not intelligent", not of the words could be considered toxic if isolated).
给作者的问题
- What kind of toxicity is present in the dataset? It would be interesting to have more information about the specific phenomena in the dataset and how accurate are PERSPECTIVE API and LLama Guard on identify it.
Thank you for the positive and thoughtful feedback! We agree that the scope of measuring toxicity is very open-ended. Hence, we are using Jigsaw’s definition of toxicity as measured by Perspective API. In addition to the TOXICITY score, Perspective API also provides scores for IDENTITY ATTACK, INSULT, PROFANITY, SEVERE_TOXICITY, and THREAT as stated in their documentation.
We will provide additional data analysis based on the attributes mentioned above in our manuscript, to explore what kinds of toxicity are generated by different models. Moreover, we will provide the Llama Guard scores for the English subset of our dataset along with the distribution of harmful content in the subset as per Llama Guard’s taxonomy.
The paper introduces a new dataset, PolygloToxicityPrompts, consisting of 245K prompts in 17 languages with varying toxicity levels. It is inspired by RealToxicityPrompts and includes prompts in the 17 languages supported by Perspective API, which supplies toxicity ratings. This language set covers a range of language families while excluding low-resource languages.
The paper evaluates 62 LLMs including base, instruction tuned, and preference tuned models. The experiments investigate the effects of per-language training data size, model size, alignment methods, and instruction following ability. (Experiments are run on a subset of 5K prompts per language due to the large number of models investigated.)
接收理由
- Clearly written, well motivated, and well supported with prior work. Clear limitations and ethids statements.
- A large benchmark of significant use to the field.
- Benchmark is 83% natural text and not a translated version of English benchmarks.
- Large number of LLMs evaluated.
- Evaluation metrics are clearly explained and appropriate.
- Identifies some subtle confounders such as models that produce shorter or poor quality responses appearing to have lower toxicity.
拒绝理由
I don't see any reasons to reject.
给作者的问题
Given the source of the prompts in publicly available corpora, what role does memorization play in toxic outputs?
Thank you for the encouraging feedback and positive endorsement of our work!
To address your question: since the training data for nearly all models considered in our study are proprietary, it’s hard to make a comment on the role memorization would play in toxic outputs. However, we sample 10 generations per prompt with a high temperature setting to introduce randomization and alleviate concerns with memorization. We hope that more LLMs will be released along with their pretraining data (as is the plan for AI2's OLMo), which will enable more direct studies of this phenomenon that until now only companies themselves have been able to do (e.g., Longpre et al '23).
Thanks for the reply. The explanation of multiple samples at high temperature to minimize memorization effects sounds like a reasonable approach.
The authors present POLYGLOTOXICITYPROMPTS (PTP). A multilingual toxicity evaluation benchmark which consists of 425K prompts spanning 17 languages. Further, the authors use the PTP benchmark to investigate research questions to study the impact of model size, prompt language, and instruction and preference-tuning methods on toxicity by benchmarking over 60 LLMs.
接收理由
- The authors do a thorough job in including diverse languages in building the toxicity dataset from C4 and Pile.
- The authors present their evaluation a wide number of LLM including open-source LLMs.
- The authors present insights which can be ripe for future work, for example, the correlation between input and output toxicity for different models.
拒绝理由
The authors clarify that to attain a larger sample of toxic content for languages with low toxicity rates, they create synthetic high-toxicity data. Specifically, they translate toxic samples from the mC4 and THE PILE corpora into target languages using the NLLB-3B model to create ≈ 70K translated prompts across 9 languages. While the fact that this amounts to only 16.8% of our dataset would have been a non-issue, and in-fact heralded in other contexts.. given that the goal of this paper is to be multilingual, it makes me wonder if this is incorrect in principle? While, I don't think this impacts the value of the rest of the dataset, the authors should make more explicit right upfront, or elsewhere!
给作者的问题
Refer above.
- The README in the shared repository is not very helpful. It's just an abstract.
Thank you for the feedback! While we do have a subsection dedicated detailing these challenges and our process for creating synthetic high-toxicity data, we will revise the introduction to make it more explicit that our dataset contains some amount of synthetically generated translated data as well.
We will also update the shared repository with code to load our dataset and generate continuations with LLMs, as well as a more detailed README.
Thank you! With regards to the shared repository, the README is still not updated? I'm slightly confused on the utility and motive of sharing an anonymous github repository without a bare minimum README to navigate it? Right now, all it has is the abstract from the paper.
Thank you for the follow up! We apologize for the delay in updating the README, we have now updated it with details about the dataset. Due to our dataset's size and Github's storage constraints, we are currently unable to add code to the repository. We plan to move our dataset to huggingface for easier access and update the repository with code to evaluate toxic degenerations in arbitrary LLMs after the discussion period to maintain anonymity. We hope that you find the README more helpful now.
This paper provides a multilingual dataset for eliciting toxic responses in large language models. The authors systematically collected an extensive collection of 25000 prompts per language covering 17 languages. Furthermore, using the dataset, this paper provides a detailed analysis of toxicity in large language models by considering variables such as prompt language, model size, model alignment methods, and toxicity evaluation methods.
接收理由
- The key contributions of this paper are: a) a large multilingual corpus for evaluating toxicity, and b) analyzing current open and closed source models based on these prompts
- The resources provided in this paper would enable future research on toxicity in multilingual generations
- The paper also presents novel insights into studying toxic generation. For example, instruction tuning and preference tuning play a role in reducing toxicity. Interestingly, specific alignment methods such as DPO or SFT-PPO do not lead to significant changes in toxicity.
- Last but not least, this paper is very well written, easy to follow, and provides clear reproducibility statements indirectly enabling further research in developing similar benchmarks for LLMs.
拒绝理由
N/A
Thank you for the encouraging feedback and the very positive review!
Great, once again, thank you for the good work!
The paper design and curate a large multilingual benchmark (245k) to measure toxicity across 17 languages; This is a valuable resource for future work, especially given that the multilingual prompts---for the most part---are not translation from English. The authors have evaluated a large number of LLMs (62) on this benchmark, providing insights on the role of various factors such as model size, etc on toxicity generation. I agree with the reviewers about the impact of this work, and encourage the authors to apply reviewers' comments. Another interesting question to explore is whether the performance of models on natural prompts correlate with the synthetically generated ones (and whether the latter can be useful replacement).