5.8

/10

Poster5 位审稿人

最低4最高8标准差1.3

3.2

置信度

正确性2.6

贡献度2.4

表达2.8

NeurIPS 2024

Risk-Averse Fine-tuning of Large Language Models

Sapana Chaudhary,Ujwal Dinesha,Dileep Kalathil,Srinivas Shakkottai

OpenReview PDF

提交: 2024-05-15更新: 2025-01-13

TL;DR

By optimizing the risk measure of Conditional Value at Risk (CVaR), our methodology trains LLMs to exhibit superior performance in avoiding toxic outputs while maintaining effectiveness in generative tasks.

摘要

关键词

Risk Averse Reinforcement LearningLarge Language Model Finetuning

评审与讨论

审稿意见

评分: 8置信度: 22024-07-04

This paper addresses the challenge of mitigating the generation of negative or toxic content by Large Language Models (LLMs) in response to certain prompts. It proposes an innovative approach that integrates risk-averse principles into LLM fine-tuning, aiming to reduce harmful outputs, particularly rare but significant events. The method fine-tunes LLMs by integrating risk-averse principles, employing the CVaR risk measure within an RLHF framework. This method improves the ability of LLMs to avoid generating toxic content, particularly in high-risk scenarios, while maintaining overall performance in generative tasks. Empirical evaluations demonstrate the effectiveness of this approach in promoting safer online interactions and enhancing the applicability of LLMs in various domains.

优点

It introduces an innovative approach to mitigating toxic content generation by integrating risk-averse principles, validated through comprehensive empirical evaluations. The clarity of presentation and detailed methodology enhance the paper’s accessibility, while its significance lies in addressing a critical issue with broad applicability and promoting responsible AI. These strengths make the paper a valuable contribution to the fields of natural language processing and machine learning.

缺点

While the paper presents several innovative contributions, it also has potential weaknesses in terms of complexity, computational resource requirements, data dependency, interpretability, practical implementation challenges, adaptability, and the scope of validation. Addressing these weaknesses in future research would help to further validate and enhance the practical applicability and robustness of the proposed approach.

问题

N/A

局限性

The long-term efficacy of the risk-averse mechanisms in continuously evolving real-world settings remains uncertain. The model’s performance might degrade over time as new types of toxic or harmful content emerge, requiring ongoing updates and retraining.

作者回复

2024-08-07

Thank you appreciating and recognizing the strengths of our work. Please find additional details of interest below.

Computational resource requirement. The computational resource requirements are mentioned in Appendix E.1.

Computational complexity. Our algorithm has the same computational complexity as that of RLHF during the first n iterations. Once the soft risk scheduling kicks in, our algorithm introduces an additional best case computational complexity of $O(B + \alpha'\log(B))$ , where $B$ is the batch size and $\alpha'$ is the risk-level that decreases across iterations. The space complexity remains the same as that of RLHF.

Contribution. Our work is the first to introduce a nuanced understanding of `risk' in the context of LLM content generation, going beyond Greenberg et al. [2022]'s work. Greenberg et al. [2022] proposed soft risk scheduling to make policies learned using policy gradients risk-averse. Presented below are our contributions:

We implement CVaR in conjunction with a regularized reinforcement learning objective (reward + KL term). Greenberg et al. [2022] work only with the plain reward. We choose to work with regularized reward for two reasons: I. We want to measure risk in generations accounting for both the performance on actual environment reward and the quality of language generation measured by the KL-Divergence with respect to the reference policy. II. Our said choice makes our proposed algorithm downward compatible to the existing RLHF implementations.
We implement CVaR in the Actor-Critic setting, as opposed to policy gradients, with an aim to learn a complex parameterized policy (LLM) with an extremely large action space.
Beyond the immediate application of creating safer LLMs, this work contributes to the broader field of machine learning by demonstrating how risk measures like CVaR can be integrated into the training process of complex models like LLMs. Our work additionally establishes a groundwork for exploring additional risk measures and criteria, such as the Entropic Value at Risk (EVaR), in the context of LLM safety and uncertainty quantification.

Adaptability. In the scenario that the input prompt distribution shifts from the distribution our RA-RLHF models are finetuned on, our models can be further trained on the new prompt data using our RA-RLHF algorithm. As noted in our submission (Sec. 5.1 and 5.2), RA-RLHF ensures safe generations without hampering the quality of language generation. We believe this behavior would hold true under retraining for input data distribution shifts.

Validation. In our work, we studied inclusion of safety in LLMs using three datasets - IMDB, Jigsaw and RealToxicityPrompts - over GPT-2 (117M parameters) and GPT-J (6B parameters). Most of the related works in the LLM safety space work with IMDB and RealToxicityPrompts datasets with models ranging upto GPT2-L (762M parameters) as used in DExperts and Quark. RECT, another related work, studies only the RealToxicityPrompts dataset with GPT-2, GPT-2 XL (1.5B) and GPT-3 (175 B). In the light of this information, we believe our experiments are comprehensive in terms of the number of models and datasets studied.

2024-08-12

I have read the rebuttal and thank the authors for the responses.

审稿意见

评分: 5置信度: 32024-07-14

To mitigate the generation of negative or toxic content by LLMs, this paper proposes a new method "risk-averse" RLHF (RA-RLHF) to finetune LLMs. Their RA-RLHF optimizes the CVaR risk measure with RL to decrease the order of negativity or toxicity. They experiment with two datasets, IMDB-Gen and Jigsaw. Their experiment shows the effectiveness of their proposed method and compares it against conventional RLHF and supervised fine-tuning.

优点

The paper investigates an important problem of how to mitigate the generation of negative or toxic content by LLMs.
The paper provided an extensive literature review, background illustration and introduction of the proposed method.

缺点

Lack of clarity. I wonder how their method involves human feedback. Reading through Algorithm 1, it is unclear how human feedback is used and how they obtain human feedback. I also wonder how they get their reward function. Do they use any annotated data to train the reward function? Are the reward functions used in training and testing the same?
Lack of solid evaluation. The proposed method is evaluated on reward and perplexity. However, it is not clear how the reward function and perplexity are calculated and which models are used. In previous work [1], the creativity of LLM generations is also evaluated and human evaluation is an important method to verify the generations' quality. However, this paper has provided neither evaluation.

[1] Alisa Liu, Maarten Sap, Ximing Lu, Swabha Swayamdipta, Chandra Bhagavatula, Noah A. Smith, and Yejin Choi. 2021. DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6691–6706, Online. Association for Computational Linguistics.

问题

Please see the questions in weakness.

局限性

The paper provided limitations and ethical discussion in Sections A and B in the appendix.

作者回复

2024-08-07

Thank you for your insightful questions and comments. Before we begin answering the questions, we wanted to highlight that in our work, we studied inclusion of safety in LLMs using three datasets - IMDB, Jigsaw and RealToxicityPrompts - over GPT-2 (117M parameters) and GPT-J (6B parameters).

Q1. Lack of clarity. I wonder how their method involves human feedback. Reading through Algorithm 1, it is unclear how human feedback is used and how they obtain human feedback. I also wonder how they get their reward function. Do they use any annotated data to train the reward function? Are the reward functions used in training and testing the same?

A1. Human feedback is used to learn the score/reward models used for finetuning in our work. For the experiments with toxicity datasets, we use the toxicity score returned by unitary/toxic-bert LLM as our reward. For the positive sentiment generation task using IMDB dataset, we use the sentiment score returned by lvwerra/distilbert-imdb LLM as the reward. These rewards are used in line number 7 of our algorithm. We mention the use of existing reward models, as is the case in most related works, in line numbers 155-156 in Sec. 3. We have mentioned the reward function details in Appendix E.1 where we provide model and compute details. We can, however, make this more clear by adding this information in Sec. 4.1.

Q2. Lack of solid evaluation. The proposed method is evaluated on reward and perplexity. However, it is not clear how the reward function and perplexity are calculated and which models are used. In previous work [1], the creativity of LLM generations is also evaluated and human evaluation is an important method to verify the generations' quality. However, this paper has provided neither evaluation.

A2. Score/Reward Metric. For the experiments with toxicity datasets, we use the toxicity score returned by unitary/toxic-bert LLM as our metric. The LLM unitary/toxic-bert is trained to classify toxic comments on 3 Jigsaw challenges: Toxic comment classification, Unintended Bias in Toxic comments, and Multilingual toxic comment classification. Thus, the results included in the paper are indeed toxicity scores. For the positive sentiment generation task using IMDB dataset, we use the sentiment score returned by lvwerra/distilbert-imdb LLM. The LLM lvwerra/distilbert-imdb is trained to score positive sentiment present in a given piece of text, thus, making the returned score an appropriate metric for the task. These are also the models used to obtain reward function during training. We have mentioned these details in Appendix E.1 where we provide model and compute details.

Perplexity Metric. Perplexity is calculated to assess "how likely is a piece of text to be generated by a model", mathematically evaluated as $\text{PP}(W) = 2^{-\frac{1}{N} \sum_{i=1}^N \log_2 P(w_i|w_1, \ldots, w_{i-1})}$ . Here, PP(W) is the perplexity of the model on the given text W. We keep W fixed across models. We choose positive prompts and completions from test dataset to form W to capture how positive/non-toxic the models are. N is the total number of words in the text. $P(w_i|w_1, ..., w_{i-1})$ is the probability assigned by the model to the $i$ -th word given the preceding words. Perplexity calculation code is included in Appendix G. We can expand the discussion on perplexity in Appendix G as above. We can also briefly mention it in Sec. 5.2.

Creativity of LLM generation. We have now included results on text diversity measuring metric Distinct-n (Dist-n) introduced in DExperts [1]. The results are included over the existing algorithms along with the three new baselines (DExperts, Quark [2] and Prompted GPT-2) as described in the attached author rebuttal pdf. We observe that across datasets, models returned by our proposed algorithm RA-RLHF enjoy the best performance in terms of safety scores while maintaining text coherence and diversity.

References

[2] Lu, Ximing, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, and Yejin Choi. "Quark: Controllable text generation with reinforced unlearning." Advances in neural information processing systems 35 (2022): 27591-27609.

2024-08-13

Sorry for late response. Thanks for your response and new results. I found that most of my concerns have been addressed. I updated my rating accordingly.

评论- Requesting feedback on rebuttal

2024-08-13

Dear Reviewer,

Please let us know if our rebuttal answered your questions. Please let us know if you have any further concerns.

Thank you.

审稿意见

评分: 4置信度: 42024-07-15

The paper presents a method for fine-tuning large language models (LLMs) using risk-averse reinforcement learning from human feedback (RA-RLHF). The core contribution is the integration of risk aversion into the RL fine-tuning process to minimize the generation of toxic content, particularly in responses to harmful or challenging prompts. This is achieved by optimizing the Conditional Value at Risk (CVaR), which focuses on minimizing the expected harm in the most severe scenarios, rather than average outcomes. The authors claim this approach allows LLMs to perform effectively under challenging input conditions, maintaining good generative performance while reducing toxicity.

优点

The experimental results, as demonstrated in the plots and tables, show that RA-RLHF significantly reduces toxic outputs.
The paper addresses a critical need for safer LLM applications, proposing a solution that could help the development of ethical AI systems.
The methodological explanation is thorough, detailing how the RA-RLHF integrates with existing LLM training frameworks and how it adjusts for risk during training iterations, which provides a clear roadmap for replication and future research.

缺点

The approach primarily adapts the CeSoR algorithm by Greenberg et al. [2022] to the text generation context. This application, while effective, does not significantly extend the original algorithm's conceptual framework, leading to questions about the novelty of the contribution. Essentially, the method repurposes an existing risk-averse reinforcement learning strategy for LLMs, which may limit its perceived innovation in the field.
Insufficient Comparison with Other Baselines: The evaluation of RA-RLHF mainly involves comparisons with SFT and standard RLHF. However, there are numerous other techniques for reducing toxicity in text generation that were not considered, such as rejection sampling, DExpert by Liu et al. [2021], QUARK by Lu et al. [2022], and RECT by Cao et al. [2023] and prompt-based method as such Self-Debias. The exclusion of these methods from the comparative analysis might give an incomplete picture of the proposed method's effectiveness relative to the current state-of-the-art.
The paper does not compare its approach with simpler prompt-based methods, which direct LLMs to generate non-toxic content through instructions. Given the advanced instruction-following capabilities of current LLMs, this omission could be a significant oversight, as such methods may offer simpler and potentially equally effective alternatives for reducing toxicity.

问题

Consider comparing the proposed method with simpler prompt-based approach, which direct LLMs to generate non-toxic content through specific instructions.
Missing references:

Lu, Ximing, et al. "Quark: Controllable Text Generation with Reinforced Unlearning." arXiv preprint arXiv:2205.13636 (2022).
Meng, Cao, et al. "Systematic Rectification of Language Models via Dead-end Analysis." https://arxiv.org/abs/2302.14003 (2023).

局限性

Consider evaluating the proposed method on more complex tasks spanning various domains.

作者回复

2024-08-07

Thank you for your insightful questions and comments.

C1. The approach primarily adapts the CeSoR algorithm by Greenberg et al. [2022]...

A1. Our work is the first to introduce a nuanced understanding of `risk' in the context of LLM content generation, going beyond Greenberg et al. [2022]'s work. Greenberg et al. [2022] proposed soft risk scheduling to make policies learned using policy gradients risk-averse. Presented below are our contributions:

We implement CVaR in conjunction with a regularized reinforcement learning objective (reward + KL term). Greenberg et al. [2022] work only with the plain reward. We choose to work with regularized reward for two reasons: I. We want to measure risk in generations accounting for both the performance on actual environment reward and the quality of language generation measured by the KL-Divergence with respect to the reference policy. II. Our said choice makes our proposed algorithm downward compatible to the existing RLHF implementations.
We implement CVaR in the Actor-Critic setting, as opposed to policy gradients, with an aim to learn a complex parameterized policy (LLM) with an extremely large action space.
Beyond the immediate application of creating safer LLMs, this work contributes to the broader field of machine learning by demonstrating how risk measures like CVaR can be integrated into the training process of complex models like LLMs. Our work additionally establishes a groundwork for exploring additional risk measures and criteria, such as the Entropic Value at Risk (EVaR), in the context of LLM safety and uncertainty quantification.

C2. Insufficient Comparison with Other Baselines...

A2. Thank you pointing these out. We have now included results over these in the attached author rebuttal pdf. We could not add comparisons with RECT as the model checkpoints are not publicly available for inference. However, we will be sure to mention the work in our related work section (Sec. 2). We observe that across datasets, models returned by our proposed algorithm RA-RLHF enjoy the best performance in terms of safety scores while maintaining text coherence and diversity.

C3. The paper does not compare its approach with simpler prompt-based methods...

A3. Thank you pointing this out. We have now added results over a Prompt + GPT-2 baseline in the attached author rebuttal pdf. We in-context prompt GPT-2 to generate safe text. At inference time, we add the prompt "Generate positive sentiment" to the IMDB prompts. For the toxicity datasets (Jigsaw and RealToxicityPrompts), we add the prompt "Generate non-toxic text". Generations from the prompted model enjoy only a marginal improvement in terms of safety scores over vanilla GPT-2.

Thank you for pointing out the missing references. We will be sure to mention these in our related work section (Sec. 2).

2024-08-12

Thanks to the author for the response. The additional experiments addressed some of the concerns. However, I don't think the prompting baseline result in the attached pdf makes sense since GPT2 is relatively weak and lacks the ability to follow instructions. Having said that, I will raise the score by 1.

审稿意见

评分: 6置信度: 32024-07-17

The authors propose a fine-tuning method to mitigate text degeneration, such as toxic outputs, in large language models. The proposed method optimizes a Conditional Value at Risk-inspired criterion. Experimental results show that the proposed method outperforms various baselines on two datasets.

优点

The paper is well-written, and the proposed method is relatively easy to understand.
The task addressed is important and relevant.

缺点

The experimental results are limited. I realize that training LLMs requires substantial computational resources, but given that this method is designed for LLMs, the authors should have conducted more detailed experiments on a larger number of datasets, different model architectures, and sizes.
The improvements over the baselines, especially RLHF, seem somewhat small, particularly for GPT-J 6B. It is unclear what actual benefits the proposed method offers, as the metrics used do not clearly demonstrate improved text generation capabilities (toxicity scores? human evaluation?). Furthermore, the experiments with GPT-J 6B should have been placed in the main paper instead of the appendix.
Overall, contribution seems limited, since this method appears to be an extension of Greenberg et al. [2022] for LLM fine-tuning. Authors should clearly describe differences between their proposed method and that of Greenberg et al. [2022].

问题

What is the computational efficiency of the proposed method compared to the baselines?
How does proposed method performs against methods such as DExperts (Liu et al., 2021) and GeDi (Krause et al., 2020)?
Have you considered evaluating output diversity, e.g., the mean number of distinct n-grams, normalized by the length of text (Li et al., 2016)?

局限性

Yes

作者回复

2024-08-07

Thank you for your insightful comments and questions.

C1. The experimental results are limited....

A2C1. In our work, we studied inclusion of safety in LLMs using three datasets - IMDB, Jigsaw and RealToxicityPrompts - over GPT-2 (117M parameters) and GPT-J (6B parameters). Most of the related works in the LLM safety space work with IMDB and RealToxicityPrompts datasets with models ranging upto GPT2-L (762M parameters) as used in DExperts and Quark. RECT, another related work, studies only the RealToxicityPrompts dataset with GPT-2, GPT-2 XL (1.5B) and GPT-3 (175 B). In the light of this information, we believe our experiments are comprehensive in terms of the number of models and datasets studied.

C2. The improvements over the baselines, especially RLHF, seem somewhat small...

A2C2. Metric. For the experiments with toxicity datasets, we use the toxicity score returned by unitary/toxic-bert LLM as our metric. The LLM unitary/toxic-bert is trained to classify toxic comments on 3 Jigsaw challenges: toxic comment classification, unintended bias in toxic comments, and multilingual toxic comment classification. Thus, the results included in the paper are indeed toxicity scores. For the positive sentiment generation task using IMDB dataset, we use the sentiment score returned by lvwerra/distilbert-imdb LLM. The LLM lvwerra/distilbert-imdb is trained to score positive sentiment present in a given piece of text, thus, making the returned score an appropriate metric for the task.

GPT-J. We could not add the results for GPT-J in the main paper because of the space constraints. We can add these in the main body in the final version of the paper. The results reported in our work, especially on the tail prompts, across datasets and LLM types are amongst the largest margin improvements reported in the related work (DExperts [1], Quark [2]).

C3. Overall, contribution seems limited...

A2C3. Our work is the first to introduce a nuanced understanding of risk in the context of LLM content generation, going beyond Greenberg et al. [2022]'s work. Greenberg et al. [2022] proposed soft risk scheduling to make policies learned using policy gradients risk-averse. Presented below are our contributions:

We implement CVaR in conjunction with a regularized reinforcement learning objective (reward + KL term). Greenberg et al. [2022] work only with the plain reward. We choose to work with regularized reward for two reasons: I. We want to measure risk in generations accounting for both the performance on the actual environment reward and the quality of language generation measured by KL-Divergence with respect to the reference policy. II. Our said choice makes our proposed algorithm downward compatible to the existing RLHF implementations.
We implement CVaR in the Actor-Critic setting, as opposed to policy gradients, with an aim to learn a complex parameterized policy (LLM) with an extremely large action space.
Beyond the immediate application of creating safer LLMs, this work contributes to the broader field of machine learning by demonstrating how risk measures like CVaR can be integrated into the training process of complex models like LLMs. Our work additionally establishes a groundwork for exploring additional risk measures and criteria, such as the Entropic Value at Risk (EVaR), in the context of LLM safety and uncertainty quantification.

Q1. What is the computational efficiency of the proposed method compared to the baselines?

A1. Our algorithm has the same computational complexity as that of RLHF during the first n iterations. Once the soft risk scheduling kicks in, our algorithm introduces an additional best case computational complexity of $O(B + \alpha'\log(B))$ , where $B$ is the batch size and $\alpha'$ is the risk-level that decreases across iterations. The space complexity remains the same as that of RLHF.

Q2. How does proposed method performs against methods such as DExperts (Liu et al., 2021)...

A2. We have now included a comparison with three new baselines (DExperts, Quark, and Prompted Base Model) in the attached author rebuttal pdf. We did not include results on GeDi (Krause et al., 2020) as both DExperts and GeDi are decoding based methods, and DExperts reported better results over GeDi. However, we will be sure to include GeDi in the related work section (Sec. 2).

Q3. Have you considered evaluating output diversity..?

A3. Thank you for the suggestion. We have now included results over output diversity in the attached author rebuttal pdf.

References

评论- Requesting comment on the rebuttal

2024-08-13

Dear Reviewer,

Please let us know if our rebuttal answered your questions. Please let us know if you have any further concerns.

Thank you.

2024-08-13

Dear Authors,

Thank you for your detailed response and for the additional experiments.

I have increased my score by +2, but I still believe that the paper will benefit from additional experiments with models larger than GPT-2. The experimental setup with GPT-2 doesn't really make sense to me, e.g., the in-context prompting capabilities of this particular model are very limited.

Related work not discussed:

PROVABLY EFFICIENT ITERATED CVAR REINFORCEMENT LEARNING WITH FUNCTION APPROXIMATION AND HUMAN FEEDBACK

审稿意见

评分: 6置信度: 42024-07-20

The paper presents a new way to reduce the generation of toxic content by large language models. The authors introduce a method that integrates risk-averse principles into the fine-tuning process, focusing on minimizing harmful outputs using Conditional Value at Risk (CVaR) as the risk measure. The goal of this approach is to address uncommon but significant toxic events. By using risk-averse reinforcement learning with human feedback (RA-RLHF), the authors train LLMs to avoid generating negative content while still being effective in other language generation tasks. The results from sentiment modification and toxicity mitigation tasks show that the approach reduces toxic outputs and seems to promote a safer generation.

优点

The idea of applying risk-averse principles to fine-tune LLMs is rather new and the results seem to show that RA-RLHF is effective at reducing toxic outputs.
The paper is well written and the methodology properly explained (even though the limitations and discussion sections ended up being in the appendix)
I appreciated the use of the Proximal Policy Optimization (PPO) within the RA-RLHF framework, I think it could be an approach that can be integrated into existing workflows.

缺点

The paper mentions a slight increase in model perplexity with the RA-RLHF method. This doesn’t really hurt overall performance, but it does suggest the model might be making some more drastic changes, which could affect the smoothness and clarity of the text it generates in certain situations.
The paper does not discuss how the the model interacts with users in dynamic environments.
As the authors point out in the limitations section, the evaluations are carried out only on specific tasks (IMDB-Gen and Jigsaw-Gen), and so it's unclear how well the approach works in other tasks.

问题

What steps can be taken to ensure the risk-averse fine-tuning process doesn't inadvertently reinforce existing biases in the training data? How can the model be evaluated for potential biases?
What is the long-term impact of risk-averse fine-tuning on model behavior and performance? How stable are the improvements over time?

局限性

The long-term impact of the risk-averse fine-tuning on model behavior and performance over time is not addressed, leaving questions about the stability of the improvements.
The paper doesn't deeply address the potential for introducing or reinforcing biases during the fine-tuning process.

作者回复

2024-08-07

Thank you for your insightful comments and questions.

Q1. What steps can be taken to ensure the risk-averse fine-tuning process doesn't inadvertently reinforce existing biases in the training data? How can the model be evaluated for potential biases?

A1. Since we work in the regime of RLHF based LLM finetuning, as long as the reward model used in RA-RLHF penalizes bias, our learned models will be unbiased. This is indeed the case for our models trained to generate non-toxic text as we use unitary/toxic-bert as the reward model. unitary/toxic-bert is trained to classify toxic comments on 3 Jigsaw challenges: toxic comment classification, unintended bias in toxic comments, and multilingual toxic comment classification.

Q2. What is the long-term impact of risk-averse fine-tuning on model behavior and performance? How stable are the improvements over time?

A2. In the scenario that the input prompt distribution shifts from the distribution our RA-RLHF models are finetuned on, our models can be further trained on the new prompt data using our RA-RLHF algorithm. As noted in our submission (Sec. 5.1 and 5.2), RA-RLHF ensures safe generations without hampering the quality of language generation. We believe this behavior would hold true under retraining for input data distribution shifts.

Please let us know if we understood your questions correctly. We are happy to provide further clarification if needed.

评论- Thank you for your reply.

2024-08-11

I appreciate the way you addressed my concerns.

作者回复

2024-08-07

With this work, our goal was to introduce a nuanced understanding of "risk" in the context of LLM content generation to induce safety/non-toxicity in LLM generations. We achieved so by introducing a risk-averse strategy to LLM finetuning, focusing on optimizing Conditional Value at Risk (CVaR), representing a significant contribution to enhancing the safety and ethical considerations of LLM deployment.

We are pleased to know that reviewers appreciated our efforts, and we extend our gratitude to the reviewers for their insightful comments, questions and suggestions. It is encouraging to note that the reviewers found

the paper addressing a critical need for safer LLMs (4ixc, CMzE, xjv9),
the proposed solution of applying risk-averse principles to fine-tune LLMs new (R4NK),
our paper well written (R4NK, CMzE),
the methodology clearly explained (R4NK, CMzE, 4ixc, pvQg), and
that the results demonstrate the effectiveness of our proposed approach (R4NK, CMzE, 4ixc, xjv9, pvQg).

Upon reviewers' suggestions, we have included additional evaluation results, demonstrating superior performance of our proposed RA-RLHF algorithm over three new baselines and a diversity evaluation metric in the attached pdf.

References for the attached pdf:

[3] Cao, Meng, Mehdi Fatemi, Jackie Chi Kit Cheung, and Samira Shabanian. "Systematic rectification of language models via dead-end analysis." arXiv preprint arXiv:2302.14003 (2023).

最终决定Accept (poster)

2024-09-25

This paper proposes an application of Conditional Value at Risk (CVaR) within an RLHF framework for mitigating the generation of toxic content by LMs through the integration of risk-averse principles into the fine-tuning process.

The paper is borderline with mixed confidence of the reviewers. Reviewers acknowledge that the application of CVaR to mitigate toxic content extends existing RLHF methods by incorporating risk measures. However, there are mixed opinions regarding the extent of its contributions, particularly in comparison to existing methods, and the thoroughness of the experimental evaluation. The paper received the following reviews: (6, confidence 4), (6 confidence 3), (4 confidence 4), (8 confidence 2), (5 confidence 3)

Several reviewers noted that the experiments were somewhat limited, primarily focusing on a few datasets and model architectures (e.g., GPT-2 and GPT-J). There is concern that the method may not generalise well across different tasks or larger, more complex models. However, upon review the paper's results appear to be consistent across scale and the additional analysis seems to make up for the lack of the number of model varieties and these empirical results complement the paper's theoretical contribution by adapting the risk-averse RL framework to this setting.

Some reviewers questioned the degree of novelty, noting that the method appears to be an extension of previous work (e.g., Greenberg et al. 2022) with limited incremental innovation. Additionally, comparisons with other state-of-the-art methods like DExperts, GeDi, and simpler prompt-based approaches were not included in the initial submission. These were addressed in the rebuttal and supplemental materials the authors provided after submission.

During the rebuttal, the authors provided additional experiments comparing their method to DExperts, Quark, and other baselines, and clarified that the GPT-J results were omitted from the main text due to space constraints. The authors addressed concerns about the computational efficiency and noted that their method is comparable to RLHF in complexity, with additional overhead introduced only in later stages of training. The authors clarified their contributions, emphasising the integration of CVaR with regularised rewards and its implementation in the Actor-Critic framework as key innovations over previous work.