One-Shot Safety Alignment for Large Language Models via Optimal Dualization
摘要
评审与讨论
The paper studies the safety alignment of language models using constrained Reinforcement Learning from Human Feedback (RLHF). The main contribution of the paper is deriving a closed-form solution of the dual function of a constrained RLHF problem. This closed-form solution reduces solving a constrained RLHF problem to an unconstrained RLHF problem.
优点
This is a superbly written paper studying a highly important problem: the safety alignment of language models. The main contribution is a novel closed-form solution of the dual function, which allows for a new two-step algorithm for constrained RLHF. This presents a significant improvement over earlier algorithms for constrained RLHF as it (probably) greatly reduces the computational burden.
缺点
No Details on Compute Given
The main weakness of the paper is that compute resources and computation times are not stated in the paper (I searched extensively but couldn't find it; sorry if I missed it). This makes the sentence from the abstract: "[...], thus greatly reducing the computational burden [...]" essentially unjustified. It makes perfect sense to me that the two-stage approach in the paper is significantly cheaper than performing gradient ascent/descent or other constrained optimisation algorithms. However, I could also imagine that the situation might be different in practice. For example, I could imagine that:
- Starting with a larger value of than and reducing the value of to during optimisation might accelerate convergence to the feasible set.
- Or the opposite: Starting with a smaller value of than and increasing the value of to during optimisation might lead to better optimisation conditions initially, thereby accelerating convergence. Both could occur when applying gradient descent/ascent. Therefore, to justify the claims in the abstract, the paper at least needs to provide details on the computation time of their experiments. However, to make the claim well-founded, the paper should actually provide a comparison of the computation times with the baseline approach and should discuss computation times in the main part of the paper.
Experiment Evaluation
The model-based evaluation of the experiments makes sense from an optimisation point of view. However, it does not say so much about the quality of the obtained language models since overoptimisation [17] might have occurred, for example. The GPT evaluation helps somewhat in this regard but could still be flawed. I understand that this is difficult to address since a statistically significant human evaluation would be expensive.
Limitations
The discussion of social impact is very brief; see the "Limitations" section of this review.
问题
Questions
Experiments
- What is the computational budget of your experiments (see "Weaknesses")?
- Line 295: How was the grid of safety margins chosen?
Code Availability
- I would like to have a look at the code during the rebuttal, as offered in the answer to question 5 of the NeurIPS paper checklist.
- Are there any plans for publishing the code?
Appendix N: Sample Responses
I would be interested in the helpfulness and safety scores of the sample responses. Additionally, I would be interested in an interpretation of the sample outputs of the PeCAN-aligned language model. For example, the answer in Table 6 is partially non-sensical and unrelated, and the answer in Table 8 contains typos and grammar errors ("I don against advise"). Do these answers also lose in helpfulness against the baseline or is there maybe an overoptimisation of the helpfulness model taking place?
Typos and Other Minor Suggestions
- Line 100: preference-based safety alignment?
- Line 181: I assume this accuracy notion is from the literature, since line 187 states that [11] proves something using this accuracy notion. I would suggest using a different wording than "we introduce" if this is the case. For example, "To quantify the level of estimation error, we consider the accuracy notion of Chang et al. [11]", or something along these lines.
- Line 221: with an existing dataset
- Line 229: If the size
- Line 409: Conference name capitalised as in line 418?
- Line 242: conference name also capitalized?
局限性
The paper sufficiently discusses limitations.
In my opinion, the discussion of social impact is too brief. As far as I can see, it is limited to this sentence: "Our methods can benefit researchers in building safer language models." However, what is proposed in the paper is really a general-purpose method of solving contained RLHF problems. I think it should be acknowledged that there are also harmful dual-use applications for this general-purpose tool (as there are for most other general-purpose tools, e.g. [A]). Examples include minimising safety while maintaining a level of helpfulness which could be relevant to malignant online communities, such as cyberbullying communities or troll networks.
[A]: Fabio Urbina, Filippa Lentzos, Cédric Invernizzi, Sean Ekins: Dual use of artificial-intelligence-powered drug discovery. Nat. Mach. Intell. 4(3): 189-191 (2022)
Post-Rebuttal: Since the authors have addressed my concerns, I raise my score to 8 (Strong Accept).
We thank the reviewer for the positive evaluation and the valuable feedback. We have answered all questions to the best of our ability. We are glad to address any further questions you might have.
1. Computing resources and running time.
Our experiments are conducted on a single 48G NVIDIA A6000 GPU, taking about 15 hours to align each model. For comparison, the constrained RLHF [R4] indicates 20 hours for each run on a more powerful NVIDIA A100 GPU. The computing resources and running time are not reported in safe-RLHF [12]. However, since safe-RLHF uses PPO for policy updates, like constrained RLHF, on a much larger dataset, we expect its running time to scale 2-3x (i.e., 40-60 hours) proportional to constrained RLHF. Constrained DPO [22] uses a single 40G NVIDIA A100 GPU without reporting the running time. Therefore, our methods reduce running time by at least 25% compared to the aforementioned methods while using a much cheaper GPU. We will include this discussion in future revisions.
2. Monotonically tuning the dual value.
Although intuitively tuning the dual variable simulates gradient descent/ascent updates to some extent, its efficacy highly depends on the unknown optimization landscape. To our understanding, the suggested strategy implicitly involves three undefined factors: (i) the optimal dual variable ; (ii) positive or negative constraint satisfaction; and (iii) annealing step size. Adjusting the dual variable depends on whether the primal variable (i.e., parameters of an LM) violates the constraint. However, the LLM policy, updated via sophisticated optimizers (e.g., DPO), does not necessarily ensure that the updated LLM policy always satisfies (or violates) the constraint. Hence, gradually reducing or increasing the dual variable, even starting from a sufficiently large or small initial value, may not converge to the optimum.
Moreover, small step sizes slow down the convergence speed of the dual variable, while large step sizes often result in severe oscillation of policy iterates. Therefore, iterative primal-dual algorithms often suffer from high computational burden and instability issues, as reported in Figure D.2 and the conclusion section of [R1], and Figure 2 of [22].
We would also like to note that the heuristic tuning strategy does not scale well to multi-constrained cases as the search space grows exponentially. In contrast, our method provides a principled and guaranteed methodology for multi-constraint cases.
3. Insufficiency of model-based evaluation.
We agree on the limitations of model-based evaluation, which is why we have the GPT-based evaluation. While human evaluation can be ideal, it is costly and subject to biases, as you mentioned. It seems there is no gold-standard evaluation for language models. We believe that human and AI evaluations should be complementary [R1].
4. Choices of safety margins
We chose the set of safety margins to achieve a diverse range of safety improvement levels. These margins were made primarily to better visualize the trend of safety improvement versus the dual variable in Figure 2 (left).
5. Code Availability
According to the rebuttal policy, we have shared our source code with the AC in an anonymized link. Please feel free to contact the AC for access to our code. We plan to officially release our code to the public after cleaning it up and adding detailed instructions.
6. Details on sample Responses
Thank you for your great interest in our experiments. We provide the GPT-evaluated safety scores of sample outputs upon the malicious red-teaming prompts on Pages 29--31 of the submitted manuscript as follows. We observe noticeable improvements brought by MoCAN- and PeCAN-alignment. However, we would like to remark that these sample prompts are handcrafted by literature [12] to mainly test safety performance. The helpfulness score evaluated on these prompts may not fully indicate the ground-truth helpfulness of each LM.
Table 1. GPT-evaluated safety levels for sample responses in the submitted manuscript.
| Moldel | SFT | Safe-RLHF | DPO | DPO | MoCAN | PeCAN |
|---|---|---|---|---|---|---|
| Table 5 | 5 | 8 | 9 | 8 | 7 | 8 |
| Table 6 | 0 | 10 | 0 | 10 | 10 | 10 |
| Table 7 | 2 | 9 | 4 | 10 | 10 | 9 |
| Table 8 | 0 | 10 | 10 | 10 | 10 | 10 |
Table 1. GPT-evaluated help levels for sample responses in the submitted manuscript.
| Moldel | SFT | Safe-RLHF | DPO | DPO | MoCAN | PeCAN |
|---|---|---|---|---|---|---|
| Table 5 | 6 | 8 | 8 | 8 | 8 | 7 |
| Table 6 | 1 | 10 | 1 | 10 | 10 | 10 |
| Table 7 | 2 | 7 | 7 | 10 | 9 | 9 |
| Table 8 | 1 | 10 | 9 | 10 | 9 | 9 |
7. Typos and Other Minor Suggestions
Thank you for your careful reading of our paper and for catching typos. We will fix these typos and double-check the paper's writing in revisions.
8. Social impact.
Thank you for bringing our attention to the broader scope of social impact and for pointing out an excellent reference on the dual use of AI. We will acknowledge the dual use of constrained alignment methods and remark on potential applications that could be negative to our society, such as dialogue systems with gender biases [R3].
References
[R1] Confronting Reward Model Overoptimization with Constrained RLHF
[R2] Complementarity in Human-AI Collaboration: Concept, Sources, and Evidence
[R3] GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language Models
Thank you for your answer. My questions and criticism are fully addressed except regarding Code Availability/Reproducibility. Thank you for providing the code. Could you pinpoint me the the method/lines where step 4 of MoCAN takes place? I think I understand where the 5th takes place, but I haven't found the 4th so far.
Additional remark: the setup procedure described in the Readme is incomplete: I also had to run pip install -r requirements.txt after creating a conda environment using conda env create --file conda-recipy.yaml.
First, we are glad that most questions and criticism have been addressed. We are also grateful to reviewer TvNM for carefully checking our implementation and identifying the missed step in the setup process. We will incorporate this step into our README file for the released version.
Regarding step 4 of MoCAN, please refer to OneShot/safe_rlhf/trainers/model_based_dual_trainer.ipynb.
Should you have any further questions or need clarification regarding our implementation, please feel free to reach out. We are more than happy to provide additional information or address any concerns.
Thank you for pointing me to the file where step 4 of MoCAN is implemented. This is a very strong paper. Since my concerns regarding the computing resources have been addressed and the provided code offers good reproducibility, I raise my score to Strong Accept.
Thank you so much for taking the time to check the code and for kindly raising your rating in support of our paper. We really appreciate the valuable feedback you shared regarding the details of the experiments and the social impacts where we can improve. Thank you again!
This paper introduces a novel approach to aligning large language models (LLMs) with safety constraints using a dualization perspective. The key contributions are:
- A method to reduce constrained alignment to an equivalent unconstrained problem by pre-optimizing a dual function.
- Two practical algorithms, MOCAN and PECAN, for model-based and preference-based scenarios respectively.
- A theoretical analysis of the dual function's properties and the stability of the approach.
- Extensive experiments demonstrating the effectiveness of the proposed methods in improving both helpfulness and safety of LLMs.
优点
Strengths:
- Novel theoretical approach to constrained LLM alignment with strong mathematical foundations
- Practical algorithms that reduce computational burden compared to iterative primal-dual methods
- Comprehensive experimental evaluation across multiple tasks and baselines
- Flexibility to work with both model-based and preference-based scenarios
- Theoretical guarantees on the stability and effectiveness of the approach
缺点
Weaknesses:
- Limited to single safety constraint in experiments due to dataset limitations
- Assumes Bradley-Terry preference setup, which may not always hold in practice
- Potential sensitivity to the quality of pre-trained reward and safety models in MOCAN
- PECAN slightly underperforms MOCAN, suggesting room for improvement in preference-based scenarios
- Limited discussion of potential negative effects or failure cases
问题
Questions:
- How does the computational complexity of MOCAN and PECAN compare to iterative primal-dual methods?
- Have you explored the effectiveness of your approach with multiple simultaneous safety constraints?
- How sensitive are MOCAN and PECAN to the quality of pre-trained reward and safety models?
- Could you provide more insight into why PECAN slightly underperforms MOCAN?
- Have you considered extending the approach to more general preference models beyond Bradley-Terry?
- How does the performance of your methods scale with larger language models (e.g., 13B or 70B parameters)?
- Have you explored the potential for using your dualization approach in other constrained optimization problems in machine learning?
- How robust is the method to potential adversarial attacks or attempts to circumvent the safety constraints?
局限性
Limitations: The authors have addressed some limitations, but there is room for improvement:
- Addressed:
- Experiments limited to single safety constraint due to dataset availability
- Assumption of Bradley-Terry preference setup
- Slight underperformance of PECAN compared to MOCAN
- Could be better addressed:
- Scalability to multiple simultaneous safety constraints
- Sensitivity to quality of pre-trained reward and safety models
- Robustness to potential adversarial attacks or constraint circumvention
- Missing:
- Discussion of potential biases introduced by the method
- Analysis of computational requirements compared to baseline methods
- Exploration of failure cases or scenarios where the method might not perform well
- Consideration of privacy implications when using human preference data
Suggestions for improvement:
- Conduct experiments with multiple simultaneous safety constraints, if possible
- Explore the method's effectiveness with more general preference models
- Provide a more detailed analysis of computational requirements and scalability
- Investigate potential failure cases and limitations of the approach
- Discuss potential biases and privacy implications of the method
We thank the reviewer for the very positive evaluation and the valuable feedback. We have answered all questions to the best of our ability, and we are glad to address any further questions you might have.
1. Experiments are limited to a single safety constraint.
As stated in the limitation section, our experiments use a single safety constraint due to the lack of suitable datasets. We are eager to test our method with multi-safety datasets if they become available. To the best of our knowledge, all existing works in safety alignment, such as [12, 22, 35], involve only one constraint, even in theory. In contrast, our algorithms and theoretical guarantees have advanced in dealing with multiple constraints.
2. Restriction of the Bradley-Terry preference setup.
We agree that the Bradley-Terry model may not reflect all ground-truth human preferences. However, we want to remark that our two-stage strategy is orthogonal to the preference setup. In fact, our two presented algorithms readily adapt to more general preference setups by generalizing the DPO optimization backbone to the more generic -PO (see Equation (6) of [R1]). Specifically, we can consider and , where is the preference for helpfulness and is the preference for safety, is a preference-based utility function, and is an underlying behavior policy. When , it recovers the standard Bradley-Terry preference setup (e.g., Proposition 1 of [R1]). We have remarked on the general preference setup in the conclusion section and left the extension for future work.
3. Potential sensitivity to pre-trained reward and safety models.
In practice, pre-training reward and safety models can only be estimated up to some error, as characterized by Definition 1. Theorem 3 shows that our strategy can find a nearly optimal LLM policy up to some estimation errors. Therefore, in theory, MoCAN enjoys stability against perturbations in reward/safety models.
Experimentally, we supplemented new experiments of MoCAN using the beaver-7b-v3.0-reward/cost models (see Table 3 in the PDF attached to the global response). It is observed that MoCAN-aligned LMs attain reasonably good performance under the new reward/cost models.
4. PECAN slightly underperforms MOCAN.
Please refer to the first point in our global response.
5. Potential negative effects or failure cases.
Potential negative effects can arise from misusing our constrained alignment method in applications that harm society, such as promoting gender biases [R2]. Our method assumes IID preference datasets and potential failures can be caused by out-of-distribution datasets [R3]. We will discuss potential negative effects and failures in future versions.
6. The comparison of computational complexity.
Our alignment methods turn to solve a one-shot unconstrained problem, while iterative primal-dual algorithms [12, R4, 22] must solve an unconstrained optimization problem for each round of dual update. Moreover, these algorithms need to generate a large batch of on-policy responses for evaluating the update of the dual variable, which is computationally expensive.
In practice, our experiments are conducted on a single 48G NVIDIA A6000 GPU, taking about 15 hours to align each model. For comparison, the constrained RLHF [R4] indicates 20 hours for each run on a more powerful NVIDIA A100 GPU. The computing resources and running time are not reported in safe-RLHF [12]. However, since safe-RLHF uses PPO for policy updates, like constrained RLHF, on a much larger dataset, we expect its running time to scale 2-3x (i.e., 40-60 hours) proportional to constrained RLHF. Constrained DPO [22] uses a single 40G NVIDIA A100 GPU without reporting the running time. Therefore, our methods reduce running time by at least 25% compared to the aforementioned methods while using a much cheaper GPU. We will include this discussion in future revisions.
6. Scaling with larger language models.
Due to resource and compute limits, we cannot experiment with our alignment methods on larger language models. To our knowledge, 7B models are commonly used to evaluate generation performance in recent NeurIPS papers (e.g., [R5, R6]). We believe it is acceptable to experiment with 7B models for fair comparison.
7. Exploration in other constrained ML problems.
Our dualization approach also applies to MinMax-RLHF [10] and alignment with -divergence [R7]. Please check Appendix I and Appendix A for more details.
8. Resistance to adversarial attacks.
We thank the reviewer for highlighting this interesting direction. We are not aware of any common attackers that target safety constraints. However, one could naturally consider formulating the improvement of adversarial robustness into new constraints. Please refer to the paragraphs in our global response.
9. Suggestions for improvement.
Thank you for carefully reviewing our paper and suggesting several important points for improvement. We will discuss them point-by-point in revisions.
References
[R1] A General Theoretical Paradigm to Understand Learning from Human Preferences. Azar et al., 2023.
[R2] GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language Models. Zhang et al., 2024.
[R3] Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization. Hassan et al., 2023.
[R4] Confronting Reward Model Overoptimization with Constrained RLHF. Moskovitz et al., 2023.
[R5] Training language models to follow instructions with human feedback. Ouyang et al., 2022.
[R6] RRHF: Rank Responses to Align Language Models with Human Feedback. Yuan et al., 2023.
[R7] Aligning Language Models with Preferences through -divergence Minimization. Go et al., 2023
I thank the authors for their detailed response. I have now a much clearer idea of my concerns and I would like to keep my score. I look forward to seeing all of the changes in the next version.
This paper aims to address the issue that Lagrangian-based primal-dual policy optimization methods are computationally expensive and often unstable in conventional RLHF. The authors improve stability by pre-optimizing a smooth and convex dual function in a closed form, eliminating the need for cumbersome primal-dual policy iterations and thereby enhancing its stability. Experiments show that the proposed approach yields a good trade-off between helpfulness and safety improvements. The proposed method can achieve better safety performance than the existing DPO algorithm.
优点
- The paper is logically clear and easy to follow.
- The motivation of the paper is clear.
- Very detailed mathematical formulations.
- The proposed method is promising in enhancing a model's security and usefulness.
缺点
-
The existing benchmark Alpaca-eval mainly assesses the model's instruction-following capability. To confirm that this method does not undermine the model's overall performance, additional benchmarks, such as MMLU and TruthfulQA, are suggested to be considered.
-
Additional testing on benchmarks such as AdvBench or PKU-SafeRLHF Evaluation is suggested to evaluate the safety performance of the model's open-ended generation.
问题
Please see the weaknesses above.
局限性
YES
We thank the reviewer for the positive evaluation and the valuable feedback. All questions have been answered as best as we could. We are glad to address any further questions you might have.
Additional benchmarks.
We have conducted additional experiments on the benchmark datasets TruthfulQA and AdvBench. Please refer to Tables 1 and 2 in the PDF attached to the global response for detailed results. These tables show that our aligned models achieve higher safety scores with increased values, even when evaluated on these out-of-distribution prompts.
Dear Reviewer EtW9,
We are grateful for your insightful comments and feedback. In our detailed responses, we have carefully addressed each of your concerns. Given the impending rebuttal deadline, we kindly request that you review our responses to ensure that all issues have been adequately resolved. If further clarification or elaboration is required, we would be happy to provide additional explanations. Thank you for your time and consideration.
Best regards,
The authors of Submission 13052
I would like to thank the authors for addressing my concerns, and I raised the acceptance rate from "weak accept" to "accept".
Thank you so much for taking the time to review our paper and for kindly raising your rating in support of our paper. We appreciate the valuable feedback you shared. Thank you again!
The paper proposes a novel dualization-based method to convert a constrained alignment of a LLM to an unconstrained alignment. The proposed two-stage policy learning method, CAN, eliminates the need for cumbersome primal-dual iterations with theoretical analysis. Based on CAN, two practical algorithms, MOCAN and PECAN, are compatible with cases when reward/safety models are known and when they are unknown, e.g., human preference, respectively. The paper also presents strong empirical results that support the claims.
优点
- By and large, the paper is written well. I especially appreciated the discussion on the comparison with existing works and showcase the importance of reduction to unconstrained alignment which comes with stability analysis.
- The idea of using a dualization perspective to build a two-stage approach that reduce the constrained LM alignment into unconstrained LM alignment is novel and useful.
- The section on two practical algorithms based on CAN are well described and the proposed two algorithms covers cases when reward and safety models are known or unknown.
- The observation on influence of offline data for the difference between the number of prompts and responses is also very helpful.
- The theoretical results appear to be correct.
缺点
- It would have been interesting to see how would this method can be combined with LM adversarial attack techniques to improve the model robustness.
- Some additional commentary on how to improve the pre-alignment mentioned in the MOCAN vs PECAN comparison would have been very helpful.
问题
- Could you comment on how LM adversarial attack techniques can be used with CAN to improve model robustness?
- Could you comment on directions that may help with pre-alignment to improve the PECAN performance?
局限性
Yes, the author sufficiently addressed the limitations.
We thank the reviewer for the very positive evaluation and the valuable feedback. We have answered all questions as best we could, and we are glad to address any further questions you might have.
1. Incorporating adversarial robustness.
Please refer to the second point in our global response.
2. Ways to improve pre-alignment.
There are several ways to improve pre-alignment and, consequently, PeCAN's performance. As clarified in the global response, models can always be pre-aligned over larger datasets or using stronger reference LMs. With the same experimental resources, properly regularizing the pre-alignment process, as done in the training of reward/cost models in reference [12] (see Equations (13) and (14)), could enhance the reliability of the pre-aligned pseudo-preference labelers. Additionally, integrating more safeguard measures into the vanilla log-probability computation, as demonstrated in our supplementary experiments, could significantly improve the quality of pseudo-preference labeling and PeCAN's overall performance.
This paper proposes a new constrained optimization formulation to ensure the safety of language models (LMs). Based on the proposed constrained optimization problem, the authors derive the dual formulation and provide the closed-form solution to this dual formulation. Through theoretical derivations, they show that the dual function has four desirable properties, including smoothness and local strong convexity in . Then, the authors optimize the Lagrange multiplier and update an LM using this . Based on these theoretical derivations, the authors propose two practical algorithms: Model-based Constrained Alignment via dualizatioN (MoCAN) and Preference-based Constrained Alignment via dualizatioN (PeCAN). First, MoCAN assumes that reward model and safety model are given. Then, it estimates for each offline data , optimizes using the offline data and estimated , and finally updates the LM with pseudo-preferences constructed with . Second, PeCAN trains pre-aligned LMs and . Using the pre-aligned LMs, PeCAN generates a dataset, optimizes using the collected dataset, and updates LM with pseudo-preferences constructed with an implicit reward constructed by the pre-aligned LMs instead of . In the experiments section, the authors demonstrate that the proposed algorithm successfully enhance the safety and helpfulness of LMs. Notably, the proposed algorithm provide Pareto frontiers in terms of helpfulness and safety win rates with respect to the changes in .
优点
The proposed algorithm is based on the rigorous theoretical derivations.
缺点
The experiments are not sufficient to fully support the authors’ claims.
Firstly, there is a notable gap between the empirical results of MoCAN and PeCAN. In practice, MoCAN utilizes reward and cost models pre-trained using a pre-collected dataset, while PeCAN relies on pre-aligned LMs and . This means PeCAN replaces pre-trained reward and cost models by pre-aligned LMs. However, MoCAN shows better performance than PeCAN. Therefore, contrary to the authors’ claim, it seems that training reward and cost models is more effective than using pre-alignment LMs.
Secondly, based on the empirical results of MoCAN, changes in do not show a clear trade-off between safety and helpfulness (as shown in the middle of Figure 3). In addition, it is hard to say that PeCAN outperforms DPOs.
问题
-
Difference between MoCAN and PeCAN: In theory, MoCAN and PeCAN are not significantly different, as and cost can be replaced by and , respectively, based on and . Then, why are there such different empirical results between MoCAN and PeCAN?
-
Using closed-form solution to update the policy: Why not use the closed-form solution to update the policy? The authors have already derived . Then, it seems straightforward to update to be proportional to . However, the authors adopt preference-based approaches, which are sparser than the value of .
-
Model-based evaluation of DPOs and RLHF: In Figure 3 (left), there are no empirical results for DPOs and RLHF. Are there any results for model-based evaluation?
局限性
We thank the reviewer for the valuable feedback. We have addressed all your comments/questions to the best of our ability; see our detailed response below. We hope that our response would add your openness of re-evaluating our paper. We will be happy to address any further questions you might have.
1. The empirical gap between MoCan and PeCan.
Please see to the first point in our global response all the reviewers above.
2. No clear trade-off in MoCAN.
We note that the trade-off in MoCAN is evident in the model-based evaluation in Figure 3 (left), which is consistent with our theoretical findings. The less clear trade-off in the pairwise win-rate comparison in Figure 3 (middle) is somewhat understandable, as the win-rate performance does not directly correspond to the associated model-based performance. The model-based evaluation and the MoCAN algorithm are associated with proxy reward/safety models, while the pairwise win-rate evaluation is more closely tied to the ground-truth reward/safety and the specific baseline LM (i.e., the reference policy). Additionally, the GPT-based evaluation uses handcrafted prompts from [12] for the safety study and prompts from the Alpaca-eval dataset [20] associated with the "helpful_base" category, following the practice in [22]. The model-based evaluation uses prompts from the SafeRLHF-30K test set. These prompts come from different sources, leading to expected distributional shifts. In particular, the GPT-based evaluation result can be viewed as out-of-distribution performance, and thus, it may differ from the in-distribution performance seen in model-based evaluation. We also note that in related literature, such as [35], the helpfulness-safety trade-off under pairwise win-rate evaluation is unclear.
3. Infeasibility of straight-forward policy update.
There seems to be a misunderstanding regarding the alignment of language models. Sampling responses from a fixed and known policy is generally infeasible, even if the policy can be expressed in a closed form. In alignment, we aim to fine-tune language models, which contain billions of weights/parameters. Any change in the sampling policy for an LM must be implemented by updating its internal weights. Moreover, the input space (prompt tokens) and the output space (response tokens) of an LM policy can be viewed as infinitely large. Sampling in a large discrete space is notoriously difficult [R1, R2]. There is no direct way to force a language model to generate responses by exactly following a mathematically known policy. This practical impossibility is ubiquitous in LM alignment, even in unconstrained problems, departing from the common belief in classic reinforcement learning.
Regarding the reviewer's follow-up comment that "preference-based approaches are sparser than the value of ", we do not have a clear understanding. We kindly ask the reviewer for more clarification on this comment so that we can respond accurately.
4. Model-based evaluation of DPOs and RLHF.
We have supplemented the model-based evaluation results of DPOs and Safe-RLHF (see Figure 1 in the PDF attached to the global response). Our findings indicate that the model-based scores of DPO LMs align closely with those in Figure 3 (middle). Notably, Safe-RLHF exhibits better helpfulness vs. safety tradeoff, supporting our claim that there can occasionally be a discrepancy between model-based and GPT-based evaluations. The surprisingly good safety level of Safe-RLHF under in-distribution model-based evaluation may be attributed to the large dataset used in training (nearly one million data points, see https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF).
References
[R1] Oops I Took A Gradient: Scalable Sampling for Discrete Distributions. Grathwohl et al., 2021.
[R2] A Langevin-like Sampler for Discrete Distributions. Zhang et al., 2022.
Dear Reviewer wcCD,
We are grateful for your insightful comments and feedback. In our detailed responses, we have carefully addressed each of your concerns. Given the impending rebuttal deadline, we kindly request that you review our responses to ensure that all issues have been adequately resolved. If further clarification or elaboration is required, we would be happy to provide additional explanations. Thank you for your time and consideration.
Best regards,
The authors of Submission 13052
Thank you for your response.
However, I still have a question regarding the policy update. Specifically, I’m curious why we don't use the closed-form solution of , which allows us to express as a function of and . Why not use this function instead of explicitly learning ?
This is a minor question, and I will raise my score.
We thank the reviewer for responding to our rebuttal and for being willing to raise the score. We would like a few more clarifications on your concern.
First, we would like to remark on the relation between , , and : However, this relation does not give the explicit formula of due to the lack of normalization factors , of and , where and is defined similarly. Moreover, such normalization constants are intractable to compute due to the high-dimensional sampling space and input space (i.e., response tokens and prompt tokens).
Second, even assuming the probability oracle is available, sampling from the oracle remains way more challenging. The reason is multi-fold, including the high-dimensional sampling space and input space and the unmeasurable complex landscape of the probability distribution , unlike sampling from a common distribution like normal or uniform distributions.
We hope this response can clarify the reviewer's questions. We look forward to the follow-up discussion and are happy to address any further comments or questions.
Thank you for your response!
We thank the reviewer for the quick response. As we noticed in your previous comment that "This is a minor question, and I will raise my score", we were wondering if there were any clarification issues regarding the question we could address better. Many thanks.
We thank all the reviewers for their careful review and valuable feedback. We have addressed all questions raised by the reviewers in the separate rebuttals below and are glad to address any further concerns. Multiple reviewers have brought up several matters, so we present a global response to these concerns below.
1. The empirical gap between MoCan and PeCan.
First, we want to clarify that we did not intend to claim that PeCAN outperforms MoCAN throughout the manuscript. The main motivation for having both PeCAN and MoCAN is to facilitate flexible working scenarios, specifically model-based and model-free scenarios. Since PeCAN requires pre-aligned models and uses log-probabilities to indicate ground-truth helpfulness and safety, its empirical performance compared to MoCAN is heavily tied to the quality comparison of the pre-trained reward/safety models (used in MoCAN) and the pre-aligned helpfulness-only/safety-only LMs (used in PeCAN). If the pre-alignment and downstream log-probability computation do not sufficiently indicate the ground-truth helpfulness/safety levels, PeCAN can underperform MoCAN.
Second, we would like to highlight the differences in training details between the beaver reward/cost models [12] and our pre-aligned DPO models. The first difference is in the size of the training data. The beaver reward/cost models are trained on the full Safe-RLHF dataset, which contains roughly 1 million data points (please refer to https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF). In contrast, the pre-alignment in our paper, due to resource limits, is conducted over a smaller dataset containing approximately 30K data points (please refer to https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-30K), as practiced in references [22,35]. The second difference lies in the training loss objective. While both the literature [12] and our work assume the Bradley-Terry preference setup, they additionally impose regularization upon the DPO-type loss function (see Equations (13) and (14) in [12]), which may greatly boost the robustness and empirical performance of the reward/cost models. Conversely, we pre-align models by faithfully following the DPO loss without regularization. The lack of regularization may result in less robust log-probabilities as proxies for ground-truth helpfulness and safety, as partly manifested in our calibration results in Figure 6.
Additionally, log-probabilities, by definition, can be sensitive to special tokens such as termination and new-line tokens, as well as a few potential outlier tokens (e.g., those caused by hallucination). Therefore, using the plain log-probabilities faithfully, as in the PeCAN algorithm, can be less indicative than pre-trained scalar reward/cost models for the ground truth. To address this, we supplement new experiments by adopting the log-probability computation module in the DPO pipeline on the Huggingface platform (https://github.com/huggingface/trl/blob/main/trl/trainer/dpo_trainer.py#L72) in pseudo-preference labeling, which incorporates many safeguard engineering tricks. For example, lines 760-771 adjust the difference between tokenizing the concatenated string of prompt+response and merging the separate tokenization results of prompt and response; lines 851-859 clip chosen and rejected responses to be of the same length; lines 862-870 add a BOS token in front of prompts; and lines 880-895 truncate responses to avoid the resulting token sequences being too long. Only the pseudo-preference labeling part is updated with the new computation module, while the pre-alignment maintains the same DPO manner. We find these additional safeguard tricks can significantly improve the performance of PeCAN and offset the visible gap between MoCAN and PeCAN (see Figure 2 in the PDF attached below).
2. Incorporating adversarial robustness.
We thank the reviewer for highlighting this interesting direction. One promising approach could be incorporating robustness requirements as a constraint into our alignment framework. For instance, one could consider the improvement margins of adversarial robustness as follows: where is an external entity that maliciously modifies the original prompt input. While this basic formulation interprets the attacker as a function that acts only on prompts, one could also consider more sophisticated constraints that account for the interaction between the LM and the attacker. We refer the reviewer to several related studies [R1, R2, R3] for more details.
However, it is questionable whether the one-shot computational benefit of exact dualization remains valid after incorporating such sophisticated constraints. We would like to leave this direction for future work and are glad to discuss it further in revisions.
3.Supplemented experiments.
As suggested by reviewers, we supplement new experimental results, including (i) the evaluation of MoCAN- and PeCAN-alignment on new benchmarks TruthfulQA and AdcBench, (ii) MoCAN-alignment of using new pre-trained beaver- 7b-v3.0-reward/cost models, (iii) the model-based evaluation scores of DPO and safe-RLHF models, (iv) PeCAN-alignment using the Huggingface log-probability computing module in the pseudo-preference labeling. Please refer to the attached PDF for more details.
References
[R1] Representation Engineering: A Top-Down Approach to AI Transparency. Zhou et al., 2023.
[R2] Open the Pandora's Box of LLMs: Jailbreaking LLMs through Representation Engineering. Li et al., 2024.
[R3] Uncovering Safety Risks in Open-source LLMs through Concept Activation Vector. Xu et al., 2024.
This paper presents a novel dualization-based method for transforming a constrained alignment of a large language model (LLM) into an unconstrained alignment. The method, named CAN, employs a two-stage policy learning framework that avoids the computationally intensive and unstable primal-dual iterations typically associated with Lagrangian-based policy optimization in reinforcement learning from human feedback (RLHF). The authors provide a theoretical analysis demonstrating that CAN effectively pre-optimizes a smooth and convex dual function, thereby enhancing both stability and performance. The paper introduces two practical algorithms based on CAN: MOCAN, for scenarios where reward or safety models are known, and PECAN, for cases where such models are unknown. Empirical results validate the effectiveness of these methods, showing a favorable balance between helpfulness and safety improvements.
Initially, there were concerns regarding the technical differences and performance gaps between MOCAN and PECAN. However, the authors addressed these concerns in their rebuttal with compelling arguments. It is recommended that the final version of the paper be updated to incorporate these clarifications. A notable limitation of the paper is that, while it presents algorithms designed to handle multiple safety constraints, the evaluation was limited to benchmarks involving only a single safety constraint. Overall, the paper makes a substantial contribution to the development of safety-aware language models by overcoming the limitations of traditional methods and providing strong theoretical and empirical support for its approach.