Towards Minimal Targeted Updates of Language Models with Targeted Negative Training
We propose Targeted Negative Training, a method to reduce the probability that a language model assigns to unwanted text while minimally changing model behavior otherwise.
摘要
评审与讨论
This work tackles the problem of updating a language model to reduce undesirable behavior (e.g. generating offensive content or hallucinations), while minimally changing generations elsewhere. The authors propose a method, TNT (Targeted Negative Training) which aims to keep the probability distribution of text close to the original except for instances which are intended to be removed. The paper presents some experiments with a T5 model on reducing hallucinations in summarization and in avoiding toxic responses.
优点
- The topic studied by this work is timely, and progress in this direction would be of interest to many in the community.
- The proposed method is novel and a significant contribution to the literature
缺点
- The experiments could be greatly strengthened. The authors study a single model T5, at a fairly small scale (220M) parameters. No purely autoregressive model is studied. Only two tasks are examined. This lack of breadth makes it hard for readers to access the generality of the claims in the paper. I believe this paper would benefit greatly from more experiments and ablations.
- There are very few comparisons to previous work, which in many cases tackle the same tasks presented in the experiments. As an example, avoiding toxic generations has been studied by [1,2]
[1] Lu, Ximing, et al. "Quark: Controllable text generation with reinforced unlearning." Advances in neural information processing systems 35 (2022): 27591-27609. [2] Ilharco, Gabriel, et al. "Editing models with task arithmetic." arXiv preprint arXiv:2212.04089 (2022).
问题
I have no questions.
Thank you for your review. We have responded to your concerns below:
- [The experiments could be greatly strengthened. The authors study a single model T5 (220M). No purely autoregressive model is studied. Only two tasks are examined. This lack of breadth makes it hard for readers to access the generality of the claims in the paper. I believe this paper would benefit greatly from more experiments and ablations.]
Thank you for the feedback. First, we would like to emphasize that in order to generate Figure 2, we ran 7 * 9 = 64 finetuning jobs on two very different generation tasks (hallucination in summarization and toxicity in response generation), for 126 jobs in total. Figure 2 shows that TNT consistently outperforms or at least matches baseline methods across all rates of reduction in both tasks, demonstrating that its benefit is broad and robust to different definitions of unwanted. We’ve additionally added a set of ablations comparing methods on external data rather than model generations; TNT still outperforms baseline the results, illustrating that the proposed losses are better even under a different data context, and the main results are better than the ablations, confirming our hypothesis that preferencing more common token conditionals yields a more minimal update. Due to the extensive nature of each experiment on just one model and task, we were not able to complete such experiments for larger models (magnitudes more resource-intensive) or additional tasks (requires appropriate dataset), but we hope the mathematical analysis and updated experiments convinces the reviewer that there is merit in the proposed approach.
- [There are very few comparisons to previous work, which in many cases tackle the same tasks presented in the experiments, e.g. avoiding toxic generations has been studied by [1,2]]
We chose as baselines other finetuning-based methods that take into account token-level information for learning. The works the reviewer mentioned do not fall into this setting, but we have instead included them as background or related work:
- Quark finetunes an existing language model on its generations by prepending inputs with sequence-level rewards and regularizing with a KL penalty on the token-level conditional distributions. Then, during inference time one conditions on high reward to sample from the model. TNT differs in that it utilizes token-level annotations and directly specifies the output token probabilities rather than conditioning the inputs on the rewards.
- Task Arithmetic subtracts the weights of pretrained and finetuned models to yield task vectors which are composed together to add or negate behaviors. As these task vectors are a function of vanilla finetuning, e.g. via filtering the dataset of toxic language, this approach, like finetuning, will not yield minimal targeted updates.
Do you have any additional questions or concerns that you would like to share? If not, would you kindly consider raising your score?
Thank you for responding to my comments. While the number of experiments itself is not very small, there is a lot of redundancy in the experimental settings, which makes it hard to know whether the findings from this paper are broad and general. After this response, my concerns still stand, and I am thus sticking with my original scores. I believe this paper has great potential and would highly encourage authors to improve the breadth of their experiments.
This work provides a fine-tuning-based algorithm to update a language model to suppress harmful content generation while remaining as close to the original model as possible. Given a set of harmful responses to avoid (defined by a function that takes in a string and produces binary output), the authors define the ideal target model as one that matches the original model conditioned on never producing an undesirable string. The authors optimize a loss function based on this definition and find that the resulting model better satisfies the desired notion of safety.
优点
-
The paper's key problem is relevant to practice and is well-specified
-
The paper's solution is similarly elegant and was a natural consequence of the problem specification, providing a tractable solution to the proposed problem.
-
The results demonstrate a solid improvement over reasonable baselines to provide targeted updates to the model.
缺点
-
The authors are missing connections to existing methods. For example, Korbak et al, 2022 (among others) show that PPO would converge to the same closed-form presented in Eq. 2 when using the positivity/negativity classification as a reward function; this method would also avoid the drawbacks mentioned in the related work for inference time procedures.
-
The paper demonstrates their technique on some relatively easier benchmarks, and it would be much more interesting to try more complicated schemes. There are three ways in which I find them weak
- In the specific case where p_neg is defined as the presence of a bad token in the string, there is no need to do any training, and by simply ignoring bad tokens at decoding, one recovers the true optimal solution for both greedy and temperature decoding with zero overhead. Both the toxicity and hallucinations benchmarks provided in this text are dangerously close to "reject any sentence with a bad (word/entity)", which makes the application rather uninteresting. It would be cooler to see benchmarks where the reward model may still be automated but captures some global property of the sentence that requires a learning-based technique such as yours to solve.
- For the toxicity benchmark, the authors mention that 1.6% of the time, the completion is toxic. I believe a very natural baseline to this problem is performing temperature 1 decoding and regenerating anytime the output is toxic. I get the sense this will very rarely require few regenerations for this specific benchmark, making the learning rather excessive. Having a benchmark that requires a larger change will be a true test of this problem.
- For baselines, as mentioned in Weakness 1, I believe PPO is a more natural baseline for learning from a reward model
Overall, I believe the work can better demonstrate the technique.
问题
-
At the top of page 7, the paper mentions that all experiments were done with greedy decoding. Does that mean fine-tuning was also based on greedy generations? If only 1.6% of model generations were toxic, would the model get any gradient signal for 98.4% of the generations?
-
Why is TNRLL not called TNFR?
-
Just to be clear, does "token level annotations" refer to a function that can take a sentence and assess whether it is negative or positive? This was not super clear to me, even though there were a few sentences dedicated to it on page 4. If it is a function, it might be better specified as such.
Thank you for your review! Addressing each comment below:
- [The authors are missing connections to existing methods. For example, PPO would converge to the same closed-form presented in Eq. 2 when using the positivity/negativity classification as a reward function; this method would also avoid the drawbacks mentioned in the related work for inference time procedures.]
Thank you for bringing this up. We have added to the related work that, given a positivity/negativity classifier as a reward function, the objective PPO seeks to maximize (equivalent to minimizing reverse KL) has an optimal solution that is a soft version of the result TNT targets: namely, while , where the former is approximately equal to the latter when is small.
- [It would be cooler to see benchmarks where the reward model may still be automated but captures some global property of the sentence that requires a learning-based technique such as yours to solve.]
We agree that the wordlist setup does not demonstrate the full practical utility of TNT and have replaced this experiment with one where we train a token-level classifier to identify toxic spans. In this updated experiment, the benefit of TNT over baselines is even more pronounced, with TNT methods overall (and each of TNRF, TNRR, and TNRLL individually) being strictly better than baselines across all ranges of similarity and reduction (see Figures 2(c) and (d)).
- [(paraphrased for brevity) For the toxicity benchmark, learning is rather excessive for a base rate of 1.6% (e.g., natural baseline is to regenerate anytime output is toxic). Having a benchmark that requires a larger change will be a true test of this problem.]
We agree, and with our new broader definition of toxicity defined by a token-level classifier trained on human annotations, the level of toxicity of the original model is 8%. Moreover, this 8% rate is not constant across inputs; for instance, we see that for examples where the input contains toxicity (based on the same classifier definition), the model generations have toxic language 23.3% of the time. Thus in our updated experiment we have a setting where regeneration would be quite expensive for a certain subset of inputs, making finetuning a more practical alternative.
- [I believe PPO is a more natural baseline for learning from a reward model.]
Thanks for the suggestion. We initially did not consider PPO as a baseline since it is an algorithm generally used to optimize sequence-level rewards. [1] recently introduced Fine-Grained RLHF which could be used to consider token-level rewards, but we did not have the chance to implement this baseline given time constraints (It is also worth noting that this baseline is much more complicated than the proposed approach, requiring sampling from the model during training as well as keeping track of four separate models, including a separate value model that is trained simultaneously to the policy model). Our experiments do hint at the opportunity for considering alternatives beyond such a baseline, though. Namely, whereas the objective considered by the PPO algorithm is equivalent to minimizing the reverse KL divergence between the current model and the target distribution , regardless of whether we consider sequence- or token-level rewards, TNT encompasses a suite of objectives which consider minimizing not just the reverse KL between token-level conditional distributions but also other combinations of divergences, and our experiments show that other divergence combinations may be preferable in certain scenarios, even when the algorithm is fixed regardless of objective; in particular, at lower levels of reduction, TNT losses that use forward KL (i.e., TNRF and TNFF) tend to outperform TNRR at maintaining similarity with the original model generations.
- [At the top of page 7, the paper mentions that all experiments were done with greedy decoding. Does that mean fine-tuning was also based on greedy generations? If only 1.6% of model generations were toxic, would the model get any gradient signal for 98.4% of the generations?]
Thanks for the question. The fine-tuning was also based on the greedy generations; however, the token-level KL divergence terms are computed using the full token-level output distributions, such that any deviation of any probability mass between the current model and the desired target output distributions will yield gradient signal. We have also confirmed empirically that the model gets gradient signal throughout training (via logging the gradient norm).
- [Why is TNRLL not called TNFR?]
We use R to denote reverse KL, F to denote forward KL, and LL to denote log-likelihood (forward KL based on the sample only, rather than full token output distributions). The first character(s) denote loss on negative tokens and the second character(s) denote loss on non-negative tokens. TNRLL uses reverse KL on the indices where there is a negative token and otherwise maximizes the log-likelihood of the sampled token.
- [Just to be clear, does "token level annotations" refer to a function that can take a sentence and assess whether it is negative or positive?]
Token-level annotations give a label to every token of a sequence, e.g. hallucinated entity as negative while the rest of the sentence is positive. An example function is the newly added token-level classifier for labeling toxic spans.
We appreciate your valuable comments and questions, as they have significantly contributed to strengthening our paper. If you have any further comments, we would be more than happy to address them. Alternatively, if the reviewer agrees that the recent modifications have improved the paper, would you be willing to consider raising your score? Thank you for your time and thoughtful feedback!
Thank you for the further clarification and experiments during the rebuttal process. My score remains at a 6.
Summary:
The paper introduces Targeted Negative Training (TNT), a method for minimally updating language models to prevent unwanted outputs. Unlike previous techniques, TNT fine-tunes models using negative examples generated by the model itself, with the goal of closely aligning the updated model's distribution to the original while avoiding specific undesired behaviors. The method operates by minimizing reverse KL-divergence, ensuring the updated model does not deviate significantly from its initial training. Experiments demonstrate that TNT effectively maintains original model performance better than other negative training approaches while reducing unwanted behaviors. However, it requires access to the original model and detailed token-level annotations, presenting potential practical challenges. TNT's iterative nature also suggests it could be used to enhance model safety over time through continuous refinement.
优点
Advantages:
-
Iterative Model Updates: TNT allows for iterative updates to language models without needing all negative tokens to be specified upfront. This flexibility is advantageous for practical applications where updates may be continuous and ongoing.
-
Maintained Model Performance: The experimental setup indicates that TNT can effectively maintain the original model's performance better than baseline methods while also reducing unwanted behaviors.
-
Reproducibility and Accessibility: The experiments are reproducible, with the promise of making the code public and using publicly available datasets, which enhances the credibility and utility of the research.
缺点
Disadvantages:
-
Limited Novelty: The core ideas and methods of TNT may not be as novel as claimed, given the prior existence of the NADO algorithm [NeurIPS 2022, https://arxiv.org/pdf/2205.14219.pdf] framework which appears to address similar goals using related techniques.Further more, NADO has proven its objective to be the theoretically closed form solution of the shared targets of TNT/NADO, which weakens the value of the approximated solution (step-level branch-cutting) given by TNT. TNT's flexibility is also less than that of NADO as it requires auxiliary negative annotation, which is a closer setup to that of the FUDGE algorithm [NAACL 2021, https://arxiv.org/abs/2104.05218].
-
Oversight in Literature Review: Even if TNT can differ itself from NADO and FUDGE (since the two previous methods study different tasks, yet essentially with similar mathematical setup), the absence of a citation and/or discussion of these existing works suggests a possible gap in the literature review process, which might question the thoroughness of the background research conducted for the paper.
问题
What is the essential different between TNT and previous constrained decoding algorithms (FUDGE, NADO, Neural Logic, etc.) that aim to maximize/minimize a given (emplicitly through a symbolic process or implicitly through negative samples) sequence boolean function that defines the negativity of samples?
Thank you for your review. Thank you especially for highlighting a key strength of our approach in its ability to enable iterative model updates: “TNT allows for iterative updates to language models without needing all negative tokens to be specified upfront. This flexibility is advantageous for practical applications where updates may be continuous and ongoing.”
We first wish to clarify and address two points the reviewer made in the paper summary:
- [The method operates by minimizing reverse KL-divergence, ensuring the updated model does not deviate significantly from its initial training.]
The target distributions of interest are the optimal distribution in terms of reverse KL divergence with the original under the constraint that negative tokens are avoided. Then, the suite of TNT algorithms optimize for these target distributions via different combinations of forward and reverse KL divergence (depending on whether the token conditional requires a negative update or not). This is different from “the method operates by minimizing reverse KL-divergence.”
- [It requires access to the original model and detailed token-level annotations, presenting potential practical challenges.]
A finetuning method like TNT is specifically meant for the maintainers of a given language model (i.e., to iteratively update their model), so we do not foresee a problem in the method requiring access to the original model. We agree that the fact that the training algorithms require the output probabilities of the original model could present some issues around the amount of memory needed during training, but there are strategies to reduce memory consumption by trading off with compute time, e.g. relevant output probabilities distributions could be precomputed.
As for token-level annotations, we address the practicality of this information in Section 2.1. To summarize, while we agree that in some cases token-level annotations can be more expensive to collect than sequence level ones (e.g. it is easier to identify that a piece of code fails than to find the source of a bug), in many cases the two involve comparable effort, e.g. labeling an overall sequence with “has hallucination” or “has offensive language” generally requires identifying the hallucination or offensive language itself. In such a setting, we expect a method that takes this more fine-grained information into account to be more sample-efficient (a recent example being fine-grained RLHF vs sequence-level RLHF).
Next, we wish to address the reviewer’s concerns about novelty.
- [The core ideas and methods of TNT may not be as novel as claimed, given the prior existence of the NADO algorithm framework which appears to address similar goals using related techniques.]
Thank you for bringing the NADO paper to our attention. It presents a great result that one can turn sequence-level constraint information into token-level guidance by estimating for all ). However, we disagree that the existence of this analysis and the NADO algorithm weakens the novelty or value of TNT. Ways in which the methods differ:
- NADO provides an inference-time algorithm for sampling from the desired target distribution while TNT is a suite of finetuning objectives to approximate a target distribution by matching token conditionals.
- NADO uses sequence-level annotations while TNT uses token-level annotations.
- NADO involves training an auxiliary model while TNT involves finetuning the language model itself.
- NADO seeks to learn for all while TNT seeks to learn for all . For the former, every element in the set is intractable so can only be estimated via samples, whereas for the latter, every element in the set can be queried exactly via a model forward pass.
Moreover, where the two works overlap is not a contribution of either approach. Namely, we define a minimal targeted update as the closest distribution in reverse KL to the original given negative examples are zeroed out (Eq 1), and the solution (Eq 2) matches the distribution given NADO paper’s Eq 3 (The authors define the problem using the forward KL, but we assume they use the distribution and not the stated KL divergence since the forward KL will be infinite for any distribution more constrained in support than the original). This setup was also given in [1, 2] before both the TNT and NADO papers, and our paper also connects this formulation to other works in the controllable generation space that had not previously mentioned it (see Relationship to Conditioning under 2.3).
TNT’s contribution relative to [1, 2] is to show how moving from the sequence-level to token-level allows for a simple suite of finetuning methods (instead of the multi-step training algorithms proposed by [1, 2]). NADO’s contribution relative to [1, 2] instead is to show how one can translate sequence-level annotations to token-level guidance.
These contributions are in fact synergistic; as we have now highlighted in our related work, it is possible to combine the complementary contributions of both works:
Meng et al 2022 also consider sequence-level constraints but prove that this setting can be translated into token-level guidance (i.e., relating to ) via the approximation of for all and sequence-level boolean constraint function . Consequently, Meng et al 2022 are able to propose a simpler algorithm than existing work, and by combining ideas in their work with this work, it is possible to define a finetuning algorithm that optimizes analytical token-level divergences even given only sequence-level annotations (define using results in Meng et al 2022, optimize using TNT).
- [NADO has proven its objective to be the theoretically closed form solution of the shared targets of TNT/NADO, which weakens the value of the approximated solution (step-level branch-cutting) given by TNT]
We disagree with the reviewer’s claim that TNT is an approximated solution. The solution TNT targets yields exactly, as the sequence-level distribution is correct when all token-level conditionals are correct. However, as it is not possible to enumerate all token-level conditionals, TNT focuses on matching those that are more commonly encountered under the original model. This approximation is analogous to that of any objective targeting a solution that cannot be enumerated, even though the optimal solution itself is exact. Similarly, NADO relies on the auxiliary model accurately estimating for all given only samples.
- [TNT's flexibility is also less than that of NADO as it requires auxiliary negative annotation]
We agree that TNT’s assumption of token-level annotations is stronger than NADO’s assumption of sequence-level annotations, but as discussed earlier in point 2, there exist situations where collecting more fine-grained annotations will be more advantageous from an effort-to-value ratio (e.g. labeling a sequence as “has hallucination” requires identifying the hallucination), and a method such as TNT allows one to take advantage of such fine-grained information via an easy-to-implement algorithm. Furthermore, as mentioned above, the theoretical result in NADO makes it possible to define a TNT variant that continues to match token-level conditionals even only given sequence level information. [Absence of citation and/or discussion of NADO and FUDGE suggests a possible gap in the literature review process] We first would like to point out that we have already cited FUDGE in our paper and explained how it is an inference-time controllable generation approach, in contrast to our finetuning-based approach. Thank you for bringing NADO to our attention. We have now added a citation and discussion of it as well.
- [What is the essential difference between TNT and previous constrained decoding algorithms (FUDGE, NADO, Neural Logic, etc.)?]
Thank you for the question. The primary difference between TNT and the mentioned algorithms is that TNT is a finetuning method that allows for iterative updates with no changes to the generation pipeline, whereas the mentioned algorithms change the inference-time generation procedure and thus incur additional prediction cost, increasingly so if one wishes to iteratively add constraints. Comparing the algorithms individually:
- FUDGE uses an auxiliary token-level classifier to modify the output probabilities of a language model during inference, while TNT uses such a classifier (or other methods for token-level annotations) to label model generations for finetuning.
- NADO trains an auxiliary language model to predict for every token conditional of a model generation the same label as the overall sequence; Then, NADO modifies the output probabilities of the primary language model during inference time based on the outputs of the auxiliary language model. TNT instead finetunes the original language model by matching its token conditionals with target distributions, defined by the original model and external annotations (which, as discussed above, could come from the NADO auxiliary model when only sequence-level constraints are given). Also, TNT only runs a single language model during inference.
- NeuroLogic Decoding is a decoding process that takes into account logical constraints, while TNT involves learning from negative examples. As a result, TNT can handle unwanted behavior that is hard to define based on logical constraints.
Are there any additional concerns that the reviewer would like us to address? If not, could the reviewer kindly reconsider their score? We sincerely appreciate your consideration, thank you.
References:
[1] Khalifa et al. A distributional approach to controlled text generation. ICLR 2021.
[2] Korbak et al. Controlling conditional language models without catastrophic forgetting. ICML 2022.
- The paper tackles the problem of modifying the generative distribution of LLMs to reduce the likelihood of generating undesirable tokens whilst simultaneously not deviating too much from the original distribution.
- Concretely, the authors present their approach (dubbed Targeted Negative Training, or TNT), wherein for each prefix and continuation from an already trained model, if the target token is annotated to be undesirable then the approach minimizes a divergence between the original distribution from the already trained model and a modified distribution based on setting the logit of the undesirable token to 0 and renormalizing for the target model. Otherwise the objective minimizes a divergence between the original and the learned model distribution.
- The authors explore different instantiations of divergences: (forward KL, forward KL), (reverse KL, reverse KL) and (reverse KL, forward KL) for the (negative token signal, positive token signal), which they dub (D_{n}, D_{p}) respectively. In addition to that, they also explore with (forward KL, maximum likelihood) and (reverse KL, maximum likelihood) for (D_{n}, D_{p}).
- The results presented show that the proposed method allows for better tradeoffs between controlling for hallucinations and keeping the model close to the original distribution.
优点
- The proposed method for leveraging sentence level annotations in order to modify model behaviour is quite interesting.
- I quite like the extensiveness of explorations in terms of exploring the different divergences for both the positive and negative signals. From the results, both TNFF and TNRF seem to achieve a pretty good tradeoff between being faithful to the original distribution and controlling for hallucinations.
缺点
- For the different proposed models, without an equivalent table similar to Table 1, it is hard to understand the effectiveness of the approach. Concretely, similar to the TNFLL and TNRLL rows, it would also be good to have an equivalent row for TNFF, TNRR and TNRF.
- From Table 1, for both the TNFLL and TNRLL approaches, the performance is considerably worse compared to the baseline method. For TNFLL, it the hallucination rate is substantially higher, while for TNRLL, the BLEU score is much lower. Given this observation, I am hesitant to believe that the proposed approach is actually substantially than the baseline approach.
- In my opinion, intuitively, because this approach minimizes the KL at a prefix level, especially considering the fact that the annotations obtained are from a noisy source, it is possible that this approach would steer the model towards not predicting certain words in certain contexts. Concretely, (based from the example in Figure 5), for the sentence "In some regions of the country, the sex ratio is still quite concerning", because the annotations are noisy, the word "sex" would be (incorrectly) marked as offensive. Consequently, because of the proposed objective, the model might not be able to produce the token "sex" for a similar prefix as "In some regions of the country, the", even if it did make sense in the context. I think this is a reasonably big limitation of the approach, and it would have been nice to have some discussion on this in the paper.
问题
- Would it be possible to rows for TNFF, TNRR and TNRF in Table 1 ?
- Would it be possible to provide some clarification on how Figure 2 was constructed ? Specifically, is the level of hallucination mapped to a different value of \alpha used (so higher \alpha -> lower hallucination rates and original distribution fidelity) ?
Thank you for your review. We address each of your points below:
- [For the different proposed models, without an equivalent table similar to Table 1, it is hard to understand the effectiveness of the approach. Would it be possible to add rows for TNFF, TNRR and TNRF in Table 1?]
Thank you for the suggestion; we have updated Table 1 to include all TNT methods. We’d also like to clarify that Figure 2 (rather than Table 1) is meant to encapsulate the main result, as it considers all methods across all alpha values in one plot; Table 1 on the other hand only looks at a single fixed alpha value and is instead meant to focus on how changing the loss on negative tokens alone can reduce disfluencies. This is the reason why the original Table 1 focused on TNT methods that share the same loss as baselines on non-negative tokens (i.e., TNFLL and TNRLL). We’ve updated the text to draw attention to Figure 2 as the main result.
- [From Table 1 TNFLL hallucination rate is substantially higher (than baselines), while TNRLL BLEU score is much lower (than baselines). Given this observation, I am hesitant to believe that the proposed approach is actually substantially better than the baseline approach.]
First, we wish to reiterate that Figure 2 gives the main result, that TNT methods yield a better trade-off of reducing unwanted behavior vs. maintaining similarity to the original. While we intended Table 1 to focus on disfluency results, based on your feedback we realize how the current presentation of Table 1 could be confusing, as we also report similarity and reduction metrics but only for a single fixed value of , which puts different methods at different points along their similarity vs. reduction curves. This fixed value of in resulted in rows in Table 1 that are hard to compare along similarity and reduction metrics, e.g. if one row has more reduction but lower similarity than another, which is better? Figure 2 is meant to address this challenge. To make Figure 2 and Table 1 congruent, we have additionally updated Table 1 to choose the that yields the best BLEU score among values that achieve a given rate of reduction (75%). See the updated table for toxicity below. Now, the methods can be more easily compared across all metrics (similarity, reduction, and disfluency); for instance, one can see that, given a 75% rate of reduction is achieved, TNRF is strictly better than baselines on all metrics, all TNT methods except TNFLL yield better similarity results than baselines, and all TNT methods perform better than baselines on the number of total introduced disfluencies (sum of Repeats and Random ??).
| BLEU | ROUGE-L | Seq Acc | Toxicity | Repeats | Random ?? | |
|---|---|---|---|---|---|---|
| Original | 100.0000 | 100.0000 | 100.0000 | 8.1830 | 16 | 4 |
| NL + LL () | 13.6497 | 32.6104 | 2.0661 | 1.8150 | 287 | 136 |
| UL + LL () | 37.1265 | 59.9806 | 20.4405 | 1.6784 | 23 | 1122 |
| TNFLL () | 33.7884 | 57.1009 | 18.1630 | 1.7577 | 36 | 1 |
| TNRLL () | 39.1922 | 61.4776 | 22.9207 | 1.9471 | 23 | 1 |
| TNRR () | 55.9532 | 71.5365 | 35.1366 | 1.4493 | 34 | 3 |
| TNRF () | 60.2071 | 74.8574 | 39.6167 | 1.0396 | 21 | 3 |
| TNFF () | 61.0565 | 74.2388 | 40.0749 | 1.9031 | 33 | 3 |
- [It is possible that this approach would steer the model towards not predicting certain words in certain contexts. I think this is a reasonably big limitation.]
We take the reviewer’s comment to mean that if the model is not context-aware when pushing down probability mass over certain tokens, then it is possible for the model to begin avoiding a given token even in contexts where it would be acceptable. Fortunately, this concern is not an issue for TNT because it is context-aware: all negative annotations are a function of the prefix, which means that the negative signal the model sees is of the form “X is bad when preceded by Y” rather than just “X is bad.” As the wordlist example does not showcase this context-aware property of the method since the annotations themselves do not take into account context, we replaced the wordlist experiment with one based on a token-level classifier that labels spans as toxic based on its context.
- [Would it be possible to provide some clarification on how Figure 2 was constructed? Specifically, is the level of hallucination mapped to a different value of \alpha used (so higher \alpha -> lower hallucination rates and original distribution fidelity)?]
To construct Figure 2, for each method we performed nine separate updates using 9 different alpha values. Then, we iterate through hallucination (or toxicity) rates in increments of 0.1 and plot the highest BLEU achieved across values that achieve less than that given level of hallucination (or toxicity). The model selection was chosen based on the validation set results, while the results themselves are on the test set. And yes, in general a higher alpha corresponds to lower hallucination rates and lower original distribution fidelity (exceptions include baseline models which struggle to increase original distribution fidelity above a certain level even as becomes smaller, and cases where an that is too large causes optimization issues).
Thank you for your comments. Please let us know if you have additional questions or concerns! If not, would you consider raising your score?
Thank you to the reviewers for providing feedback on this work. We appreciate that reviewers noted the following strengths:
- Our work is relevant: “The key problem is relevant to practice and is well-specified” (VP85), “the topic studied by this work is timely” (eQid).
- Our contribution is novel, elegant, and advantageous: “The paper’s solution is similarly elegant [to the well-specified problem]” (VP85), “The proposed method… is quite interesting” (DCRf), “The proposed method is novel and a significant contribution to the literature” (eQid), “TNT allows for iterative updates…advantageous for practical applications where updates may be continuous and ongoing” (fDnb).
- Our paper is sound and well-executed: “I quite like the extensiveness…of exploring the different divergences for both the positive and negative signals” (DCRf), “results demonstrate a solid improvement over reasonable baselines” (VP85), “experimental setup indicates that TNT can effectively maintain the original model's performance better than baseline methods while also reducing unwanted behaviors” (fDnb).
To summarize, our technical innovation are the following:
- We introduce Targeted Negative Training, a suite of algorithms to update an existing language model to remove probability mass over undesirable outputs.
- Relative to related work (reference numbers match the updated paper), TNT provides
- a suite of finetuning algorithms which allow the existing model to be updated in place, rather than inference-time approaches (e.g., Dathathri et al. 2019, Krause et al. 2021, Yang & Klein. 2021, Liu et al. 2021, Meng et al. 2022) which incur additional generation cost for each additional constraint imposed.
- a simple way to take advantage of token-level annotations which are more efficient for targeted updates than their sequence-level counterparts (considered in Ziegler et al. 2019, Khalifa et al. 2020, Korbak et al. 2022, Meng et al. 2022).
- objectives whose solutions are minimal targeted updates that push down probability mass over unwanted outputs while minimally changing model behavior otherwise, unlike methods whose optimum/optima are not sufficiently constrained (He & Glass 2020, Welleck et al. 2020) or only interpolate between these goals (Ziegler et al. 2019, Lu et al. 2022, Wu et al. 2023).
- The approach enables the iterative updating of a model as new constraints or specifications of what is unwanted arise. The commutative nature of negative updates allows them to be applied in any order without undoing the work of previous updates. This setup stands in contrast to “positive updates” where catastrophic forgetting from iterative finetuning is a common concern (e.g. Luo et al. 2023).
Empirical contributions:
- We show that using previously proposed negative losses alone (i.e., without corresponding positive signal on negative token indices) results in a substantial increase in disfluencies.
- We show that switching to TNT losses on negative tokens alone can reduce the number of disfluencies introduced in an update by multiple orders of magnitude relative to baseline methods.
- We show that TNT losses enable a better trade-off between reducing unwanted behavior and keeping generations as close to their original as possible, compared to baseline losses which consider the same token-level negative signal (composite BLEU vs. unwanted rate AUCs improved by 24% and 125% in tasks considered).
- We show that individual TNT losses can be globally better than all baseline methods studied. For example, in our toxicity reduction experiment, we find that TNT methods TNRR and TNRF outperform baseline methods in maintaining original generation behavior across all possible reduction rates.
- We study an extensive set of methods and hyperparameters, resulting in the comparison of over 190 different finetuned generative models across two distinct tasks (reducing hallucination in summarization and reducing toxicity in response generation).
In response to your comments (reviewer in brackets), we have made the following changes:
- Reproduced Figure 2 for other similarity metrics and updated Table 1 to include all TNT methods, to underscore the robustness of the results [DCRf]. The former shows that TNT outperforms baselines across a wide range of similarity metrics, and the latter shows that all TNT methods additionally outperform baselines in disfluencies introduced. Also added disfluency vs. rate reduction scatterplots (Figure 6) in the appendix to showcase disfluency results analogous to Table 1 for other reduction rates.
- Replaced the current toxicity reduction experiment based on a wordlist with an experiment involving a token-level classifier for annotating toxic content, to emphasize the utility of the proposed method on a more interesting setting / definition of toxic [VP85]. In this updated task, the superiority of TNT over baselines is even more stark.
- Added NADO to the related work and how TNT differs [fDnb]. Also included how the two can be combined in one algorithm that takes advantage of the contributions of both works.
- Added an ablation which uses token-level annotations on external data instead of model generations, to increase the breadth of the experiments [eQid]. Under the additional experiments, TNT continues to outperform baselines.
Overall, our experiments analyze the results of over 190 finetuning runs, and we have a full page of related work. We've addressed each reviewer's specific concerns individually below.
We hope that reviewers find our response and updates satisfactory and are willing to consider raising their scores. Thank you!
The paper proposes a method to make small correction to LLMs to avoid generating certain tokens while almost keeping the original generation distribution.
Strengths of the paper:
- the method uses sentence level annotations in order to modify the original model.
- experiments on T5 show improvement over baselines on toxicity metric.
Weakness of the paper:
- The quality of the model (BLEU/ROUGE) are significant worse than original. It is contrary to the paper's desired goal of maintaining the original behavior.
- The paper is only tested on one model (T5), although many fine-tuned models are derived. It would be better to add another decoder-only model.
- The result table and main conclusion are a bit confusing given the current presentation.
Reviewers also pointed out missing comparison with NADO [https://arxiv.org/pdf/2205.14219.pdf] and Task Vector[https://arxiv.org/abs/2212.04089]. The authors provided valid responses. NADO is quite a different method on different task. The author should not feel obligated to add this paper in the discussion or to compare with this paper. Task Vector does not satisfy this paper's problem definition of minimal change of model behavior.
为何不给更高分
The paper's experiments are not substantial to justify the generality. But this paper does have some merit so I do not mind if it is accepted.
为何不给更低分
N/A
Reject