Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models
Introduce refusal tokens to enable control over a single model’s refusal rates and discuss desirable data properties for optimizing this approach.
摘要
评审与讨论
The paper proposes a simple technique to introduce refuse and respond tokens (and category level refuse tokens) when doing SFT with refusal data. The technique provides out of the box control benefits by using token probabilities for the refusal token. Through experiments with llama 3-8b they show that the method enables controlling the TPR and FPR to produce a good ROC curve. Additionally, category level tokens enable controlling specific categories of refusal while not affecting the other categories.
Update - The authors have addressed my concern and hence I'm increasing my score.
接收理由
-
Proposed technique is simple and easy to adopt on top of any model/architecture/data.
-
The technique inherently provides an easy way to control refusals by thresholding on refusal token probability.
拒绝理由
While I appreciate the simplicity and ease of use of the method, the main idea of this technique has already been extensively explored in prior work for controlling style [1], attributes[3,5], tasks[2], domains[1,4] as acknowledged in the related work section. Ehe paper explicitly shows the benefit of this technique for refusals but the specific technique lacks novelty. Though control tokens have been previously explored in different domains, a primary reason that they haven’t been used in practice is the challenge of choosing right thresholds, balancing multiple control tokens, and consistency between control token and responses. All of these limitations still exist when applied to the current domain as well.
Further, the experimental contributions and insights do not compensate sufficiently for studying a known technique in a new domain. The main paper presents primarily results on training one model and evaluating primarily on CoCoNOT. I acknowledge results presented in the appendix on additional models and evaluations on XSTest, but these results are presented at a very high level without any additional insights and comparisons. The lack of experimental depth in the main paper makes the contributions incomplete. I recommend the authors to present a more focused study understanding the benefits of refusal tokens, comparing its addition to various stages of model training and studying its generalization to more models and data sources.
-
CTRL: A CONDITIONAL TRANSFORMER LANGUAGE MODEL FOR CONTROLLABLE GENERATION. Keskar et al, 2019
-
Defending Against Neural Fake News. Zellers et al, 2020
-
SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF. Dong et al, 2023
-
Metadata Conditioning Accelerates Language Model Pre-training. Gao et al, 2025
-
Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset. Rashkin et al, 2019
给作者的问题
-
Results in CoCoNot paper suggest that DPO with refusal data was more effective than SFT and LoRA. Results with the Temporal data presented here contradict those findings. A deeper dive to understand the results or even compare refusal tokens with DPO setting would be beneficial.
-
There are many interesting results in the appendix on additional models and eval datasets. I would suggest bringing these to them main paper to present a stronger study on refusal control tokens.
-
Evaluation setup section mentions " Furthermore, with llama-3.1 70B showing similar performance as GPT-3.5, we decided that an open-source model would be easier to reproduce as API models change and deprecate constantly.” - prior work had verified the correlation of GPT4 and GPT3.5 with human evaluations and hence showed that these models are reliable judges. The shift to using llama-3.1 70B without doing any studying and verifications is concerning. I would suggest using GPT 4/3.5 or running your own human evaluation to verify reliability.
-
Line numbers are missing from submission.
Thank you for your time and effort! We appreciate that you found our method simple and easy to adopt. Furthermore, I would like to highlight that Reviewer Kbim found our application very novel and our experiments well thought out.
Challenge of choosing right thresholds, balancing multiple control tokens, and consistency between control token and responses
This is a good point. We also found these challenges interesting, which is why we propose some form of solution for each and thereby make progress in these areas.
Choosing the Right Thresholds – In our case study using the category token (Table 2), we performed what we referred to as a “cheap sweep” over the category tokens to choose the ideal thresholds. This cheap sweep involves independently adjusting the thresholds over each of the categories to choose the final thresholds for each of the categories.
Balancing Multiple Control Tokens – We introduced algorithms to address scenarios with multiple tokens. Figure 4 and Table 2 show how one of these algorithms can be used, while the second algorithm is presented in the appendix.
Consistency – This is an important point. As shown in Slide 7 of our Google Slides, when the ratio of contrast to refusal is balanced (1:1), the error is significantly reduced—approaching zero—when comparing against the baseline model. This insight about contrast is further explored in Section 6, which discusses the role of token and contrast data during training. Additionally, we add an experiment where we place the control codes in the user prompt as proposed by Keskar et al, 2019 and found in the latest Qwen3 models for think/no think tokens (Slide 3 of our Google Slides) and find that this reduces the consistency of the respond token, the TPR and FPR is higher than our placement of the token. This experiment highlights the placement as we have proposed to improve the consistency as previously proposed and currently used (i.e Qwen3).
GPT-3.5/4
Unfortunately, our concerns in the paper were warranted: GPT-3.5 and GPT-4 are now deprecated, making them difficult to use as comparison points. We had validity concerns about LLaMA-3.1 70B. Nevertheless, we believe that using open models as judges better supports open science. Furthemore, as noted in the paper, we manually verified that the judge model correctly marked refusals and responses. The few errors we observed typically involved cases where the model qualified its answer, though such instances were rare. Overall, the model generally followed the CoCoNoT rubric. For the temporal setting, we manually verified approximately 300 questions and found only a handful of incorrect labels.
DPO in CoCoNoT Paper
This is a great question—I had the exact same thought. However, after closely reviewing the CoCoNoT data, I’m not sure how they support their claim. They apply LoRA on refusals, then DPO on contrast data. Yet, here are the response rates reported from their paper for the CoCoNOT evaluation. They claim this is an improvement, but applying DPO on contrast data actually increases the response rates for categories where the model should refuse (except for "Humanizing") and improves the refusal rate on contrast (cases the model should respond to) by only 0.2%. I hesitate to call this a clear improvement based on the following data—let me know if you interpret this differently:
| Model | Humanizing | Incomplete | Indeterminate | Safety | Unsupported | Contrast |
|---|---|---|---|---|---|---|
| Cont. LoRA (Tulu2-7B merged) | 20.0 | 12.8 | 0.7 | 9.1 | 4.9 | 88.9 |
| + Contrast DPO | 17.3 | 15.5 | 3.5 | 12.3 | 9.9 | 89.1 |
Writing Suggestions Thank you so much for your suggestions—they were very helpful! We’re excited to incorporate them. Additionally, we will include the citations you have mentioned that we have not included currently (i.e [3-5]).
Please let us know if you have any additional questions. Otherwise, if you feel some of your questions have been addressed, we kindly ask you to consider raising your score accordingly. We sincerely appreciate the time and effort you’ve invested in your review.
Thank you for addressing my comments. Could you share more details on the human validation done for the LLaMA-3.1 70B judge model? Including the actual eval done and how the model performed here and, in the paper, would go a long way in validating all the experiments in the paper.
Thank you for your engagement!
The authors, and thus not requiring an IRB, originally labeled examples according to the rubric provided by CoCoNoT and the system prompt used for the temporal evaluation. Most cases were clear-cut; however, in ambiguous cases, limited to the CoCoNoT evaluation, a discussion was held to interpret the rubric. After these discussions, we found that the rubric was generally quite clear.
I would also like to highlight that in AlpacaEval's human agreement, LLaMA-3 70B is included and achieves only 1.7% lower agreement than GPT-4.
Following the rubric provided by CoCoNoT for the CoCoNoT evaluation and the system prompt for the temporal evaluation, we labeled 150 randomly sampled examples from the LLaMA-3 + UltraChat baseline. We found that CoCoNoT achieved approximately 91% agreement, and the temporal evaluation achieved approximately 95% agreement. Additionally, we found out of the 13 incorrect annotations for coconot, all but 1 of them were a qualification marked as respond when the rubric points that the label should be a refusal.
Thank you for providing the information. Please include these details in the paper. I'm happy with the response and hence increasing the score.
This paper presents a novel method for training LMs for refusal that enables controlling refusal rates at inference time: the models are instruct-tuned with special tokens like "[refuse]" and "[respond]" prepended to target responses depending on whether the response is a refusal or not, and at test time, the probability of generating these tokens is controlled, thereby calibrating the refusal rates of the LM. In addition, the proposal also includes introducing different refusal tokens for fine-grained refusal-type control.
Experiments on models trained with supervised finetuning (SFT) on a combination of general instruction-tuning and a refusal-specific dataset (CoCoNot) show that training with refusal tokens enables test-time control both in a setting where there is a single refusal token, and also in a setting where there are multiple refusal category-specific tokens. Moreover, it is shown that in the fine-grained refusal setting controlling one type of category generally does not affect the other refusal types.
Strengths: The paper proposes a simple and elegant idea for enabling test-time control of refusal rates. The results show that training models with this simple modification does provide some control of test-time behavior.
Weaknesses:
W1: While it is clear that refusal tokens enable some test-time control of refusal behavior, the extent to which the behavior can be controlled using the threshold is not entirely clear. Concretely, if the user sets the threshold for refusal to, say 0.2, is the actual refusal rate expected to be 20%? This calibration error can be easily measured, and is necessary to understand the effectiveness of the proposed method. If I understand the results correctly, Figure 3 implies that this error is high in the multi-category setting. Even after completely suppressing refusal tokens of specific types, the refusal rates in those categories are fairly high, e.g., the refusal rate in the incomplete category seems to be around 50% when refusal tokens of that category are completely suppressed. Assuming this interpretation is correct, I recommend:
- Explicitly measuring this calibration error both in the single and multi-category settings.
- If the error is high, discussing how it can be reduced, e.g.: training on more data? methods other than SFT?
W2: The presentation of various details can be improved:
- The details of the category-thresholding scheme are unclear. The description in Section 5.1 and Algorithm 1 say that the argmax token is generated if the highest probability token is not from the subset of tokens to consider, but this means that the token actually being generated can be outside the subset of tokens being considered. Moreover, it seems like only category thresholding is used for the actual experiments. It is unclear why the other kind is presented.
- The point of the experiments described under "Increasing F1 scores via category refusal tokens" in Section 5.1 is not clear. What is the F1 score being measured in Figure 4? Is it for only refusal, or refusal + contrast sets? Is it in just one refusal category? In Table 2, a comparison between thresholding and logit bias is shown, but such a comparison seems unrelated to the main point of the paper. Relatedly, since the purpose of the logit bias is the same as thresholding, it can be removed to make the paper easier to read and understand.
接收理由
Simple and elegant method to enable test-time control of refusal rates.
拒绝理由
Important experiments to quantify the extent of controllability are missing, and the writing of the paper can be improved.
Point 2: Writing
Thank you for the writing suggestions—we’re excited to improve the clarity of the paper. This is a great help!
Your understanding of the algorithm is correct. The mechanism changes the relationship between the respond token and the category token without interfering with queries from other categories. This approach assumes that the category token with the highest probability reflects the classification category of the query. One reason you want to do this is because refusal messages between the different categories have different messages. Although if you do not care about the refusal message text, then altering the algo so emitting the subset if any one token passes that threshold is sufficient. There are many ways of utilizing the refusal tokens in setting the thresholds. We offered two different algorithms for this.
You're also correct that the sum-thresholding scheme can be moved to the appendix. We initially included it to (1) highlight the flexibility of the token mechanism and (2) show that category refusal tokens can be used similarly to a single token. Similarly, the logit bias approach can be moved to the appendix as well, though we note that in practice, logit bias is easier to apply in API-based services.
Regarding Figure 4, it originates from our Temporal Experimental Setting (described in Sec. 4) is to highlight the case when you have a better ratio of refusal messages to contrast data (i.e 1 to 1 instead of 10 to 1 like CoCoNOT). Here, the error associated with the refusal token—measured as the difference between the baseline model and the model trained with both refusals and contrast data—is close to zero. It measures the F1 score on temporal refusals (as the refusal class) with CoCoNot refusal categories (excluding the temporal subcategory) and TriviaQA used as the contrast class. We’ll revise the figure caption to make this clearer. Thank you for the opportunity to clarify this.
Please let us know if you have further questions. Otherwise, if you feel your points have been addressed, we kindly ask you to consider raising your score. We sincerely appreciate the time and insight you’ve contributed to this review.
Algorithm 1
I understand that you do not want suppression of tokens in one category to not interfere with probabilities of tokens in the other categories. But the logic of what happens in the else block is not clear to me.
I see that the algorithm considers two constraints: 1) a threshold for refusal: emit a refusal token only if its probability is greater than the threshold; 2) a set to consider: only emit refusal tokens within the set. I see that the if block represents the case where both the constraints are satisfied.
The part that is unclear is what happens in the case where either or both constraints are not satisfied (the else block). corresponds to outputting a token disregarding both the constraints. This does not seem like the behavior you would expect. You instead want to output the most probable refusal token whose probability is above the threshold, and if such a token does not exist, output the "respond" token. Can you confirm if my understanding is correct?
Also, I think the notation in Algorithm 1 needs to be fixed.
- is not actually defined. I assume it is a set containing only .
- is supposed to be a refusal token, but in the final line (in the else block, it refers to a refusal or a respond token).
You are correct about the notation. Thank you so much for pointing that out to us! On the following point,
This does not seem like the behavior you would expect. You instead want to output the most probable refusal token whose probability is above the threshold, and if such a token does not exist, output the "respond" token.
We believe the behavior you're referring to can be achieved either by adding all category refusal tokens to the subset , or by using prioritized thresholding, assigning a threshold to each and evaluating the 's sequentially. However, in the current version of the else statement, the goal is to revert to the model's original behavior. The algorithm is intended to describe a refusal using a category token when a token from the selected category set has the highest probability among all refusal tokens and exceeds a threshold. Otherwise, the model outputs the argmax among the category tokens and the respond token.
We will add clarifying text next to the relevant lines in the algorithm and carefully review the notation. We truly appreciate your feedback in helping us improve the clarity. Please let us know if this clarified your understanding.
We appreciate your thoughtful review and are glad you found our idea elegant.
Point 1: Calibration error in both single- and multi-category settings
This is a great question.
This model was trained with UltraChat as the baseline, which has an inherent refusal rate. Since we tagged all the UltraChat data with respond tokens, the model may still refuse even when the respond token is present. Thus, we calculate the error as the difference between the refusal rate of UltraChat and the refusal rate when the token is suppressed.
In the table below:
- The top row corresponds to suppression of individual category tokens (e.g., if “Humanizing” is suppressed, we report the error only for humanizing).
- The second row reflects suppression of the single refusal token.
- The third row is a model trained without any refusal/respond tokens.
- The last row is the base UltraChat model with no additional refusal messages or contrast data.
We will update Figure 4 in the paper to include these baseline numbers. In the meantime, this data is available on Slide 5 of the Google Slides.
| Suppression Type | Humanizing | Incomplete | Indeterminate | Safety | Unsupported | Average |
|---|---|---|---|---|---|---|
| Cat Token Suppressed | 0.189 | 0.193 | 0.280 | 0.178 | 0.189 | 0.2058 |
| Single Token | 0.185 | 0.205 | 0.322 | 0.178 | 0.185 | 0.215 |
| UltraChat + CoCoNot | 0.489 | 0.417 | 0.440 | 0.471 | 0.489 | 0.4612 |
| UltraChat | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.0000 |
Improving Calibration
Improving calibration requires addressing two aspects: (1) Calibration of the refusal token itself and (2) Consistency between the token and the generated response.
For (1), we propose using softmax temperature scaling on the refusal/respond tokens. As shown on Slide 4 of the Google Slides, applying temperature scaling results in a more linear relationship between increasing the threshold and the probability of emitting the refusal token.
For (2), we found that adding contrast data is essential for aligning the refusal/respond tokens with the generated text. CoCoNot contains a 10:1 ratio of refusal messages to contrast/borderline examples, which is suboptimal. One of the key motivations behind exploring our temporal setting was to evaluate a 1:1 ratio of refusal to contrast examples. As seen on Slide 6 of the Google Slides, we found that when the respond token is present under this balanced ratio, the model returns to the baseline refusal rate. This suggests that the refusal/respond tokens behave more reliably when trained with balanced contrastive data.
Thanks for sharing these results. They show that suppressing the refusal tokens reduces the refusal rate, but this was clear from the original draft of the paper as well. These results do not show the calibration error in the proposed method. You could concretely measure the calibration error, e.g.: vary the threshold from 0.0 to 1.0 in intervals of 0.1, and in each setting, compute the absolute difference between the threshold and the refusal rate, and compute an average of the absolute differences over all the settings. This is essentially Expected Calibration Error (ECE), without any binning because you don't need that. This value would tell us how much you can actually control the refusal behavior of the model using your approach.
we tagged all the UltraChat data with respond tokens, the model may still refuse even when the respond token is present
Can you elaborate why you chose to do this? Since you already have access to a refusal classifier (used for measuring refusal rates), would it not be cleaner to tag only non-refusals with "respond" tokens, and refusals with "refuse"? This might explain why you see suppression being more effective when you mix in CoCoNot, where the refusal labels are cleaner.
cleaner to tag only non-refusals with "respond" tokens, and refusals with "refuse"
This is a great question! In the CoCoNOT paper they found that even after removing the refusal messages from the Tulu mix, the model still would refuse at pretty high rates around 65% averaged over the five categories (not including contrast holdout category). Additionally, in the construction of UltraChat, they did attempt to filter out refusal messages. Thus, we found that having some form of refusal rate is inevitable. Furthermore, in scenarios, where refusal messages splits are added to some base dataset like Tulu, the tag can be easily applied without extensive filtering over potentially millions of examples. This scenario highlights that the refusal token is still effective at moving the refusal rate in the presence of noise. Surprisingly, even Alpaca had a native refusal rate despite having little to no refusal messages. We believe that this behavior would additionally be influenced by what is found in the pretraining data, which Olmo Trace might be useful in analyzing this. We believe that this is an incredibly interesting question that should be studied extensively but is beyond the scope of this paper.
ECE
Below is the table as requested for the single-token setting. We report the suggested metric of the refusal token being generated, both with and without applying a softmax temperature of 2 (note: this temperature was not extensively tuned). Additionally, we call ECE the quantity refusal ratethreshold, computed using both the raw refusal rate and an adjusted refusal rate defined as: These metrics are presented for both the original model and the softmax temperature-scaled variant.
| Threshold | Original Token ECE | Original ECE | Original Adjusted ECE | Soft Temp 2 Token ECE | Soft Temp 2 ECE | Soft Temp 2 Adjusted ECE |
|---|---|---|---|---|---|---|
| 0 | 0.0000 | 0.1196 | 0.1854 | 0.0000 | 0.1196 | 0.1840 |
| 0.1 | 0.1326 | 0.1421 | 0.2736 | 0.0783 | 0.1284 | 0.2513 |
| 0.2 | 0.0536 | 0.0474 | 0.1817 | 0.0145 | 0.0693 | 0.2144 |
| 0.3 | 0.0326 | 0.0343 | 0.1098 | 0.0435 | 0.0277 | 0.1188 |
| 0.4 | 0.1217 | 0.1323 | 0.0128 | 0.1007 | 0.1019 | 0.0587 |
| 0.5 | 0.1942 | 0.2109 | 0.0544 | 0.1594 | 0.1846 | 0.0148 |
| 0.6 | 0.2819 | 0.3037 | 0.1433 | 0.1964 | 0.2574 | 0.0730 |
| 0.7 | 0.3688 | 0.3976 | 0.2338 | 0.2268 | 0.3210 | 0.1169 |
| 0.8 | 0.4514 | 0.4928 | 0.3264 | 0.2217 | 0.3838 | 0.1596 |
| 0.9 | 0.5232 | 0.5693 | 0.3905 | 0.1739 | 0.4527 | 0.2118 |
| 1.0 | 0.0000 | 0.5282 | 0.2736 | 0.0000 | 0.5282 | 0.2741 |
| Average | 0.1964 | 0.2708 | 0.1987 | 0.1105 | 0.2341 | 0.1525 |
From this table, we observe that both the token-level and response-level ECEs are higher near the extreme thresholds. This occurs because the refusal rate never reaches 0% or 100%, limiting calibration quality at the endpoints. If we instead compute the adjusted refusal rate using the model's empirical maximum and minimum refusal rates—i.e., using the refusal rate at threshold 0 as the effective upper bound and the rate at threshold 1 as the lower bound—we obtain improved calibration scores: 0.08 with softmax temperature scaling and 0.13 without.
Thank you for your engagement in the discussion!
This scenario highlights that the refusal token is still effective at moving the refusal rate in the presence of noise.
Thanks for providing more details here. While it is clear that your proposed method seems to change refusal behavior even in noisy setups, I think it would be informative to see how effective it would be when the noise is eliminated as much as possible. As it stands now, from the ECE results you shared, it seems like the calibration error of your method is high. I wonder if this is because of noisy training data, or some deeper difficulty in controlling the refusal behavior of LMs using refusal tokens (and that would be a direct assessment of your approach).
ECE results
Thanks also for providing these. Since your definition of ECE is essentially (where is the number of threshold values), a higher value of this means better calibration, correct? Based on this understanding, I see that the models are not very well calibrated. This is an informative result. Though, as I wrote above, it would be helpful to know how much of this is due to noisy data, and how much is due to the limitation of the method.
I recommend you include the ECE numbers in the paper, and if you do, it would be better to instead use as the definition, to actually make this an error rate, and more similar to ECE in the traditional sense.
ECE results
Sorry about the confusion.
On the ECE results, lower is better. For example, if the threshold is 0.1 and the probability of the refusal token is 0.2, then the refusal token will be emitted. The lower the threshold, the higher the refusal rate.
So, although the calibration is not perfect, decreasing the threshold monotonically increases your refusal rate. We can provide these in a table if requested.
Please let us know if this clarifies your understanding.
Ah, I am sorry, I missed the inverse relationship between the threshold and the refusal rate. Thanks for the clarification. Your definition of ECE does indeed seem correct.
In light of this clarity, I want to again highlight in the case where you remove the error from the data (i.e., using the refusal rate at threshold 0 as the effective upper bound and the rate at threshold 1 as the lower bound), we obtain ECE scores of 0.08 with softmax temperature scaling, which we do believe is a respectable error.
May we kindly ask you to reconsider raising your score? Thank you so much for your engagement and time!
I have increased my score. Thanks! I recommend including the ECE results in the paper.
Authors propose a new method for test-time control of refusal/response rates through prepending response or refuse tokens to existing refusal datasets. This method allows the user to set an inference-time threshold and roughly control the sensitivity of the model's refusal, without expensive retraining of the model. The idea is simple but intriguing. The authors then proceed to experiment with more fine-grained control, creating separate tokens for each type (reason for) refusal. This hypothetically allows the user to disable specific types of refusals at inference time, a very powerful concept.
接收理由
The paper is well-written, the experiments are well thought out, and the basic concept is simple but quite powerful.
The experiments with separate types of refusal tokens are also very interesting, which is a capability I don't believe I have seen before in the literature. This is very novel.
The insights about the benefits of contrast data are also interesting.
The technique is straightforward and should be easily reproducible and widely applicable.
All in all, this is a very novel paper with strong practical implications. The authors did great work. Although this is effectively an instantiation of existing control code work, the application is very novel.
拒绝理由
One relatively major concern with the paper is the lack of any experimental baselines. I would've liked to see a comparison with at least one other method, e.g. modulating the system prompt. There exist other test-time interventions to modulate refusal rate, they are just probably bad (e.g. requesting explicitly in the prompt for the model to refuse X% of the time). It would have greatly strengthened the authors' argument if at least one other method was presented alongside their results.
Unfortunately this is a major gap. If authors are able to present at least one properly-set-up baseline during the rebuttal period, my score can easily go up several points (I could see 8-9). I see genuine merit in the work, but lack of baselines significantly damages the claims.
给作者的问题
Some analysis of the calibration situation would have been interesting. In other words, if as a user I set a token probability threshold (for the refusal token) of 0.5, how closely does that correspond to an actual refusal rate of 0.5? How reliable is the correspondence? More analysis on this would have been good.
Some ablations on the specific choice of refusal token and if that matters also.
Refusal rates don't drop to zero (or even near zero) even if you suppress the refusal token, meaning that althought the refusal tokens provide a lever to use, it's relatively imperfect.
We appreciate that you found our findings and application interesting.
Adding System Prompt Baseline
This is a great idea!
We tried three different system prompts. We first wrote one that we thought closely aligns with what the refusal token aims to achieve (i.e., only responding when confident that the response is appropriate):
“Before responding, determine whether the query warrants a response. Only proceed if you are at least ${per}% confident that a response is appropriate. If not, refuse to answer. If you do respond, ensure your answer is faithful to the original prompt.”
From this base prompt, we described the task to GPT-4o and asked it to generate two additional versions: one more detailed and another refined version of similar length. Note: if there is a specific prompt you would like us to test, we’d be happy to do so.
As we increased the confidence level from 0 to 100 in increments of 10, we observed that the overall refusal rate remained largely unchanged. You can find the graph on Slide 2 of the Google Slides.
Additionally, we experimented with placing CTRL Code tokens at the end of the prompt, closer to how Keskar et al., 2019 (i.e., user-controlled) employed such tokens, similar to the no think/think tokens used in Qwen3. However, we found that this setup reduced our ability to achieve fine-grained control over refusal rates. Specifically, placing the [respond] CTRL Code at the end of the prompt appears to diminish its effectiveness, as shown on Slide 3 of the Google Slides.
Analysis of the Calibration Situation
This is a great suggestion! We added an analysis chart on Slide 4 of the Google Slides, which shows that softmax temperature scaling helps the refusal token be emitted more consistently with changes in the threshold.
Choice of the Refusal Token
We added the refusal token as a new entry in the embedding matrix. Therefore, the string used do not matter compared to using an existing token in the vocabulary. We believe introducing a new token into the embedding is the more universal approach.
Refusal Rates Don't Drop to Zero
The models were trained with UltraChat as the baseline, which has an inherent refusal rate. Since we tagged all UltraChat examples with respond tokens, the model may still refuse even when the [respond] token is present. When calculating the difference between the refusal rate of the model UltraChat and that of the token-suppressed version, we find a ~20% refusal rate difference on responses it should refuse. The token suppression takes the refusal rate from 80% to 60%, where 40% represents UltraChat’s baseline refusal rate on these questions. To alleviate this, more contrast is needed. For example, in the case where the refusal message to contrast is one to one (in our temporal setting), we find that this error is close to zero (See Slide 7).
Please let us know if you have any additional questions. Otherwise, if you feel your questions have been addressed, we kindly ask you to consider raising your score. We sincerely appreciate the time and insight you’ve put into your review.
I am satisfied, I thank the authors for their engagement and additional work. Raising the score.
The paper introduces refusal tokens that aim at controlling the refusal rate of language models during inference without finetuning again the model. The refusal token, which is specific to a type of refusal (e.g. safety), prepended to the model's response, is used during the training of the model. It is also possible to use all the types of refusal tokens together. The paper shows the effectiveness of the approach via a detailed analysis, experimenting on the Llama 3.1 and Mistral models, and using CoCoNot and TriviaQA as evaluation datasets. It also shows the usefulness of introducing contrast or borderline examples in the training data.
接收理由
- The problem addressed is central in current language models research and very important for their practical use - the ability of language models to refuse answering certain questions or to follow certain instructions.
- The background is clearly explained and the related work section is very comprehensive.
- The paper presents extensive experiments and multiple ablations are performed.
拒绝理由
I don't see a reason to reject the paper. However, I do think the presentation of some of the sections should be improved (see comments below).
给作者的问题
Presentation Comments:
I think the clarity of sections 5 and 6 should be improved by adding structure and better connecting them to the previous sections of the paper. Specifically, I think section 5 can be improved by structuring the different results via subsections/paragrahs with short titles. Section 6 is titled discussion but to my understanding it introduces new experiments so maybe "Additional experiments" is more suitable. I also think a conclusion section for the paper is needed.
Citations:
-
The paper for the TriviaQA dataset mentioned and used in the work should be cited: Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics.
-
It will be useful to cite the survey: Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, and Lucy Lu Wang. The art of refusal: A survey of abstention in large language models. arXiv preprint arXiv:2407.18418, 2024.
Minor: -page 3, section 2, 2nd paragraph: extra full stop.
Thank you so much for finding our work important/practical and the suggested writing changes! We are excited to improve the structure of section 5 and 6. Furthermore, we will include the citations that you have mentioned. We appreciate your suggestions.
Please let us know if you have any additional questions. We sincerely appreciate the time you’ve put into your review.
Thank you very much for your response and for the willingness to apply the suggested changes.
This paper presents refusal tokens, a test-time control mechanism that allows LLMs to flexibly refuse certain categories of user queries that are unsafe, unanswerable, or ill-posed. The proposed mechanism uses meta-tokens like [refuse] or category-specific tokens (ie [humanizing-refuse], [unsupported-refuse]) prepended during SFT. At inference, the softmax probability of these tokens enables test-time calibration of refusal sensitivity via thresholding or logit bias. The authors evaluate the method using the CoCoNot dataset and a temporally constructed refusal dataset, demonstrating improvements in refusal F1 scores, controllability of specific refusal categories, and the utility of contrastive examples during training. The method offers advantages over prior techniques such as tagging, system prompts, and activation steering, with minimal inference-time overhead and no need for retraining.
The reviewers generally agree that the paper is well-motivated, practically impactful, and methodologically sound, specifically simplicity of the method, ease of integration, and potential for test-time controllability. Strengths (also pointed by the reviewers)
- The method is easy to implement, architecture-agnostic, and adds practical utility.
- Fine-grained control via category-specific tokens is seen as a novel and useful extension.
- The authors provide a thorough empirical analysis across multiple models and refusal settings, including calibration, consistency, and contrastive training.
Concerns, feedback and discussion: The author responses were comprehensive with additional experiments.
- In response to Reviewer cYf6’s concern about the controllability and calibration of refusal rates, the authors computed and shared a full ECE table, varying the refusal token threshold from 0.0 to 1.0 and showing calibration both with and without softmax temperature scaling.
- Reviewer gWX3 raised concerns about relying on LLaMA-3 as a judge without validation. The authors responded with manual validation of 150 examples using the CoCoNot rubric, showing 91–95% agreement and noting that nearly all disagreement involved nuanced qualification cases.
- Reviewer Kbim requested comparisons to other refusal control baselines. The authors tested three system prompts, varying refusal confidence levels and showing that refusal rates remained mostly unchanged regardless of prompt tuning.