PaperHub
5.7
/10
Poster3 位审稿人
最低4最高7标准差1.2
7
4
6
3.0
置信度
COLM 2025

Out-of-Distribution Detection using Synthetic Data Generation

OpenReviewPDF
提交: 2025-03-18更新: 2025-08-26
TL;DR

This work presents an effective OOD detection method using LLM-generated synthetic proxies, eliminating the need for external OOD data. Experiments show it reduces false positives and outperforms baseline methods in text classification tasks.

摘要

关键词
out-of-distribution detectionout-of-distribution generalizationsynthetic data

评审与讨论

审稿意见
7

This paper presents an alternative for out-of-distribution (OOD) detection. Instead of relying on real OOD data or unlabeled wild data, the authors leverage LLM to generate synthetic OOD data: a two-stage method for far-OOD data and a single-stage method for near-OOD data. The authors use some classical NLP tasks (toxicity detection and sentiment analysis) as well as classification tasks in LLM training -- RHFL to verify the effectiveness of the synthetic OOD data.

接收理由

To be honest, I am not familiar with this area, so I can only provide feedback based on my limited knowledge.

  • The paper is well-written and easy to follow. I feel I indeed learned something about OOD detection from the paper.
  • The paper is novel: using LLM to generate synthetic OOD data, which has not been well explored in prior work.
  • The experiments are extensive, and the results are quite promising.
  • The analyses are also quite interesting and nice.

拒绝理由

Weakness:

I don't see any obvious weakness -- but it may be due to my limited knowledge in this area.

给作者的问题

Theoretically, you can directly prompt the model to generate OOD examples even for the far-OOD, instead of applying the two-stage approach. What is the reason for not doing this, and what would be the actual drawbacks? I think some experimental results would be nice to support this approach.

Line 226-227 "we believe is due to the significant semantic similarity between the InD and OOD data, making the task especially challenging." Would there be any further experimental results that can qualitatively show this is the case?

评论

We would like to express our gratitude for your time and effort in reviewing our paper and providing a favorable recommendation. We appreciate your recognition of the novelty and clarity of our work, as well as your acknowledgment of the extensive experiments and promising results. We're glad to hear that the paper has provided useful insights into OOD detection.

1. Theoretically, you can directly prompt the model to generate OOD examples even for the far-OOD, instead of applying the two-stage approach.

While it's possible to directly prompt the model for far-OOD examples, we chose the two-stage approach to promote greater diversity in the generated samples. In our preliminary experiments, we did try the one-stage direct prompting method, but it produced less diverse outputs. By first generating a few seed samples and using them as in-context examples in the second stage, we guide the model to produce a wider range of plausible OOD samples (as confirmed later in Figure 4). By leveraging the seed samples in this way, we encourage the model to explore a wider range of potential OOD variations, rather than producing overly similar or constrained outputs that might occur with a direct prompt. We will make sure to include these preliminary findings and the comparison with the direct prompting method in the final draft.

2. Line 226-227 "we believe is due to the significant semantic similarity between the InD and OOD data, making the task especially challenging."

To elaborate on the point made in lines 226-227, the semantic similarity between InD and these OOD datasets stems from the fact that both BT (SEAC) and BT (DAWBS) are OOD datasets from the same BeaverTails dataset that contains the InD dataset "Non-Violent Unethical Behavior" (for details, please see Appendix B.2). BT (SEAC) corresponds to the "Sexually Explicit, Adult Content" category, and BT (DAWBS) corresponds to the "Drug Abuse, Weapons, Banned Substance" category. Although these OOD categories are different, they share substantial semantic similarity with the InD dataset (Non-Violent Unethical Behavior), as all these categories involve content moderation issues and deal with behaviors considered unethical or inappropriate.

This overlap in thematic content makes it particularly challenging for the model to distinguish between OOD and InD samples, as the content in these OOD categories is not drastically different in nature from the InD examples. Despite these challenges of the near-OOD datasets, our synthetic model is the only approach that performs close to the ideal model (Table 1).

审稿意见
4

This paper presents an approach to generate synthetic OOD data to train OOD detectors (classifiers). The crux of the approach is to 'carefully' build prompts to induce useful OOD data using LLM prompting. The approach is evaluated for the following classification tasks where OOD is scarce: 1) toxicity detection, 2) harm detection, 3) RLHF reward modeling. It is also evaluated on a downstream setting with selective classification using OOD predictions. The authors propose a classifier which is trained on the iid data plus the synthetic. Comparisons are made against OOD approaches in the machine learning literature (mostly image classification).

The paper addresses an interesting topic of OOD in text processing tasks. However, the underlying process of OOD generation proposed is not clear. A motivating point such as comparison with other (non-LLM) synthetic OOD generation methods is missing. Using an LLM to generate synthetic data is not novel so why and how of the approach should be clear to make the paper strong.

接收理由

  • The paper tackles the difficult and less researched topic of OOD in text processing tasks.

拒绝理由

  • The synthetic data generation method is not clearly explained. Besides the illustrated workflow, there is only a reference to the prompts in the Appendix. So, we do not know what is the intuition and way of inducing the diverse OOD generation paths. How do we know that the OOD is diverse and relevant? (In the intro this is mentioned as a limitation of existing approaches).

  • There is no baseline where the classifier is trained on other cheaper way of generating OOD data. E.g. corrupting existing iid pairs.

  • Although the authors carried out a downstream evaluation with selective answering. This was done on an authors' trained classifier. Experiments with more powerful models, maybe zero-shot, and existing toxicity detection models would make the paper stronger.

给作者的问题

  • Table 1 is too small. Maybe the most important numbers (metrics ) can be included in the main part of the paper.

  • The far-OOD task seems rather trivial.

  • Why ReAct is not included in experiments Section 4.2.3 and Figure 3?

  • OOD proportion on the top axis of Figure 3, why is different for different approaches? The explanation of 'Risk' is not clear. Why not use accuracy at a given percentage?

伦理问题详情

I'm not sure if requires ethic revision due to the content in the used datasets.

评论

6. Why ReAct is not included in experiments Section 4.2.3 and Figure 3?

Thank you for the comment. In our preliminary experiments, we found ReAct to perform similarly to other scoring baselines (e.g., MSP, Energy, DICE). We chose to exclude ReAct from the experiments in Section 4.2.3 and Figure 3 because it did not show significant improvements over the other baselines in our main results. Furthermore, to enhance the clarity and focus of the presentation, we decided not adding it, as the upper axis in Figure 3 was already becoming overcrowded. We will include it in the final draft to provide a complete comparison.

7. OOD proportion on the top axis of Figure 3, why is different for different approaches? The explanation of 'Risk' is not clear. Why not use accuracy at a given percentage?

The top axis in Figure 3 shows the remaining proportion of OOD data after each method has removed uncertain samples. Different methods remove OOD samples at different rates, which is why the proportions vary. The goal of selective classification (Geifman and El-Yaniv, 2017) is to abstain from making predictions when the model is uncertain, which is achieved by removing uncertain samples at each coverage level. Ideally, these uncertain samples should be OOD, so the more OOD samples removed, the better. As shown in the figure, our method removes the majority of OOD samples at a given coverage level, as compared to the baslines.

Regarding the use of "Risk", this is a standard metric in the selective classification literature (Geifman and El-Yaniv, 2017) and not a choice we explicitly made. Risk is essentially the complement of accuracy (1 - accuracy), which is commonly used to quantify model uncertainty and performance in this context.

评论

Thank you for reviewing our paper. While our responses may be detailed, we have made every effort to be concise while ensuring clarity and comprehensiveness. We look forward to further discussion.

1. The underlying process of OOD generation proposed is not clear. The synthetic data generation method is not clearly explained. Besides the illustrated workflow, there is only a reference to the prompts in the Appendix. So, we do not know what is the intuition and way of inducing the diverse OOD generation paths.

We thank the reviewer for raising this important point. We would like to clarify that the underlying OOD generation process is described in Section 3.1 of the paper. As noted, Figure 1 serves only as a high-level illustration, while the specific prompt designs used for generating OOD data are comprehensively documented in Tables 10–14 of the Appendix.

To address the reviewer's main concern regarding how diverse OOD generation paths are induced, we emphasize the following:

We leverage in-context learning to induce diversity, with different strategies for near-OOD and far-OOD data:

For near-OOD, since it originates from the same domain as the InD data, we use InD examples as in-context demonstrations within prompts. For far-OOD, which comes from a different domain, we employ a two-stage prompting strategy. Initially, we prompt the LLM to generate a few seed samples, which are then used as in-context examples in the second stage to guide further generation. This process allows the LLM to generalize beyond the narrow decision boundary of the InD distribution and produce a diverse set of plausible yet OOD samples.

2. How do we know that the OOD is diverse and relevant?

We thank the reviewer for the thoughtful question and respond with two key empirical findings:

a) Effectiveness of Synthetic OOD Data (Figure 4): We demonstrate that the synthetic proxy data forms more generalized clusters and expands the decision boundary around the InD data. This suggests the generated OOD data captures a wider, non-linear space, thereby enabling detection of a diverse range of OOD test samples.

b) Cross-modal OOD generalization (Figure 5): We show that a model trained on our synthetic OOD data (ToxiGen and GSM8K) achieves perfect FPR95 on the real MBPP test set—despite not being explicitly trained on MBPP. The performance matches that of an ideal model, indicating the diversity and relevance of our synthetic OOD data across modalities.

We hope this addresses the reviewer’s concerns and clarifies both the intuition and the effectiveness behind our OOD data generation methodology.

3. A motivating point such as comparison with other (non-LLM) synthetic OOD generation methods is missing.

We appreciate the reviewer’s point. The use of synthetic OOD generation, particularly in the text modality, is still an emerging area, and we are not aware of established non-LLM synthetic OOD generation methods that serve as competitive baselines for comparison.

To address this gap, we employ the ideal baseline, which is trained on real OOD data. This baseline is highly competitive, offering a benchmark that reflects optimal performance. However, it is rarely used in OOD research, as it does not align with real-world conditions where OOD data is both highly diverse and often inaccessible. As shown in Table 1, our synthetic approach either outperforms or matches the ideal baseline on far-OOD datasets. Furthermore, for the challenging near-OOD datasets, our synthetic model is the only method that performs closely to the ideal model.

Moreover, if the reviewer has any specific baselines in mind, we would be grateful if they could share them, and will gladly consider incorporating them in the final version.

4. Table 1 is too small. Maybe the most important numbers (metrics ) can be included in the main part of the paper.

Thank you for the suggestion. We will incorporate this change in the final draft to improve clarity and presentation.

5. The far-OOD task seems rather trivial.

We appreciate the feedback. However, we want to emphasize that far-OOD detection is crucial in real-world systems that need to detect and handle tasks such as math or coding problems differently. For instance, these tasks often need to bypass unnecessary processes, like harmful content filters, which are useful for general text but become costly and irrelevant when applied to specialized domains like math or code. Thus, far-OOD detection is not just a theoretical challenge but a practical necessity for optimizing system performance across diverse use cases.

审稿意见
6

This paper proposes an LLM-based OOD detection method. The final detector is still supervised, but the OOD examples required to train the detector is replaced by synthetic examples generated by an LLM. The experiments were conducted on the Llama model and multiple classic OOD detection task, in addition to RLHF reward modeling. The results and the following analyses reveal that this proposed method is highly effective, matching and even surpassing the performance of an "ideal" case where the original OOD examples are used for training the detector.

接收理由

  1. Writing is very clear, along with clear and easy-to-interpret results. I also find the analyses (section 4.2.4) very interesting.
  2. Experiments are relatively wide-ranging (maybe could've experimented with more LLMs though) and cover many tasks, old and new. And the gains seem solid when compared to the non-LLM-based methods.

拒绝理由

My main reason for rejection is the relatively old-fashioned methodology that underlies the whole paper, along with a lack of discussion on LLM-based alternatives. Replacing OOD samples with LLM-generated ones is well and good, and a legitimate use of a powerful LLM, but one could easily devise a baseline that prompts the LLM to detect OOD samples directly (e.g., by showing the LLM a few examples of in-domain examples and then adding the query text at the end and asking the model to determine whether it's in-domain). In fact, this survey might contain relevant LLM-based methods that could serve as the baseline. The lack of a competitive LLM-based alternative is concerning, given the rapid pace of advancement in LLM's capabilities. At the very least, a discussion of recent LLM-based methods is warranted. In case the alternative methods aren't fair to compare against, one should make the case for this claim too.

[1] Large Language Models for Anomaly and Out-of-Distribution Detection: A Survey. Ruiyao Xu, and Kaize Ding.

给作者的问题

My main question is already raised above. And in case more LLM-based alternatives are evaluated in the revision, I would ask preemptively for their respective pros and cons, when compared with the authors' proposed method.

评论

We sincerely thank the reviewer for their thoughtful feedback. We greatly appreciate your recognition of the clarity of our work, as well as your acknowledgment of the extensive experiments and promising results.

1. My main reason for rejection is the relatively old-fashioned methodology that underlies the whole paper, along with a lack of discussion on LLM-based alternatives.I would ask preemptively for their respective pros and cons, when compared with the authors' proposed method.

We appreciate the reviewer’s thoughtful and detailed feedback.

Regarding the suggestion of prompting an LLM with a few in-domain examples followed by a query to determine in-domain status: while conceptually valid, this approach is computationally intensive and difficult to scale in practice. Text inputs can be excessively long, and including several in-domain examples for each query increases input length substantially. This not only slows down inference but also imposes significant memory and latency costs, which makes it impractical for real-world deployment—especially in scenarios where OOD detection is expected to act as a fast, lightweight filtering step.

Additionally, relying on in-context learning for OOD detection on base LLMs can be unstable and yield suboptimal performance, as base models are not fine-tuned to discriminate in/out-of-distribution samples based on limited demonstrations alone. In contrast, well-trained OOD detectors are typically designed for efficiency and robustness at scale, which our approach directly supports.

We also examined the survey mentioned and would like to clarify that most methods discussed—including AnomalyGPT (Zhang et al., 2023), Myriad (Li et al., 2023b), Tabular (Li et al., 2024a), AnoCLIP (Deng et al., 2023), CLIP-AD (Chen et al., 2023b), MCM (Ming et al., 2022), Miyai et al. (2023), SETAR (Li et al., 2024d), and AnomalyRuler (Yang et al., 2024c)—are developed for image, video, tabular, or multimodal modalities, and are not directly applicable to the text-based OOD detection setting we study.

In fact, text-focused, LLM-based OOD generation or detection methods is still an emerging area, and we are not aware of established LLM-based methods that serve as competitive baselines for comparison.

To address this gap, we employ the ideal baseline, which is trained on real OOD data. This baseline is highly competitive, offering a benchmark that reflects optimal performance. However, it is rarely used in OOD research, as it does not align with real-world conditions where OOD data is both highly diverse and often inaccessible. As shown in Table 1, our synthetic approach either outperforms or matches the ideal baseline on far-OOD datasets. Furthermore, for the challenging near-OOD datasets, our synthetic model is the only method that performs closely to the ideal model.

Moreover, if the reviewer has any specific baselines in mind, we would be grateful if they could share them, and will gladly consider incorporating them in the final version.

Finally, we agree that a discussion of recent LLM-based approaches is valuable. In the final draft, we will include a dedicated section discussing these methods from the survey, along with their applicability and limitations in the context of text-based OOD detection.

评论

Thanks for the detailed response. I agree in principle with the cons of LLM-based OOD detector raised in the response, and I applaud the authors' intention to include more discussion on LLM-based methods in the final draft.

Having said that, I still think a few-shot LLM-based baseline should be included, and you can use the probability of the predicted class label as the score to compute your AOC. It's straightforward and can be thought of a more direct way of utilizing the domain knowledge an LLM already possess -- after all, asking LLM to generate synthetic OOD examples is just a more implicit way of harnessing that same knowledge. Given the time constraint, I'm willing to re-assess my score even given partial results.

评论

I still think a few-shot LLM-based baseline should be included... Given the time constraint, I'm willing to re-assess my score even given partial results.

We sincerely thank the reviewer for engaging in the discussion period and for their willingness to re-assess their score. As requested, we conducted the baseline suggested by Reviewer tPhL. Here are the details of the experiment:

We used a few-shot setting, where five InD samples were provided as in-context examples to guide the LLM. We used Civil Comments (CC) as the InD data. Following the reviewer’s suggestion, we appended the query text and asked the model to determine whether it was InD or OOD. Specifically, we used the following prompt template in triple quotes with label space {“Yes”, “No”}:

"""Task: Out-of-Distribution (OOD) Detection

You are given several examples of In-Distribution (ID) texts. All ID examples come from the Civil Comments dataset, which consists of public comments written between 2015 and 2017 on approximately 50 English-language news sites worldwide. This dataset is used for toxicity classification research and contains a wide range of civil discourse and online discussions.

Your goal is to determine whether a new text sample is Out-of-Distribution (OOD) or not, based on your understanding of the ID examples below.

Below are several In-Distribution (ID) examples:

Example:

Text: {{ InD Sample 1 }}

Out-of-Distribution (OOD)?: No

Example:

Text: {{ InD Sample 2 }}

Out-of-Distribution (OOD)?: No

Example:

Text: {{ InD Sample 3 }}

Out-of-Distribution (OOD)?: No

Example:

Text: {{ InD Sample 4 }}

Out-of-Distribution (OOD)?: No

Example:

Text: {{ InD Sample 5 }}

Out-of-Distribution (OOD)?: No

Now, based on the above five examples, determine whether the following text Example is Out-of-Distribution (OOD). Answer with 'Yes' if the Example is Out-of-Distribution, or 'No' if it is not.

Example:

Text: {{ Test Sample }}

Out-of-Distribution (OOD)?:"""

We ensured that the five-shot samples were mutually exclusive from the test set.

InDMethodGSM8K (FPR95 ↓, AUROC ↑)MBPP (FPR95 ↓, AUROC ↑)SST-2 (FPR95 ↓, AUROC ↑)ToxiGen (FPR95 ↓, AUROC ↑)
Original (Ideal)0.00, 100.000.00, 100.000.055, 99.994.79, 98.67
MSP100.00, 41.11100.00, 78.4792.31, 54.2792.77, 65.80
Energy96.36, 54.8180.80, 82.8370.35, 73.2584.89, 68.74
CCReAct96.74, 69.7892.20, 88.1661.89, 82.3184.04, 67.60
DICE97.57, 65.1088.40, 81.6669.63, 80.3183.83, 63.43
Few-shot LLM-based99.85, 15.1599.40, 51.7597.97, 40.3394.04, 58.38
Synthetic (Ours)0.00, 100.000.00, 100.0010.16, 97.6612.66, 96.59

As shown in the results, the baseline performs significantly worse compared to our proposed method. We attribute this to the fact that only InD samples were used for in-context demonstrations, whereas prior work [A, B] has shown that the important of the label space for effective in-context learning. The absence of OOD fewshot samples—a limitation of this baseline, as OOD samples are inherently unknown and thus unavailable for few-shot demonstrations—likely hindered the model’s ability to form a robust decision boundary between InD and OOD samples. This highlights a key limitation of purely in-context learning-based approaches for OOD detection in base LLMs. Consequently, this baseline’s underperformance reinforces the importance of dedicated OOD detection techniques that explicitly incorporate OOD signals during training or evaluation—such as the method we propose—which are more robust and better suited to practical deployments.

We hope this addresses the reviewer’s concerns, and we would appreciate it if the reviewer could consider revising their score. We remain open to further engagement should the reviewer have any additional questions or suggestions.

[A] Jannik Kossen, Yarin Gal, and Tom Rainforth. 2023. In-context learning learns label relationships but is not conventional learning. ICLR 2024

[B] Haokun Chen, Xu Yang, Yuhang Huang, Zihan Wu, Jing Wang, and Xin Geng. Manipulating the label space for in-context classification. arXiv preprint arXiv:2312.00351, 2023.

评论

Thanks for the quick response. It's clear that the few-shot baseline completely collapsed and you are right to suspect it's due to only having explicit ID examples in the prompt and the model is strongly biased towards always answering "No". If it's not too much, can you also try removing the explicit labels and only listing the ID examples, in a format like this:

<old introduction>
Below are several In-Distribution (ID) examples:
Example 1:
Example 2:
...
Now, based on the above five examples, determine whether the following Example is likely to be from the same dataset. Answer with 'Yes' if the Example is from the same dataset, or 'No' if it is not.
Example:
Answer:

Note that I reversed the meaning of the "Yes/No" label at the end. My hypothesis is that the model might become too confused about the notion of ID vsOOD, and I want to see if it's able to pick up semantic similarity by asking whether the example might be from the same dataset.

Given the time constraint, I think one test set might be enough to show whether the strategy works or not.

评论

We sincerely thank the reviewer for their understanding regarding the time constraints. We were able to carry out this new experiment using the format suggested by Reviewer tPhL. We used Civil Comments (CC) as the InD data and SST-2 as the OOD data. The results are shown below where (format 1) refers to the original template that we used and (format 2) is the new template suggested by Reviewer tPhL.

InDMethodSST-2 (FPR95 ↓, AUROC ↑)
Original (Ideal)0.055, 99.99
MSP92.31, 54.27
Energy70.35, 73.25
CCReAct61.89, 82.31
DICE69.63, 80.31
Few-shot LLM-based (format 1)97.97, 40.33
Few-shot LLM-based (format 2)99.67, 59.32
Synthetic (Ours)10.16, 97.66

As seen in the table above, the results using this new template still fall significantly short of our baselines. This again reinforces our previous hypothesis that using only InD samples as in-context demonstrations is not an optimal choice. The absence of OOD few-shot samples likely hindered the model’s ability to form a robust decision boundary between InD and OOD samples.

Consequently, the underperformance of this baseline highlights the necessity for well-trained OOD detectors that explicitly leverage OOD signals—like the method we propose—which offer greater robustness and are better suited for real-world applications.

We hope that our response satisfactorily addresses the reviewer’s concerns, and we would be grateful if the reviewer would consider revising their score. We remain fully open to further discussion and would welcome any additional questions or suggestions the reviewer may have.

最终决定

This paper introduces a process for synthetically generating OOD data via LLMs. It is evaluating the process on a diverse set of tasks spanning toxicity and harm detection as well as RLHF reward modeling. Multiple reviewers comment on the rigorous set of provided analyses and remark that the paper is very clear and easy to follow even for people not familiar with the topic.

While there could be improvements as suggested by the reviewers (e.g., added context about the workflow, validation of the workflow by adding additional baselines), the results appear sound, especially considering the multiple followups with reviewer tPhl that included additional few-shot experiments.

I thus recommend acceptance of the paper.

This paper went through ethics reviewing. Please review the ethics decision and details below.
Decision: All good, nothing to do or only minor recommendations