5.7

/10

Poster3 位审稿人

最低4最高7标准差1.2

3.0

置信度

COLM 2025

Out-of-Distribution Detection using Synthetic Data Generation

Momin Abbas,Muneeza Azmat,Raya Horesh,Mikhail Yurochkin

OpenReview PDF

提交: 2025-03-18更新: 2025-08-26

TL;DR

This work presents an effective OOD detection method using LLM-generated synthetic proxies, eliminating the need for external OOD data. Experiments show it reduces false positives and outperforms baseline methods in text classification tasks.

摘要

关键词

out-of-distribution detectionout-of-distribution generalizationsynthetic data

评审与讨论

审稿意见

评分: 7置信度: 22025-04-25

This paper presents an alternative for out-of-distribution (OOD) detection. Instead of relying on real OOD data or unlabeled wild data, the authors leverage LLM to generate synthetic OOD data: a two-stage method for far-OOD data and a single-stage method for near-OOD data. The authors use some classical NLP tasks (toxicity detection and sentiment analysis) as well as classification tasks in LLM training -- RHFL to verify the effectiveness of the synthetic OOD data.

接收理由

To be honest, I am not familiar with this area, so I can only provide feedback based on my limited knowledge.

The paper is well-written and easy to follow. I feel I indeed learned something about OOD detection from the paper.
The paper is novel: using LLM to generate synthetic OOD data, which has not been well explored in prior work.
The experiments are extensive, and the results are quite promising.
The analyses are also quite interesting and nice.

拒绝理由

Weakness:

I don't see any obvious weakness -- but it may be due to my limited knowledge in this area.

给作者的问题

Theoretically, you can directly prompt the model to generate OOD examples even for the far-OOD, instead of applying the two-stage approach. What is the reason for not doing this, and what would be the actual drawbacks? I think some experimental results would be nice to support this approach.

Line 226-227 "we believe is due to the significant semantic similarity between the InD and OOD data, making the task especially challenging." Would there be any further experimental results that can qualitatively show this is the case?

2025-06-01

We would like to express our gratitude for your time and effort in reviewing our paper and providing a favorable recommendation. We appreciate your recognition of the novelty and clarity of our work, as well as your acknowledgment of the extensive experiments and promising results. We're glad to hear that the paper has provided useful insights into OOD detection.

1. Theoretically, you can directly prompt the model to generate OOD examples even for the far-OOD, instead of applying the two-stage approach.

While it's possible to directly prompt the model for far-OOD examples, we chose the two-stage approach to promote greater diversity in the generated samples. In our preliminary experiments, we did try the one-stage direct prompting method, but it produced less diverse outputs. By first generating a few seed samples and using them as in-context examples in the second stage, we guide the model to produce a wider range of plausible OOD samples (as confirmed later in Figure 4). By leveraging the seed samples in this way, we encourage the model to explore a wider range of potential OOD variations, rather than producing overly similar or constrained outputs that might occur with a direct prompt. We will make sure to include these preliminary findings and the comparison with the direct prompting method in the final draft.

2. Line 226-227 "we believe is due to the significant semantic similarity between the InD and OOD data, making the task especially challenging."

To elaborate on the point made in lines 226-227, the semantic similarity between InD and these OOD datasets stems from the fact that both BT (SEAC) and BT (DAWBS) are OOD datasets from the same BeaverTails dataset that contains the InD dataset "Non-Violent Unethical Behavior" (for details, please see Appendix B.2). BT (SEAC) corresponds to the "Sexually Explicit, Adult Content" category, and BT (DAWBS) corresponds to the "Drug Abuse, Weapons, Banned Substance" category. Although these OOD categories are different, they share substantial semantic similarity with the InD dataset (Non-Violent Unethical Behavior), as all these categories involve content moderation issues and deal with behaviors considered unethical or inappropriate.

This overlap in thematic content makes it particularly challenging for the model to distinguish between OOD and InD samples, as the content in these OOD categories is not drastically different in nature from the InD examples. Despite these challenges of the near-OOD datasets, our synthetic model is the only approach that performs close to the ideal model (Table 1).

审稿意见

评分: 4置信度: 32025-05-13

This paper presents an approach to generate synthetic OOD data to train OOD detectors (classifiers). The crux of the approach is to 'carefully' build prompts to induce useful OOD data using LLM prompting. The approach is evaluated for the following classification tasks where OOD is scarce: 1) toxicity detection, 2) harm detection, 3) RLHF reward modeling. It is also evaluated on a downstream setting with selective classification using OOD predictions. The authors propose a classifier which is trained on the iid data plus the synthetic. Comparisons are made against OOD approaches in the machine learning literature (mostly image classification).

The paper addresses an interesting topic of OOD in text processing tasks. However, the underlying process of OOD generation proposed is not clear. A motivating point such as comparison with other (non-LLM) synthetic OOD generation methods is missing. Using an LLM to generate synthetic data is not novel so why and how of the approach should be clear to make the paper strong.

接收理由

The paper tackles the difficult and less researched topic of OOD in text processing tasks.

拒绝理由

The synthetic data generation method is not clearly explained. Besides the illustrated workflow, there is only a reference to the prompts in the Appendix. So, we do not know what is the intuition and way of inducing the diverse OOD generation paths. How do we know that the OOD is diverse and relevant? (In the intro this is mentioned as a limitation of existing approaches).
There is no baseline where the classifier is trained on other cheaper way of generating OOD data. E.g. corrupting existing iid pairs.
Although the authors carried out a downstream evaluation with selective answering. This was done on an authors' trained classifier. Experiments with more powerful models, maybe zero-shot, and existing toxicity detection models would make the paper stronger.

给作者的问题

Table 1 is too small. Maybe the most important numbers (metrics ) can be included in the main part of the paper.
The far-OOD task seems rather trivial.
Why ReAct is not included in experiments Section 4.2.3 and Figure 3?
OOD proportion on the top axis of Figure 3, why is different for different approaches? The explanation of 'Risk' is not clear. Why not use accuracy at a given percentage?

伦理问题详情

I'm not sure if requires ethic revision due to the content in the used datasets.

2025-06-01

6. Why ReAct is not included in experiments Section 4.2.3 and Figure 3?

Thank you for the comment. In our preliminary experiments, we found ReAct to perform similarly to other scoring baselines (e.g., MSP, Energy, DICE). We chose to exclude ReAct from the experiments in Section 4.2.3 and Figure 3 because it did not show significant improvements over the other baselines in our main results. Furthermore, to enhance the clarity and focus of the presentation, we decided not adding it, as the upper axis in Figure 3 was already becoming overcrowded. We will include it in the final draft to provide a complete comparison.

7. OOD proportion on the top axis of Figure 3, why is different for different approaches? The explanation of 'Risk' is not clear. Why not use accuracy at a given percentage?

The top axis in Figure 3 shows the remaining proportion of OOD data after each method has removed uncertain samples. Different methods remove OOD samples at different rates, which is why the proportions vary. The goal of selective classification (Geifman and El-Yaniv, 2017) is to abstain from making predictions when the model is uncertain, which is achieved by removing uncertain samples at each coverage level. Ideally, these uncertain samples should be OOD, so the more OOD samples removed, the better. As shown in the figure, our method removes the majority of OOD samples at a given coverage level, as compared to the baslines.

Regarding the use of "Risk", this is a standard metric in the selective classification literature (Geifman and El-Yaniv, 2017) and not a choice we explicitly made. Risk is essentially the complement of accuracy (1 - accuracy), which is commonly used to quantify model uncertainty and performance in this context.

2025-06-01

Thank you for reviewing our paper. While our responses may be detailed, we have made every effort to be concise while ensuring clarity and comprehensiveness. We look forward to further discussion.

1. The underlying process of OOD generation proposed is not clear. The synthetic data generation method is not clearly explained. Besides the illustrated workflow, there is only a reference to the prompts in the Appendix. So, we do not know what is the intuition and way of inducing the diverse OOD generation paths.

We thank the reviewer for raising this important point. We would like to clarify that the underlying OOD generation process is described in Section 3.1 of the paper. As noted, Figure 1 serves only as a high-level illustration, while the specific prompt designs used for generating OOD data are comprehensively documented in Tables 10–14 of the Appendix.

To address the reviewer's main concern regarding how diverse OOD generation paths are induced, we emphasize the following:

We leverage in-context learning to induce diversity, with different strategies for near-OOD and far-OOD data:

For near-OOD, since it originates from the same domain as the InD data, we use InD examples as in-context demonstrations within prompts. For far-OOD, which comes from a different domain, we employ a two-stage prompting strategy. Initially, we prompt the LLM to generate a few seed samples, which are then used as in-context examples in the second stage to guide further generation. This process allows the LLM to generalize beyond the narrow decision boundary of the InD distribution and produce a diverse set of plausible yet OOD samples.

2. How do we know that the OOD is diverse and relevant?

We thank the reviewer for the thoughtful question and respond with two key empirical findings:

a) Effectiveness of Synthetic OOD Data (Figure 4): We demonstrate that the synthetic proxy data forms more generalized clusters and expands the decision boundary around the InD data. This suggests the generated OOD data captures a wider, non-linear space, thereby enabling detection of a diverse range of OOD test samples.

b) Cross-modal OOD generalization (Figure 5): We show that a model trained on our synthetic OOD data (ToxiGen and GSM8K) achieves perfect FPR95 on the real MBPP test set—despite not being explicitly trained on MBPP. The performance matches that of an ideal model, indicating the diversity and relevance of our synthetic OOD data across modalities.

We hope this addresses the reviewer’s concerns and clarifies both the intuition and the effectiveness behind our OOD data generation methodology.

3. A motivating point such as comparison with other (non-LLM) synthetic OOD generation methods is missing.

We appreciate the reviewer’s point. The use of synthetic OOD generation, particularly in the text modality, is still an emerging area, and we are not aware of established non-LLM synthetic OOD generation methods that serve as competitive baselines for comparison.

To address this gap, we employ the ideal baseline, which is trained on real OOD data. This baseline is highly competitive, offering a benchmark that reflects optimal performance. However, it is rarely used in OOD research, as it does not align with real-world conditions where OOD data is both highly diverse and often inaccessible. As shown in Table 1, our synthetic approach either outperforms or matches the ideal baseline on far-OOD datasets. Furthermore, for the challenging near-OOD datasets, our synthetic model is the only method that performs closely to the ideal model.

Moreover, if the reviewer has any specific baselines in mind, we would be grateful if they could share them, and will gladly consider incorporating them in the final version.

4. Table 1 is too small. Maybe the most important numbers (metrics ) can be included in the main part of the paper.

Thank you for the suggestion. We will incorporate this change in the final draft to improve clarity and presentation.

5. The far-OOD task seems rather trivial.

We appreciate the feedback. However, we want to emphasize that far-OOD detection is crucial in real-world systems that need to detect and handle tasks such as math or coding problems differently. For instance, these tasks often need to bypass unnecessary processes, like harmful content filters, which are useful for general text but become costly and irrelevant when applied to specialized domains like math or code. Thus, far-OOD detection is not just a theoretical challenge but a practical necessity for optimizing system performance across diverse use cases.

审稿意见

评分: 6置信度: 42025-05-13

This paper proposes an LLM-based OOD detection method. The final detector is still supervised, but the OOD examples required to train the detector is replaced by synthetic examples generated by an LLM. The experiments were conducted on the Llama model and multiple classic OOD detection task, in addition to RLHF reward modeling. The results and the following analyses reveal that this proposed method is highly effective, matching and even surpassing the performance of an "ideal" case where the original OOD examples are used for training the detector.

接收理由

Writing is very clear, along with clear and easy-to-interpret results. I also find the analyses (section 4.2.4) very interesting.
Experiments are relatively wide-ranging (maybe could've experimented with more LLMs though) and cover many tasks, old and new. And the gains seem solid when compared to the non-LLM-based methods.

拒绝理由

My main reason for rejection is the relatively old-fashioned methodology that underlies the whole paper, along with a lack of discussion on LLM-based alternatives. Replacing OOD samples with LLM-generated ones is well and good, and a legitimate use of a powerful LLM, but one could easily devise a baseline that prompts the LLM to detect OOD samples directly (e.g., by showing the LLM a few examples of in-domain examples and then adding the query text at the end and asking the model to determine whether it's in-domain). In fact, this survey might contain relevant LLM-based methods that could serve as the baseline. The lack of a competitive LLM-based alternative is concerning, given the rapid pace of advancement in LLM's capabilities. At the very least, a discussion of recent LLM-based methods is warranted. In case the alternative methods aren't fair to compare against, one should make the case for this claim too.

[1] Large Language Models for Anomaly and Out-of-Distribution Detection: A Survey. Ruiyao Xu, and Kaize Ding.

给作者的问题

My main question is already raised above. And in case more LLM-based alternatives are evaluated in the revision, I would ask preemptively for their respective pros and cons, when compared with the authors' proposed method.

2025-06-01

We sincerely thank the reviewer for their thoughtful feedback. We greatly appreciate your recognition of the clarity of our work, as well as your acknowledgment of the extensive experiments and promising results.

1. My main reason for rejection is the relatively old-fashioned methodology that underlies the whole paper, along with a lack of discussion on LLM-based alternatives.I would ask preemptively for their respective pros and cons, when compared with the authors' proposed method.

We appreciate the reviewer’s thoughtful and detailed feedback.

Regarding the suggestion of prompting an LLM with a few in-domain examples followed by a query to determine in-domain status: while conceptually valid, this approach is computationally intensive and difficult to scale in practice. Text inputs can be excessively long, and including several in-domain examples for each query increases input length substantially. This not only slows down inference but also imposes significant memory and latency costs, which makes it impractical for real-world deployment—especially in scenarios where OOD detection is expected to act as a fast, lightweight filtering step.

Additionally, relying on in-context learning for OOD detection on base LLMs can be unstable and yield suboptimal performance, as base models are not fine-tuned to discriminate in/out-of-distribution samples based on limited demonstrations alone. In contrast, well-trained OOD detectors are typically designed for efficiency and robustness at scale, which our approach directly supports.

We also examined the survey mentioned and would like to clarify that most methods discussed—including AnomalyGPT (Zhang et al., 2023), Myriad (Li et al., 2023b), Tabular (Li et al., 2024a), AnoCLIP (Deng et al., 2023), CLIP-AD (Chen et al., 2023b), MCM (Ming et al., 2022), Miyai et al. (2023), SETAR (Li et al., 2024d), and AnomalyRuler (Yang et al., 2024c)—are developed for image, video, tabular, or multimodal modalities, and are not directly applicable to the text-based OOD detection setting we study.

In fact, text-focused, LLM-based OOD generation or detection methods is still an emerging area, and we are not aware of established LLM-based methods that serve as competitive baselines for comparison.

To address this gap, we employ the ideal baseline, which is trained on real OOD data. This baseline is highly competitive, offering a benchmark that reflects optimal performance. However, it is rarely used in OOD research, as it does not align with real-world conditions where OOD data is both highly diverse and often inaccessible. As shown in Table 1, our synthetic approach either outperforms or matches the ideal baseline on far-OOD datasets. Furthermore, for the challenging near-OOD datasets, our synthetic model is the only method that performs closely to the ideal model.

Moreover, if the reviewer has any specific baselines in mind, we would be grateful if they could share them, and will gladly consider incorporating them in the final version.

Finally, we agree that a discussion of recent LLM-based approaches is valuable. In the final draft, we will include a dedicated section discussing these methods from the survey, along with their applicability and limitations in the context of text-based OOD detection.

2025-06-05

Thanks for the detailed response. I agree in principle with the cons of LLM-based OOD detector raised in the response, and I applaud the authors' intention to include more discussion on LLM-based methods in the final draft.

Having said that, I still think a few-shot LLM-based baseline should be included, and you can use the probability of the predicted class label as the score to compute your AOC. It's straightforward and can be thought of a more direct way of utilizing the domain knowledge an LLM already possess -- after all, asking LLM to generate synthetic OOD examples is just a more implicit way of harnessing that same knowledge. Given the time constraint, I'm willing to re-assess my score even given partial results.

2025-06-09

I still think a few-shot LLM-based baseline should be included... Given the time constraint, I'm willing to re-assess my score even given partial results.

We sincerely thank the reviewer for engaging in the discussion period and for their willingness to re-assess their score. As requested, we conducted the baseline suggested by Reviewer tPhL. Here are the details of the experiment:

We used a few-shot setting, where five InD samples were provided as in-context examples to guide the LLM. We used Civil Comments (CC) as the InD data. Following the reviewer’s suggestion, we appended the query text and asked the model to determine whether it was InD or OOD. Specifically, we used the following prompt template in triple quotes with label space {“Yes”, “No”}:

"""Task: Out-of-Distribution (OOD) Detection

You are given several examples of In-Distribution (ID) texts. All ID examples come from the Civil Comments dataset, which consists of public comments written between 2015 and 2017 on approximately 50 English-language news sites worldwide. This dataset is used for toxicity classification research and contains a wide range of civil discourse and online discussions.

Your goal is to determine whether a new text sample is Out-of-Distribution (OOD) or not, based on your understanding of the ID examples below.

Below are several In-Distribution (ID) examples:

Example:

Text: {{ InD Sample 1 }}