PaperHub
5.8
/10
Rejected4 位审稿人
最低5最高6标准差0.4
6
5
6
6
4.0
置信度
ICLR 2024

Out-Of-Distribution Detection With Smooth Training

OpenReviewPDF
提交: 2023-09-19更新: 2024-02-11

摘要

关键词
out-of-distribution dection

评审与讨论

审稿意见
6

This paper addresses the issue of label smoothing in OOD detection. It first identifies the cause of overconfidence prediction is cross-entropy loss in neural networks. It then proposes a new training scheme of label smoothing, SMOT, for perturbed inputs. SMOT proposes to train models on the confidence of different areas of mask-out regions.

优点

  1. It applied label smoothing beyond training data where it demonstrated labeling smoothing on ID data is not enough.
  2. The proposed training method without using auxiliary OOD dataset, but perturbation of masking is very promising where OOD auxiliary dataset is not available.
  3. The supplementary sections justify many decisions made in the main paper, such as why masking is chosen. The paper is generally complete and clear.

缺点

The discussions/conclusions from Theorem 1 and 2 are too abrupt and not very obvious, such as the paragraph just before the Sec.3.2. More explanations are needed, especially for Theorem 2. It makes it less self-contained.

问题

  1. What are the intuitions/interpretations of Cn\sqrt{\frac{C}{n}} in Equ(4), and the equations of equations of theorem 1 and 2? The connection between the equations and the implications is not clear.
  2. Can SMOT use the updated model trained on the fly to get the mask for perturbation?
评论

Q1: What are the intuitions/interpretations of (C/n)\sqrt(C/n) in Equ(4), and the equations of equations of theorem 1 and 2? The connection between the equations and the implications is not clear.

A1: We apologize for the confusion. nn is the number of training samples and CC is a uniform constant. When the number of the training samples is large enough, this term tends to 0 such that the risk of empirical predictor fθS\mathbf{f}_{\boldsymbol{\theta}_S} can approximate to the optimal risk. Theorem 1 states that under the right conditions, the model is likely to be overconfident in the ID data. Theorem 2 states that when the model is overconfident in the ID data, it will also be overconfident in the OOD data which have a small distribution discrepancy with ID data. We'll go into more detail in the revised version.

Q2: Can SMOT use the updated model trained on the fly to get the mask for perturbation?

A2: Thanks for your kind suggestion, we perform the experiment you described. We still train 300 epochs, in the first 100 epochs we perform the standard training and then we use the model in training to obtain the CAM to perform the SMOT training. The results are as follows:

Avg FPR95Avg AUROC
msp51.1391.27
SMOT trained on the fly25.0595.44

It is shown what you described works well! Thanks again for your insightful comments. We will add these results and discussions to our revision.

审稿意见
5

This paper proposes SMOT, a smooth training algorithm for OOD detection. SMOT is based on the heuristic that masking out certain features from the input image should correspondingly leads to decrease in the network's prediction confidence. Specifically, SMOT leverages CAM together with random-thresholding to determine the masking region, and the soft label (or essentially prediction confidence encoded in the training target) is determined according to the threshold (which is related to the area of the masking region if I understand correctly). Experiments on CIFAR-10/100 and ImageNet-200 show that SMOT exhibits (moderate) performance improvements over certain existing methods.

优点

  • The manuscript is in general clearly written.

缺点

Theoretical Investigation

I find that Sec. 3.1 is somewhat hard to follow. The message / motivation it tries to convey is unclear to me. See specific comments or questions below.

  1. The conclusion of Theorem 1 is "given a sufficient amount of training data and a small optimal risk, ..., the issue of over-confidence for ID data is highly probable to arise". However, the equation is only related to "over-confidence" (which I assume refers to excessive maximum softmax probability according to Eq. 2) when the loss is exactly cross-entropy loss. If we use label smoothing as the loss (although it is later empirically shown not to work), then there won't be over-confidence by looking at Eq. 4.

  2. My same argument could be applied to Theorem 2 as well. Furthermore, I can't see how exactly the "over-confidence in OOD data" is reflected in Theorem 2. More elaboration and clarification is necessary.

  3. The concluding paragraph under Theorem 2 makes me lost again. Why we want to "access real OOD data to reduce the distribution discrepancy during training"? What does it mean to "reduce the distribution discrepancy" (the d(θ)d(\theta) in Theorem 2?) between ID and OOD? Meanwhile, why suddenly "limited training ID data", "overfitting", and "the failure of ID classification" become issues for OOD detection?

  4. Lastly, where are the proofs of the Theorems (or where are the references if they were proved by existing works)?

Design of SMOT

  1. Eq. 9 seems a little arbitrary. Why using a temperature-scaled exponential function? What's the intuition behind it? Why (t - 255)? What is the value range of t?

Experiments

  1. One limitation of the experiments and presented results is the fact that all considered OOD datasets are far-OOD which are easier to be detected. I expect to see more results on near-OOD splits (e.g., CIFAR-100 or Tiny ImageNet for CIFAR-10, SSB or NINCO for ImageNet), which are more likely to translate to real-world where the OOD images can be extremely similar to ID images.

  2. The baseline selection seems a bit arbitrary. How does SMOT compare with recent top-performing methods (e.g., ASH [1] as identified by OpenOOD [2])? Also, a highly relevant baseline is missing (see below "Related Work" for details).

  3. Why the training budget and learning rate scheduler is different between base models and SMOT models? Specifically, base models are trained for 200 epochs, while the "final model" with the proposed SMOT loss is trained for 300 epochs). Meanwhile, the base model adopts a step-wise learning rate decay schedule, while the final model uses the more advanced cosine decay. Is this a fair comparison, especially given that both longer training and sophisticated scheduler exactly benefit OOD detection (Table 5 in [3])?

  4. Lastly, an important ablation study that I believe should be included is how SMOT compares with random masking / cropping. This would better justify SMOT's design of leveraging CAM to determine the masking region.

Related Work

Sec. 5 should be more thorough and informative. Specifically, notice that the general idea of using corrupted / perturbed images associated with soft labels has been explored in at least two works in the field of OOD detection [4, 5]. Among these, [4] is in particular relevant to this work. I put up a table below making high-level comparison between [4] and this work.

soft targetperturbationneeds a pre-trained model?
[4]yϵ=(1ϵ)y+ϵ/Kuy_\epsilon=(1-\epsilon)\cdot y + \epsilon / K \cdot u (see their Eq. 3)image corruptions defined by ImageNet-Ca pre-trained classifier for determining ϵ\epsilon
SMOTyϵ=(1ϵ)y+ϵ/Kuy_\epsilon=(1-\epsilon)\cdot y + \epsilon / K \cdot u (Eq. 5 in this work)maskinga pre-trained classifier for generating CAM mask

From the above table, it is not obvious what advantages SMOT can offer over [4] (e.g., less compute, not requiring a pre-trained model). Therefore, I believe that [4] should be not only referenced but also included as an actual baseline to show that CAM-based masking and the associated method for assigning ϵ\epsilon value are better than the designs of [4].

Format

The references are inserted absurdly (e.g. "ResNet18 He et al. (2016)") which I believe are not in the most appropriate format. There are also some mis-formatting, e.g. "Eq.equation 8" in "Training details".


[1] Extremely simple activation shaping for out-of-distribution detection

[2] OpenOOD v1.5: Enhanced Benchmark for Out-of-Distribution Detection

[3] Open-Set Recognition: A Good Closed-Set Classifier Is All You Need

[4] Bridging In- and Out-of-distribution Samples for Their Better Discriminability

[5] Mixture Outlier Exposure: Towards Out-of-Distribution Detection in Fine-grained Environments

问题

Please see the questions in Weaknesses.

评论

We sincerely thank you for your constructive comments! Please find our responses below.

Q1: The conclusion of Theorem 1 is "given a sufficient amount of training data and a small optimal risk, ..., the issue of over-confidence for ID data is highly probable to arise". However, the equation is only related to "over-confidence" (which I assume refers to excessive maximum softmax probability according to Eq. 2) when the loss is exactly cross-entropy loss. If we use label smoothing as the loss (although it is later empirically shown not to work), then there won't be over-confidence by looking at Eq. 4.

A1: Thanks for your valuable comments. Theorem 1 states that networks trained on ERM principle are likely to be overconfident in ID data. We do not discuss the vanilla label smoothing here. We agree that label smoothing can alleviate the problem of overconfidence in neural networks. However, this is more of a conclusion drawn from experimental results. It is hard to follow the your conclusion "If we use label smoothing as the loss, then there won't be over-confidence by looking at Eq. 4". In fact, label smoothing is not widely used in OOD detection. Our experimental results also demonstrate that vanilla label smoothing does not improve the model's OOD detection performance. We intuitively believe that this may be due to the fact that label smoothing is applied to raw ID samples, which is contrary to the goal of OOD detection.

Q2: I can't see how exactly the "over-confidence in OOD data" is reflected in Theorem 2. More elaboration and clarification is necessary.

A2: We apologize for the misunderstanding. Theorem 2 states that for any OOD sample xx, the upper bound of the risk that the well-trained model fθS\mathbf{f}_{\boldsymbol{\theta}_S} misassigns it to the label space of the ID samples is the right-hand term of the inequality in Theorem 2 (I'm sorry, but openreview doesn't seem to be able to compile that equation).

Q3: The concluding paragraph under Theorem 2 makes me lost again. Why we want to "access real OOD data to reduce the distribution discrepancy during training"? What does it mean to "reduce the distribution discrepancy" (the d(θ)d(\theta) in Theorem 2?) between ID and OOD? Meanwhile, why suddenly "limited training ID data", "overfitting", and "the failure of ID classification" become issues for OOD detection?

A3: Thanks for your insightful comments.

In the OE approach, we are allowed to use surrogate OOD data to regularize the model, i.e., to allow the model to have low confidence on this data. However, it is clear that the surrogate OOD data has distributional differences from the real OOD data encountered during testing. Here REDUCE the distribution discrepancy refers to reducing the distribution discrepancy between the surrogate OOD data and the real OOD data.

Q4: Lastly, where are the proofs of the Theorems (or where are the references if they were proved by existing works)?

A4: The proofs are in APPENDIX A. We will highlight it in our revision. Thanks for your kind suggestion.

Q5: Eq. 9 seems a little arbitrary. Why using a temperature-scaled exponential function? What's the intuition behind it? Why (t - 255)? What is the value range of t?

A5: Thanks for pointing out this potentially confusing problem. We design the smoothing parameter function based on the following three principles:

  • The smoothing parameter should be a monotonically decreasing function of the masking threshold, because the more regions are masked, the smoother its label should be.
  • When no region is masked, the smoothing parameter should be (or close to) 0.
  • When all regions are masked, the smoothing parameter should be (or close to) 1.

The designed temperature-scaled exponential function satisfies the above rules and can simply adjust the steepness of the function by one parameter TT. Of course, other more complex functions or learnable functions can be considered, and this is our future work. tt is the masking threshold, ranging from 0 to 255, corresponding to the values in the generated heat map. We set (t255)(t-255) because when the masking threshold tt is 255, the image will not be masked, and we should use hard label, i.e., ϵ=0\epsilon = 0.

评论

Q6: One limitation of the experiments and presented results is the fact that all considered OOD datasets are far-OOD which are easier to be detected. I expect to see more results on near-OOD splits (e.g., CIFAR-100 or Tiny ImageNet for CIFAR-10, SSB or NINCO for ImageNet), which are more likely to translate to real-world where the OOD images can be extremely similar to ID images.

A6: Thank you for your suggestion. We supplement experiments under the near-OOD setting, where we compared the OOD detection ability of model trained using SMOT loss with that of a model trained with normal cross-entropy loss, using CIFAR10 as the ID dataset and CIFAR100 and tinyimagenet as the OOD dataset, respectively, with the following results:

CIFAR100Tiny-ImageNet
FPR95AUROCFPR95AUROC
cross entropy62.0387.3159.6887.23
SMOT48.1490.6942.2991.30

Apparently SMOT works well under near-OOD setting as well.

Q7: The baseline selection seems a bit arbitrary. How does SMOT compare with recent top-performing methods (e.g., ASH [1] as identified by OpenOOD [2])? Also, a highly relevant baseline is missing (see below "Related Work" for details).

A7: Following your constructive comments, we compared ASH and SMOT on CIFAR10, and the results show that SMOT can outperform ASH:

Avg FPR95Avg AUROC
ASH52.1791.04
SMOT15.4297.15

For the other two works you mentioned, they do inspire us a lot, especially [4]. The use of the classification accuracy of corrupted data on the base model to assign labels to the corrupted data is novel to us. But since the authors don't seem to have published it in conferences or journals, it has very few citations and we haven't noticed this work before. The biggest difference between ours and [4] is that we use masking as perturbation function, and we believe that our approach is more in line with the way humans perceive the world. Also our label assignment function is continuous while theirs is discrete. We are not sure which of these two approaches is better. We will explore this work more closely in a revised version later. We tentatively compared the performance of SMOT with [4] and [5] on CIFAR10, as follows:

Avg FPR95Avg AUROC
[4]18.2796.84
MixOE[5]13.5597.59
SMOT15.4297.15

As you can see, our method is slightly stronger than [4], but none of them can beat MixOE, which is reasonable because MixOE uses an extra dataset. Thanks again for your careful review!

Q8: Why the training budget and learning rate scheduler is different between base models and SMOT models? Specifically, base models are trained for 200 epochs, while the "final model" with the proposed SMOT loss is trained for 300 epochs). Meanwhile, the base model adopts a step-wise learning rate decay schedule, while the final model uses the more advanced cosine decay. Is this a fair comparison, especially given that both longer training and sophisticated scheduler exactly benefit OOD detection (Table 5 in [3])?

A8: Thanks for the heads up. We train our base model following the standard way of training a CIFAR10 classification model. When training the final model, we intuitively traine more epochs due to the increased diversity of samples and choose cosine decay due to the fact that it does not require tuning. We re-trained the base model with the same way we train the final model and find that it does lead to a small performance improvement(average AUROC from 91.27% to 92.02 and average FPR95 from 51.13% to 48.53%). We will correct the corresponding baselines in our revision for a fairer comparison.

Q9: Lastly, an important ablation study that I believe should be included is how SMOT compares with random masking / cropping. This would better justify SMOT's design of leveraging CAM to determine the masking region.

A9: Thanks to your suggestion, we have added the experiment, with the randomly generated masks. Please see A3 to Reviewer H2kX.

Q10: The references are inserted absurdly (e.g. "ResNet18 He et al. (2016)") which I believe are not in the most appropriate format. There are also some mis-formatting, e.g. "Eq.equation 8" in "Training details".

A10: Thanks for the heads up, we've corrected the error in the revised version.

审稿意见
6

This paper proposes label smoothing training framework for OOD detection. The authors use the CAM to identify the regions that have a strong correlation to the true label, and generate a masked input image and corresponding soft label for smooth training. Extensive experiments show that the smooth training strategy greatly improves the OOD performance with different score functions.

优点

  1. The proposed smooth training (SMOT) strategy, where soft labels are applied to the perturbed inputs, is technical sound to relieve overconfidence problem.
  2. The image masking and label smoothing strategy is quite novel and makes sense.
  3. The paper is well structured and in good presentation and writing.

缺点

  1. It is a little bit expensive to use CAM for identifying those label-correlated regions. I would like to see the OOD detection performance with the randomly generated masks. For example, randomly masking 30%-70% of the image for smoothing training.
  2. The proposed SMOT utilizes data augmentation for OOD detection. Therefore, the author should introduce and compare more related methods that investigate the effectiveness of data augmentation in OOD detection. I believe there have been many papers that exploring data augmentation for Calibration or OOD detection [1,2]
  3. The SMOT framework is similar to the Outlier Exposure (OE) framework, the author should also compare the proposal with other Outlier exposure (OE) based methods, and discuss the advantages compared with the OE framework.

[1] RankMixup: Ranking-Based Mixup Training for Network Calibration [2] OUT-OF-DISTRIBUTION DETECTION WITH IMPLICIT OUTLIER TRANSFORMATION

问题

  1. What model is used in Table 2/3/4?
评论

We sincerely thank you for your constructive comments! Please find our responses below.

Q1: It is a little bit expensive to use CAM for identifying those label-correlated regions. I would like to see the OOD detection performance with the randomly generated masks. For example, randomly masking 30%-70% of the image for smoothing training.

A1: Thanks to your kind suggestion. Following your constructive comments, we have added the experiment, with the randomly generated masks. More detailed results and disucssions can be found in A3 to Reviewer H2kX.

Q2: The proposed SMOT utilizes data augmentation for OOD detection. Therefore, the author should introduce and compare more related methods that investigate the effectiveness of data augmentation in OOD detection. I believe there have been many papers that exploring data augmentation for Calibration or OOD detection.

A2: Thank you for your kind suggestion.

For the first paper RankMixup[1] you mentioned, it does have similarities to our work at the top level of thought. It argues that mixed samples should have lower confidence and that the higher the mixing, the lower the confidence. In our work, we think that the less complete the sample, the lower the confidence. We take inspiration from the human perspective and design our algorithm accordingly. We believe that our approach more closely aligns with how humans perceive the world However, since RankMixup is a relatively new piece of work, the authors have not published its code for the time being and we have not compared it with SMOT. As for the other mentioned work DOE[2], it uses a min-max learning scheme-searching to synthesize OOD data that leads to worst judgments and learning from such OOD data for uniform performance in OOD detection. The biggest difference between us and them is that we make a smooth excess between ID data and OOD data. We compare DOE and SMOT on CIFAR10. The result is as follows:

FPR95AUROC
DOE5.1598.78
SMOT15.4297.15

As can be seen from the experimental results, SMOT is not as good as DOE. However, this is not a fair comparison. DOE belongs to the Outlier Exposure (OE) framework and requires surrogate OOD data. while SMOT does not require any additional data. We will add a discussion of these outstanding works in the revised version.

Q3: The SMOT framework is similar to the Outlier Exposure (OE) framework, the author should also compare the proposal with other Outlier exposure (OE) based methods, and discuss the advantages compared with the OE framework.

A3: We agree with your point. Our approach is indeed similar to OE. Masked samples can be viewed as OOD samples.

We would like to highlight the differences between ours and OE. Specificaly, we do smoothing between ID samples and OOD samples. The more ID samples are masked, the more it is considered an OOD sample. This makes the transition between ID and OOD smoother. Also, our OOD samples are created by simply masking ID samples without the need for an external dataset. We compared our method with some OE methods with the following results:

FPR95AUROC
OE[3]12.4197.85
MixOE[4]13.5597.59
SMOT15.4297.15

As can be seen from the experimental results, SMOT is not as good as OE and MixOE. We would like to note that it's not a fair comparison.

Thanks agin for your constructive comments. We will add the results and discussions to the revised paper.

[1]: RankMixup: Ranking-Based Mixup Training for Network Calibration

[2]: OUT-OF-DISTRIBUTION DETECTION WITH IMPLICIT OUTLIER TRANSFORMATION

[3]:Deep anomaly detection with outlier exposure

[4]:Mixture Outlier Exposure: Towards Out-of-Distribution Detection in Fine-grained Environments

审稿意见
6

This paper proposes a new training strategy called Smooth Training (SMOT) to improve out-of-distribution (OOD) detection performance. The key idea is to apply label smoothing to perturbed inputs rather than original inputs during training. Specifically, the authors randomly mask label-relevant regions of input images identified by class activation maps. The labels for these masked images are softened proportional to the size of the masked regions. This forces the model to output lower confidence for partial inputs, widening the gap between in- and out-of-distribution examples.

优点

  • The proposed smooth training strategy is intuitive and simple to implement, requiring only small modifications to the standard training procedure.
  • Thorough theoretical analysis is provided on how the commonly used cross-entropy loss leads to overconfidence, and how smooth training can mitigate this issue.
  • Comprehensive experiments on CIFAR and ImageNet-200 benchmarks demonstrate SMOT consistently improves OOD detection across different base models, scoring functions, and datasets. Improvements are also shown when fine-tuning CLIP.
  • Ablation studies validate the efficacy of key components like the label smoothing function and masking threshold sampling distribution.

缺点

  • Although smooth training enhances Out-Of-Distribution (OOD) detection, there's a minor decrease in in-distribution accuracy compared to conventional training. An in-depth exploration of this trade-off could be beneficial.
  • The authors employ class activation maps to pinpoint label-relevant regions for masking, necessitating a pre-trained model. Studying other perturbation techniques that don't require a pre-trained model could expand the method's applicability.
  • Further analysis could be devoted to how the results are sensitive to variations in hyperparameter settings. For instance, how essential is the use of a CAM heatmap to guide masking? What's the optimal way to establish the relationship between mask size and label smoothing hyperparameter ?

================================

Comments after Rebuttal:

Thanks for your response.

  1. The paper would greatly benefit from additional experimental analysis regarding the hyper-parameter λ\lambda. There is a discernible balance to be struck between in-distribution (ID) accuracy and OOD performance, which varies with λ\lambda. The sensitivity of this trade-off to different datasets is not sufficiently addressed. I encourage the authors to explore this aspect further to enhance the utility of their approach.

  2. The SMOT technique appears to depend on a pre-trained model to generate the masking (noting that the use of random noise results in subpar OOD performance), a process which adds computational complexity. The paper does not elaborate on alternative methods for 'dirtying' clean images beyond masking or the application of random noise. Exploring and discussing potential alternative techniques would provide a more comprehensive understanding of the method's applicability and limitations.

Given these considerations, I am inclined to raise my score. I am hopeful that the authors will take these comments into account and address them in the final manuscript to strengthen the paper's contribution.

问题

see weakness

评论

Q3: Further analysis could be devoted to how the results are sensitive to variations in hyperparameter settings. For instance, how essential is the use of a CAM heatmap to guide masking? What's the optimal way to establish the relationship between mask size and label smoothing hyperparameter?

A3: We apologize that we missed the important experiment of comparing with random masking.

As suggested by you and other reviewers, we add this experiment on CIFAR10. Since the images in CIFAR10 is of 32*32, we divide the image into 64 small 4*4 squares, and in each loop, we sample a probability pp from the distribution of Beat(α,β)Beat(\alpha, \beta), let each small square be masked with probability pp, and then smooth the label. The smoothing parameter is designed as ϵ(p)=(exp(p/T)1)/(exp(1/T)1)\epsilon(p) = (exp(p/T) - 1)/(exp(1/T)-1). We conducted experiments with different α,β\alpha, \beta and T. Results are as follows(The numbers in the table are the average AUROC/the average FPR95.):

(α,β)(\alpha, \beta)T=10T=10T=1T=1T=0.3T=0.3T=0.1T=0.1
(1,1)92.91/39.0293.76/35.1493.62/36.7892.08/44.14
(2,2)91.50/37.4891.27/42.0090.30/41.0991.47/45.07
(2/3,2/3)89.38/44.3490.97/41.3389.31/46.1792.24/41.27
(5,2)91.79/43.3490.95/44.5591.59/46.8791.79/43.34
(2,5)90.17/37.4489.00/42.5490.12/40.2392.28/42.02
50,20)91.23/45.5493.28/36.5585.33/54.6691.83/42.89
(20.50)89.64/35.9987.47/49.2986.64/45.5892.56/39.57

As can be seen from the experimental results, when CAM is not used to guide masking, it does not work well and in many cases has lower performance than using simple cross-entropy loss. This is because masking randomly does not necessarily result in a change in the label of the image. It doesn't make sense to soften the label when we mask the background

As for how to establish a relationship between the masking size and label smoothing hyperparameter, this is really a major drawback of SMOT at the moment. In our paper, we simply use a temperature-scaled exponential function with a adjustable temperature. A better approach might be to get the relation between masking size and smoothing hyperparameter from priori knowledge or to learn one such relation. This is our future research direction.

评论

We sincerely thank you for your constructive comments! Please find our responses below.

Q1: Although smooth training enhances Out-Of-Distribution (OOD) detection, there's a minor decrease in in-distribution accuracy compared to conventional training. An in-depth exploration of this trade-off could be beneficial.

A2: Thanks for your suggestion.

Smooth training does lead to a minor decrease (from 95.12% to 94.54%) in classification accuracy of ID samples. Following your suggestion, we further explore the trade-off between OOD detection performance and ID classification performance on CIFAR-10. We do this by varying the values of the hyperparameters λ\lambda. The experimental results are as follows:

TextureSVHNiSUNPlacesLSUNAverage
FPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCID ACC
w/o smot58.5988.5955.7191.9250.8091.8057.8588.7032.7195.3351.1391.2795.12
λ=0.01\lambda = 0.0135.9394.2518.9296.8225.7496.0146.2791.2612.3897.9627.7895.2694.72
λ=0.05\lambda = 0.0527.4495.1512.1597.5714.2797.3831.7993.732.0199.3517.5396.6394.56
λ=0.1\lambda = 0.123.0196.248.2798.2112.2797.8931.5293.882.0499.5615.4297.1594.54
λ=0.3\lambda = 0.337.9293.9126.9295.8128.8495.6745.0691.7921.1696.8131.9894.7994.22
λ=0.5\lambda = 0.526.2495.6521.0596.3219.6896.8943.8391.9614.0997.6725.9795.7094.50
λ=1\lambda = 131.9694.4627.2395.7217.2597.0742.0391.858.5398.1525.595.4594.02

The experimental results show that the ID ACC of models trained with SMOT is lower than that of models trained with normal cross-entropy loss, and that larger λ\lambda usually leads to lower ID ACC, but not necessarily to higher AUROC. In practice, we need to choose an appropriate λ\lambda so that the model maintains both high ID classification ability and good OOD detection ability.

Q2: The authors employ class activation maps to pinpoint label-relevant regions for masking, necessitating a pre-trained model. Studying other perturbation techniques that don't require a pre-trained model could expand the method's applicability.

A2: Thank you for your kind suggestion.

In addition to using masking as perturbation function, we also use adding gaussian noise to inputs as perturbation function, which does not require additional pre-trained models. We report the experimental results in Table 9 in the Appendix:

TextureSVHNiSUNPlacesLSUNAverage
MethodFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROCFPR95AUROC
MSP58.5988.5955.7191.9250.8091.8057.8588.7032.7195.3351.1391.27
SMOT(mask)23.0196.248.2798.2112.2797.8931.5293.882.0499.5615.4297.15
SMOT(noise)41.4093.0536.7894.6427.9195.8744.6491.4023.4096.7734.6894.35

It is shown that adding noise can also improve the OOD detection performance of the model.

AC 元评审

This paper proposes a label smoothing based training framework for OOD detection, called Smooth Training (SMOT). The key idea is to apply label smoothing to perturbed inputs rather than original inputs during training. Specifically, the authors use class activation maps to identify the regions that have a strong correlation to the true label, and randomly mask out those regions. The labels for these masked images are softened proportional to the size of the masked regions. This forces the model to output lower confidence for masked inputs. Experiments on CIFAR-10/100 and ImageNet-200 show that the smooth training strategy improves the OOD performance with different score functions over certain existing methods.

While the results look promising, the idea of training with perturbed data for better calibration of certainty is not novel, moreover, to utilize it for OOD detection requires extra hyperparameters which is not ideal. The method requires a pretrained model to use CAM, as well as training all the models from scratch, as opposed to post-hot OOD detection techniques. All the incurred complexity makes the method less favorable.

The reviewers give very reasonable suggestions to improve the paper, including additional experimental analysis regarding the hyper-parameter, as well as moving away from using CAM. More careful comparisons to closely related methods like OE and RankMixup would also be beneficial.

为何不给更高分

While the results look promising, the idea of training with perturbed data for better calibration of certainty is not novel, moreover, to utilize it for OOD detection requires extra hyperparameters which is not ideal. The method requires a pretrained model to use CAM, as well as training all the models from scratch, as opposed to post-hot OOD detection techniques. All the incurred complexity makes the method less favorable. Overall, this work's contribution is mainly engineering tricks on many levels, and does not yet meet to bar of ICLR.

为何不给更低分

N/A

最终决定

Reject