PaperHub
5.0
/10
Rejected4 位审稿人
最低5最高5标准差0.0
5
5
5
5
3.8
置信度
ICLR 2024

Split-Ensemble: Efficient OOD-aware Ensemble via Task and Model Splitting

OpenReviewPDF
提交: 2023-09-22更新: 2024-02-11
TL;DR

We design a subtask splitting training objective for OOD-aware ensemble training, and correspondingly a tree-like Split-Ensemble architecture to efficiently learn the split tasks.

摘要

关键词
OODEnsembleEfficient model architecture

评审与讨论

审稿意见
5

This paper proposed a subtask-splitting ensemble training objective to enhance the out of distribution(ood) detection as well as estimate the uncertainty. In detail, the authors split the original classification task into several complementary subtasks. When we focuses on one subtask, data from the other subtasks can be considered as ood data. Then the training scheme can take both id and ood task into consideration. In addition, the authors propose a tree-like Split-Ensemble architecture that splits and prunes the networks based on one shared backbone to extract low level features. To verify the proposed method, the authors conduct experiments on several image classification datasets such as CIFAR10 CIFAR100 and Tiny-ImageNet. The classification results on id data has an enhancement in terms of classification accuracy. According to the ood detection criterion, the ood detection ability seems to improve significantly.

优点

The authors offer us a clear presentation for the proposed method. And the idea is quite interesting, it can be considered as use multi task and domain classifier to enhance the performance. This paper presents the whole detail of the training scheme clearly including dealing with the class imbalance and splitting the subtasks. For the splitting and pruning process, the authors propose a novel splitting criterion and utilize global pruning to reduce the model size. To verify the proposed method, extensive experiments are conducted. For the proposed ood setting, the enhancement of the proposed method is very significant. Further analysis of the task splitting is also present.

缺点

1 For table 1, the authors present us the classification results on several datasets including CIFAR10, CIFAR100 and Tiny ImageNet. For CIFAR10, the proposed method is slightly better than single models. But the deep ensemble has a significant drop. However on CIFAR100, deep ensemble enhance the performance significantly. It is weird. In addition, the proposed method can optimize the network structure, to give a more complete comparison, other methods focusing on search structures could be considered for comparison.

2 The performance on Tiny-ImageNet is very significant, could the authors show us the performance on ImageNet. If the proposed method can have significant improvement on ImageNet, it can be exciting.

3 For ood detection, could the authors use commonly used dataset for ood detection or report the performance of other ood detection methods on your setting?

4 For related works, it would be better for the authors to add some works about split-based structure search such as [1]-[3]

[1] Wang D, Li M, Wu L, et al. Energy-aware neural architecture optimization with fast splitting steepest descent[J]. arXiv preprint arXiv:1910.03103, 2019.

[2] Wu L, Wang D, Liu Q. Splitting steepest descent for growing neural architectures[J]. Advances in neural information processing systems, 2019, 32.

[3] Wu L, Ye M, Lei Q, et al. Steepest descent neural architecture optimization: Escaping local optimum with signed neural splitting[J]. arXiv preprint arXiv:2003.10392, 2020.

问题

Please refer to Weakness.

评论

We thank reviewer Fg3Y for your thorough reviews and valuable comments. Hopefully the following responses can address your concerns.

 

W1: Classification Results Comparison in Tab. 1

A1: In response to your concern on Table 1, we have re-implemented the benchmarking results on CIFAR-10, and the Deep Ensemble's performance on CIFAR-10 has also been re-evaluated. These results are updated to the revised paper.

The following table shows the updated image classification results on CIFAR-10. All accuracies are given in percentage with ResNet-18/ResNet-34 as backbone. Best score for each metric in bold, second-best italicized. We re-implement all baseline methods using default hyperparameter.

MethodFLOPsCIFAR-10*
Single1x94.7 / 95.2
Deep-ensemble4x95.7 / 95.5
MC-Dropout4x93.3 / 90.1
MIMO4x86.8 / 87.5
MaskEnsemble4x94.3 / 90.8
BatchEnsemble4x94.0 / 91.0
Film-Ensemble4x87.8 / 94.3
Split-Ensemble (ours)1x95.5 / 95.6
  • The observation is similar to the original Tab. 1 in the paper, where Split-Ensemble beats single model significantly without additional computation cost. Compared to previous ensemble methods, Split-Ensemble also achieves similar or even better performance with significantly less cost.

To understand the impact of structure searching methods, we report the results of applying the same sensitivity-based pruning strategy to prune a naive ensemble model towards single model computation cost. We use the SC-OOD benchmark as suggested by Reviewer 37MR

 

Image classification and OOD detection results on the SC-OOD benchmarks. Models with ResNet18 as the backbone are trained and tested on the CIFAR-10 dataset. Best score for each metric in bold, second-best italicized.

MethodFLOPsAccuracy \uparrowFPR95 \downarrowAUROC \uparrowAUPR \uparrow
Naive Ensemble4x95.742.3490.490.6
Naive Ensemble pruned1x94.048.1990.388.7
Split-Ensemble (ours)1x95.545.591.189.9
  • Note that without the subtask-splitting training objective and parameter sharing enabled by splitting, pruning the naive ensemble to a single model cost significantly impacts its performance. Whereas Split-Ensemble achieves a similar performance to the naive ensemble with significantly less cost.

 

W2: Performance on ImageNet

A2: Ensemble-based methods are not commonly tested on the ImageNet dataset due to their significant computation cost. As Split-Ensemble achieves single-model computation cost, here we perform an evaluation on ImageNet as suggested. The following table compares image classification results on the large-scale ImageNet1K dataset with ResNet18 as backbone. Best score in bold.

MethodAcc \uparrow
Single69.0
Naive Ensemble69.4
Split-Ensemble (ours)70.9
  • Our Split-Ensemble model outperforms the single model and the ensemble model by 1.9% and 1.5% respectively, demonstrating the effectiveness of our method on the large-scale dataset. Details on ImageNet experiment settings are available in the appendix of the revised paper.
评论

W3: Common OOD Detection Dataset

A3: Thanks for the advice, we have tested our method on the recent OOD detection benchmarks SC-OOD [C1], and made the comparison with other OOD detection methods including ODIN [C2], EBO [C3], OE [C4], MCD [C5], MC-Dropout [C6], MIMO [C7], MaskEnsemble [C8], BatchEnsemble [C9], FilmEnsemble [C10]. The results are also listed below.

The following two tables show the comparison between previous state-of-the-art methods and ours on the SC-OOD CIFAR10 benchmarks. The results are reported for models with ResNet-18 backbone. Some baseline methods use the Tiny-Imagenet dataset as additional OOD training data. Best score for each metric in bold, second-best italicized.

MethodAdditional DataFPR95 \downarrowAUROC \uparrowAUPR \uparrow
ODIN52.082.085.1
EBO50.083.885.1
OE50.588.987.8
MCD73.083.980.5
UDG55.690.788.3
UDG36.293.892.6
Split-Ensemble (ours)45.591.189.9
  • The Split-Ensemble model outperforms single-model approaches in OOD detection without incurring additional computational costs or requiring extra training data. Its consistent high performance across key metrics highlights its robustness and efficiency, underscoring its practical utility in OOD tasks.

 

MethodFLOPsFPR95 \downarrowAUROC \uparrowAUPR \uparrow
Naive Ensemble4x42.390.490.6
MC-Dropout4x54.988.788.0
MIMO4x73.783.580.9
MaskEnsemble4x53.287.787.9
BatchEnsemble4x50.489.288.6
FilmEnsemble4x42.691.591.3
Split-Ensemble (ours)1x45.591.189.9
  • The Split-Ensemble model consistently outshines other ensemble-based methods in both image classification and OOD detection, achieving similar or even better performance with 4×4\times less computation cost.

 

The following table shows the comparison between previous state-of-the-art ensemble-based methods and ours on the SC-OOD CIFAR10-LT benchmarks. The results are reported for models with ResNet-18 backbone. Best score for each metric in bold, second-best italicized.

MethodAccuracy \uparrowFPR95 \downarrowAUROC \uparrowAUPR \uparrow
Naive Ensemble12.798.445.350.9
MC-Dropout63.490.666.666.1
MIMO35.796.355.156.9
MaskEnsemble67.789.066.8267.4
BatchEnsemble70.187.4568.068.7
FilmEnsemble72.584.3275.576.0
Split-Ensemble (ours)73.780.581.777.6
  • The Split-Ensemble model excels at handling challenging long-tailed dataset, as evidenced by its top-tier performance across major metrics in the SC-OOD benchmarks. This achievement is particularly notable given its efficiency, as it attains these results without incurring additional computational costs.

 

[C1] Yang et al., Semantically coherent out-of-distribution detection, ICCV, 2021

[C2] Liang et al., Enhancing the reliability of out-of-distribution image detection in neural networks, ICLR, 2018

[C3] Liu et al., Energy-based out-of-distribution detection, NeurIPS, 2020

[C4] Hendrycks et al., Deep anomaly detection with outlier exposure, ICLR, 2019

[C5] Yu et al., Unsupervised out-of-distribution detection by maximum classifier discrepancy, ICCV, 2019

[C6] Gal et al., Dropout as a bayesian approximation: Representing model uncertainty in deep learning, ICML, 2016

[C7] Havasi, et al., Training independent subnetworks for robust prediction, ICLR, 2021

[C8] Durasov et al., Masksembles for uncertainty estimation, CVPR, 2021

[C9] Wen et al., Batchensemble: an alternative approach to efficient ensemble and lifelong learning, ICLR, 2020

[C10] Turkoglu et al., Film-ensemble: Probabilistic deep learning via feature-wise linear modulation, NeurIPS, 2022

 

W4: Split-based structure search

A4: We would like to thank the reviewer for bringing up these papers. We have included them in the discussion of Sec. 2.3. Specifically, the mentioned papers propose architecture splitting to improve the learnability of a single model, where the goal is to increase the number of filters in certain layers for better model capacity. Split-Ensemble, on the other hand, uses architecture splitting as a way of deriving efficient architecture under a multi-task learning scenario of subtask-splitting training. We split the architecture into a tree-like structure with each branch corresponding to a single subtask. This motivates our splitting and pruning algorithm based on correlation and sensitivity, as in Sec. 4.

评论

Dear Reviewer Fg3Y

We apologize for any inconvenience our request may cause during your busy schedule.

We hope that our response has provided clarification for your concerns. As today is the last day for author-reviewer communication, we would greatly appreciate it if you could please let us know if we could provide any further clarifications about the paper.

Thanks so much again for taking the time to review our paper.

Best regards,

Split-Ensemble Authors

评论

I appreciate the response, but I will keep my score.

评论

Dear Reviewer Fg3Y

Thank you once again for your valuable insights and the time you've invested in reviewing our work. We have made a concerted effort to thoroughly address each of the concerns you raised. We hope that our responses and the subsequent revisions to our manuscript reflect a clear understanding and effective resolution of the issues identified.

We would be grateful to know if our revised approach and explanations meet your expectations and satisfactorily address all your concerns. Your feedback is instrumental in enhancing the quality of our research, and we are open to any further suggestions or clarifications you may have.

In light of the revisions and clarifications provided, we kindly request you to reconsider the aspects of our work that you found less convincing initially. If you feel that our responses and amendments have substantially improved the manuscript, we would greatly appreciate a re-evaluation of your score.

We look forward to your continued guidance and are ready to make any further improvements as per your recommendations.

Best regards,

Split-Ensemble Authors

审稿意见
5

In this paper, a new method, Split-Ensemble, is proposed to improve the accuracy and OOD detection of a single model by splitting a multi-classification task into multiple complementary subtasks. And a dynamic segmentation and pruning algorithm based on relevance and sensitivity is proposed to construct a more efficient tree-like Split-Ensemble model, which performs well on several experiments.

优点

  1. An innovative approach to task segmentation and model partitioning is proposed, which can improve the performance and reliability of a single model without increasing the computational overhead.
  2. The data distribution information in the original task is effectively utilized to achieve the goal of OOD-aware training without external data.
  3. An automated segmentation and pruning algorithm is designed that dynamically adjusts the model structure according to the correlation and sensitivity between subtasks.
  4. Full experiments on multiple publicly available datasets demonstrate that the Split-Ensemble approach outperforms baseline.

缺点

  1. There is no adequate theoretical analysis and discussion of the principles of subtask segmentation, and there is no explanation of how to choose the optimal number of subtasks and the way to divide the categories.
  2. Lack of detailed explanation of the definition and importance of OOD-awareness in some sections
  3. No experiments are conducted on more complex or larger datasets, and there is relatively little in the way of discussion of the limitations of its approach and potential directions for improvement.

问题

  1. In the introductory section on page 1, please enhance the background on uncertainty estimation
  2. Does the subtask splitting mentioned in the text take into account the category imbalance? Please give a clarification.
  3. The visualization in the experimental section is low, it is suggested to add
  4. Please describe in one paragraph the structure of your Split-Ensemble model in detail, including the detailed construction of each submodel
  5. For the evaluation of the model, could you provide more description of the evaluation metrics, such as the definition and calculation of AUROC?
  6. Please derive equations (1) and (2) in detail to help the reader better understand your thinking
  7. In the concluding section, could there be a more detailed discussion of future directions of work or potential applications of this methodology?
  8. Throughout the paper, could an additional time complexity analysis of the method be considered?

伦理问题详情

N/A

评论

We thank reviewer FimD for your thorough reviews and valuable comments. Hopefully the following responses can address your concerns.

 

W1: Optimal subtask number and category grouping

A1: This work focuses on the training objective of performing the subtask-splitting training for OOD awareness and achieving efficient architecture with iterative splitting and pruning. While for subtask number and category grouping we use a heuristic design in the paper. We design the subtask splitting based on the intuition that semantically close classes shall be grouped together (Sec. 3.1) so as to make the ID and OOD data in each subtask more distinguishable, and the results in Table 5 show this choice truly makes a difference. For selecting an optimal number of task splits, the tradeoff is that a larger number of splits enables each submodel to learn its OOD-aware objective more easily with fewer ID classes, therefore leading to better AUROC;. Yet the performance may suffer from aggressive pruning to fit the additional branches within the budget. This is observed in Tab. 6 in Appendix C. We choose the optimal subtask number for experiments through this ablation trial. We leave a more rigorous analysis of the optimal subtask grouping strategy as future work.

 

W2: OOD-awareness

A2: We take the intuition of OOD-awareness from previous work Outlier-Exposure (OE), where the model will better detect the OOD data if it is exposed to OOD samples during training, thus becoming OOD-aware. As OE and its variants achieve the goal of OOD-awareness with external OOD dataset, our Split-Ensemble utilizes a novel subtask-splitting training objective to make each submodel OOD-aware with only ID data.

 

W3: Experiment on complex dataset

A3: Following your suggestion, we further compare our approach with baselines on ImageNet1k. Ensemble-based methods are not commonly tested on the ImageNet dataset due to their significant computation cost. As Split-Ensemble achieves single-model computation cost, here we perform an evaluation on ImageNet as suggested. The following table compares image classification results on ImageNet1k with ResNet18 as backbone.

MethodAcc \uparrow
Single69.0
Naive Ensemble69.4
Split-Ensemble70.9
  • Split-Ensemble outperforms single model and the 4×4\times larger ensemble model by 1.9% and 1.5% respectively. Details on ImageNet experiment settings are available in the appendix of the revised paper.

 

Besides ImageNet, we follow the suggestion of Reviewer 37MR to include the new SC-OOD [C1] benchmark and CIFAR-10LT dataset [C2] in our experiments. along with some additional baselines like ODIN [C3], EBO [C4], OE [C5], MCD [C6], UDG [C1] are listed below and have been included in the Appendix of the revised manuscript.

The following two tables show the comparison between previous state-of-the-art methods and ours on the SC-OOD CIFAR10 benchmarks. The results are reported for models with ResNet-18 backbone. Some baseline methods use the Tiny-Imagenet dataset as additional OOD training data. Best score for each metric in bold, second-best italicized.

MethodAdditional DataFPR95 \downarrowAUROC \uparrowAUPR \uparrow
ODIN52.082.085.1
EBO50.083.885.1
OE50.588.987.8
MCD73.083.980.5
UDG55.690.788.3
UDG36.293.892.6
Split-Ensemble (ours)45.591.189.9
  • The Split-Ensemble model outperforms single-model approaches in OOD detection without incurring additional computational costs or requiring extra training data. Its consistent high performance across key metrics highlights its robustness and efficiency, underscoring its practical utility in OOD tasks.

 

MethodFLOPsFPR95 \downarrowAUROC \uparrowAUPR \uparrow
Naive Ensemble4x42.390.490.6
MC-Dropout4x54.988.788.0
MIMO4x73.783.580.9
MaskEnsemble4x53.287.787.9
BatchEnsemble4x50.489.288.6
FilmEnsemble4x42.691.591.3
Split-Ensemble (ours)1x45.591.189.9
  • The Split-Ensemble model consistently outshines other ensemble-based methods in both image classification and OOD detection, achieving similar or even better performance with 4×4\times less computation cost.
评论

The following table shows the comparison between previous state-of-the-art ensemble-based methods and ours on the SC-OOD CIFAR10-LT benchmarks. The results are reported for models with ResNet-18 backbone. Best score for each metric in bold, second-best italicized.

MethodAccuracy \uparrowFPR95 \downarrowAUROC \uparrowAUPR \uparrow
Naive Ensemble12.798.445.350.9
MC-Dropout63.490.666.666.1
MIMO35.796.355.156.9
MaskEnsemble67.789.066.8267.4
BatchEnsemble70.187.4568.068.7
FilmEnsemble72.584.3275.576.0
Split-Ensemble (ours)73.780.581.777.6
  • The Split-Ensemble model excels at handling challenging long-tailed dataset, as evidenced by its top-tier performance across major metrics in the SC-OOD benchmarks. This achievement is particularly notable given its efficiency, as it attains these results without incurring additional computational costs.

 

[C1] Liang et al., Enhancing the reliability of out-of-distribution image detection in neural networks, ICLR, 2018

[C2] Hendrycks et al., Deep anomaly detection with outlier exposure, ICLR, 2019

[C3] Yang et al., Semantically coherent out-of-distribution detection, ICCV, 2021

[C4] Cao et al., Learning imbalanced datasets with label-distribution-aware margin loss, NeurIPS, 2019

[C5] Liu et al., Energy-based out-of-distribution detection, NeurIPS, 2020

[C6] Yu et al., Unsupervised out-of-distribution detection by maximum classifier discrepancy, ICCV, 2019

 

W4.1: Uncertainty Estimation Background

A4.1: We have followed previous work like Film-Ensemble in organizing our introduction to uncertainty estimation, with more related work discussion in Sec. 2.1. Please inform us of any additional relevant work that we should cite, and we would like to include them in the revision.

 

W4.2: Category Imbalance in Subtask Splitting

A4.2: We make extensive discussion on this issue in Sec. 3.2. Within each subtask, the categories are naturally imbalanced due to the extensive amount of OOD samples for the subclass. We therefore propose to utilize the class-balance loss, as introduced in Eq. (1) and Eq. (2) to reweigh the loss based on the number of samples in each category. Different subtasks can also have different numbers of categories, and we especially accommodate this scenario in designing the OOD class objective in Eq. (3).

 

W4.3: Enhanced Visualizations

A4.3: Thanks for the advice. As we have already visualized the learned Split-Ensemble architecture in Appendix C, we have included a visualization on the learned features of each branch in the appendix to the revised paper.

 

W4.4: Detailed Split-Ensemble Model Description

A4.4: The Split-Ensemble model, detailed in Section 4 of our paper, is a tree-like architecture automatically achieved via iterative splitting and pruning. It starts with a shared backbone architecture, such as ResNet-18, which branches into multiple submodels in later layers, each branch specialized for a different subtask of the original classification problem. The architecture of each submodel is determined through correlation-based splitting (Sec 4.1) and sensitivity-aware pruning (Sec. 4.2), ensuring that each branch is both efficient and task-specific. The exact architectures learned by the Split-Ensemble under different configurations are available in Appendix C.

评论

W4.5: Evaluation Metrics Explanation

A4.5: The evaluation metrics we used are the common metrics used in previous OOD detection papers. Below, we provide detailed descriptions of AUROC, FPR95, and AUPR metrics used in the paper. We have included these information in the Appendix A of the revised paper.

AUROC (Area Under the Receiver Operating Characteristic Curve):

Definition: AUROC represents the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance. It is a widely used metric for binary classification problems.

Calculation: AUROC is calculated by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings and computing the area under the resulting curve.

FPR95 (False Positive Rate at 95% True Positive Rate):

Definition: FPR95 is the proportion of negative instances that are misclassified as positive when the true positive rate (TPR) is as high as 95\%. It is a specific point on the ROC curve.

Calculation: FPR95 is determined by finding the point on the ROC curve where the TPR is 95\% and noting the corresponding FPR.

AUPR (Area Under the Precision-Recall Curve):

Definition: AUPR is used to measure the trade-off between precision and recall for different thresholds. It's especially useful in datasets with a significant imbalance between positive and negative instances.

Calculation: AUPR is computed by plotting the precision against the recall at various thresholds and calculating the area under this curve.

 

W4.6: Equations (1) and (2)

A4.6: Eq. (1) and (2) formulate the class-balance reweighing in the binary cross entropy loss to accommodate imbalance data amount in different classes. The formulations are cited from the class-balance loss paper [C7] directly.

[C7] Cui et al., Class-balanced loss based on effective number of samples, CVPR, 2019

 

W4.7: Future Directions and Applications

A4.7: As this work uses heuristic design on the subtask splitting strategy, one important future direction is to automatically design the number and splitting strategy. We also see potential applications of Split-Ensemble in the OOD detection for object detection and semantic segmentation, and inspire better calibrated efficient models with improved robustness and interpretability on complicated tasks.

 

W4.8: Time Complexity Analysis

A4.8: As we constrain the FLOPs of the Split-Ensemble model to be the same as the original backbone model (Sec. 4.2), the inference speed is also similar to that of the backbone models. Moreover, the tree-like architecture of the Split-Ensemble is potentially more parallelizable, where different branches can be computed independently on different processing units, given a proper computation kernel implementation. For training, the loss of different submodels needs to be computed individually, so the training cost is at a similar level as the full ensemble in early epochs as all submodels take the architecture of the backbone. As training progresses and submodels are pruned, the training cost will reduce in later epochs.

评论

Dear Reviewer FimD

We apologize for any inconvenience our request may cause during your busy schedule.

We hope that our response has provided clarification for your concerns. As today is the last day for author-reviewer communication, we would greatly appreciate it if you could please let us know if we could provide any further clarifications about the paper.

Thanks so much again for taking the time to review our paper.

Best regards,

Split-Ensemble Authors

审稿意见
5

This paper proposed an ensemble based method for out-of-distribution detection (OOD). Specifically, the original classification task is split into several sub-tasks trained on ID data but with OOD aware class targets. One model is trained for each sub-task. A weight split and pruning strategy is proposed to reduce the computational cost. In the inference stage, probabilities produced by each model is concatenated and a sample is considered OOD if all the probabilities are below some threshold.

优点

  1. The idea of using task-splitting on ID data to train an ensemble for OOD is interesting.

缺点

  1. The effectiveness of the proposed method is not convincingly evaluated as benchmarking experiments are not enough. Table 1: benchmarking results on CIFAR-10 and TinyIMNET are missing; numbers reported for Deep Ensemble ON CIFAR-10 are problematic as it should not underperform single network; Table 2: lacking benchmarking with SOTA methods.

问题

  1. How to determine the optimal number of task splits? It seems that using a larger number of sub-tasks increase AUROC, but the computational cost is also increased.
  2. How can the method be applied to OOD detection in object detection and semantic segmentation?
评论

We thank reviewer ZwnE for your thorough reviews and valuable comments. Hopefully the following responses can address your concerns.

 

W1: Benchmarking Results and Comparison

A1: In response to your concern on Table 1, we have re-implemented the benchmarking results on CIFAR-10 and TinyIMNET, and the Deep Ensemble's performance on CIFAR-10 has also been re-evaluated. These results are updated in the revised paper.

The following table shows the updated image classification results on CIFAR-10/100 and Tiny-ImageNet. All accuracies are given in percentage with ResNet-18/ResNet-34 as backbone. Best score for each metric in bold, second-best italicized. ∗ refers to our own implementation using default hyperparameter.

MethodFLOPsCIFAR-10*CIFAR-100*Tiny-ImageNet*
Single1x94.7 / 95.275.9 / 77.336.4 / 40.2
Deep-ensemble4x95.7 / 95.580.1 / 80.446.8 / 47.8
MC-Dropout4x93.3 / 90.173.3 / 66.358.13 / 60.3
MIMO4x86.8 / 87.554.9 / 54.646.39 / 47.76
MaskEnsemble4x94.3 / 90.876.0 / 64.861.2 / 62.6
BatchEnsemble4x94.0 / 91.075.5 / 66.161.7 / 62.3
FilmEnsemble4x87.78 / 94.2577.4 / 77.251.5 / 53.2
Split-Ensemble (ours)1x95.5 / 95.677.7 / 77.3551.6 / 47.6
  • The observation is similar to the original Tab. 1 in the paper, where Split-Ensemble beats single model significantly without additional computation cost. Compared to previous ensemble methods, Split-Ensemble also achieves similar or even better performance with significantly less cost.

For Table 2, we follow the suggestion of Reviewer 37MR to include results with recent benchmarks and baselines like Yang et al. [C1] and Cao et al. [C2] as well as some additional baselines like ODIN [C3], EBO [C4], OE [C5], MCD [C6], UDG [C1]. Results have been included in the Appendix of the revised manuscript.

The following two tables show the comparison between previous state-of-the-art methods and ours on the SC-OOD CIFAR10 benchmarks. The results are reported for models with ResNet-18 backbone. Some baseline methods use the Tiny-Imagenet dataset as additional OOD training data. Best score for each metric in bold, second-best italicized.

MethodAdditional DataFPR95 \downarrowAUROC \uparrowAUPR \uparrow
ODIN52.082.085.1
EBO50.083.885.1
OE50.588.987.8
MCD73.083.980.5
UDG55.690.788.3
UDG36.293.892.6
Split-Ensemble (ours)45.591.189.9
  • The Split-Ensemble model outperforms single-model approaches in OOD detection without incurring additional computational costs or requiring extra training data. Its consistent high performance across key metrics highlights its robustness and efficiency, underscoring its practical utility in OOD tasks.

 

MethodFLOPsFPR95 \downarrowAUROC \uparrowAUPR \uparrow
Naive Ensemble4x42.390.490.6
MC-Dropout4x54.988.788.0
MIMO4x73.783.580.9
MaskEnsemble4x53.287.787.9
BatchEnsemble4x50.489.288.6
FilmEnsemble4x42.691.591.3
Split-Ensemble (ours)1x45.591.189.9
  • The Split-Ensemble model consistently outshines other ensemble-based methods in both image classification and OOD detection, achieving similar or even better performance with 4×4\times less computation cost.

 

The following table shows the comparison between previous state-of-the-art ensemble-based methods and ours on the SC-OOD CIFAR10-LT benchmarks. The results are reported for models with ResNet-18 backbone. Best score for each metric in bold, second-best italicized.

MethodAccuracy \uparrowFPR95 \downarrowAUROC \uparrowAUPR \uparrow
Naive Ensemble12.798.445.350.9
MC-Dropout63.490.666.666.1
MIMO35.796.355.156.9
MaskEnsemble67.789.066.8267.4
BatchEnsemble70.187.4568.068.7
FilmEnsemble72.584.3275.576.0
Split-Ensemble (ours)73.780.581.777.6
  • The Split-Ensemble model excels at handling challenging long-tailed dataset, as evidenced by its top-tier performance across major metrics in the SC-OOD benchmarks. This achievement is particularly notable given its efficiency, as it attains these results without incurring additional computational costs.
评论

[C1] Liang et al., Enhancing the reliability of out-of-distribution image detection in neural networks, ICLR, 2018

[C2] Hendrycks et al., Deep anomaly detection with outlier exposure, ICLR, 2019

[C3] Yang et al., Semantically coherent out-of-distribution detection, ICCV, 2021

[C4] Cao et al., Learning imbalanced datasets with label-distribution-aware margin loss, NeurIPS, 2019

[C5] Liu et al., Energy-based out-of-distribution detection, NeurIPS, 2020

[C6] Yu et al., Unsupervised out-of-distribution detection by maximum classifier discrepancy, ICCV, 2019

 

W2: Optimal Task Split Strategy

A2: As noted in Sec. 4.2, the Floating Point Operations (FLOPs) of the original backbone model are used as the computation budget for Split-Ensemble splitting and pruning. So the final computation cost would be the same no matter the number of splits. The tradeoff is that a larger number of splits enables each submodel to learn its OOD-aware objective more easily with fewer ID classes, therefore leading to better AUROC. Yet the performance may suffer from aggressive pruning to fit the additional branches within the budget. This is observed in Tab. 6 in Appendix C. As the Split-Ensemble framework can work under a wide range of splitting numbers, we heuristically choose the optimal setting for experiments through ablation study trials. We leave a more rigorous analysis on automatically designing the optimal subtask grouping strategy as future work.

 

W3: Application in Object Detection and Segmentation

A3: Beyond classification, Split-Ensemble may be used as a backbone architecture in the OOD detection in detection and segmentation tasks. Specifically, instead of connecting an independent classifier to each branch, we can connect detection or segmentation heads. The objective of each head can also be designed as detecting/segmenting a subset of classes while highlighting OOD objects of the corresponding subtask. Though this is beyond the scope of this paper, we would like to look into this in the future.

评论

Dear Reviewer ZwnE

We apologize for any inconvenience our request may cause during your busy schedule.

We hope that our response has provided clarification for your concerns. As today is the last day for author-reviewer communication, we would greatly appreciate it if you could please let us know if we could provide any further clarifications about the paper.

Thanks so much again for taking the time to review our paper.

Best regards,

Split-Ensemble Authors

审稿意见
5

The paper proposes a method to train a “Split-Ensemble” model for detection of OOD inputs. The main idea is to split classes into (semantically related) groups and train a submodel on each group. Further,

  • Submodels are trained to correctly classify a (disjoint) subset of classes plus an additional OOD class that refers to the rest of the classes (i.e., those in the subsets of other submodels).

  • Submodels share a backbone and a method is proposed to branch out from the backbone using sensitivity criteria until each submodel has an individual branch.

  • Submodels are “calibrated” so that classification may be performed as argmax of concatenated logits.

Experimental results on CIFAR-10/100, Tiny-ImageNet and other datasets (used as OOD data) show that:

  • The proposed model has better accuracy than a single model and some ensemble models with 4 members.

  • The proposed model has better OOD detection (e.g., in terms of AUROC) than a sigle model and a 4-member ensemble.

优点

S1. The method is well motivated and the presentation is easy to follow.

S2. The method shows a level of measurable success.

缺点

W1. Some key aspects of the method are not discussed properly nor validated theoretically or experimentally. For example:

  • How is the OOD detection criteria probabilistically sound?

  • When a split is decided it is not stated what architecture and parameters are used for the new branches.

  • The experiments on subtask grouping are in the appendix and are not specified in detail.

  • There is a predefined computation budget that is also not specified.

W2. Important recent baselines and benchmarks were not discussed or incorporated. For example, (Yang et al. ICCV21) and (Wang et al. ICML22). The current set of benchmarks and baselines do not represent the more performant or challenging cases.

W3. For the OOD detection experiments it is not specified how the OOD detection threshold was determined for each model.

References:

Yang et al. “Semantically Coherent Out-of-Distribution Detection.” ICCV 2021.

Wang et al. “Partial and Asymmetric Contrastive Learning for Out-of-Distribution Detection in Long-Tailed Recognition.” ICML 2022.

问题

Besides looking for some reply to the issues noted above,

Q1. Like other OOD detection methods, this method does not seem to address the issue of the distribution of OOD data being unknown. What would the authors say with regards to this in relation to the method and the reported results?

伦理问题详情

None.

评论

We thank reviewer 37MR for your thorough reviews and valuable comments. Hopefully the following responses can address your concerns.

 

W1.1: OOD Detection Criteria

A1.1: We follow the practice of ODIN [C1] and outlier exposure [C2] to consider the softmax probability of the model output as an estimation of model confidence, which is used as the OOD detection criteria. Since we train each submodel with a separated OOD-aware objective, as in Eq. (4), we perform the uncertainty estimation with the submodel fif_i that contributes to the ensemble output label, as discussed in Sec. 3.3.

 

W1.2: Post-Split architecture and parameter

A1.2: For the new branch after the splitting, the architecture and parameters are initialized as the exact copy of the original layers it is splitting from. This guarantees the same model functionality before and after the split. The branches will then be updated, pruned, and further split independently after the splitting is performed. We have updated this clarification in Sec. 4.1.

 

W1.3: Subtask Grouping Experiments

A1.3: This work focuses on the training objective of performing the subtask-splitting training for OOD awareness and achieving efficient architecture with iterative splitting and pruning. We heuristically design the subtask splitting based on the intuition that semantically close classes shall be grouped together (Sec. 3.1) so as to make the ID and OOD data in each subtask more distinguishable, and the results in Table 5 show this choice truly make a difference. We will explain the intuition behind semantically close grouping more clearly in Sec. 3.1 and point to this result. Meanwhile, we leave a more rigorous analysis of the optimal subtask grouping strategy as future work.

 

W1.4: Computation Budget

A1.4: As stated at the end of Sec. 4.2, the Floating Point Operations (FLOPs) of the original backbone model is used as the computation budget to guide the splitting and pruning of the Split-Ensemble. This is also reflected in Tab. 1 as the FLOPs of Split-Ensemble is reported as 1x, the same as the single backbone model.

 

W2: Recent Baseline and Benchmark

A2: Please see the next post due to space limitation.

 

W3: OOD Detection Threshold Determination

A3: Note that with the OOD detection metrics reported in the paper, we do not manually determine the OOD detection threshold for each model. For AUROC and AURP computation, we scan through all the possible thresholds to get the corresponding tradeoff curves. For FPR and detection error evaluation, we report the number with the threshold achieving 95% TPR. We have clarified this in the description of Tab. 2.

 

W4: Addressing Unknown OOD Data Distribution

A4: One unique feature of the proposed Split-Ensemble method is that we perform OOD-aware training without the need for external OOD data, so no knowledge of the OOD to be faced is required. Specifically, we propose subtask splitting training in Sec. 3, where each submodel is trained with part of the dataset as ID while the other as OOD. With the OOD objective in Eq. (3), each submodel becomes OOD aware, which is generalizable to unknown OOD distributions. This generalizability is verified by our results in Tab. 2, and on the new SC-OOD benchmarks you suggested. Our model, trained with only the ID data, can achieve better performance on different OOD distributions compared to baseline methods, even those trained with additional OOD data.

评论

W2: Recent Baseline and Benchmark

A2: Thank you for bringing up these highly relevant papers. Results with recent benchmarks and baselines like Yang et al. [C3] and Cao et al. [C4] along with some additional baselines like ODIN [C1], EBO [C5], OE [C2], MCD [C6], UDG [C1] are listed below and have been included in the Appendix of the revised manuscript.

The following two tables show the comparison between previous state-of-the-art methods and ours on the SC-OOD CIFAR10 benchmarks. The results are reported for models with ResNet-18 backbone. Some baseline methods use the Tiny-Imagenet dataset as additional OOD training data. Best score for each metric in bold, second-best italicized.

MethodAdditional DataFPR95 \downarrowAUROC \uparrowAUPR \uparrow
ODIN52.082.085.1
EBO50.083.885.1
OE50.588.987.8
MCD73.083.980.5
UDG55.690.788.3
UDG36.293.892.6
Split-Ensemble (ours)45.591.189.9
  • The Split-Ensemble model outperforms single-model approaches in OOD detection without incurring additional computational costs or requiring extra training data. Its consistent high performance across key metrics highlights its robustness and efficiency, underscoring its practical utility in OOD tasks.

 

MethodFLOPsFPR95 \downarrowAUROC \uparrowAUPR \uparrow
Naive Ensemble4x42.390.490.6
MC-Dropout4x54.988.788.0
MIMO4x73.783.580.9
MaskEnsemble4x53.287.787.9
BatchEnsemble4x50.489.288.6
FilmEnsemble4x42.691.591.3
Split-Ensemble (ours)1x45.591.189.9
  • The Split-Ensemble model consistently outshines other ensemble-based methods in both image classification and OOD detection, achieving similar or even better performance with 4×4\times less computation cost.

 

The following table shows the comparison between previous state-of-the-art ensemble-based methods and ours on the SC-OOD CIFAR10-LT benchmarks. The results are reported for models with ResNet-18 backbone. Best score for each metric in bold, second-best italicized.

MethodAccuracy \uparrowFPR95 \downarrowAUROC \uparrowAUPR \uparrow
Naive Ensemble12.798.445.350.9
MC-Dropout63.490.666.666.1
MIMO35.796.355.156.9
MaskEnsemble67.789.066.8267.4
BatchEnsemble70.187.4568.068.7
FilmEnsemble72.584.3275.576.0
Split-Ensemble (ours)73.780.581.777.6
  • The Split-Ensemble model excels at handling challenging long-tailed dataset, as evidenced by its top-tier performance across major metrics in the SC-OOD benchmarks. This achievement is particularly notable given its efficiency, as it attains these results without incurring additional computational costs.

 

[C1] Liang et al., Enhancing the reliability of out-of-distribution image detection in neural networks, ICLR, 2018

[C2] Hendrycks et al., Deep anomaly detection with outlier exposure, ICLR, 2019

[C3] Yang et al., Semantically coherent out-of-distribution detection, ICCV, 2021

[C4] Cao et al., Learning imbalanced datasets with label-distribution-aware margin loss, NeurIPS, 2019

[C5] Liu et al., Energy-based out-of-distribution detection, NeurIPS, 2020

[C6] Yu et al., Unsupervised out-of-distribution detection by maximum classifier discrepancy, ICCV, 2019

评论

Thanks for the clarifications and additional experiments. The latter should strengthen the submission.

I still wonder what the authors can add regarding sections 3.2 and 3.3. As of now, the ensemble members are trained with different labels (e.g., eq. 3) but ultimately used as an ensemble -- for which at least some sort of calibration is required (e.g., eq 4).

Can more be said to justify these choices theoretically? For instance, is the resulting model a mixture model?

评论

Thank you for your timely reply!

The motivation of subtask-splitting training is to reduce the redundancy of having multiple ensemble submodels learn the same task, and allow each submodel to have better uncertainty calibration with the OOD-aware training objective. To this end, the OOD-aware training objective in Equ. (2) and (3) allows each submodel to well classifiy the KK ID classes assigned to itself, and output low confidence for OOD samples outside of these KK classes, including the samples from the other NKN-K classes of the overall training task and generalizable to external unseen OOD data. Since this OOD detection ability is learnt by each submodel, we use submodel output to perform uncertainty estimation as in Sec. 3.3.

For ensemble output, as introduced in Sec 3.3, we concatenate all the ID logits from submodels to form the ensemble logit for classification. In an ideal case, only the submodel which the correct class falls into will output a high confidence in the ID logits, while all the other submodels output low confidence. So the correct classification can be made with the concatenated ensemble logits. In practice, there may be a mismatch between the range of logits output from different submodels, where a low-confidence logit from one submodel may get a larger value than a high confidence logit of another independently-trained submodel. We therefore introduce Equ. (4) to calibrate logits ranges across submodel with the overall LCE\mathcal{L}_{CE}. Note that the CE loss weight λ\lambda can be as small as 1e-4 to achieve this calibration, so the calibration does not impact the individual performance of each submodel on its subtask, including its ability to detect OOD samples.

评论

Dear Reviewer 37MR

We apologize for any inconvenience our request may cause during your busy schedule.

We hope that our response has provided clarification for your concerns. As today is the last day for author-reviewer communication, we would greatly appreciate it if you could please let us know if we could provide any further clarifications about the paper.

Thanks so much again for taking the time to review our paper.

Best regards,

Split-Ensemble Authors

评论

We would like to express our sincere thanks to all the reviewers for your thorough reviews and valuable comments. We are glad that you find our work to be innovative (ZwnE, FimD, Fg3Y), well-motivated (37MR), easy to follow (37MR, FG3Y), and showing good performance (37MR, FimD, Fg3Y). We have provided individual responses to your concerns under each of your reviews. We have also revised the submitted draft (in blue) in response to your suggestions. Hopefully the following responses can address your concerns, and we are looking forward to your positive feedback.

AC 元评审

The paper proposes a subtask-splitting ensemble training objective, where a common multiclass classification task is split into several complementary subtasks, where each subtask's training data can be considered as OOD to the other subtasks. A tree-like Split-Ensemble architecture is formed by performing iterative splitting and pruning from a shared backbone model, where each branch serves as a submodel corresponding to a subtask. Such an architecture leads to improved accuracy and uncertainty estimation across submodels under a fixed ensemble computation budget. Experiments conducted on three benchmark datasets show that the proposed model outperforms a single model baseline on OOD detection tasks.

A common concern shared by the reviewers is the lack a rigorous justification on an optimal split strategy. A deeper theoretical analysis could largely strengthen the proposed approach. Reviewers also pointed out that the results presented in the original paper show some inconsistent trends over different datasets. While the authors provided the updated results during the rebuttal, it is not entirely clear whether these new results are totally reliable. Benchmarking results are also missing for certain datasets and several SOTA methods are not included in the comparison of the original paper. While the authors provided additional results during the rebuttal, the proposed method does not appear to be consistently better than the baselines.

为何不给更高分

The proposed approach lacks a rigorous justification and the presented results show inconsistent trends, making it less convincing.

为何不给更低分

N/A

最终决定

Reject