/10

Poster4 位审稿人

最低1最高3标准差0.7

ICML 2025

SEAD: Unsupervised Ensemble of Streaming Anomaly Detectors

Saumya Gaurang Shah,Abishek Sankararaman,Balakrishnan Murali Narayanaswamy,Vikramank Singh

提交: 2025-01-23更新: 2025-07-24

TL;DR

This paper introduces SEAD: The first technique for ensembling unsupervised anomaly detectors in the streaming setting, adapting to the distribution of scores generated by base anomaly detectors on each dataset.

摘要

关键词

anomaly detectiononline learningstreamingcontinual learning

评审与讨论

审稿意见

评分: 32025-03-13

The paper proposes SEAD, an unsupervised ensemble method for streaming anomaly detection (AD) that dynamically selects the best base detectors without labeled data. It leverages multiplicative weights updates to adjust model weights based on normalized anomaly scores and introduces SEAD++ for runtime optimization via sampling. Experiments on 15 datasets demonstrate SEAD’s effectiveness and efficiency compared to base models and offline methods like MetaOD.

给作者的问题

Are the performance differences between SEAD and base models statistically significant (e.g., via paired t-tests)?
How do choices of η and λ impact SEAD’s performance? I lack sufficient knowledge in this area, but I will refer to other reviewers' comments to revise the final rating.

论据与证据

Supported Claims: SEAD’s unsupervised nature and adaptability to non-stationarity are supported by experiments showing it outperforms base models and adapts weights over time (e.g., Figure 2). Efficiency claims are validated by runtime comparisons (e.g., SEAD++ reduces runtime by ~50% vs. SEAD in Table 4). Problematic Claims: The claim that SEAD "matches the performance of the best base algorithm" (Sec. 4.3) is not fully supported; Table 2 shows SEAD occasionally ranks lower (e.g., 9th on WBC). A statistical significance test is missing.

方法与评估标准

Normalizing anomaly scores via quantiles (Sec. 3.2) addresses score incomparability. The use of APS (Averaged Precision Score) is appropriate for imbalanced anomaly detection.

理论论述

Theorem 3.1 cites regret bounds from FTRL literature but does not provide a self-contained proof. While plausible, the paper assumes anomalies are "rare" for at least one detector; this assumption’s impact on regret guarantees is not analyzed.

实验设计与分析

Comprehensive evaluation across 15 datasets with varying anomaly rates and dimensions. Runtime comparisons with MetaOD (Table 5) highlight streaming efficiency. No ablation study on SEAD’s hyperparameters (e.g., learning rate η, regularization λ). Missing comparison to state-of-the-art streaming ensembles (e.g., LODA, xStream variants).

补充材料

Enough experiments were provided in the supplementary material, but the code was missing

与现有文献的关系

SEAD addresses a gap in unsupervised online model selection for AD

遗漏的重要参考文献

Most of the references were provided and due to my lack of knowledge in the field, it is difficult for me to provide an accurate judgement.

其他优缺点

Combines multiplicative weights with quantile normalization for unsupervised AD, a novel approach. Provides a practical solution for real-time monitoring systems where labeling is infeasible. The pseudo-code (Algorithm 1) is clear, but the loss function (Eq. 2) could be better motivated.

其他意见或建议

Typos: "t-digest" (Sec. 3.2), "detector" misspelled in Table 1 header. Figure 1 is referenced but not included in the provided content.

作者回复

2025-04-01

We thank the reviewer for valuable comments and suggestions. We propose to add these to the final camera ready version.

On hyper-parameter ablations

We add the following experiment to compare against the choice of $\eta = [1, 0.1, 0.01]$ and $\lambda = [10^{-2}, 10^{-4}, 10^{-6}]$ . This is in the table below where $(1,10^{-6})$ implies $\eta = 1$ and $\lambda = 10^{-6}$ . We will add this to the camera ready version of the paper.

dataset	(1, 10^-6)	(1, 10^-4)	(1, 10^-2)	(0.1, 10^-6)	(0.1, 10^-4)	(0.1, 10^-2)	(0.01, 10^-6)	(0.01, 10^-4)	(0.01, 10^-2)
pima	0.519	0.519	0.52	0.531	0.531	0.531	0.527	0.527	0.527
pendigits	0.091	0.255	0.188	0.164	0.164	0.159	0.147	0.147	0.147
letter	0.084	0.083	0.076	0.066	0.066	0.064	0.056	0.056	0.056
optdigits	0.084	0.031	0.03	0.041	0.041	0.039	0.042	0.042	0.042
ionosphere	0.555	0.544	0.557	0.568	0.568	0.568	0.57	0.57	0.57
wbc	0.486	0.483	0.471	0.487	0.487	0.487	0.488	0.488	0.488
mammography	0.118	0.128	0.125	0.118	0.118	0.12	0.123	0.123	0.123
glass	0.115	0.129	0.124	0.093	0.093	0.093	0.094	0.094	0.094
vertebral	0.154	0.158	0.158	0.152	0.152	0.152	0.154	0.154	0.154
cardio	0.505	0.508	0.173	0.24	0.24	0.231	0.209	0.209	0.209

On statistical tests comparing the performance of SEAD against other forms of aggregation such as mean, max and min.

We compare SEAD against the following competitor methods using a few different statistical tests. In each case, we are computing the p-values for the null-hypothesis that the Average Precision score by SEAD is statistically identical to that by the competitor methods across the 15 datasets we test on. From the table, we can see that with the 15 datasets, we can only reject the null hypothesis for the RRCF methods. We believe this to be an artifact of the fact that we only test on 15 datasets and not more.

Competitor_Model	T_Test_PValue_OneSided	Wilcoxon_PValue	MannWhitney_PValue	Sign_Test_PValue	Cohens_d
rule_based_models_0	0.48809	0.34162	0.3937	0.21198	0.00812
rrcf_0	0.02438	0.01516	0.14504	0.02869	0.57685
rrcf_1	0.04179	0.02399	0.22136	0.02869	0.4979
rrcf_2	0.04579	0.0473	0.2807	0.15088	0.48413
rrcf_3	0.03284	0.01292	0.33156	0.02869	0.53363
xstream_0	0.02094	0.00076	0.04073	0.00049	0.59848
xstream_1	0.20831	0.30767	0.48345	0.60474	0.22371
xstream_2	0.13756	0.31934	0.39371	0.5	0.30355
xstream_3	0.14407	0.27546	0.50827	0.39526	0.2951
iforestasd_0	0.32432	0.48898	0.33913	0.15088	0.12444
iforestasd_1	0.37355	0.55481	0.35445	0.30362	0.0879
iforestasd_2	0.3849	0.56235	0.36997	0.21198	0.07974
iforestasd_3	0.38341	0.73776	0.47519	0.69638	0.08081
mean	0.37577	0.53751	0.49173	0.39526	0.0863
max	0.07483	0.0535	0.20926	0.15088	0.40743
min	0.08475	0.11045	0.31664	0.21198	0.38714

审稿意见

评分: 22025-03-13

The authors study unsupervised anomaly detection on data streams, where data distribution can change over time, affecting single model performance. The authors introduce a weighted ensemble that combine individual anomaly detectors based on how low their normalized scores are. The method is tested on 15 datasets.

给作者的问题

N/A

论据与证据

The authors claim that the proposed method, SEAD, is an effective and efficient ensemble method. However, a comparison with other ensemble methods is not conducted. The comparison is carried out only w.r.t. individual models.

方法与评估标准

The method is evaluated on sufficient datasets.

理论论述

N/A

实验设计与分析

The reported variances seem very large. Not clear why this happens for SEAD, but this raises questions about the stability of the method.

补充材料

Yes, all.

与现有文献的关系

The paper studies anomaly detection in streaming data, where the assumption is that data distribution can change over time. This is an interesting topic which is understudied in literature.

遗漏的重要参考文献

N/A

其他优缺点

N/A

其他意见或建议

Typos:
- "we propose SEAD , the first" => "we propose SEAD, the first";
- "with state of the art streaming" => "with state-of-the-art streaming";

作者回复

2025-04-01

We thank the reviewer for their comments and address the main concern raised, namely that of the variance.

The reported variances seem very large. Not clear why this happens for SEAD, but this raises questions about the stability of the method.

As mentioned in the paper, SEAD has the lowest variance among all the methods across the diverse datasets we test against. This actually suggests that the datasets are diverse and not necessarily that the method is unstable. Among the competing methods, we observe that SEAD has the lowest variance in the observed ranks across the datasets - making it more performant than baselines.

审稿人评论

2025-04-02

The main concerns was the lack of comparison with other ensemble methods, but this is not addressed in the rebuttal. Therefore, I will keep my original rating.

作者评论

2025-04-03

We thank the reviewer for this feedback. We did compare against simple online baselines ensemble techniques (mean, max and min) and the offline method MetaOD in the main paper in Tables 3 and 5 respectively. In addition, to address feedback from reviewer dMSV, we also compared SEAD against the offline method in “Unsupervised Model Selection for Time Series Anomaly Detection”, ICLR 2023 by Goswami et. al. We emphasize (as in the paper) that ours is the first unsupervised online ensembling technique. Thus, we only compare against simple baselines of (mean, max and min) and offline. We compared against the state-of-art offline methods (MetaOD and Goswami et.al.) to demonstrate that in the online setting, SEAD performs similar to those in terms of accuracy method but with much lesser running time. We show the updated comparison table below for offline methods below (also in the response to reviewer dMSV).

Dataset	SEAD Average Precision	SEAD Runtime	MetaOD Average Precision	MetaOD Runtime	Goswami et.al. Average Precision	Goswami et.al. Runtime
Ozone	0.059	1746	0.052	173684	0.07	61543

Updated results comparing the p-values in response to reviewer rbaj

In addition to these results, we also perform Wilcoxon signed-rank test to evaluate the statistical significance of our results. The datasets that we test on are diverse and no single method performs well on all datasets. Hence, each method has a high variance in Average Precision Score (APS) across the 15 datasets that we have tested on. The high variance makes statistical significance tests unreliable, as we see in the original rebuttal to reviewer rbaj.

To overcome this issue, we split each dataset into chunks of 50 contiguous data points. This is also relevant for the online learning paradigm where we want to evaluate the model continously over time. Since APS is not defined for chunks having all data points labeled as non anomalous, we only consider chunks which have at least one anomalous label. Using this splitting mechanism gives us 3,282 chunks across all datasets. We report p values for the Wilcoxon signed-rank test using the APS scores on the 3,282 chunks. Using a threshold of 0.01, we conclude that our method is statistically different from all base models and ensemble baselines.

Reference_Model	Competitor_Model	Wilcoxon_PValue
SEAD	rule_based_models_0	6.93E-258
SEAD	rrcf_0	4.26E-68
SEAD	rrcf_1	1.47E-43
SEAD	rrcf_2	3.16E-14
SEAD	rrcf_3	7.43E-29
SEAD	xstream_0	2.09E-95
SEAD	xstream_1	3.53E-12
SEAD	xstream_2	0.00018
SEAD	xstream_3	0.0016
SEAD	iforestasd_0	1.88E-64
SEAD	iforestasd_1	8.08E-63
SEAD	iforestasd_2	3.48E-57
SEAD	iforestasd_3	2.94E-51
SEAD	mean	6.41E-171
SEAD	max	6.48E-05
SEAD	min	6.17E-44

In light of these points, we sincerely hope both this reviewer, as well as rbaj can consider increasing their scores.

审稿意见

评分: 12025-03-14

This paper proposes streaming ensemble of anomaly detectors, a model selection algorithm for streaming, unsupervised AD. The key insight that SEAD leverages is that anomalies by definition are ‘rare’, which SEAD uses to work in a fully unsupervised fashion. SEAD sets the weights for the individual models and chooses the weight using the classical multiplicative weights update (MWU) to predict with expert advice. Experiments verify the idea.

给作者的问题

see above

论据与证据

SEAD relies on the assumption that there are fewer anomalies in the data stream, that is, it adjusts the weights based on the assumption that algorithms with lower anomaly scores are more reliable. However, if the proportion of anomalies is high, some detection algorithms may systematically misjudge, but SEAD will still give them higher weights. This assumption seems to be not applicable to anomaly detection of complex patterns (such as concept drift causing some anomaly categories to become frequent).

方法与评估标准

SEAD needs to calculate anomaly scores for all base detection algorithms and perform MWU weight updates, which may be computationally expensive. In large-scale streaming data, calculating anomaly scores for all detection algorithms and normalizing them may be too expensive. SEAD++ attempts to reduce computation by subsampling detectors, but still needs to maintain and adjust weights between all detectors.

理论论述

n/a

实验设计与分析

Only Averaged-Precision score (APS) is used for evaluation , which is insufficient . I suggest the authors to include more comparisons.
The experiments in the paper are mainly based on standard datasets (ODDS, USP database, etc.), but lack real industrial data verification. These datasets may not fully simulate the abnormal distribution of the real world (such as network security, industrial fault detection, etc.). The paper only contains an internal telemetry dataset, but does not provide detailed analysis or reproducible experimental details.
The paper does not compare with the latest unsupervised anomaly detection model selection methods (such as the time series anomaly detection model selection method proposed by Goswami et al., 2023). The paper is only compared with MetaOD (offline method) and not with other online unsupervised methods. I suggest the authors to choose the latest unsupervised methods for a fairer experimental comparison.

补充材料

yes

与现有文献的关系

[1] SEAD relies on the accumulated information of historical anomaly scores for detector weighting, but does not explicitly handle concept drift. If the data stream distribution changes significantly (such as anomaly pattern changes), SEAD may maintain outdated detector weights, resulting in performance degradation. Some concept drift literature should be discussed.

[2] Not compared with "Unsupervised Model Selection for Time-series Anomaly Detection". ICLR 2023 spotlight.

遗漏的重要参考文献

n/a

其他优缺点

other weakness:

SEAD cannot outperform the base detector in all cases, and can only work when at least one detector performs well. However, if all detectors perform poorly on a certain data stream, SEAD cannot achieve good results. SEAD cannot actively improve the base detector, but only does a weighted combination, which cannot correct the inherent defects of the base detector.

其他意见或建议

the presentation should be large improved.

作者回复

2025-04-01

How is SEAD adaptive to distribution changes

SEAD updates the weigths of the base detectors using the multiplicative weight updates (MWU). In the learning theory literature, it has been established that when the parameters of learning rate and regularization strength are appropriately chosen, the MWU algorithm is adaptive to distribution shifts. See the Cesa-Bianchi et al., 1997 in the submission.

Empirically, we do show that with the proposed hyper-parameters, SEAD works in the presence of different types of concept drifts. We do this by evaluating SEAD on two types of the large-scale INSECTS dataset — one having gradual concept drift and the other sudden shifts.

SEAD++ still needs to maintain and adjust weights between all detectors.”

The main reduction in SEAD++ is in not needing to do forward-pass or inference and back-ward pass or gradient calculation on half of the AD models. For many big models, these forward and backward passes are the main computational bottlenecks. While SEAD++ does require maintaining weights for each base detector, the updates and storage for the weights are trivial and independent of the size of the AD model. In this sense, we do not view maintaining separate weights for each base detector as computationally demanding.

Why use Average-Prceision as the metric and not other metrics like PR-AUC

Both the AP score used in the paper and the PR-AUC are two different methods to compute is a the area under the precision-recall curve. This metric of area under the precision-recall curve is a widely accepted metric for performance in imbalanced binary classification such as Anomaly detection. We use the particular implementation APS reported in the paper instead of the PR-AUC since the trapezoidal approximation to compute the integral for the area under the curve is more faithful for downstream performance compared to the PR-AUC implementation in sklearn (

https://towardsdatascience.com/the-wrong-and-right-way-to-approximate-area-under-precision-recall-curve-auprc-8fd9ca409064/

While there are other metrics such as best F1, we did not use them as the best F1 only shows the performance at the optimally chosen threshold value and lacks interpretability, and sensitive to the threshold tuning method used to choose the threshold.

On comparison against Goswami et.al., ICLR 2023 paper

We thank the reviewers for this oversight. We will add this to the related work section as the state-of-art in offline unsupervised model selection and add the following empirical result to Table 5 as well.

By using implementation from https://github.com/mononitogoswami/tsad-model-selection, we will add two extra columns for the method of Goswami et.al. in Table 5 in the main paper with APS of 0.07, and run-time of 61543 seconds. Thus, the updated Table 5 will read

Dataset	SEAD Average Precision	SEAD Runtime	MetaOD Average Precision	MetaOD Runtime	Goswami et.al. Average Precision	Goswami et.al. runtime
Ozone	0.059	1746	0.052	173684	0.07	61543

In summary, our result here shows that while the method of Goswami et.al. results in similar accuracy as SEAD, its run-time is 40x slower!

Conceptual differences with Goswami et.al.‘s ICLR 2023 paper

The main issue with Goswami et.al’s algorithm is that it is designed in the offline case where all the train data is available up-front, a single model is selected and applied as a batch to the entire inference dataset. In order to adapt it to the online setting, we ran inferences in batches of 50 data-points, where we retrained and applied it to a contiguous batch of 50 datapoints. For each re-training, we use the entire data-stream seen thus far and re-train from scratch each time.

Operating this way significantly increased run-time due to repeated training on the stream. Repeated re-training is needed since we are operating on data-streams with non-stationarities. Adapting the algorithm presented in Goswami et.al. to be online where some computations can be re-used (for example the re-trainings need not be from scratch) is non-trivial and requires further research that is beyond the scope of this paper. This 40x increased in run-time was the fundamental reason we did not evaluate the algorithm from Goswami et.al. (and even MetaOD) on all the datasets.

审稿意见

评分: 22025-03-16

The paper introduces SEAD (Streaming Ensemble of Anomaly Detectors), an unsupervised, online model selection algorithm for anomaly detection in streaming data, where labels are unavailable, and data distributions change over time. SEAD dynamically assigns weights to multiple anomaly detection models using Multiplicative Weights Update (MWU), ensuring that the best-performing model is prioritized without needing ground truth labels. The method operates in constant time per data point (O(1)), is model-agnostic, and adapts to concept drift by adjusting model weights as the data evolves. SEAD++ further optimizes runtime by sampling a subset of models per step, making SEAD a practical and scalable solution for real-time anomaly detection in streaming environments.

给作者的问题

There are no critical questions for the authors.

论据与证据

The claims in the paper are largely well-supported by both theoretical reasoning and experimental results. SEAD’s ability to perform unsupervised model selection without labeled data is convincingly demonstrated through its weighting mechanism, which dynamically adjusts model importance based on anomaly scores. Its constant time complexity per data point (O(1)) is justified by its efficient online update strategy, contrasting with slower offline approaches like meta-learning. However, some claims could be refined. While SEAD performs consistently well, it does not always outperform all methods across every dataset, making it more accurate to say it ranks among the top-performing models rather than claiming universal superiority. Additionally, SEAD assumes that at least one base model is effective, meaning it may struggle if all base models are weak, a limitation that is not fully addressed.

方法与评估标准

The proposed methods and evaluation criteria are appropriate for the problem of unsupervised anomaly detection in streaming environments, as they align with the challenges of real-time detection, model selection without labels, and handling non-stationary data. The use of a diverse set of anomaly detection algorithms as base models ensures that SEAD’s model selection approach is tested across different detection paradigms. The reliance on anomaly score weighting instead of labeled data makes sense in real-world applications where obtaining labeled anomalies is infeasible. The choice of evaluation datasets is also relevant, as the authors include 15 publicly available datasets that cover different types of anomalies and distributions, ensuring that SEAD is tested in diverse scenarios. The Averaged Precision Score (APS) metric is an appropriate choice for evaluating anomaly detection performance, as it accounts for the highly imbalanced nature of anomaly detection tasks. However, the evaluation could be expanded by testing SEAD’s performance in high-dimensional datasets or comparing it against deep learning-based anomaly detectors, which are increasingly used in large-scale anomaly detection problems.

理论论述

The paper presents theoretical guarantees for SEAD, particularly in the form of a regret bound. This theoretical result aims to justify that SEAD's selection strategy will be competitive as the performance of the best anomaly detector in hindsight, even without labeled data.

实验设计与分析

The experimental design in the paper appears well-structured, with evaluations conducted on 15 public datasets, covering a diverse range of anomaly detection scenarios. One aspect that could be further scrutinized is the robustness of the experimental setup across datasets with significantly different anomaly distributions. The paper does not provide a detailed breakdown of how SEAD performs on datasets with varying anomaly rates, which could impact its effectiveness.

补充材料

The supplementary material was reviewed, specifically focusing on additional experimental details, dataset characteristics, and hyperparameter settings. One aspect that stands out is the expanded discussion on SEAD++'s trade-offs, including a sensitivity analysis on the number of sampled detectors. This helps clarify the computational savings and detection accuracy trade-offs, an area that was less detailed in the main paper. The dataset characteristics table also reveals the diversity in anomaly rates across datasets, from extremely imbalanced streams (0.03% anomalies) to more balanced cases (36% anomalies), confirming that SEAD was tested under varying conditions.

与现有文献的关系

The key contributions of the paper are well-situated within the broader anomaly detection and online model selection literature, but their novelty and impact appear somewhat incremental rather than groundbreaking. The idea of leveraging multiple anomaly detection models in a streaming setting is an interesting direction, aligning with prior work on ensemble-based anomaly detection and meta-learning for model selection. However, the main theoretical foundation of the method—Multiplicative Weights Update (MWU)—is not novel and has been extensively studied in online learning literature. While the paper applies MWU in a new context by using it for online model selection in anomaly detection, the underlying theory itself is borrowed from well-established frameworks. From an empirical standpoint, SEAD demonstrates competitive but not significantly superior performance compared to existing baselines. The results indicate that while SEAD adapts dynamically to different data distributions, its overall detection accuracy does not show a substantial improvement over simpler ensemble baselines such as mean and max aggregation of anomaly scores. This suggests that while the method is effective, its contribution in terms of practical anomaly detection performance is not dramatically beyond existing approaches. Furthermore, while the experimental validation is comprehensive in terms of the number of datasets (15 public datasets), the selection is somewhat limited in diversity, focusing primarily on tabular datasets. The inclusion of vision-based anomaly detection benchmarks or datasets with high-dimensional, structured data (e.g., image or sensor data) could have strengthened the paper’s practical contributions by demonstrating the generalizability of SEAD beyond classical tabular anomaly detection tasks. Expanding the evaluation scope would provide stronger evidence of SEAD’s applicability in real-world, high-dimensional anomaly detection problems. In summary, while SEAD presents an interesting adaptation of MWU to streaming anomaly detection, its contribution is somewhat incremental in both theoretical and empirical dimensions. The method does not introduce a fundamentally new learning principle, and its empirical improvements over baselines are relatively modest. A broader experimental validation across diverse domains, such as vision-based anomaly detection, would have further enhanced its practical significance.

遗漏的重要参考文献

While there may be newer or alternative approaches to anomaly detection ensembles that are not explicitly cited, none appear to be critical omissions that would significantly alter the context or positioning of SEAD’s contributions.

其他优缺点

While the empirical results are comprehensive, a more detailed discussion of failure cases or performance variability across datasets would strengthen the argument for SEAD’s robustness.

其他意见或建议

The overall presentation and completeness of the paper could be improved. In particular, Table 2 has text that is too small, making it difficult to read, while Tables 4 and 5 have comparatively larger text, creating an inconsistency in formatting. Additionally, on page 8, there are noticeable empty spaces, which could have been utilized more effectively to improve readability and layout balance. A more efficient arrangement of tables and text could enhance the paper’s clarity and visual coherence, making it easier for readers to follow the results and comparisons.

作者回复

2025-04-01

We thank the reviewer for their valuable questions and comments. Our responses to the raised points are outlined below.

Limitations of SEAD: The performance can only be as good as the best detector

SEAD is the first online, unsupervised model-selection algorithm for anomaly detection (AD). The most pressing issues left open in the literature is the choice of which model to use when. The vast literature on AD suggests that there are good models for most situations one encounters in practice. Nevertheless, in practical settings, the choice of which model to choose when is hard to do. This challenge is exacerbated in the online unsupervised setting where there are no ground-truth labels and the best model can change over time.

On comparison with simple baselines like mean aggregator

We thank the reviewer for this point. One advantage of SEAD is that its comparison against baselines like the mean aggregator improves with the addition of many random detectors. To complement the results in the paper, we performed one more experiment where we added 13 random detectors that give a random number as anomaly score to each input. As we can see the performance of SEAD does not diminish a lot while that of baselines like mean, max and min deterioate significantly. The reason being that SEAD quickly identifies the bad detectors and down-weighs them, while baselines like mean cannot do this.

In particular, we will add this table showing that as more random detectors are added, the performance of SEAD is much more distinguishable than that of baselines.

dataset	sead	mean	max	min
pima	0.52	0.513	0.399	0.358
pendigits	0.246	0.099	0.051	0.034
letter	0.088	0.056	0.055	0.073
optdigits	0.048	0.045	0.052	0.024
ionosphere	0.491	0.493	0.323	0.385
wbc	0.292	0.436	0.176	0.121
mammography	0.06	0.081	0.053	0.029
glass	0.078	0.082	0.127	0.07
vertebral	0.137	0.154	0.193	0.192
cardio	0.518	0.195	0.147	0.098

最终决定Accept (poster)

2025-05-01

The paper introduces SEAD, an unsupervised model selection procedure for streaming anomaly detection. This is the first work of this kind, and provides both theoretical guarantees and robustness to uninformative AD models compared to baselines. However, the empirical evaluation is not entirely convincing in the showing advantages over the average baseline. Considering critical difference diagrams for adjusted hypothesis testing might be beneficial to provide a clearer quantitative comparison.