PaperHub
6.5
/10
Poster4 位审稿人
最低6最高8标准差0.9
6
6
6
8
3.8
置信度
正确性3.0
贡献度3.0
表达2.5
NeurIPS 2024

Unsupervised Anomaly Detection in The Presence of Missing Values

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06

摘要

关键词
missing dataanomaly detectiondeep learning

评审与讨论

审稿意见
6

In anomaly detection, where training data consists only of normal instances, conventional missing value imputation approaches may cause imputation bias, meaning that imputations are inclined to make anomalous incomplete instances appear normal. This study addressed this issue by proposing an end-to-end training method that incorporates missing value imputation and anomaly detection into a unified optimization problem. The experimental results demonstrated that the proposed method can mitigate imputation bias, thereby outperforming conventional "impute-then-detect" methods in various anomaly detection benchmarks.

优点

  • Anomaly detection in the presence of missing values is a very important problem in practice, while few existing studies have addressed this issue.
  • This study addresses the issue properly using a novel approach.
  • The manuscript is overall well written, providing detailed justification for the approach.

缺点

  • The implementation details are obscure in the manuscript.
  • Discussion on when the proposed method is more effective is needed.
  • Evaluation on realistic scenarios that motivated this study is needed.

问题

  • How does the imputer network look? The network may not directly process missing values. How were the missing values handled before being fed into the network? This should be clarified.
  • In Table 8, the hyperparameter settings differ between datasets. It is important to explain how they were chosen. It is very inappropriate if the authors chose them based on test performance. Additionally, how is overfitting prevented?
  • I suppose the proposed method may work well when the distribution of each feature is significantly different between normal and anomalous data. It would be great if the authors investigated when the proposed method was more effective compared to "impute-then-detect" baselines.
  • In the real world, the mechanism of how missing values appear in data is usually unknown, and we cannot guess it from data alone; it can only be conjectured based on domain knowledge. If we don't know the actual missing mechanism of the data, how do we choose the missing mechanism M M when generating pseudo-abnormal samples? Is the effectiveness sensitive to the suitability of the missing mechanism we choose? For example, what will happen if the actual mechanism was NMAR and we set M to MCAR?
  • As motivated by cases of abnormal user detection in recommendation systems and novel or anomalous cell detection in bioinformatics, where the missing rates can be higher than 30% or even 80%, the effectiveness of the proposed method should be evaluated under such conditions. The benchmark datasets used in the experiments do not cover such conditions.

局限性

作者回复

We are grateful for your reviews and recognition of our work. Our responses to your questions are as follows.

Response to Weakness 1 and Question 1:

The imputer is an MLP network, in which its backbone is designed as follows: input512128128512outputinput \rightarrow 512 \rightarrow 128 \rightarrow 128 \rightarrow 512 \rightarrow output, and we use LeakyReLU as the activation function and don't use bias term. As depicted in Figure 4 of our manuscript, the missing values in incomplete data are filled with zeros before being fed into the network.

Response to Question 2:

As shown in the table, we have latent dimension dd, learning rate η\eta, and loss parameters α,β,λ\alpha, \beta, \lambda to tune. We have to say that hyperparameter tuning for unsupervised learning is a very challenging task and so far there is no good solution.

  • Since different dataset usually have different ambient dimension, we need to use different latent dimension. Therefore, in the experiments, for low-dimensional data, we let d=4d=4; for moderate-dimensional data, we let d=32d=32; for high-dimensional data, we let d=128d=128. These values were selected casually.

  • For the learning rate η\eta, we just let it be 0.00010.0001, which works well in most cases, as shown in the table. However, due to the high diversity of the datasets, we need to use other values for some datasets to ensure the convergence speed.

  • For α,β,λ\alpha, \beta, \lambda, we set all values to 1, which works well in most cases. However, due to the high diversity of the datasets, in some cases, we need to tune them slightly according to the test performance. This strategy is very common in unsupervised learning such as anomaly detection. Almost all papers on unsupervised anomaly detection used this strategy.

It should be emphasized that to ensure fairness, we have sufficiently tuned the hyperparameters of all compared methods in the experiments.

Regarding overfitting, since our neural network is not very complex, the number of data points in each dataset is relatively large, and we fixed the number of optimization epochs to a small number, the overfitting seems not happened; otherwise, the AUROC score of our method will not be high.

Response to Weakness 2 and Question 3:

According to our experimental settings and empirical results, we have the following findings:

  • Depending on the experimental settings and results of our manuscript, the detection performance of our proposed method outperforms all baselines in most cases, which indicates our method is more effective compared to "impute-then-detect" baselines under unsupervised settings.
  • Moreover, according to some new experimental results (See Figure 1 in the attached PDF), the effect of the imputation bias from normal data on the ``impute-then-detect'' methods gradually diminishes as the missing rate increases. Therefore, our proposed method will be more effective when the missing rate is relatively small, say less than 50%.
  • The generated pseudo-abnormal samples are critical to our method and our proposed method will be more effective when they overlap more significantly with real abnormal samples.

Response to Question 4:

This is a very practical and valuable question. In real scenarios, we generally don't know the missing mechanism of incomplete data. In such a situation, our method chooses the simplest missing mechanism (MCAR) based on Occam's Razor. For your second question, we have designed two experiments to answer it.

  1. Observing the detection performance on a real incomplete dataset when using different missing mechanisms to generate pseudo-abnormal samples. In this situation, we don't know what the real missing mechanism is for the incomplete data.

  2. Observing the detection performance on synthetic incomplete datasets when using different missing mechanisms to generate incomplete data and pseudo-abnormal samples. In this situation, we know the missing mechanism of the incomplete data.

The experimental results are provided in the following table.

DatasetMissing Mechanism on Normal DataMissing Mechanism on Pseudo-Abnormal Samples
MCARMCARMARMARMNARMNAR
AUROC(%)AUPRC(%)AUROC(%)AUPRC(%)AUROC(%)AUPRC(%)
TitanicUnknown82.0981.3979.0677.0880.5079.17
MovieLens1MUnknown66.3265.3463.1463.3961.4460.91
BladderUnknown100.00100.0099.9599.95100.00100.00
Seq2_HeartUnknown96.6296.4096.7996.6095.5694.41
AdultMCAR71.1971.5064.1166.4467.2866.72
AdultMAR65.6667.2374.6170.7471.1469.69
AdultMNAR70.6969.1768.3568.7871.6068.97

According to the experimental results, on real incomplete data, our method is robust to the setting of missing mechanisms of the mask for the generated pseudo-abnormal samples and has better overall performance on MCAR. Therefore, based on the Occam's Razor and empirical results, we recommend using MCAR as the missing mechanism of generated pseudo-abnormal samples when the real missing mechanism is unknown. Moreover, on synthetic incomplete data, detection performance degrades when using different missing mechanisms generating incomplete normal data and pseudo-abnormal data.

Response to Question 5:

In fact, our work covered the real scenarios in Section 4 of our manuscript. In Table 1 of our manuscript, we introduced four real incomplete datasets, including Titanic (pattern recognition), MovieLens1M (recommendation system), Bladder, and Seq2-Heart (cell analysis), with missing rates of 10.79%, 82.41%, 86.93%, and 88.51%, respectively. The related experimental results are reported in Table 3 of our manuscript.

Thank you again for the comments.

评论

Thank you for your response and clarification. After considering your reply to my comments and those of other reviewers, I've decided to increase the rating to 6.

评论

We sincerely appreciate your recognition. We will continuously work hard to improve every aspect of this work.

审稿意见
6

This study addresses the challenge of anomaly detection in the presence of missing data, which is common in various fields like recommendation systems and bioinformatics. Traditional methods struggle with missing data, leading to biased imputations and ineffective anomaly detection. The study proposes an integrated approach that combines data imputation with anomaly detection in a unified optimization framework. By generating pseudo-abnormal samples during training, the method mitigates imputation biases and enhances anomaly detection performance. The approach is supported by theoretical guarantees and outperforms baseline methods in experimental evaluations on diverse datasets.

优点

  • The paper is well-written
  • The proposed method is novel
  • Extensive experiments are conducted to robustly support the claims

缺点

  • There is no report on computational time.

问题

N/A

局限性

N/A

作者回复

We are grateful for your reviews and recognition of our work. Regarding your concern about computational time, we supplement comparisons of theoretical time complexity and experimental time cost between our proposed method and the baselines in the following Table 1 and Table 2.

The notations used in the complexity analysis (Table 1) are explained as follows:

  • n,mn, m denote the number of samples of the training phase and inference phase, respectively.

  • Missforest is a well-known data imputation algorithm based on random forest (O(t1vnlogn)\mathcal{O}(t_1 \cdot v \cdot n\log{n})) where t1t_1 denotes the number of trees, vv denotes the number of attributes.

  • T,Tg,Td,Tae,TocT, T_g, T_d, T_{ae}, T_{oc} denote the iterations of corresponding methods.

  • Lˉ\bar{L} and dˉ\bar{d} denote the number of layers of the neural network and the maximum width of the layers of the corresponding models, respectively.

  • t2t_2 denotes the number of trees of I-Forest and tt is the maximum iterations of the Sinkhorn algorithm.

  • p,ψ,Kp, \psi, K denote the key parameter of the corresponding methods.

Table 1: The time complexity of training and inference phase.

DI MethodsAD MethodsTime Complexity (Training)Time Complexity (Inference)
I-ForestO(Tp(t1vnlogn)+t2ψlogψ)\mathcal{O}(T \cdot p(t _ 1 \cdot v \cdot n\log{n}) + t _ 2 \cdot \psi \log{\psi})O(p(t1vmlogn)+t2mlogψ)\mathcal{O}(p(t _ 1 \cdot v \cdot m\log{n}) + t _ 2 \cdot m\log{\psi})
MissForestDeep SVDDO(Tp(t1vnlogn)+(Tae+Toc)(ndˉ2Lˉ+n))\mathcal{O}(T \cdot p(t _ 1 \cdot v \cdot n\log{n}) + (T _ {ae} + T _ {oc}) (n\bar{d} ^ 2\bar{L} + n ))O(p(t1vmlogn)+(mdˉ2Lˉ+m))\mathcal{O}(p(t _ 1 \cdot v \cdot m\log{n}) + (m\bar{d} ^ 2\bar{L} + m))
O(Tp(t1vnlogn)) \mathcal{O}(T \cdot p(t _ 1 \cdot v \cdot n\log{n}))NeutraL ADO(Tp(t1vnlogn)+T(ndˉ2Lˉ+nK))\mathcal{O}(T \cdot p(t _ 1 \cdot v \cdot n\log{n}) + T (n\bar{d} ^ 2\bar{L} + n\cdot K))O(p(t1vmlogn)+(mdˉ2Lˉ+mK))\mathcal{O}(p(t _ 1 \cdot v \cdot m\log{n}) + (m\bar{d} ^ 2\bar{L} + m\cdot K))
DPADO(Tp(t1vnlogn)+T(ndˉ2Lˉ+n2))\mathcal{O}(T \cdot p(t _ 1 \cdot v \cdot n\log{n}) + T (n\bar{d} ^ 2\bar{L} + n ^ 2) )O(p(t1vmlogn)+(mdˉ2Lˉ+mn))\mathcal{O}(p(t _ 1 \cdot v \cdot m\log{n}) + (m\bar{d} ^ 2\bar{L} + mn))
I-ForestO((Tg+Td)ndˉ2Lˉ+t2ψlogψ)\mathcal{O}((T _ g + T _ d)n\bar{d} ^ 2\bar{L} + t _ 2 \cdot \psi \log{\psi})O(mdˉ2Lˉ+t2mlogψ)\mathcal{O}(m\bar{d} ^ 2\bar{L} + t _ 2 \cdot m\log{\psi})
GAINDeep SVDDO((Tg+Td)ndˉ2Lˉ+(Tae+Toc)(ndˉ2Lˉ+n))\mathcal{O}((T _ g + T _ d)n\bar{d} ^ 2\bar{L} + (T _ {ae} + T _ {oc}) (n\bar{d} ^ 2\bar{L} + n))O(mdˉ2Lˉ+(mdˉ2Lˉ+m))\mathcal{O}(m\bar{d} ^ 2\bar{L} + (m\bar{d} ^ 2\bar{L} + m))
O((Tg+Td)ndˉ2Lˉ) \mathcal{O}((T _ g + T _ d)n\bar{d} ^ 2\bar{L})NeutraL ADO((Tg+Td)ndˉ2Lˉ+T(ndˉ2Lˉ+nK))\mathcal{O}((T _ g + T _ d)n\bar{d} ^ 2\bar{L} + T (n\bar{d} ^ 2\bar{L} + n \cdot K))O(mdˉ2Lˉ+(mdˉ2Lˉ+mK))\mathcal{O}(m\bar{d} ^ 2\bar{L} + (m\bar{d} ^ 2\bar{L} + m\cdot K))
DPADO((Tg+Td)ndˉ2Lˉ+T(ndˉ2Lˉ+n2))\mathcal{O}((T _ g + T _ d)n\bar{d} ^ 2\bar{L} + T (n\bar{d} ^ 2\bar{L} + n ^ 2) )O(mdˉ2Lˉ+(mdˉ2Lˉ+mn))\mathcal{O}(m\bar{d} ^ 2\bar{L} + (m\bar{d} ^ 2\bar{L} + mn))
ImAD (Ours)-O(T(ndˉ2Lˉ+tn2))\mathcal{O}(T(n\bar{d} ^ 2 \bar{L} + t \cdot n ^ 2))O(mdˉ2Lˉ+m)\mathcal{O}(m\bar{d} ^ 2\bar{L} + m)

We have utilized the Speech and Usoskin datasets to benchmark the time cost of all methods, including ours and baselines. The two datasets exemplify the two distinct categories of tabular datasets used in our experiments. The Speech dataset contains 3,686 instances with 400 attributes and Usoskin dataset contains 610 instances with 25,334 attributes. ALL experiments were conducted on 20 Cores Intel(R) Xeon(R) Gold 6248 CPU with one NVIDIA Tesla V100 GPU, CUDA 12.0. The related results are provided in Table 2.

Table 2: The time cost (second) on Speech and Usoskin datasets.

DI MethodsAD MethodsTime (Speech Training)Time (Speech Inference)Time (Usoskin Training)Time (Usoskin Inference)
MissForestI-Forest86.4882.695648.625662.65
MissForestDeep SVDD109.1282.595651.655660.07
MissForestNeutraL AD115.1782.595658.385660.11
MissForestDPAD106.4082.665652.645660.09
GAINI-Forest149.950.113664.489.42
GAINDeep SVDD172.590.013567.516.84
GAINNeutraL AD178.640.013574.246.88
GAINDPAD169.870.083568.506.86
ImAD (Ours)-471.430.0295.040.04

Based on the theoretical analysis of time complexity in Table 1 and empirical results in Table 2, the "impute-then-detect'' pipelines have high time costs for the dataset like Usoskin with a large number of attributes. In contrast, our proposed method shows significant efficient advantages in the inference phase for both Speech-like and Usoskin-like datasets.

评论

Thank you for the explanation. The author has addressed my concerns, so I have raised my score to 6.

评论

Thank you so much for your feedback and support. Your suggestions further enhanced the quality of our work.

审稿意见
6

This paper introduces ImAD, an end-to-end approach to anomaly detection in the presence of missing data. It addresses the imputation bias observed in the traditional impute-then-detect approaches, where the imputation model trained only on normal data tends to normalize incomplete abnormal samples. The proposed method generates pseudo-abnormal samples in the latent space and uses them in joint learning of the imputor, reconstructor, and projector.

优点

  • It proposes a novel approach that combines data imputation and anomaly detection in a single framework, which has not been explored much in previous studies.
  • The idea of generating pseudo-abnormal samples in the latent space and using them for joint learning of the imputor, projector, and reconstructor makes sense and shows strong empirical performance.

缺点

  • Although the authors provide extensive details on the data and the experimental setup, some crucial information is still missing. For instance, it does not explain how the optimal hyperparameters for each method (both the proposed and others) are selected for the results shown in the tables and figures (whether by using a validation set, or referring to the performance on the test set, or using other methods), which can significantly affect the results and generalizability. Additionally, the composition of the training and test set split is not explained.
  • The results from using only two missing rate values (0.2 and 0.5) may not provide enough evidence about its robustness and effectiveness. In the previous study referred as a reference to this setting (Yoon et al., 2018), a broader range from 0.1 to 0.8 was explored. Therefore, certain claims regarding the imputation bias (e.g., the authors state that the detection performance of “impute-then-detect” methods does not decrease with the increasing of missing rate from 0.2 to 0.5 because of the imputation bias issue) are not fully supported empirically.

问题

  • Are the missing rates used for training and test data the same? How would it affect the performance if different values are used?
  • Could you provide more information on how to select the optimal hyperparameters in the given experiments and also in general situations?
  • I am curious to see how the performance would change to different missing rates in comparison with other methods (for a wider range of values).
  • Is the imputor learned by this model only effective for the task of anomaly detection, or can it be applied to other tasks as well?
  • What are the specific details of the architectures for MLPs used for each main component? How would different architectures affect the performance?
  • Is the method applicable to non-tabular data?

局限性

Limited experimental setting and ablation study.

作者回复

We are grateful for your reviews and suggestions of our work. Our responses to your questions are as follows.

Response to Weakness 1 and Question 2:

Since we are studying unsupervised anomaly detection, there is no validation set during the training stage. As shown in Table 8 of our paper, we have the latent dimension dd, learning rate η\eta, and loss hyperparameters α,β,λ\alpha,\beta,\lambda to tune. Since we did not want to tune these hyperparameters, initially, we just let d=4,32,128d=4, 32, 128 for low-dimensional, moderate-dimensional, and high-dimensional data casually, let η=0.0001\eta=0.0001, α=β=λ=1\alpha=\beta=\lambda=1. This setting works well in most cases. However, due to the high-diversity of the datasets (they have different sizes, are from different fields, and are with different missing rates and patterns), we have to tune the hyperparameters w.r.t. the test performance slightly. This is actually the convention of unsupervised anomaly detection. This strategy is commonly used in unsupervised learning such as clustering, novelty detection, and representation learning. Note that we have used grid search w.r.t. the testing set to find the optimal hyperparameters for all baselines to ensure fair comparisons. Tuning hyperparameters for unsupervised learning remains an open problem [1], although automated machine learning has made considerable progress for supervised learning.

[1] Fan et al. A simple approach to automated spectral clustering. NeurIPS 2022.

In Appendix I.1, we describe in detail the data sources of all the used datasets as well as the settings for normal and abnormal samples, where the arrhythmia and Speech datasets are from ODDs (Outlier Detection DataSets), with inherent settings for normal and abnormal samples. For the split of the training and test sets, we first set the ratio of normal and abnormal samples in the test set to 1:1, and then use all the remaining normal samples as the training set.

Response to Weakness 2 and Question 3:

  1. In fact, besides the datasets with synthetic missing values (missing rate=20% and 50%), our work included real datasets with inherent missing values of which the missing rates are about 10% and 80%. Please refer to Table 1 and Table 3 in our manuscript.

  2. The main experiments in (Yoon et al., 2018) are conducted with 20% missing rate only. The paper only tested one dataset with missing rates ranging from 10% to 80% in the ablation study.

  3. In this rebuttal, we added related experiments on Speech dataset with missing rate mr0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8\text{mr} \in \\{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8\\} and the results are shown in Figure 1 of the attached PDF, where the detection performance of "impute-then-detect" methods does not degrade and some of them even improve with the increasing of missing rate from 0.1 to 0.8. Moreover, our proposed method (ImAD) outperforms all baselines in almost all cases.

Response to Question 1:

This is a very practical question. Yes, they are the same in our experiments, because such a setting is consistent with existing data imputation works [Yoon et al., 2018; Muzellec et al., 2020] and can also facilitate the comparison among baselines.

However, the issue you mentioned is meaningful and worth exploring in depth for practical imputation scenarios. We conduct related experiments on the Speech dataset. In these experiments, we keep the missing rate mr=0.5\text{mr}=0.5 on the training set and change the missing rate from 0.2 to 0.8 on the test set. We visualize the experimental results in Figure 2 of the attached PDF. With such an experimental setup, the performance of the methods based on Mean-Filling fluctuates little, and the performance of the methods based on MissForest and GAIN fluctuates significantly. Our proposed method also shows some degree of performance fluctuation.

Response to Question 4:

Thanks for your question. It is hard to answer this question according to the current results accurately. Since our work is anomaly detection with missing data, our imputer is specially designed for the detector and the overall anomaly detection method is an end-to-end method, very different from the ``impute-then-detect'' pipelines. Therefore, we have to say that the learned imputer in our method is only effective for the task of anomaly detection, currently.

Response to Question 5:

The architectures of each module of our method are provided as follows:

  • Imputer: input512128128512outputinput \rightarrow 512 \rightarrow 128 \rightarrow 128 \rightarrow 512 \rightarrow output
  • Projector: input512256128outputinput \rightarrow 512 \rightarrow 256 \rightarrow 128 \rightarrow output
  • Reconstructor: input128512outputinput \rightarrow 128 \rightarrow 512 \rightarrow output

In all three modules, we used LeakyReLU as the activation function and didn't use bias terms.

Based on the empirical results, the architecture designed exhibits superior performance in comparison to all baselines. Therefore, we did not further explore the effects of network architecture and this is also not a key contribution to our work.

Response to Question 6:

Thanks for your question. Anomaly detection and data imputation are ubiquitous tasks across various data types. In this work, we primarily focus on incomplete tabular data since data missing is quite common on tabular data. Anomaly detection with missing data on tabular data has many practical applications such as identifying abnormal users in recommendation systems and discovering abnormal cells in bioinformatics. However, anomaly detection with missing data on other data types such as images may encounter some new questions and challenges, such as in what scenarios we need the method and how to define a meaningful missing pattern for images. Therefore, we are not sure whether our method can be directly applied to non-tabular data with missing values. But it is worth studying in the future.

评论

Thank you for your detailed explanation and additional results. While I still believe referencing test set performance is not a standard convention even in unsupervised AD, I acknowledge that hyperparameter tuning for unsupervised learning is a challenging issue. The authors' effort to set up a fair comparison scenario is reasonable. Most of my other concerns and questions are addressed, so I have raised my score to 6.

评论

We are very grateful for your feedback and support. Hyperparameter tuning for unsupervised AD is a very important and practical problem. We hope that the researchers in the community can find a good solution to this problem in the future.

审稿意见
8

Paper proposes a unified framework to find anomalies in data with missing attribute values. Instead of relying on impute-then-detect approach, which could lead to imputation bias, authors proposed a multi-objective learning framework in which the imputation and data modeling are done together. The core assumption is that there is a latent dimension in which the normal data lies within a sphere and anomalous data lies in the region outside this sphere, and sampling from this anomalous region will provide anomalous samples in the original space. This is done by learning two-way mappings between the latent space and original space. An imputation map is also learnt for the anomalous data. Details of a neural network based implementation is given.

Experimental results on several benchmark data sets are given to showcase two capabilities: the proposed method can indeed identify anomalies when the data has missing attribute values better than other methods which do not impute, and the proposed method has better performance than impute-then-detect scheme.

优点

  • This is an interesting paper with an innovative approach. The idea of combining imputation and detection in a joint optimization framework is novel.

  • Authors have supported their assumptions with theoretical results.

  • The experimental evaluation is detailed and robust and largely supports the claims made in the paper.

缺点

  • Experimental results are only marginally better than other solutions. I must admit that the method consistently outperforms the best existing approach so it might be better to use this approach.

问题

  • How sensitive is the performance to the choice of r1r_1 and r2r_2? I feel that this choice could be very impactful.
  • I am not clear on the need for Sinkhorn distance. When will the samples not be pairwise?

局限性

Limitations are discussed, though more discussion on the sensitivity to parameter choices might be included. The paper does not have any immediate societal concerns.

作者回复

We are very pleased and honored to receive your positive evaluation of our work. Our responses to your questions are as follows.

Response to Question 1:

        As you are concerned, we have explored the influences of constrained radii r1,r2r_1, r_2 for detection performance, and the related results are reported in Appendix G of our manuscript. According to Proposition A.1 in our manuscript, we have r=σFd1(p)r= \sigma\sqrt{F ^ {-1} _ d (p)}, where σ\sigma and dd are the variance and dimension of the target distribution, respectively, and pp denotes the sampling probability. We set the target distribution as DzN(0,0.52Id),Dz~N(0,Id)\mathcal{D} _ {\mathbf{z}} \sim \mathcal{N}(\mathbf{0}, 0.5 ^ 2 \cdot \mathbf{I} _ d), \mathcal{D} _ {\tilde{\mathbf{z}}} \sim \mathcal{N}(\mathbf{0}, \mathbf{I} _ d), and p=0.9p=0.9. We adjust the dimension d4,8,16,32,64,128,256,512d \in \\{ 4, 8, 16, 32, 64, 128, 256, 512\\} of the target space and obtain the corresponding r1,r2r_1, r_2 (see following table).

Latent Dimension(d)  48163264128256512
r1=0.5Fd1(0.9)r_1=0.5\sqrt{F^{-1}_{d}(0.9)}1.391.822.423.264.446.108.4511.76
r2=Fd1(0.9)r_2=\sqrt{F^{-1}_{d}(0.9)}2.783.654.856.528.8812.2016.9023.52

According to the results (Appendix G of our manuscript), our method is not very sensitive to the changes in the radii r1r_1 and r2r_2, but its performance degrades with the reduction in the latent dimension. This is reasonable since a smaller latent dimension results in more information loss.

Response to Question 2:

        In the proposed method ImAD, we utilize a projector P:RmRd\mathcal{P}:\mathbb{R} ^ m\rightarrow\mathbb{R} ^ d to transform Dx\mathcal{D} _ {{\mathbf{x}}} (normal data distribution) and Dx~\mathcal{D} _ {\tilde{\mathbf{x}}} (pseudo-abnormal data distribution) into Dz\mathcal{D} _ {{\mathbf{z}}} and Dz~\mathcal{D} _ {\tilde{\mathbf{z}}} respectively. In this process, the samples from data distribution (Dx\mathcal{D} _ {{\mathbf{x}}} or Dx~\mathcal{D} _ {\tilde{\mathbf{x}}}) and target distribution (Dz\mathcal{D} _ {{\mathbf{z}}} or Dz~\mathcal{D} _ {\tilde{\mathbf{z}}}) do not have pairwise relationship and we need to measure the discrepancy between P(Dx)\mathcal{P}(\mathcal{D} _ {{\mathbf{x}}}) and Dz\mathcal{D} _ {{\mathbf{z}}} using their finite samples. Thus, we use Sinkhorn divergence to cover this situation.

Response to limitation:

        The key hyperparameters of our proposed method include latent dimension dd, constrained radii r1,r2r_1, r_2, learning rate and the trade-off coefficients α,β,λ\alpha, \beta, \lambda in optimization objective. According to the results in Appendix G of our manuscript, our method is not very sensitive to changes in the radii r1r_1 and r2r_2, but its performance degrades with the reduction in the latent dimension dd. The choice of latent dimension dd is based on the data dimension. Typically, high data dimensions correspond to high latent dimensions dd. According to the ablation study on α,β,λ\alpha, \beta, \lambda in Appendix H and hyperparameters selected via grid search in Appendix I.5, the changes of the three coefficients have a non-trivial impact on detection performance, which indicates that each module corresponding to α,β\alpha, \beta, or λ\lambda is indispensable for the proposed method. Moreover, while setting α,β,λ=1\alpha, \beta, \lambda = 1 (the same coefficient with Sinkhorn divergence), ImAD achieves good performance in most cases based on the empirical results. The learning rate is set to 0.0001 on most datasets and other choices mainly aim to make the optimization converge faster.

评论

Thanks for your clarifications. I stand by my rating.

评论

We highly appreciate your feedback and recognition.

作者回复

We appreciate the comments made by all reviewers. We summarize the major work of this rebuttal as follows:

  • As requested by reviewer 4yRa, we added time complexity analysis (in the form of O()\mathcal{O}(\cdot)) and running time cost of the compared methods in Table 1 and Table 2 of the attached PDF. On high-dimensional data (e.g. Usoskin with 25000+features), our method is at least 30 times faster than the competitors.

  • As requested by reviewer US97, we added the experiments with the missing rate changing from 0.1 to 0.8 and the experiments with different missing rates on the training set and test set. Due to the space limitation, we provided these results in Figure 1 and Figure 2 of the attached PDF. Our method outperformed other methods in almost all cases.

  • As suggested by Reviewer aFbK, we added experiments to study the effects of different missing mechanisms of the mask for the generated pseudo-abnormal samples, where previously we only used MCAR. The related results are provided in Table 3 of the attached PDF. Our method is quite robust to the setting of missing mechanisms of the mask for the generated pseudo-abnormal samples.

In addition to this global rebuttal and the attached PDF, we responded to the specific questions of each reviewer separately. Thank you again for the comments and suggestions by all reviewers. We are looking forward to further discussion with you.

最终决定

This paper proposes a novel unsupervised anomaly detection framework for incomplete data, with a focus on mitigating imputation bias of the common "impute-then-detect" practice. It first generates incomplete pseudo-normal data and then jointly learns an imputation model and a detection model from both normal and pseudo-normal data. Theorectical supports of the designed framework are provided, and experimental results on 11 data sets show the framework outperforms several baseline methods (in terms of AUROC and AURPC) on most data sets.

All reviewers find the design novel and the paper acceptable. Most reviewers requested more detailed information on the experiments, especially regarding the implementation of network and the choice of hyper-parameters. These requests are overall well addressed in rebuttal. In addition, the requested computational complexity analysis and a few more sensitivity analysis are also provided.

Overall, this is a well-written paper that presents a novel unsupervised anomaly detection framework for incomplete data, whose effectiveness is justified by theoretical promises and comprehensive empirical evaluations.