Fairness-aware Anomaly Detection via Fair Projection
摘要
评审与讨论
This paper presents analysis and methodology for anomaly detection in an unsupervised setting while ensuring group fairness. The paper presents basic definitions and feasibility for fairness in this setting and a method to obtain a common target distribution and overall fairness metric without requiring a fixed threshold. The paper also presents computational results using practical benchmarks.
优缺点分析
The strengths of the paper are in its overall framing of fairness issues in this setting and in its introduction of an effective method. Given the relatively little work on the topic, this is significant. The weaknesses are relatively limited theoretical bases for the method (especially relative to alternatives) and limited consideration of additional factors that appear in other studies (e.g., network effects).
问题
- What is the best possible in terms of the fairness objective in the unsupervised setting?
- What is the worst-case complexity for achieving the optimal objective value?
- What is the worst case impact on anomaly detection with a fairness constraint?
局限性
The authors have considered the limitations to some extent but they could explore the overall impact on anomaly detection to a greater extent.
最终评判理由
I think all the points raised in the discussion were valid and that the authors responded well to them. I have not altered my score.
格式问题
None
We sincerely appreciate your thoughtful comments and recognition of our work. We have carefully addressed each question and concern below.
Question 1: What is the best possible in terms of the fairness objective in the unsupervised setting?
Response: The ideal case involves achieving complete demographic parity across all groups for both normal and abnormal data. The necessary conditions for the ideal case are outlined below. For simplicity, we consider a sensitive attribute associated with two demographic groups, denoted as .
- For the normal data , the complete demographic parity across groups is achieved when Proposition 2 (Section 3.3.3) holds, i.e., where denotes the distribution projector.
- For the abnormal data , the complete demographic parity across groups is achieved when both Proposition 2 (Section 3.3.3) and Assumption 2 (Section 3.2) hold. Specifically, if the anomalies are generated from the normal data by a perturbation function , such that , then the condition ensures that the ideal case with respect to fairness is attained for both normal and abnormal data when Proposition 2 holds.
Question 2: What is the worst-case complexity for achieving the optimal objective value?
Response: We are a little confused about this question. Do you mean the iterative complexity for the optimal objective value in (11)? Since the optimization is highly nonconvex, there is no guarantee to achieve the optimal objective value. This is very common in deep learning. If you mean the complexity for reaching the zero objective value in (10), it depends the data and network size. The worst-case complexity for the network is that widths of all layers are equal to the input data dimension. That means, we have to compute the density in high-dimensional space, which may raise the test error.
Question 3: What is the worst case impact on anomaly detection with a fairness constraint?
Response: For simplicity, we assume to protect a sensitive attribute associated with two demographic groups. If samples from one group are abnormal while those from the other group are normal, imposing a fairness constraint on the optimization objective will inevitably degrade detection accuracy in such scenario. In the extreme case where complete group fairness is achieved between groups, the detection capability of the detector is lost entirely (e.g., AUROC=0.5)..
I commend the authors on their responses to my comments and those of the other reviewers. I find their replies quite comprehensive.
Dear Reviewer,
We sincerely appreciate your response to our rebuttal.
Best,
Authors
The authors propose FairAD, a novel method addressing group fairness in unsupervised anomaly detection. The approach maps demographic groups into a common target distribution, ensuring fairness without explicit constraints. The paper also introduces ADPD, a new threshold-free fairness metric compared to previous threshold-sensitive metrics, and demonstrates superior accuracy-fairness trade-offs empirically through comprehensive experiments.
优缺点分析
Strengths
- This paper provided theoretical foundations clearly articulate necessary conditions for fairness in unsupervised AD settings.
- The proposed method is novel, effectively combining anomaly detection and fairness via distribution alignment.
- Comprehensive empirical validation across diverse datasets convincingly demonstrates the method's effectiveness.
- Introduction of the ADPD metric mitigates the limitations of current threshold-sensitive metrics.
Weaknesses
- Evaluation predominantly focused on tabular data (only one image dataset); potential applicability to other domains remains underexplored.
- The evaluation does not consider scenarios involving multiple protected attributes; exploring how FairAD and the ADPD metric generalize to such multi-attribute fairness settings would further strengthen the work.
问题
-
Given that your fairness method relies on mapping demographic groups to a common distribution, how would extreme demographic imbalances (e.g., very small minority groups) affect the quality of this mapping?
-
The current work primarily validates performance on tabular data. Can you theoretically or empirically discuss how the compact-distribution mapping might behave differently on high-dimensional data such as images or sequential data like time series?
-
Your fairness metric, ADPD, is threshold-free and well-suited for binary group scenarios. Can you extend or elaborate on its applicability and theoretical robustness when dealing with multiple protected attributes?
局限性
The authors have adequately addressed methodological assumptions and limitations.
最终评判理由
Thank you for addressing my questions. The clarifications have been helpful. I will keep my rating.
格式问题
None
We sincerely appreciate your thoughtful comments and recognition of our work. We have carefully addressed each question and concern below.
To Weakness 1 & Question 2: We complement experiments on a text data (SST_sentiment_fairness_data from HuggingFace) with gender as protected attribute. In this experiment, we use BERT(bert-large-uncased) to extract embeddings (dim=1024) and adopt balanced splitting for two groups. The related results are reported in the following table.
\begin{matrix} \hline & \text{AUC (\%) } & \text{normal (ADPD \%)} & \text{all (ADPD \%)} \\ \hline \text{FairOD} & 42.28 & 8.10 & 5.57 \\ \hline \text{Deep Fair SVDD} & 62.23 & 12.57 & 9.48 \\ \hline \text{Ex-FairAD (Ours)}& 62.79 & 6.99 & 5.17 \\ \text{Im-FairAD (Ours)} & 63.64 & 6.68 & 5.03 \\ \hline \end{matrix} We expect to extend the proposed framework to more different type of data (e.g. point cloud, graphs, video) in the future work.
To Weakness 2 & Question 3: The original optimization objective of Im-FairAD is as follows:
$
\underset{\phi, \psi}{\text{min}} ~ \sum\_{s \in S}\text{Sinkhorn}(h\_\phi(\mathcal{X}\_{S=s}), \mathcal{Z}) + \frac{\beta}{n}\sum\_{i=1}\^{n} \Vert \mathbf{x}\_i - g_\psi(h\_\phi(\mathbf{x}\_i)) \Vert^2,
$
where only single protected attribute is considered. When protecting multiple sensitive attributes (e.g., race & gender), the optimization objective of Im-FairAD can be naturally reformulated to the following form (similar on Ex-FairAD).
$
\underset{\phi, \psi}{\text{min}} ~ \sum\_{S \in \Omega} \sum\_{s \in S}\text{Sinkhorn}(h\_\phi(\mathcal{X}\_{S=s}), \mathcal{Z}) + \frac{\beta}{n}\sum\_{i=1}^{n} \Vert \mathbf{x}\_i - g_\psi(h\_\phi(\mathbf{x}\_i)) \Vert^2,
$
where denotes the set of sensitive attributes. Meanwhile, the proposed fairness metric, ADPD, becomes
$
\text{ADPD} := \frac{1}{n \cdot \vert \Omega \vert} \sum_{S \in \Omega}\sum_{k=1}^{n} \Big\vert \mathbb{P}(\text{Score}(\mathcal{X}) > t_k | S=s_i )
- \mathbb{P}(\text{Score}(\mathcal{X}) > t_k | S=s_j) \Big\vert.
$
When the number of values of the protected attribute (e.g., race) exceeds two, i.e., , the ADPD becomes
$
\text{ADPD} := \frac{1}{n \cdot \vert \Omega \vert} \sum\_{S \in \Omega}\sum\_{k=1}\^{n} \max \big( \big\\{ \big\vert \mathbb{P}(s\_i) - \mathbb{P}(s\_j) \big\vert \big\\}^{s\_i, s\_j \in S}\_{i \neq j} \big),
$
where and the range of ADPD still is [0, 1).
To Question 1: This is a great question. In such scenario, the accuracy of distance measurement between distributions is compromised, which limits distribution transformation capability of the proposed method and further impacts the performance the detection and fairness for the extremes small groups.
Dear reviewer,
Firstly, thank you for your service! Please take a look at the author's response and engage with the content to maintain or revise your evaluation. Note that the discussion period has been extended to Aug 8, 11.59pm AoE.
Thank you,
-AC
Thank you for your thoughtful and detailed rebuttal. I appreciate the clarifications and additional experiments you provided in response to my questions.
We really appreciate your acknowledgement for our response.
The paper addresses the problem of fairness in unsupervised anomaly detection, particularly in high-stakes domains like finance and healthcare. The authors introduce two new methods (Im-FairAD and Ex-FairAD) that map data from different demographic groups to a shared, compact target distribution to ensure group fairness while maintaining high detection accuracy. A threshold-free fairness metric (ADPD) is also proposed for a more holistic fairness evaluation.
优缺点分析
Strengths: The paper makes significant theoretical and empirical contributions by rigorously analyzing fairness feasibility in UAD, introducing strong foundational assumptions, and validating them experimentally. Im-FairAD is performed equal or better than prior methods in balancing detection accuracy and fairness across multiple real-world datasets. The proposed threshold-free ADPD metric is another important contribution, addressing key limitations in previous fairness evaluation techniques.
Weaknesses: While the theoretical framework is well-motivated, the practical complexity of computing Sinkhorn distances and training fairness-aware projections could have some scalability issues. Moreover, the proposed methods assume the availability of well-defined demographic groupings and rely on reasonably fair target distributions, which might not generalize well across all domains.
问题
- You rely heavily on Assumptions 1–3 (learnable abnormality, transferable fairness, and generalizable parity). Could you elaborate on how these assumptions can be validated or approximated in practice, especially in domains where abnormal samples may differ significantly from normal data?
- The method hinges on mapping data to a shared, compact target distribution (a truncated Gaussian). How sensitive is the model performance to this specific choice? Have you explored other target distributions, and if so, how do they compare?
- The use of Sinkhorn distance and deep architectures may raise concerns about scalability. How will this work for bigger datasets?
- The approach focuses on group fairness across single protected attributes (e.g., race or gender). Have you considered how FairAD performs under intersectional fairness settings involving multiple overlapping attributes? If not, what challenges do you foresee in extending your method to this scenario?
局限性
A limitation can be that the fairness guarantees rest on assumptions (e.g. learnable abnormality and transferable fairness). These are plausible but I cannot be verified or enforced in all scenarios. Also since no anomalous samples are used during training, the fairness on unseen anomalies cannot be theoretically guaranteed without further assumptions.
最终评判理由
The authors addressed all of my questions and weaknesses. Also reading the other reviewers I would like to upgrade my score.
格式问题
- Most figures are quite small and therfore hard to read
We sincerely appreciate your thoughtful comments and recognition of our work. We have carefully addressed each question and concern below.
To Weakness & Question 3: There are four key factors affecting the computational cost of Sinkhorn distance. We list them below.
- : sample size;
- : feature dimension of sample;
- : coefficient of entropic regularization term;
- : stop threshold on error;
To evaluate the time cost associated with Sinkhorn distance, we conduct experiments to quantify the time required for a single computation of the Sinkhorn distance on both CPU and GPU. In this experiment, we utilize a real tabular data (Census from ADBench) as source distribution and a truncated Gaussian as target distribution, which maintains the task similarity with our proposed method. We keep and adjust the and . Note that means that 4000 samples drawn from source distribution and an additional 4000 samples drawn from target distribution, both of which are utilized in a single computation of the Sinkhorn distance. All the experiments are conducted on Ubuntu 22.04 with 10 Core Intel Xeon Processor, 1 NVIDIA Tesla V100S-32GB, CUDA 12.4. The average results with 10 repeats are provided in the following table.
The single computation time (seconds) of Sinkhorn distance.
\begin{matrix} \hline & {\epsilon=1e^{-2}} & {\epsilon=1e^{-2}} & {\epsilon=1e^{-2}} & {\epsilon=1e^{-3}} & {\epsilon=1e^{-3}} & {\epsilon=1e^{-3}} & {\epsilon=1e^{-4}} & {\epsilon=1e^{-4}} & {\epsilon=1e^{-4}}\\ & \text{CPU} & \text{GPU} & \text{Speedup} & \text{CPU} & \text{GPU} & \text{Speedup} & \text{CPU} & \text{GPU} & \text{Speedup} \\ \hline \alpha=1.0 & 11.89 & 0.1825 & 65\times & 12.07 & 0.1836 & 65\times & 12.66 & 0.2017 & 62\times \\ \alpha=0.5 & 12.77 & 0.1982 & 64\times & 12.82 & 0.2020 & 63\times & 13.14 & 0.2027 & 64\times \\ \alpha=0.1 & 14.49 & 0.2302 & 62\times & 14.51 & 0.2339 & 62\times & 14.63 & 0.2367 & 61\times \\ \hline \end{matrix}
These results indicates that the GPU can consistently accelerate the computation of the Sinkhorn distance about 60 times across various configurations when handling similar tasks. To validate the performance on different type data, we further conduct experiments on two image datasets (CIAFR10, Fashion-MNIST) and three text datasets (IMDB, 20Newsgroups, Yelp), where we directly utilize their corresponding embeddings provided in ADBench. We keep in the following results. Based on this estimation on different type datasets (different source distributions), GPU can averagely accelerate the computation of the Sinkhorn distance about 50 times.
The single computation time (seconds) of Sinkhorn distance.
\begin{matrix} \hline \text{Dataset} & \text{Census} & \text{CIFAR10-(0)} & \text{Fashion-MNIST-(0)} & \text{IMDB} & \text{20Newsgroups-(0)} & \text{Yelp} & \text{Avg} \\ (n_\text{mini-batch}, d) & (4000, 500) & (4000, 512) & (4000, 512) & (3000, 768) & (3000, 768) & (3000, 768) & -\\ \hline \text{Time (CPU)} & 14.63 & 18.29 & 14.18 & 12.14 & 10.61 & 11.86 & 13.61\\ \text{Time (GPU)} & 0.2367 & 0.3260 & 0.2211 & 0.3308 & 0.2200 & 0.2848 & 0.2699\\ \text{Speedup} & 61\times & 56\times & 64\times & 36\times & 48\times & 41\times & 50\times\\ \hline \end{matrix}
When addressing a similar task (involving source distribution and feature (or embedding) dimension) with , based on the empirical estimation, a single computation of Sinkhorn distance requires approximately on the GPU. If this process is repeated 100 times, the total computational time associated with the Sinkhorn distance amounts to roughly 150 minutes. We argue that such a time cost may be acceptable for scenarios of this scale.
We summarize the several effective strategies for accelerating the computation of the Sinkhorn distance when handling high-dimensional and large-scale data as follow:
- First and foremost, leveraging GPU acceleration for computing the Sinkhorn distance.
- Dimensionality reduction when handling high-dimensional data.
- Employing relatively large values for and .
- Decreasing the number of samples drawn from target distribution.
For the information of well-defined demographic groups, the current version needs to know the protected attributes accurately and a compact target distribution is critical for distinguishing between normal and abnormal samples. We hope the future work can relieve the dependence for accurate demographics and try to propose more general target distribution.
To Question 1: First, when abnormal samples differ significantly from normal data, the anomaly detection task becomes trivial. In other words, the anomalies can be identified easily and accurately and the False Positive Rate can be achieved. Under such conditions, it does not yield fairness issues technically, as no samples would be misidentified. Regarding the demographic parity of true abnormal samples between groups, it is not a technical issue in this situation.
For these assumptions:
- Assumption 1 ensures the learnability of normal patterns while providing feasibility guarantees for the unsupervised anomaly detection task. All the existing unsupervised anomaly detection works have validated the assumption. In real-world scenarios, especially fault detection in industry, it is quite common that anomalous samples emerge as perturbed normal samples and the evolution from normality to anomaly is gradual.
- Similarly, on Assumption 2 and Assumption 3, they have already verified by the existing fairness-aware anomaly detection works, such as FairOD, Deep Fair SVDD, CFAD and so on. In our experiments, the results on real-world datasets validate the reasonability of the two assumptions again.
- Certainly, these assumptions cannot hold all domains and scenarios, nor is that our goal. In fact, we make (or summarize) these assumptions in order to clarify the necessary prerequisites for unsupervised anomaly detection (UAD) and fairness-aware UAD.
To Question 2: We conduct experiments to explore the sensitive of performance for different truncated radius . The truncated radius is determined based on sampling probability in a Gaussian distribution . We adjust and related results are reported in the following table. We can observe that the performance (including detection accuracy and fairness) exhibits minor fluctuations within a reasonable changing range of .
The performance with different sampling probability on COMPAS and Adult (balanced splitting). Note that denotes the CDF of the chi-square distribution .
\begin{matrix} \hline r=\sqrt{F^{-1}_d(p)} & \text{Ex-FairAD} & \text{Ex-FairAD} & \text{Ex-FairAD} & \text{Im-FairAD} & \text{Im-FairAD} & \text{Im-FairAD}\\ & \text{AUC (\%)} & \text{normal (ADPD \%)} & \text{all (ADPD \%)} & \text{AUC (\%)} & \text{normal (ADPD \%)} & \text{all (ADPD \%)}\\ \hline & & & \text{COMPAS} & & & \\ \hline p=0.80 & 62.00 & 4.56 & 6.08 & 63.29 & 5.17 & 7.35\\ p=0.85 & 61.46 & 4.51 & 5.72 & 62.22 & 4.90 & 6.72 \\ p=0.90 & 63.13 & 3.40 & 6.31 & 63.17 & 4.05 & 6.97 \\ p=0.95 & 61.54 & 3.63 & 5.62 & 61.50 & 4.01 & 5.88\\ \hline & & & \text{Credit} & & & \\ \hline p=0.80 & 64.98 & 3.37 & 1.66 & 63.91 & 3.31 & 1.64\\ p=0.85 & 63.38 & 3.50 & 1.80 & 62.76 & 3.32 & 2.10\\ p=0.90 & 64.37 & 3.31 & 1.33 & 66.08 & 3.56 & 2.00\\ p=0.95 & 63.21 & 2.64 & 1.31 & 63.47 & 2.68 & 1.48\\ \hline \end{matrix}
We use truncated Uniform in hypersphere as target distribution and related results are reported in the following table.
\begin{matrix} \hline & \text{Ex-FairAD} & \text{Ex-FairAD} & \text{Ex-FairAD} & \text{Im-FairAD} & \text{Im-FairAD} & \text{Im-FairAD}\\ & \text{AUC (\%)} & \text{normal (ADPD \%)} & \text{all (ADPD \%)} &\text{AUC (\%)} & \text{normal (ADPD \%)} & \text{all (ADPD \%)}\\ \hline & & & \text{COMPAS} & & & \\ \hline \text{Uniform} & 60.02 & 4.09 & 4.99 & 60.47 & 3.79 & 3.92\\ \text{Gaussian} & 63.11 & 4.38 & 4.54 & 63.89 & 4.16 & 4.76\\ \hline & & & \text{Credit} & & & \\ \hline \text{Uniform} & 62.57 & 3.79 & 2.25 & 63.73 & 4.85 & 3.08\\ \text{Gaussian} & 64.96 & 2.95 & 1.92 & 63.70 & 2.20 & 1.97\\ \hline \end{matrix}
To Question 4: The original optimization objective of Im-FairAD is as follows:
$
\underset{\phi, \psi}{\text{min}} ~ \sum\_{s \in S}\text{Sinkhorn}(h\_\phi(\mathcal{X}\_{S=s}), \mathcal{Z}) + \frac{\beta}{n}\sum\_{i=1}\^{n} \Vert \mathbf{x}\_i - g_\psi(h\_\phi(\mathbf{x}\_i)) \Vert^2,
$
where only single protected attribute is considered. When protecting multiple sensitive attributes (e.g., race & gender), the optimization objective of Im-FairAD can be naturally reformulated to the following form (similar on Ex-FairAD).
$
\underset{\phi, \psi}{\text{min}} ~ \sum\_{S \in \Omega} \sum\_{s \in S}\text{Sinkhorn}(h\_\phi(\mathcal{X}\_{S=s}), \mathcal{Z}) + \frac{\beta}{n}\sum\_{i=1}^{n} \Vert \mathbf{x}\_i - g\_\psi(h\_\phi(\mathbf{x}\_i)) \Vert^2,
$
where denotes the set of sensitive attributes. Meanwhile, the proposed fairness metric, ADPD, becomes
$
\text{ADPD} := \frac{1}{n \cdot \vert \Omega \vert} \sum_{S \in \Omega}\sum_{k=1}^{n} \Big\vert \mathbb{P}(\text{Score}(\mathcal{X}) > t_k | S=s_i )
- \mathbb{P}(\text{Score}(\mathcal{X}) > t_k | S=s_j) \Big\vert.
$
Thank you for your time! After reading all of them things are more clear to me and I requested to upgrade my score.
Thank you so much for your careful consideration and acknowledgment of our work and response. We will complement the new explorations and extensions into the revised manuscript.
This paper addresses group fairness in unsupervised anomaly detection by proposing FairAD (Im-FairAD and Ex-FairAD), which maps data from different demographic groups to a shared compact distribution using density estimation and Sinkhorn distances, ensuring fairness without explicit regularization. A threshold-free fairness metric is introduced. Experiments on real datasets show better accuracy-fairness trade-offs under both balanced and skewed splits, transferring fairness to unseen anomalies while training only on normal data.
优缺点分析
Strengths:
- Addresses a critical gap in fairness within UAD with clear problem formulation.
- Solid theoretical framework, clarifying feasibility, necessary assumptions, and fairness transferability in UAD.
- Introduction of ADPD metric is valuable for fair evaluation without arbitrary thresholds.
- Extensive empirical validation with competitive baselines and realistic skew/balance conditions.
- Reproducibility appears high with clear method descriptions.
Weaknesses:
- The reliance on Sinkhorn distances and density estimation could pose scalability concerns on large-scale high-dimensional data.
- The choice of target distribution (truncated Gaussian) may limit generalizability if data manifolds are highly complex.
问题
- How does FairAD (especially with Sinkhorn) scale with higher dimensions and larger datasets ?
- How sensitive is performance to the choice of the truncated Gaussian target distribution? Would learning a target distribution (e.g., via normalizing flows) yield further improvements while maintaining fairness?
- Could the authors clarify how ADPD behaves under highly imbalanced groups or severe covariate shifts across groups? Are there conditions under which ADPD might fail to capture fairness violations?
局限性
Yes, the authors have adequately discussed limitations, assumptions, and potential societal impacts. They explicitly address the limitations of fairness guarantees in UAD, state the necessary assumptions for fairness transfer, and note that individual fairness is not tackled.
最终评判理由
After reading the rebuttal, I appreciate the authors' detailed response. My main concerns were addressed. Overall, the paper makes a novel and solid contribution to fairness in unsupervised anomaly detection, introducing both methodological innovations and empirical validation.
格式问题
No. All good.
We sincerely appreciate your thoughtful comments and recognition of our work. We have carefully addressed each question and concern below.
To Weakness 1 & Question 1: This is a practical and valuable question worthy of further exploration. There are four key factors affecting the computational cost of Sinkhorn distance. We list them below.
- : sample size;
- : feature dimension of sample;
- : coefficient of entropic regularization term;
- : stop threshold on error;
To evaluate the time cost associated with Sinkhorn distance, we conduct experiments to quantify the time required for a single computation of the Sinkhorn distance on both CPU and GPU. In this experiment, we utilize a real tabular data (Census from ADBench) as source distribution and a truncated Gaussian as target distribution, which maintains the task similarity with our proposed method. We keep and adjust the and . Note that means that 4000 samples drawn from source distribution and an additional 4000 samples drawn from target distribution, both of which are utilized in a single computation of the Sinkhorn distance. All the experiments are conducted on Ubuntu 22.04 with 10 Core Intel Xeon Processor, 1 NVIDIA Tesla V100S-32GB, CUDA 12.4. The average results with 10 repeats are provided in the following table.
The single computation time (seconds) of Sinkhorn distance.
\begin{matrix} \hline & {\epsilon=1e^{-2}} & {\epsilon=1e^{-2}} & {\epsilon=1e^{-2}} & {\epsilon=1e^{-3}} & {\epsilon=1e^{-3}} & {\epsilon=1e^{-3}} & {\epsilon=1e^{-4}} & {\epsilon=1e^{-4}} & {\epsilon=1e^{-4}}\\ & \text{CPU} & \text{GPU} & \text{Speedup} & \text{CPU} & \text{GPU} & \text{Speedup} & \text{CPU} & \text{GPU} & \text{Speedup} \\ \hline \alpha=1.0 & 11.89 & 0.1825 & 65\times & 12.07 & 0.1836 & 65\times & 12.66 & 0.2017 & 62\times \\ \alpha=0.5 & 12.77 & 0.1982 & 64\times & 12.82 & 0.2020 & 63\times & 13.14 & 0.2027 & 64\times \\ \alpha=0.1 & 14.49 & 0.2302 & 62\times & 14.51 & 0.2339 & 62\times & 14.63 & 0.2367 & 61\times \\ \hline \end{matrix}
These results indicates that the GPU can consistently accelerate the computation of the Sinkhorn distance about 60 times across various configurations when handling similar tasks. To validate the performance on different type data, we further conduct experiments on two image datasets (CIAFR10, Fashion-MNIST) and three text datasets (IMDB, 20Newsgroups, Yelp), where we directly utilize their corresponding embeddings provided in ADBench. We keep in the following results. Based on this estimation on different type datasets (different source distributions), GPU can averagely accelerate the computation of the Sinkhorn distance about 50 times.
The single computation time (seconds) of Sinkhorn distance.
\begin{matrix} \hline \text{Dataset} & \text{Census} & \text{CIFAR10-(0)} & \text{Fashion-MNIST-(0)} & \text{IMDB} & \text{20Newsgroups-(0)} & \text{Yelp} & \text{Avg} \\ (n_\text{mini-batch}, d) & (4000, 500) & (4000, 512) & (4000, 512) & (3000, 768) & (3000, 768) & (3000, 768) & -\\ \hline \text{Time (CPU)} & 14.63 & 18.29 & 14.18 & 12.14 & 10.61 & 11.86 & 13.61\\ \text{Time (GPU)} & 0.2367 & 0.3260 & 0.2211 & 0.3308 & 0.2200 & 0.2848 & 0.2699\\ \text{Speedup} & 61\times & 56\times & 64\times & 36\times & 48\times & 41\times & 50\times\\ \hline \end{matrix}
When addressing a similar task (involving source distribution and feature (or embedding) dimension) with , based on the empirical estimation, a single computation of Sinkhorn distance requires approximately on the GPU. If this process is repeated 100 times, the total computational time associated with the Sinkhorn distance amounts to roughly 150 minutes. We argue that such a time cost may be acceptable for scenarios of this scale.
We summarize the several effective strategies for accelerating the computation of the Sinkhorn distance when handling high-dimensional and large-scale data as follow:
- First and foremost, leveraging GPU acceleration for computing the Sinkhorn distance.
- Dimensionality reduction when handling high-dimensional data.
- Employing relatively large values for and .
- Decreasing the number of samples drawn from target distribution.
Based on the truncated Gaussian on as our target distribution, we define the anomaly score function as , which eliminates the need for a density estimation process.
To Weakness 2: In fact, the proposed method pays no much attention to the complexity of data distribution in the original space. We primarily expect that the normal data locate in some high-density regions in the latent space for a clear and reliable decision boundary. The transformation capability between distributions of the proposed method depends on the complexity of the neural networks and the accuracy of the distribution distance measurement. Theoretically, leveraging neural networks and the Sinkhorn distance, our method can effectively transform the complex data distributions into a well-defined target distribution.
To Question 2: We conduct experiments to explore the sensitive of performance for different truncated radius . The truncated radius is determined based on sampling probability in a Gaussian distribution . We adjust and related results are reported in the following table. We can observe that the performance (including detection accuracy and fairness) exhibits minor fluctuations within a reasonable changing range of .
The performance with different sampling probability on COMPAS and Adult (balanced splitting). Note that denotes the CDF of the chi-square distribution .
\begin{matrix} \hline r=\sqrt{F^{-1}_d(p)} & \text{Ex-FairAD} & \text{Ex-FairAD} & \text{Ex-FairAD} & \text{Im-FairAD} & \text{Im-FairAD} & \text{Im-FairAD}\\ & \text{AUC (\%)} & \text{normal (ADPD \%)} & \text{all (ADPD \%)} & \text{AUC (\%)} & \text{normal (ADPD \%)} & \text{all (ADPD \%)}\\ \hline & & & \text{COMPAS} & & & \\ \hline p=0.80 & 62.00 & 4.56 & 6.08 & 63.29 & 5.17 & 7.35\\ p=0.85 & 61.46 & 4.51 & 5.72 & 62.22 & 4.90 & 6.72 \\ p=0.90 & 63.13 & 3.40 & 6.31 & 63.17 & 4.05 & 6.97 \\ p=0.95 & 61.54 & 3.63 & 5.62 & 61.50 & 4.01 & 5.88\\ \hline & & & \text{Credit} & & & \\ \hline p=0.80 & 64.98 & 3.37 & 1.66 & 63.91 & 3.31 & 1.64\\ p=0.85 & 63.38 & 3.50 & 1.80 & 62.76 & 3.32 & 2.10\\ p=0.90 & 64.37 & 3.31 & 1.33 & 66.08 & 3.56 & 2.00\\ p=0.95 & 63.21 & 2.64 & 1.31 & 63.47 & 2.68 & 1.48\\ \hline \end{matrix}
When instancing our framework by normalizing flow, we obtain the following the optimization problem:
$
\mathop{\text{maximize}}\_{\mathcal{W}} \sum\_{s \in S}\sum\_{\mathbf{x}\sim \mathcal{X}\_{S=s}} \log \Big(p\_{\mathcal{Z}}(F\_\mathcal{W}(\mathbf{x}))\left\vert\det(\nabla\_\mathbf{x}F\_\mathcal{W}(\mathbf{x}))\right\vert\Big),
$
where denotes the sensitive attributes. In this experiment, we use Real NVP as and as . The related results are reported in the following tables. For and , we use and to measure the anomaly score of , respectively. In this experiment, Real NVP indeed improves the detection performance. However, it results in degraded fairness compared to the original instances (Im-FairAD and Ex-FairAD).
\begin{matrix} \hline & \text{AUC (\%)} & \text{normal (ADPD \%)} & \text{all (ADPD \%)}\\ \hline \text{RealNVP-1} & 70.51 & 18.23 & 16.15 \\ \text{RealNVP-2} & 64.54 & 10.93 & 10.74\\ \hline \text{Ex-FairAD} & 63.11 & 4.38 & 4.54\\ \text{Im-FairAD} & 63.89 & 4.16 & 4.76\\ \hline \end{matrix}
To Question3: The definition of ADPD indicates its insensitivity to the sizes of groups concerning the sensitive attribute, as it is probability-based. Depending on the definition, ADPD can effectively measure the group fairness/unfairness across all possible cases.
Dear reviewer,
Firstly, thank you for your service! Please take a look at the author's response and engage with the content to maintain or revise your evaluation. Note that the discussion period has been extended to Aug 8, 11.59pm AoE.
Thank you,
-AC
Dear authors, thank you for your detailed rebuttal. It has addressed most of my concerns. I maintain my score to borderline accept.
Dear reviewer,
We sincerely appreciate your acknowledgement to our rebuttal.
Best,
Authors
This paper tackles group fairness in unsupervised anomaly detection. The authors first discuss the limitations of classical fairness notions and introduce two mild assumptions—transferable fairness and generalizable parity. They then propose FairAD, which learns a projection that maps each protected group onto the same latent distribution. Together with a threshold-free metric, ADPD, extensive experiments on public datasets show that FairAD achieves higher task accuracy and better fairness than baseline methods.
优缺点分析
Strengths:
- This work begins with a comprehensive discussion of fairness definitions and evaluation methods, providing fresh insights for the field
- The proposed method is novel and supported by sound theoretical grounding
- The new metric and evaluation protocol are practical and thorough, reflecting rigorous research
Weaknesses:
- The analysis relies on assumptions—e.g., learnable abnormality, transferable fairness, and generalizable parity—that may fail when anomalies are structurally different (e.g., time-series faults or adversarial attacks)
- The datasets do not cover sequential, graph, or large-scale image/video benchmarks
- The impact of β (reconstruction weight) and λ (Ex-FairAD) on the fairness–accuracy trade-off is under-explored; more guidance on their influence and selection would be helpful
- Computational overhead is not fully addressed—for example, empirical run-time versus sample size for the Sinkhorn step is missing
问题
- Is the ideal method theoretically achievable if we can perfectly select the projection or based on any other oracles?
- Section H hints at experiments with minor anomaly contamination; could you quantify how much contamination degrades fairness?
- For multi-attribute fairness, can the projection be shared across combinations of sensitive attributes (e.g., race × gender) without exploding Sinkhorn cost?
局限性
See weaknesses
最终评判理由
Overall, the proposed FairAD achieves good performance and is promising in fairness-aware anomaly detection. I agree with other reviews that some assumptions may limit the generalization. I retain my recommendation of borderline accept.
格式问题
no
We sincerely appreciate your thoughtful comments and recognition of our work. We have carefully addressed each question and concern below.
To Weakness 1: Yes, anomalies in real data could be more complex. Anomaly detection under adversarial attacks and fair anomaly for time series are important and interesting problems worthy of exploration in future work. We hope that our work can bring new perspectives to these future studies.
To Weakness 2: We complement experiments on a text data (SST_sentiment_fairness_data from HuggingFace) with gender as protected attribute. In this experiment, we use BERT(bert-large-uncased) to extract embeddings (dim=1024) and adopt balanced splitting for two groups. The related results are reported in the following table. \begin{matrix} \hline & \text{AUC (\%) } & \text{normal (ADPD \%)} & \text{all (ADPD \%)} \\ \hline \text{FairOD} & 42.28 & 8.10 & 5.57 \\ \hline \text{Deep Fair SVDD} & 62.23 & 12.57 & 9.48 \\ \hline \text{Ex-FairAD (Ours)}& 62.79 & 6.99 & 5.17 \\ \text{Im-FairAD (Ours)} & 63.64 & 6.68 & 5.03 \\ \hline \end{matrix}
Existing studies on fairness-aware anomaly detection in graph data, such as FairGAD and DEFEND, adopt an inductive learning paradigm where the training set and test set are identical. This setting differs from the learning paradigm (transductive learning) followed by the proposed method. We expect to extend the proposed framework to more different type data in the future work.
To Weakness 3: In Appendices D.1 and D.2 of our paper, we evaluated the fairness-accuracy trade off by adjusting and , respectively. The corresponding results are shown in Figure 6 and 7 in our manuscript. From Figure 6, we observe that as increases, the accuracy (measured by AUC) shows a trend of slight increasing first and then decreasing on both balanced and skewed data. The fairness (measured by ADPD) shows a pronounced increasing trend. From Figure 7, we observe that the impact of varying on accuracy is marginal across most datasets. As increases, the fairness shows a trend of decreasing first and then increasing in most cases.
To Weakness 4: We report the training time and the time occupied by Sinkhorn distance in single epoch. For this experiments, We maintain the same configurations as in our manuscript, with (stop threshold on error of Sinkhorn distance) on the three tabular datasets (Compas, Adult and Credit) and for CelebA. The (coefficient of entropic regularization term) is consistently set to across all experiments.
$
\text{Sinkhorn}(\mathcal{X}, \mathcal{Y}):= \min \langle \mathbf{P}, \mathbf{C} \rangle_F + \alpha \sum\_{i,j}\mathbf{P}\_{ij}\log(\mathbf{P}\_{ij}), ~~\text{s.t.}~\mathbf{P}\mathbf{1} = \mathbf{a}, \mathbf{P}^T\mathbf{1}=\mathbf{b}, \mathbf{P} \geq 0.
$
The time cost (seconds) on the datasets with balanced splitting.
\begin{matrix} \hline \text{Dataset} & \text{\# Samples} & \text{Dimension} & \text{Ex-FairAD (all)} & \text{Ex-FairAD (Sinkhorn)} & \text{Im-FairAD (all)} & \text{Im-FairAD (Sinkhorn)} \\ \hline \text{Compas} & 2000 & 4 & 0.3842 & 0.1888 & 0.6028 & 0.3414 \\ \hline \text{Adult} & 12000 & 4 & 2.932 & 1.645 & 4.645 & 2.900 \\ \hline \text{Credit} & 10000 & 8 & 2.922 & 1.710 & 4.770 & 3.063 \\ \hline \text{CelebA} & 16000 & 512 & 243.51 & 150.18 & 264.12 & 159.58 \\ \hline \end{matrix}
To Question 1: Yes. A perfect projection holds the Proposition 2, which ensures complete group fairness among different groups of normal data. Under this condition, the ideal case is achieved when both Assumption 1 and Assumption 2 hold.
To Question 2: We conduct experiments on Compas(balanced splitting) to explore how fairness changes as the contamination rate increases. The related results are reported in the following table. In terms of fairness, we observe that the fairness (measured by ADPD) of normal data exhibits minor fluctuations within a small range. However, the fairness across the entire test set (including normal and abnormal samples) shows the declining trends (better fairness) as increases. This phenomenon aligns with our optimization objective: as the number of abnormal samples in the training set grows, the model would achieve improved group fairness among the groups of abnormal data.
The performance changes on Compas with the increasing of contamination rate.
\begin{matrix} \hline & \text{Ex-FairAD} & \text{Ex-FairAD} & \text{Ex-FairAD} & \text{Im-FairAD} & \text{Im-FairAD} & \text{Im-FairAD} \\ \hline & \text{AUC (\%) } & \text{normal (ADPD \%)} & \text{all (ADPD \%)} & \text{AUC (\%)} & \text{normal (ADPD \%)} & \text{all (ADPD \%)}\\ \hline 0.00 & 63.11 & 4.38 & 4.54 & 63.89 & 4.16 & 4.76 \\ \hline 0.05 & 58.84 & 4.33 & 4.55 & 58.88 & 4.11 & 4.72 \\ \hline 0.10 & 55.89 & 3.98 & 3.45 & 55.80 & 3.70 & 4.16 \\ \hline 0.15 & 54.48 & 4.19 & 3.11 & 55.01 & 3.89 & 3.51 \\ \hline 0.20 & 54.41 & 4.53 & 2.49 & 55.04 & 4.20 & 2.95 \\ \hline \end{matrix}
To Question 3: The computational cost associated with the Sinkhorn distance increases linearly with the number of sensitive attributes when protecting multiple sensitive attributes . Given that in most cases, this scenario does not lead to prohibitive computational cost. To empirically validate this claim, we conduct experiments on Adult (gender & race) and Compas (gender & race) datasets to protect two sensitive attributes at the same time and report the computation cost (seconds) of Sinkhorn distance in the following table.
Time cost (seconds) of Sinkhorn distance when protecting one and two attributes in single epoch.
\begin{matrix} \hline & \text{Compas} & \text{Adult} \\ \hline \text{Im-FairAD (one)} & 0.3414 & 2.900\\ \hline \text{Im-FairAD (two)} & 0.6342 & 5.8610\\ \hline \end{matrix}
Thank you to the authors for their response. Please include these new discussions in the manuscript. Overall, this is an interesting paper, and I will maintain my score.
Thank you so much for the acknowledgement. We will include these into the paper.
This work focuses on the problem of group fairness in unsupervised anomaly detection. The authors propose FairAD, a method that learns a projection to map data from different demographic groups into a shared, compact distribution, thereby ensuring fairness without explicit regularization. They also introduce ADPD, a threshold-free metric for fairness evaluation, and show improved results in balancing detection accuracy and fairness in experiments on public datasets.
All reviewers leaned towards accepting this work, and maintained (or increased) their positive scores throughout rebuttal.
Main feedback:
-
concerns around the reliance on strong assumptions, such as "learnable abnormality" and "transferable fairness." Several reviewers questioned whether these assumptions would hold in practice, especially for complex types of anomalies (Gmds, sGV5, ieQH, tpFa).
-
computational overhead and scalability -- reviewers raised concerns around the computational cost of the Sinkhorn distance on large, high-dimensional datasets. To this, authors provided empirical analysis showing that GPU acceleration and other strategies can make the method feasible in practice (Gmds, sGV5, ieQH).
-- reviewers felt the work had a somewhat limited evaluation scope, focusing on tabular data. Reviewers requested additional experiments on other data types (e.g. text, images) and for scenarios with multiple protected attributes or extreme group imbalances (Gmds, sGV5, QFnW). Authors responded by providing new results on a text dataset and explaining how the framework might be extended to multi-attribute fairness.