PaperHub
5.5
/10
Poster3 位审稿人
最低3最高3标准差0.0
3
3
3
ICML 2025

Stray Intrusive Outliers-Based Feature Selection on Intra-Class Asymmetric Instance Distribution or Multiple High-Density Clusters

OpenReviewPDF
提交: 2025-01-13更新: 2025-07-29
TL;DR

This paper proposes the stray intrusive outliers-based feature selection method for high-dimensional data classification with intra-class asymmetric instance distribution or multiple high-density clusters.

摘要

For data with intra-class Asymmetric instance Distribution or Multiple High-density Clusters (ADMHC), outliers are real and have specific patterns for data classification, where the class body is necessary and difficult to identify. Previous Feature Selection (FS) methods score features based on all training instances or rarely target intra-class ADMHC. In this paper, we propose a supervised FS method, Stray Intrusive Outliers-based FS (SIOFS), for data classification with intra-class ADMHC. By focusing on Stray Intrusive Outliers (SIOs), SIOFS modifies the skewness coefficient and fuses the threshold in the 3$\sigma$ principle to identify the class body, scoring features based on the intrusion degree of SIOs. In addition, the refined density-mean center is proposed to represent the general characteristics of the class body reasonably. Mathematical formulations, proofs, and logical exposition ensure the rationality and universality of the settings in the proposed SIOFS method. Extensive experiments on 16 diverse benchmark datasets demonstrate the superiority of SIOFS over 12 state-of-the-art FS methods in terms of classification accuracy, normalized mutual information, and confusion matrix. SIOFS source codes is available at https://github.com/XXXly/2025-ICML-SIOFS
关键词
Feature selectionstray intrusive outliersrefined density-mean centerintra-class asymmetric instance distribution or multiple high-density clustersdata classification

评审与讨论

审稿意见
3

This paper proposes a supervised FS method, Stray Intrusive Outliers-based FS (SIOFS), for data classification with intra-class ADMHC. By focusing on Stray Intrusive Outliers (SIOs), SIOFS modifies the skewness coefficient and fuses the threshold in the 3σ principle to identify the class body, scoring features based on the intrusion degree of SIOs.

给作者的问题

Please see Weaknesses.

论据与证据

Please see Weaknesses.

方法与评估标准

Please see Weaknesses,

理论论述

Please see Weaknesses.

实验设计与分析

Checked.

补充材料

Checked the theoretical part.

与现有文献的关系

Most current FS methods score features based on the characteristics of all training instances. Existing FS methods rarely aim to identify the class body in the context of intra-class multiple high-density clusters. This paper addresses them.

遗漏的重要参考文献

None.

其他优缺点

Strengths

  1. The paper structure is clear and is easy to follow.
  2. The proposed method is novel.
  3. The method is supported by theoretical evidence and empirical evidence.

Weaknesses

  1. At the third paragraph in Introduction, the authors mention that as shown in Fig. 1b, class “2” has two high-density clusters, but there is only class "1" and "3" in the figure.
  2. What is definition of 3σ3\sigma principle? Can authors explain its insight?
  3. The explanation in Line 132 for equation 3 assumes that outliers have low instance density and thus will not be included. However, there are different types of outliers and low density assumption can not guarantee that all outliers are not included.
  4. Why using l1 distance in equation 1?
  5. The authors need to explain why the modified SC is formulated as the type of Eq.4. Why (di(l)u(l))3(d_i^{(l)} - u^{(l)})^3 is used? Is this mean that other terms like (di(l)u(l))1(d_i^{(l)} - u^{(l)})^1 can not be used?
  6. In Line 192, why normalize s^(l)\hat{s}^{(l)} by s^(l)/3\hat{s}^{(l)}/3?

其他意见或建议

Please see Weaknesses.

作者回复

We sincerely thank Reviewer 5k2j for the constructive and valuable comments. The concerns are addressed as follows.

Q1: In Fig. 1b, class "2" has two high-density clusters, but there is only class "1" and "3" in the figure.

Many thanks for the comment. We revise this typo from class "2" to class "3".

Q2: What is the definition of 3σ principle? Can authors explain its insight?

Sorry for the unclear description. We add the classical literature (Harris & Stocker, 1998) and [1] to support the 3σ\sigma principle. For ξN(μ,σ2)\xi\sim N(\mu,\sigma^2), the probability (denoted as PrPr) that Pr(ξ(μ3σ,μ+3σ))>0.997Pr(\xi\in(\mu-3\sigma,\mu+3\sigma))>0.997. This provides a robust framework for analysing variability in normally distributed data and guides thresholds for outlier detection in the context of ADMHC.

[1] Groeneveld, R. A., & Meeden, G. (1984). Measuring Skewness and Kurtosis. Journal of the Royal Statistical Society: Series D (The Statistician), 33(4), 391-399.

Q3: There are different types of outliers and low density assumption can not guarantee that all outliers are not included.

We clarify that due to random errors in the data, especially in synthetic datasets, it is impossible to correctly identify all of them. In this paper, only low-density cases are identified and used as outliers. We will include the above details of the outliers we have identified in the final version.

Q4: Why using l1 distance in Eq.(1)?

As presented in the original paper (see line 79, right column), the 1\ell_1 norm can treat each component of the feature vector equally. 1\ell_1 treats all xif(k)xjf(k)\vert x_{if}^{(k)}-x_{jf}^{(k)} \vert linearly while 2\ell_2 penalizes large xif(k)xjf(k)\vert x_{if}^{(k)}-x_{jf}^{(k)}\vert quadratically. This makes it easier to identify outliers in 1\ell_1 than in 2\ell_2.

As suggested, we add comparisons with 1\ell_1 distance and the typical 2\ell_2 distance. Note that we only replace 1\ell_1 distance with 2\ell_2 distance. As shown in Rebuttal Table D, SIOFS with 1\ell_1 distance outperforms that with 2\ell_2 distance on all datasets. We will include these details in the final version.

Rebuttal Table D. Comparative results of SIOFS with 1\ell_1 distance (denoted as "SIOFS") and 2\ell_2 distances (denoted as "w/ 2\ell_2") on some datasets. "w/ ()1(\cdot)^1" means that the ()3(\cdot)^3 terms are directly replaced by ()1(\cdot)^1 in Eq.(4). The average ACC over all 11 datasets is also reported for a comprehensive comparison.

ACC (%) \uparrowSIOFSw/ 2\ell_2w/ ()1(\cdot)^1
CLL71.32±\pm3.8465.47±\pm2.5870.57±\pm1.85
TOX83.24±\pm4.9981.19±\pm2.9383.24±\pm3.09
Carcinom93.77±\pm2.7293.77±\pm2.3393.30±\pm2.19
Lung95.32±\pm1.0294.09±\pm1.8095.24±\pm0.61
Lymphoma89.76±\pm2.2089.76±\pm2.4489.76±\pm1.52
Over 11 datasets81.8081.3381.64

Q5: Explain the modified SC in Eq.(4). Why (di(l)u(l))3(d_i^{(l)}-\mathrm{u}^{(l)} )^3 is used? How about other terms like (di(l)u(l))1(d_i^{(l)}-\mathrm{u}^{(l)} )^1.

According to the literature (Linton, 2017), the original formula of the SC is SC=1ni=1n(XiXˉ)3(1ni=1n(XiXˉ)2)3/2SC=\frac{\frac{1}{n}\sum_{i=1}^n(X_i-\bar{X})^3}{\left(\frac{1}{n}\sum_{i=1}^n(X_i-\bar{X})^2\right)^{3/2}}, where XiX_i are the individual data points, Xˉ\bar{X} is the mean and nn is the number of instances. As mentioned in lines 142-150 (right column), the modified SC can be obtained by substituting u(l)\mathrm{u}^{(l)} and σ^(l)\hat{\sigma}^{(l)} into the formula SCSC. That is, the term (di(l)u(l))3(d_i^{(l)}-\mathrm{u}^{(l)})^3 is directly derived from the term (XiXˉ)3(X_i-\bar{X})^3 and has the same statistical meaning. The SC is a statistical measure that quantifies the asymmetry of a probability distribution. It indicates the degree to which data deviate from a symmetric, bell-shaped normal distribution [1]. Since we address the highest-density subclusters in the context of ADMHC, it is reasonable to use and modify the SC via Eq.(4).

In addition, we appreciate the kind suggestion to discuss other terms such as (di(l)u(l))1(d_i^{(l)}-\mathrm{u}^{(l)})^1. We add the comparison between (di(l)u(l))3(d_i^{(l)}-\mathrm{u}^{(l)})^3 and (di(l)u(l))1(d_i^{(l)}-\mathrm{u}^{(l)})^1. To make the comparison reasonable, two terms with ()3(\cdot)^3 in Eq.(4) are directly replaced by ()1(\cdot)^1 (denoted as "w/ ()1(\cdot)^1"). As demonstrated in Rebuttal Table D, our SIOFS using the term ()3(\cdot)^3 achieves higher ACCs compared to ()1(\cdot)^1. We will incorporate the above discussions into the final version.

Q6: In Line 192, why normalize s^(l)\hat{s}^{(l)} by s^(l)/3\hat{s}^{(l)}/3?

Please see our response to Reviewer aEFk, Q5.

审稿意见
3

This paper proposes a supervised FS method, Stray Intrusive Outliers based FS (SIOFS), for data classification with intra-class ADMHC. By focusing on Stray Intrusive Outliers (SIOs), SIOFS modifies the skewness coefficient and fuses the threshold in the 3σ principle to identify the class body, scoring features based on the intrusion degree of SIOs.

Extensive experiments on 15 diverse benchmark datasets demonstrate the superiority of SIOFS over 12 state-of-the-art FS methods in terms of classification accuracy, normalized mutual information, and confusion matrix.

SIOFS has the potential to improve analysis results in sectors that contain the classes with intra-class asymmetric instance distribution or multiple high-density clusters, from healthcare to finance. In all, SIOFS provides the theoretical foundation for further development of effective and interpretable models.

给作者的问题

I am particularly concerned about the selected value of α, specifically whether the experimentally determined optimal α aligns with the expectations derived from theoretical considerations. My primary concern is the potential discrepancy between theory and experimental results: the experimentally selected optimal α may not satisfy the criteria of "larger" and "the most of" as anticipated in the theoretical derivation. This concern has been exacerbated by some minor errors identified in this paper, which have raised doubts about the consistency between theory and practice.

Line 50 right column,class “2”, There is no class 2 in the figure 1 (b); line 71 left column: radii; line 445 left column: multipple.

论据与证据

In Theorem 2, ‘the larger value in …’ is not rigorous, a better description is ‘if α1>α2∈(0,1), then …’ It’s confusing to define 'larger'. Especially from experiments table 1, The α varies widely across datasets, ranging from 0.1 to 0.6. In the ablation experiment and the appendix, it is also shown that the change of α has a significant impact on the performance of some datasets. Is the final selection of a determined by the test results?

方法与评估标准

The scheme designed in the experiment is meaningful and fair, and can explain the superiority of the algorithm to a certain extent. The designed evaluation methods are consistent with most related work, which reflects the rationality of the validation framework.

理论论述

Theorem 2 indicates that a larger α is necessary. However, the significant variation in α across different datasets presented in Table 1 raises concerns about the alignment between theoretical derivation and experimental validation. From the results in Figures 5 and 7, why does α differ so significantly for different data sets?

实验设计与分析

The elements of this experiment have been fully addressed; however, there are additional considerations regarding the deep learning component. While feature selection does not appear to offer significant advantages over deep feature extraction in terms of performance, as noted at the end of the article, its primary advantage over deep learning lies in interpretability—specifically, the ability to identify key factors influencing health outcomes rather than focusing solely on accuracy.

补充材料

The related work and additional experimental settings of supplementary materials are detailed, but the proofs of theorem 2 cannot dispel my doubts about rigor.

与现有文献的关系

From the recent review, the challenge of high-dimensional data classification has always existed, among which feature selection is a highly concerned scheme. In particular, asymmetric instance distribution and multi-density cluster, which are concerned in this paper, belong to strong prior knowledge, which can be used for model design with strong pertinently and predictable performance improvement. Recently, the use of outliers for feature selection [Yuan et al., 2022; Yuan et al., 2024] is a novel and niche research topic, which is beneficial to the development of machine learning. However, we hope to maintain the spirit of open source, so that more people can find the highlights of work and apply it to practice instead of doing closed development. Open-source code will not only increase your citations, it will also allow peers other than reviewers to review your work and provide more professional opinions.

遗漏的重要参考文献

‘Feature selection techniques for machine learning: a survey of more than two decades of research.’ This article's descriptive classification of feature selection is very similar to the related work Review of FS in the appendix, but it is not cited. Of course this is not crucial, as the two most important and relevant articles are cited: FSDOC (Yuan et al., 2022) and IOFS (Yuan et al., 2024) methods.

其他优缺点

The article demonstrates a high degree of originality and clarity in its narrative. However, the descriptions of theorems and key indicators are somewhat ambiguous, which poses challenges for code implementation. For instance, at line 161, "the most of" should be specified with a precise percentage (e.g., 0.8, 0.75, or 0.6). Additionally, the transition from 3σ to Formula 5, as well as the adjustment of the coefficient "2" mentioned in the left column at line 189, the normalization operation on line 193, all appear to be based on heuristic reasoning rather than rigorous derivation. Should such parameters be supported and fed back by experiments?

其他意见或建议

  1. The discussion and expression need to be more rigorous
  2. The experiment needs to be considered carefully
作者回复

We sincerely appreciate the reviewer’s feedback. Below, we address the concerns in detail.

Q1: Some ambiguous descriptions about the theorems. Inconsistency between theory and experiment about α.

As addressed in our response to Reviewer 52hZ, Q1, we clarify that the condition for in Theorem 2 does not contradict the experimental results. When α\alpha is small, the conclusion in lines 184-187 still holds. α\alpha has a significant impact on ACC because of the scattered distribution of clusters and multiple local high-density subclusters (see Fig. 1b). This problem can be solved by selecting more features.

In addition, we regret the unclear descriptions. We delete "larger" and "the most of", and revise Theorem 2 as follows: d1(l),d2(l),,dnl(l)d_1^{(l)},d_2^{(l)},\dots,d_{n_l}^{(l)}, \dots. When α(0,1)\alpha\in(0,1) and σ(l)>0\sigma^{(l)}>0, u(l)+2σ(l)>mode(l)u^{(l)}+2\sigma^{(l)}>mode^{(l)} holds with probability 1 if s^(l)>0\hat{s}^{(l)}>0, and u(l)+2σ(l)<mode(l)u^{(l)}+2\sigma^{(l)}<mode^{(l)} holds with probability 1 if s^(l)<0\hat{s}^{(l)}<0. We will include the above details in the final version.

Q2: Lack evidence or practical examples to validate the advantage on deep learning-based FS.

We appreciate the constructive feedback provided. As mentioned in lines 417-418, deep FS has extensive applications, such as Remote Sensing (RS) scene classification and 3D object recognition. We add the ACC results by selecting all features (denoted as "AllFeat") in Rebuttal Table B and the CMs with the top 5% of features (see https://i.postimg.cc/HLw5K8z6/CMs-AID.png) on AID dataset. These results demonstrate that the SIOFS is superior to AllFeat while some comparative methods (such as ILFS, S^2DFS) is inferior. As shown in CMs, our SIOFS can further improve the classification performance on deep features against other methods. Combined with Fig.1a, for the "School" with intra-class ADMHC, the number of instances correctly classified by SIOFS is 28, while it is 24 by TRC and 27 by Fisher and ReliefF. SIOFS gains better performance in predicting the confusing classes, being able to identify key factors influencing RS monitoring. On UCM, SIOFS (94.95) also performs better than AllFeat (94.48); On ModelNet, SIOFS (93.03) has higher ACC than AllFeat (92.71). We will include these details in the final version.

Rebuttal Table B. ACC (%) on AID. Six FS methods are randomly selected for comparison.

ACC (%) \uparrow
AllFeat84.36
Fisher84.92
ReliefF84.80
TRC85.05
ILFS83.93
FSDOC84.51
S^2DFS84.35
SIOFS85.83

Q3: Maintain the spirit of open source.

We appreciate the valuable comment. The code will be released once accepted.

Q4: Add an essential reference.

As suggested, we will include this paper in the Related Work section in the final version.

Q5: Explain the transition from 3σ to Eq.(5), the adjustment of the coefficient "2" in line 189, and the normalization in line 193.

As mentioned in lines 129-131 (right column) and lines 179-187, and combined with the revised Theorem 2, the transition to Eq.(5) is a heuristic reasoning that follows the conclusion in lines 183-187 (see "That is, ..."), and it is necessary to introduce SC for normalization. The original formula of SCSC (see our response to Reviewer 5k2J, Q5) and the literature [1] (see the response to Reviewer 5k2j, Q2) clarify that s(l)(3,3)s^{(l)}\in(-3,3) is an empirical guideline, providing intuitive thresholds to assess skewness severity and guide data adjustments, and reflecting realistic bounds for most practical datasets. That is, we have 13s(l)(1,1)\frac{1}{3}s^{(l)}\in(-1,1), then 213s(l)(1,3)2-\frac{1}{3}s^{(l)}\in(1,3). As mentioned in lines 198-205, the 213s(l)2-\frac{1}{3}s^{(l)} with its range of (1,3) is more rational than 2 for σ(l)\sigma^{(l)} to address the highest-density subcluster in the context of ADMHC. As demonstrated in Rebuttal Table C, the proposed 213s(l)2-\frac{1}{3}s^{(l)} outperforms other formulas in most cases, and the average ACCs over the 11 datasets further quantitatively prove the superiority of our method. We will incorporate the above discussions in the final version.

Rebuttal Table C. Results of different adjustments and normalizations on some datasets. Our SIOFS is equipped with "213s(l)2-\frac{1}{3}s^{(l)}". Additional settings are included in the table.

ACC (%) \uparrow213s(l)2-\frac{1}{3}s^{(l)}113s(l)1-\frac{1}{3}s^{(l)}313s(l)3-\frac{1}{3}s^{(l)}212s(l)2-\frac{1}{2}s^{(l)}214s(l)2-\frac{1}{4}s^{(l)}
CLL71.32±\pm3.8469.52±\pm1.4266.37±\pm1.2470.57±\pm2.5369.52±\pm4.58
TOX83.24±\pm4.9983.63±\pm4.6482.26±\pm2.8483.43±\pm3.7282.65±\pm3.98
Carcinom93.77±\pm2.7293.10±\pm1.0592.82±\pm2.1492.72±\pm2.3693.10±\pm1.96
Lung95.32±\pm1.0295.63±\pm0.9195.07±\pm0.9495.22±\pm1.7395.07±\pm1.21
Lymphoma89.76±\pm2.2088.54±\pm2.9587.50±\pm2.4889.76±\pm3.5888.19±\pm1.87
Over 11 datasets81.8081.3580.9881.6881.48

Q6: Minor errors.

We will revise them in the final version.

审稿人评论

Dear author, thank you for your reply. The imprecision of theory and experimental verification is still my concern. If there is empirical operation, is the theoretical derivation still rigorous? Because experiments can sometimes be deceptive.

作者评论

We sincerely appreciate the reviewer's feedback. We clarify that whether α\alpha is larger or smaller, it can be theoretically derived. The experimental results validate our theory. We summarize the details of derivations as follows.

Theoretical derivation about α\alpha. As mentioned in our submission, from the definition of RDM center (see Eq.3 and lines 137-139) and the fact that s^(l)\hat{s}^{(l)} has the same property of s(l)s^{(l)} (see lines 154-157, right column), it is natural to deduce that (in lines 183-187) when α\alpha is smaller in (0,1), u(l)+2σ^(l)\mathrm{u}^{(l)}+2\hat{\sigma}^{(l)} is relatively large for obtaining the highest density values in d1(l),,dnl(l)d_1^{(l)},\dots,d_{n_l}^{(l)} when s^(l)>0\hat{s}^{(l)}>0. Similarly, if s^(l)<0\hat{s}^{(l)}<0, u(l)+2σ^(l)\mathrm{u}^{(l)}+2\hat{\sigma}^{(l)} is relatively small for obtaining all highest density values in d1(l),,dnl(l)d_1^{(l)},\dots,d_{n_l}^{(l)} (see The detailed theoretical derivation when α\alpha is smaller).

To make this theory clear, we summarize as follows: For d1(l),d2(l),,dnl(l)d_1^{(l)},d_2^{(l)},\dots,d_{n_l}^{(l)}, u(l),s^(l)\mathrm{u}^{(l)},\hat{s}^{(l)} are the same as in (4), and mode(l)mode^{(l)} is the same as the footnote of Section 3.1. When α(0,1)\alpha\in(0,1) and σ^(l)>0\hat{\sigma}^{(l)}>0, u(l)+2σ^(l)>mode(l)\mathrm{u}^{(l)}+2\hat{\sigma}^{(l)}>mode^{(l)} holds with probability 1 if s^(l)>0\hat{s}^{(l)}>0, and u(l)+2σ^(l)<mode(l)\mathrm{u}^{(l)}+2\hat{\sigma}^{(l)}<mode^{(l)} holds with probability 1 if s^(l)<0\hat{s}^{(l)}<0.

The detailed theoretical derivation when α\alpha is smaller: According to the definition of RDM center, u(l)\mathrm{u}^{(l)} is the average of the top αnl\lceil\alpha\cdot n_l\rceil high density values in d1(l),,dnl(l)d_1^{(l)},\dots,d_{n_l}^{(l)}. When α\alpha is smaller in (0,1)(0,1) but αnl=1\lceil\alpha\cdot n_l\rceil=1, we have u(l)=mode(l)\mathrm{u}^{(l)}=mode^{(l)}. Combined with σ^(l)>0\hat{\sigma}^{(l)}>0, we have u(l)+2σ^(l)>mode(l)\mathrm{u}^{(l)}+2\hat{\sigma}^{(l)}>mode^{(l)}. Following Theorem 1, let ξ\xi be the distance between an instance and the center of class ll, f(x)f(x) be the probability density function of ξ\xi. Given the SC of f(x)f(x) that s(l)>0s^{(l)}>0, there are two properties of statistical probability: (i) f(mode+ε)>f(modeε)f(mode+\varepsilon)>f(mode-\varepsilon), for any ε>0\varepsilon>0. (ii) f(x)f(x) is monotonically increasing near the left side of modemode and decreasing near the right side. d1(l),,dnl(l)d_1^{(l)},\dots,d_{n_l}^{(l)} is a random sample of ξ\xi, let d2,d3d_2,d_3 denote the 2nd and 3rd largest density in d1(l),,dnl(l)d_1^{(l)},\dots,d_{n_l}^{(l)}, respectively, and d1=mode(l)d_1=mode^{(l)}. Obviously there is f(d1)>f(d2)>f(d3)f(d_1)>f(d_2)>f(d_3). When α\alpha is smaller in (0,1)(0,1) but αnl=2\lceil\alpha\cdot n_l\rceil=2, σ^(l)>0\hat{\sigma}^{(l)}>0 and s^(l)>0\hat{s}^{(l)}>0. s^(l)\hat{s}^{(l)} has the same property of s(l)s^{(l)} (see lines 154-157, right column). Due to random errors in the sampling, the probability that d2>d1d_2>d_1 holds w.r.t property (i) is 1, then u(l)=d1+d22>d1\mathrm{u}^{(l)}=\frac{d_1+d_2}{2}>d_1, i.e., at least u(l)+2σ^(l)>mode(l)\mathrm{u}^{(l)}+2\hat{\sigma}^{(l)}>mode^{(l)} holds with probability 1. When α\alpha is smaller in (0,1)(0,1) but αnl=3\lceil\alpha\cdot n_l\rceil=3, σ^(l)>0\hat{\sigma}^{(l)}>0 and s^(l)>0\hat{s}^{(l)}>0. If d3>d1d_3>d_1, then u(l)=13(d1+d2+d3)>d1\mathrm{u}^{(l)}=\frac{1}{3}(d_1+d_2+d_3)>d_1; If d3<d1d_3<d_1 (see https://i.postimg.cc/GmzqF1fV/PDF.png), since f(d3)<f(d2)f(d_3)<f(d_2), combining the property (i), d3d_3 can only be very close to d1d_1 in s^(l)>0\hat{s}^{(l)}>0, namely u(l)=13(d1+d2+d3)\mathrm{u}^{(l)}=\frac{1}{3}(d_1+d_2+d_3) is very close to d1d_1. Combined with σ^(l)>0\hat{\sigma}^{(l)}>0, it holds with probability 1 that u(l)+2σ^(l)>mode(l)\mathrm{u}^{(l)}+2\hat{\sigma}^{(l)}>mode^{(l)} for αnl=3\lceil\alpha\cdot n_l\rceil=3. When αnl=4,5,\lceil\alpha\cdot n_l\rceil=4,5,\dots, the same conclusion can be obtained. Similarly, if s^(l)<0\hat{s}^{(l)}<0, u(l)+2σ^(l)\mathrm{u}^{(l)}+2\hat{\sigma}^{(l)} is relatively small for obtaining all highest density values in d1(l),,dnl(l)d_1^{(l)},\dots,d_{n_l}^{(l)}.

Experimental verification for α\alpha. Based on the theory and explanations, the value of α\alpha is reasonable in (0,1)(0,1), and thus we set α=0.1,,0.9\alpha=0.1,\dots,0.9 in Fig. 7 and 9. As mentioned in lines 113-115 of our submission, α\alpha is the ratio of higher density to all instances in a class, it is affected by dataset characteristics. For some datasets, such as CLL (see Fig. 1b), due to the scattered distribution of clusters and multiple local high-density subclusters, α\alpha has a significant impact on ACC in some cases.

The details of responses to reviewer's concerns are shown in the previous rebuttal. We will polish the complete mauscript, not limited to "larger" in Theorem 2 and "the most of" in line 178. We promise to release the code once it is accepted.

As demonstrated by above explanations, our theoretical derivation is complete and rigorous (whether α\alpha is larger or smaller), and the experimental verification is effective. To avoid confusing, all above details will be included in the final version. We hope this is sufficient reason to consider raising the score.

审稿意见
3

For the problem of high-dimensional data classification with intra-class asymmetric instance distribution or multiple high-density clusters (ADMHC), a novel supervised feature selection (FS) method named Stray Intrusive Outliers-based FS (SIOFS) is proposed. The proposed method uses the RDM center to characterize the class body, and the modified skewness coefficient (SC) is adjusted and fused into the 3σ3 \sigma principle to define the class body. Then, intrusion degree is modeled based on the conclusion of intersecting spheres. Finally, the feature ranking is determined by the intrusion degrees of SIOs. Experimental results demonstrate the effectiveness of the proposed method over other state-of-the-art FS methods.

给作者的问题

For classification problems, the increase in the number of classes will have a significant impact on the performance of the method, that is, it will increase the difficulty of classification. I noticed that the number of classes in the datasets involved in the experiments is actually relatively small, with the largest number of only 4040 classes. A natural confusion is whether the proposed method can still have a significant advantage over existing methods when the number of classes is large? Therefore, as the number of classes increases, the effectiveness of the proposed method needs further experimental verification.

论据与证据

\bullet This paper mainly proposes the claim: In high-dimensional data classification with intra-class ADMHC, the distribution of distances is asymmetry or multi-peak. Existing FS methods rarely aim to identify the class body in the context of intra-class ADMHC. The proposed SIOFS method targets intra-class ADMHC for data classification.

\bullet The theoretical results in this paper and the related experimental verification provide strong and clear evidence for the claim.

方法与评估标准

Yes. The proposed method indeed demonstrates advantages over other state-of-the-art FS methods in experiments.

理论论述

Yes. I have reviewed the theoretical proofs and the correctness of the proofs supports the claim.

实验设计与分析

Yes. The experimental designs to verify the effectiveness of the proposed method are complete, and the comparison of other state-of-the-art FS methods and the comparison and analysis of experimental results on several diverse benchmark datasets well demonstrate the effectiveness of SIOFS.

补充材料

Yes. I have reviewed the proofs of the theoretical results and the further details of the methods and experiments in the supplementary material.

与现有文献的关系

Prior FS methods rarely aim to identify the class body in the context of intra-class ADMHC. This paper explains the motivation for quantifying the intrusion degree of SIOs and proposes the SIOFS method to deal with intra-class ADMHC for data classification.

遗漏的重要参考文献

No. All the essential references have been adequately discussed.

其他优缺点

\bullet Strengths:

The proposed method effectively handles the data classification problem with intra-class ADMHC and achieves better performance than existing methods.

\bullet Weaknesses:

  1. Theorem 2 assumes that α\alpha is a larger value, but experimental results show that the value of α\alpha is often relatively small, with the largest being 0.60.6 that occurs by chance. What is the exact range of values ​​of α\alpha in Theorem 2? How to explain the inconsistency between theoretical results and experiments?

  2. In Equation (6), the coefficient 22 is adjusted to 213s^(l)2-\frac{1}{3} \hat{s}^{(l)}, and the motivation and basis for this adjustment are not explained in detail. Can coefficient 13\frac{1}{3} of s^(l)\hat{s}^{(l)} be adjusted to other values ​​between 0 and 1?

其他意见或建议

Line 161, "Condition i" --> "Condition (i)",

Line 123, right column, "condition ii" --> "condition (ii)",

Line 130, right column, "condition ii" --> "condition (ii)".

作者回复

We sincerely thank Reviewer 52hZ for the recognition of our work and for providing constructive comments.

Q1: Explain the inconsistency between Theorem 2 and results about α\alpha.

Sorry for the incomplete statement about Theorem 2. We clarify that the condition for α\alpha in Theorem 2 does not contradict the experimental results. When α\alpha is small, the conclusion in lines 183-187 (left column) still holds. For easy understanding, we add the following statement.

According to the definition of RDM center (see lines 96-98, right column), u(l)\mathbf{u}^{(l)} is the average of the top αnl\lceil\alpha\cdot n_l\rceil high density values in d1(l),,dnl(l)d_1^{(l)},\dots,d_{n_l}^{(l)}. When α\alpha is smaller but αnl=1\lceil \alpha\cdot n_l\rceil=1, we have u(l)=mode(l)^{(l)}=mode^{(l)}. Following Theorem 1, let ξ\xi be the distance between the instance and the center of class ll and f(x) be the probability density function of ξ\xi. Assume that the SC of f(x)f(x) is greater than 0. There are two properties of statistical probability (Harris & Stocker, 1998): (i) f(mode+ε)>f(modeε)f(mode+\varepsilon)>f(mode-\varepsilon), for any ε>0\varepsilon>0. (ii) f(x) is monotonically increasing near the left side of the modemode and decreasing near the right side. In Theorem 2, d1(l),,dnl(l)d_1^{(l)},\dots,d_{n_l}^{(l)} is a random sample of ξ\xi, let d2,d3d_2,d_3 denote the second and third largest density in d1(l),,dnl(l)d_1^{(l)},\dots,d_{n_l}^{(l)}, respectively, and d1=mode(l)d_1=mode^{(l)}. Obviously there is f(d1)>f(d2)>f(d3)f(d_1)>f(d_2)>f(d_3). When α\alpha is small but αnl=2\lceil\alpha\cdot n_l\rceil=2, σ^(l)>0\hat{\sigma}^{(l)}>0 and s^(l)>0\hat{s}^{(l)}>0. Given the chance of errors in the sampling process, the probability that d2>d1d_2>d_1 holds according to condition (i) is 1, then u(l)=d1+d22>d1^{(l)}=\frac{d_1+d_2}{2}>d_1, i.e., at least u(l)+2σ^(l)>mode(l)^{(l)}+2\hat{\sigma}^{(l)}>mode^{(l)} holds with probability 1. When α\alpha is small but αnl=3\lceil\alpha\cdot n_l\rceil=3, σ^(l)>0\hat{\sigma}^{(l)}>0 and s^(l)>0\hat{s}^{(l)}>0. If d3>d1d_3>d_1, then u(l)=13(d1+d2+d3)>d1^{(l)}=\frac{1}{3}(d_1+d_2+d_3)>d_1; If d3<d1d_3<d_1 (see figure https://i.postimg.cc/GmzqF1fV/PDF.png), since f(d3)<f(d2)f(d_3)<f(d_2), combining the property (i), d3d_3 can only be very close to d1d_1 in this skewed distribution, i.e., u(l)=13(d1+d2+d3)^{(l)}=\frac{1}{3}(d_1+d_2+d_3) is very close to d1d_1. Combined with σ^(l)>0\hat{\sigma}^{(l)}>0, it holds with probability 1 that u(l)+2σ^(l)>mode(l)^{(l)}+2\hat{\sigma}^{(l)}>mode^{(l)} for αnl=3\lceil \alpha \cdot n_l\rceil=3. When αnl=4,5,\lceil\alpha\cdot n_l\rceil=4,5,\dots, the same conclusion can be obtained. Similarly, when s^(l)<0\hat{s}^{(l)}<0 and α\alpha is small, but αnl=1,2,\lceil\alpha\cdot n_l\rceil=1,2,\dots, u(l)+2σ^(l)<mode(l)^{(l)}+2\hat{\sigma}^{(l)}<mode^{(l)} holds with probability 1.

In addition, we rewrite "Based on Theorem 2..." (line 181) as "Combining Theorem 2..." to make the discussion and expression more rigorous. Based on the above explanations and Theorem 2, the value of α\alpha is reasonable in (0,1), and thus we set α=0.1,,0.9\alpha=0.1,\dots,0.9 in Figures 5 and 7. Since α\alpha is the ratio of higher density to all instances in a class (see lines 113-115), it is affected by different dataset characteristics. For example, due to the scattered distribution of clusters and multiple local high-density subclusters in CLL (see Fig. 1b), α\alpha has a significant impact on ACC (see Fig. 7). This problem can be solved by selecting more features. Above statements will be added in the final version.

Q2: Explanations of the adjustment of the coefficient "2" mentioned in the left column at line 189 and the normalization operation on line 193.

Please see our response to Reviewer aEFk, Q5.

Q3: Typos: conditions i-->(i), ii-->(ii) and iii-->(iii).

As suggested, we go over the paper and will revise them in the final version.

Q4: Whether SIOFS can still have a significant advantage over other methods when the number of classes is large?

As suggested, we add comparisons with the baselines on Caltech101 dataset, which has 101 classes and 3030 images. Like (Yuan et al., 2022), we also use the Fisher Vectors (262144 dimensions). Additionally, in order to reduce the computation time and preserve the main properties of the original representation, we uniformly sample these very high dimensional feature vectors with 50 components, i.e. 1,51,,2621441,51,\dots,262144, and obtain the final feature vector (5243 dimensions) for each image. Following the setting in (Yuan et al., 2022), we select 60%,70%,80%,90% of features. Same with the original paper, ACC and NMI are calculated over all baselines. And SIOFS outperforms all comparative methods in all cases. Some results are shown in Rebuttal Table A.

Rebuttal Table A. ACC (%) on the large-scale high-dimensional Caltech101 dataset. Details will be added in the final version.

ACC (%) \uparrow60%70%80%90%
FSDOC43.5643.4343.9343.20
S2DFS42.3843.4343.5643.66
IOFS45.4548.8451.5253.83
EGCFS43.9643.8943.8644.26
FSDK39.1440.4642.1542.97
SIOFS45.7448.9851.5854.13
最终决定

This paper proposes a novel Stray Intrusive Outliers-based Feature Selection (SIOFS) method for high-dimensional data classification with intra-class asymmetric instance distribution or multiple high-density clusters (ADMHC). The key innovation lies in focusing on stray intrusive outliers that intrude into other class bodies, using a refined density-mean center to characterize class bodies and modifying the skewness coefficient fused with the 3σ principle to identify class boundaries. Theoretical guarantees are provided for the proposed method, and extensive experiments on 15 benchmark datasets demonstrate its effectiveness.

The paper received generally positive reviews, with all three reviewers acknowledging its ​​novelty​​ in addressing ADMHC scenarios and its ​​strong empirical results​​. Reviewer aEFk highlighted the method's potential applications but initially expressed concerns about theoretical-experimental alignment. While minor weaknesses were noted, the reviewers unanimously leaned toward acceptance after the rebuttal.