Unsupervised Feature Selection using a Basis of Feature Space and Self-Representation Learning
Novel representation learning method, Graph Regularized Self-Representation and Sparse Subspace Learning (GRSSLFS), is proposed for the unsupervised feature selection.
摘要
评审与讨论
This paper introduce a novel unsupervised feature selection methods called GRSSLFS, which combine matrix factorization with self-representation subspace learning and apply graph regularization to preserve the geometric structure of the feature vectors. This method is proved to be effective in both theory and experiments.
优点
In this paper, the author introduce the problem of the redundant data in traditional self-representation, and then apply matrix factorization self-representation problem to achieve the goal to reduce the dimension of basis matrix. Here are some strengths of this article:
-
This paper introduce a novel problem of redundant data in self-representation problems and then propose a method to solve this problem.
-
Plenty of theoretical proof are given in the paper and appendix, the convergence analysis indeed increase the persuasiveness of the article.
-
The proposed method was compared with a variety of comparison algorithms on multiple data sets, demonstrating the effectiveness of the method.
缺点
However, there are still some weaknesses in this paper.
-
In the end of Introduction section, the second and third contribution points is not sufficient, as these constraints of regularization are not proposed in this article.
-
In the methodology section, some formula calculations are confusing and not very convincing. Such as the multiplication in formula (6) and the optimization target in the optimization goal (formula 7). These issues will be described in detail in subsequent questions 1 and 2.
-
In the methodology section, the description of the algorithm is not complete enough. The specific process of selecting features according to the matrix U in Algorithm 1 has not been described in detail.
问题
-
In the section of methodology, the equation (6) is confusing and not so clear. It seems impossible to subtract the matrix XUV of shape mm from the matrix X with the shape mn?
-
As the feature matrix B is fixed by the VBE method proposed in section 3.3, it is unclear why the basis coefficient matrix G in equation (7) is a parameter to be optimized. Why the matrix G can not be determined by equation (4) directly and reduce the number of parameters.
-
In section 3.1, subspace learning that introduces graph regularization seems to be existing methods. Should this part of the content be moved to related work?
Response to Questions:
[Question1]. In the section of methodology, the equation (6) is confusing and not so clear. It seems impossible to subtract the matrix XUV of shape mm from the matrix X with the shape mn?
[Response]. With respect to the second comment of the previous part, we have adjusted the way we introduce the matrices and after formula (6) by incorporating their respective sizes. This subtraction is feasible because the matrix is , matrix is , and matrix is of size . Therefore, the confusion has been resolved by including the dimensional information of the matrices.
[Question2]. As the feature matrix B is fixed by the VBE method proposed in section 3.3, it is unclear why the basis coefficient matrix G in equation (7) is a parameter to be optimized. Why the matrix G can not be determined by equation (4) directly and reduce the number of parameters.
[Response]. This is an insightful question, and initially, we believed it could aid in reducing computations by fixing the basis matrix and computing from Equation (4). However, it's essential to note that Equation (4) constitutes the first term of our objective function in Equation (8). To be more precise, in addition to relation (4), the matrix simultaneously plays a crucial role in preserving the geometrical information of the data in the final equation of subsection 3.1 and is actively involved in the subspace learning process in the concluding equation of Subsection 3.2. Therefore, to ensure that all parts of the objective function in (8) decrease during the updating process, we must consider as a part of the whole and update it within the general framework, as we did. If we only compute based on equation (4), we can only guarantee the decrease of the first term of the objective function.
[Question3]. In section 3.1, subspace learning that introduces graph regularization seems to be existing methods. Should this part of the content be moved to related work?
[Response]. The reviewer highlighted the graph regularization term within the framework of manifold learning, which is an established method that has been widely used in dimensionality reduction techniques. However, the novelty of our regularization term lies in establishing a connection between the concept of the basis matrix and the geometrical structure of the data. Specifically, we introduce a new term, , where the basis matrix plays a crucial role, facilitated by the coefficient matrix . Therefore, as a consequence of our self-representation framework in Section 3.1, we used the basis matrix for preserving the geometry of data.
Response to Weaknesses:
[Weaknesses1]. In the end of Introduction section, the second and third contribution points is not sufficient, as these constraints of regularization are not proposed in this article.
[Response]. We thank the reviewer for raising this point. In response to the reviewer's comment, we would like to mention that we have revised the contribution points of our proposed method. The main contributions of the paper are summarized below:
1- To the best of our knowledge, GRSSLFS is the first feature selection method that applies a basis of linearly independent features into a unified framework of subspace learning and self-representation.
2- GRSSLFS characterizes the challenges of subspace learning and self-representation by utilizing a basis matrix derived from the feature space. The objective is to eliminate unnecessary features and identify informative ones within high-dimensional data.
3- GRSSLFS constructs a basis for the feature space by incorporating the variance information of features into the Basis Extension method. The goal is to create a feature basis with the highest variance score. This ensures that the basis for the feature space comprises elements with the most dispersion in space, thereby minimizing the accumulation of features.
[Weaknesses2]. In the methodology section, some formula calculations are confusing and not very convincing. Such as the multiplication in formula (6) and the optimization target in the optimization goal (formula 7). These issues will be described in detail in subsequent questions 1 and 2.
[Response]. We greatly appreciate the reviewer for providing valuable comments that helped clarify certain aspects. To address the concerns raised, we thoroughly reviewed all relevant matrices involved in the formulations (6) and (7) and included their sizes which were omitted in the initial submission. Specifically, we appended the sizes of matrices and after equation (6) and the size of matrix before equation (7).
[Weaknesses3]. In the methodology section, the description of the algorithm is not complete enough. The specific process of selecting features according to the matrix U in Algorithm 1 has not been described in detail.
[Response]. We thank the reviewer for raising this issue. Regarding this comment, we have added an explanation about choosing the features according to the matrix , in the revised manuscript, just after the main problem in equation on page 6. Please note that, as we explain in Section 3.2 on subspace learning and based on the term , which is reflected in in our basis-based model, we can interpret the matrix as the approximation of selected sub-matrix. We have where are the rows of the matrix . The above representation indicates that each row of the matrix can be interpreted as a weight in displaying the sub-matrix of selected features. So, we use the norms of the rows of the matrix as the importance measure of each feature (column) of the input data .
Dear Reviewer, We kindly request your prompt review of our responses as the deadline is in a few hours.
Thank you for your timely assistance
Authors of this paper propose graph regularized self-representation and sparse subspace learning (GRSSLFS) for unsupervised feature selection. The basis extension method is modified to select bases with highest variance score. These bases are used to build graph regularized self-representation learning and subspace learning.
优点
The graph regularized self-presentation learning and subspace learning are combined in terms of a set of selected bases from input space with the highest variance score. Experiments on various datasets demonstrate the advantage of the proposed method comparing with baselines. The ablation study also shows the necessity of each component.
缺点
The subspace learning module is the key component, but its derivation from selected bases highly relies on the assumption that XU=B. This might not be hold if B is selected according to the proposed variance basis extension method. Moreover, the selection based on subspace learning module lacks of convincing explanation since G does not exactly represent X based on B as a fixed set of feature vectors.
问题
It is confusing to explain (4) as the self-representation problem if B is arbitrary basis matrix since they may not come from the input data matrix X. Taking PCA for example, the columns of B are orthogonal, but they are not from the input feature space. Moreover, B defined as a square matrix of size m is inconsistent with the sleeted r bases in section 3.3.
In section 3.1, authors mentioned that two features have a similar structure in the feature space, and it is expected Bg_l and Bg_r have similar structure. What does the similar structure mean? How is the similarity measured? In other words, it is unclear how the matrix A is constructed.
The derivation in section 3.2 depends on the assumption that XU=B. As B is a set of feature vectors selected from input data, it is unclear whether the assumption still holds or not. Similarly for Theorem 3.1, it is trivial to have if the assumption holds.
The variance basis extension is to simply change the selection order of feature vectors in terms of variance score of feature vectors. It is possible that for each individual feature, the variance is high, but is it similar to say the largest amount of data dispersion?
For completeness, authors should describe the derivation process on how equations (9)-(11) are obtained. Since all three equations are fractional, is it possible that any of the denominators can be zero? How is it handled?
In Algorithm 1, the selected features are derived from U. However, U is not directly related to the input X instead to B and G, unless BG=X. However, B is selected feature vectors from input space. It is unclear why the assumption can hold. So why is the selection rule proper?
The computation complexity is quite high since it is quadratic to both the number of samples and the number of features comparing with most of baseline methods. The application to the PneumoniaMNIST dataset is quite interesting. However, the way of presenting the outcomes can be improved significantly. For example, what is the interested region? how many selected features are in the interested region? How do other compared methods perform? The validation is not quantified. How many radiologists are involved in the evaluation? What is the performance measured? These plots shown in Fig. 2 delivers less useful information except that more red points are accumulated in the center when the number of selected features increases.
We thank the reviewer for raising this point. According to the self-representation formulation (4),
and letting , we get
$
\mathbf{X}\simeq\mathbf{BG}\Rightarrow[\mathbf{f}_1,\mathbf{f}_2,\ldots,\mathbf{f}_n]\simeq[\mathbf{B}\mathbf{g}_1,\ldots,\mathbf{B}\mathbf{g}_n].
$
Therefore, each feature vector is approximated with the vector .
To address the reviewer's comments, we have clarified the meaning of similar structure by adding a description in the main paper that our metric for measuring the similarity between features relies on their geometric distance in space. Specifically, to incorporate the geometric structure of the data, the elements of the similarity matrix are constructed as follows:
$
\mathbf{a}_{ij} = \begin{cases} e^{\frac{-\Vert \mathbf{f}_{i} - \mathbf{f}_{j} \Vert_2^2}{t^2}}, & \text{if}\quad \mathbf{f}_{j} \in N_{k}(\mathbf{f}_{i})\quad \text{or} \quad \mathbf{f}_{i} \in N_{k}(\mathbf{f}_{j}) \ 0, & \text{otherwise,} \end{cases},
$
where refers to the set of feature vectors that are closest to , and denotes the scale parameter of the Gaussian distribution. Please note that, by this definition, if the vectors and are close, then the value of is large, and vice versa. One advantage of this formulation, based on the k-nearest neighborhood approach, is that it makes the matrix sparse, and the computational complexity of the manifold learning process is considerably reduced.
In the theory of manifold learning, it is commonly assumed that if two vectors are similar to each other in the original space, then their representations in the transformed space should also be close. Based on this observation, if the two features and are similar, it is expected that their mapped vectors and will be also similar. To this end, the problem of manifold learning for transformation can be defined as:
$
\min\_G \frac{1}{2}\sum\_{q=1}^n\sum\_{r=1}^n\left\Vert \mathbf{B}\mathbf{g}\_{q}-\mathbf{B}\mathbf{g}\_{r} \right\Vert\_2^2a\_{qr}
$
implying that must be small whenever is large (that means and are close). Computations show that
$
\frac{1}{2}\sum\_{q=1}^n\sum\_{r=1}^n\left\Vert \mathbf{B}\mathbf{g}\_{q}-\mathbf{B}\mathbf{g}\_{r} \right\Vert\_2^2a\_{qr}
=&\frac{1}{2}\sum\_{q=1}^n\sum\_{r=1}^n\left(\mathbf{B}\mathbf{g}\_{q}-\mathbf{B}\mathbf{g}\_{r}\right)^T
\left(\mathbf{B}\mathbf{g}\_{q}-\mathbf{B}\mathbf{g}\_{r}\right)a\_{qr}\nonumber\\
=&\sum\_{r=1}^n\left(\ \mathbf{g}\_{r}^T\mathbf{B}^T\mathbf{B} \mathbf{g}\_{r}\right){p}\_{rr}-
\sum_{q=1}^n\sum_{r=1}^n\left(\mathbf{g}_{r}^T\mathbf{B}^T\mathbf{B} \mathbf{g}_{q}\right)a_{qr}\nonumber\ =&\mathrm{Tr}\left(\mathbf{B}\mathbf{G}\mathbf{L}\mathbf{G}^T\mathbf{B}^T\right)\nonumber\
$
in which is the Laplacian matrix, where with , for .
We thank the reviewer for raising this point. In order to address the reviewer’s question, we would like to confirm that the variance of a feature measures its dispersion or range. When a feature exhibits high variance, it suggests that its values are widely spread over a broader range. In the context of the variance basis extension, altering the selection order of feature vectors based on their variance scores implies a focus on features with significant variability or dispersion.
However, it is important to note that while high variance indicates a wide spread of values for an individual feature, it does not necessarily mean that this feature contributes the most to the overall dispersion or variability in the dataset. The idea of the "largest amount of data dispersion" is better captured by considering the overall variance or variability of the entire dataset, which takes into account the combined effects of all features. In summary, high variance for individual features reflects their own level of dispersion, but identifying the features contributing the most to the overall data dispersion requires considering the collective impact of all features. For more information, we would like to mention that the theoretical investigations into this issue can be found in some related studies, such as ``Unsupervised feature selection based on variance-covariance subspace distance, Neural Networks, 2023."
We thank the reviewer for raising this point. In response, we should mention that considering the explanations provided to respond the Weaknesses section, in our algorithm for generating the basis (in our proposed Variance Basis Extension (VBE) method in subsection 3.4), we select the basis vectors from the columns of the input data matrix , denoted as . Therefore, in our subspace learning formulation in subsection 3.2, the matrix exists such that .
Because of this issue, in the proof of Theorem 3.1, we explained how to construct this matrix and then, as the reviewer mentioned, the proof is straightforward.
We thank the reviewer for raising this point. In response to the reviewer's comment, we would like to mention that in general, a basis for a vector space, generated by a set of vectors such as , can be formed in two ways:
First case: the basis members are chosen directly from the vectors . A well-known method for this selection is the Basis Extension (BE) method, detailed in the manuscript. The BE method is capable of constructing a set of basis vectors for any vector space, including the feature space in our problem. In this scenario, as addressed in the initial question from the reviewer, there exists a -matrix such that This implies that the columns of are derived from the input feature space.
Second case: the basis members are extracted using the vectors . For instance, as noted by the reviewer, the basis vectors of PCA can be utilized to ensure that the constructed basis vectors are mutually perpendicular. However, this is not the case for our proposed feature selection framework.
In response to the other reviewer's comment, we would like to mention that as stated in Subsection 3.1 (just after Equation (2)), our manuscript assumes and throughout. Consequently, the matrix is consistently a square matrix of size .
It is worth emphasizing that, even in scenarios where or , the creation of a basis matrix for the feature space is still viable and does not pose any challenges in the development of our feature selection framework. To elaborate further, for any matrix with (where it is proved that is always smaller than ), the BE method can be applied to generate the matrix basis . This basis matrix can then be utilized in our proposed feature selection method. In such cases, it is sufficient to assume that the matrix , and the matrix product will be in and will have meaningful implications. Moreover, it can be easily seen that the expression will be well-defined. To address the reviewer's comment, we have incorporated a brief explanation of this issue in a remark on Page 4 of the revised manuscript.
We thank the reviewer for raising this point. In response to the reviewer's comment, we would like to mention that the assumption holds and there always exists a matrix like . The reason is that in our proposed Variance Basis Extension (VBE) method which is explained in Subsection 3.4 (Subsection 3.3 in the old version), we generate the basis matrix as in which the basis vectors are selected from the input features set . These are the columns of the data matrix . Therefore, we have . Now, as in the proof of Theorem 3.1, we can define the matrix as , where (for ) is a vector whose -element is 1 and other elements are 0. Then it can be seen that
0 & 0 & & & 0 \newline \vdots & \vdots & & & \newline 0 & 0 & & & \newline 1 & 0 & & & \newline 0 & \vdots & \cdots & \cdots & \vdots \newline \vdots & 1 & & & \newline & 0 & & & 1 \newline & \vdots & & & \vdots \newline 0 & 0 & & & 0 \end{bmatrix} =[\mathbf{f}\_{i\_1},\ldots,\mathbf{f}\_{i\_m}] =\mathbf{B}.$$ Thus, the $0,1$-matrix $\mathbf{U}$ always exists in our method of generating the basis matrix, and the equality $\mathbf{X}\mathbf{U}=\mathbf{B}$ is satisfied. On the other hand, considering that the matrix $\mathbf{B}$ serves as a basis for the feature space, it follows from a fundamental theorem in linear algebra (namely, each vector in a vector space can be expressed as a linear combination of basis members) that a matrix $\mathbf{G}\in\mathbb{R}^{m\times n}$ exists such that $\mathbf{X}$ can be represented in its basis form, denoted as $\mathbf{X} = \mathbf{BG}$. According to the $0,1$-matrix $\mathbf{U}$ mentioned above, it can be seen that $\mathbf{X}=\mathbf{B}\mathbf{G}=\mathbf{X}\mathbf{U}\mathbf{G}$. Hence, $\mathbf{G}$ as the coefficient matrix can be used to recover the data matrix $\mathbf{X}$ from the basis matrix $\mathbf{B}$. This explanation is provided to show how the matrices $\mathbf{G}$ and $\mathbf{B}$ are used in the framework of subspace learning.In response to the reviewer's comment, we would like to mention that as mentioned on Page 6 of the paper, the extraction of the update rules introduced for , , and involves three steps:
(1) Calculating the gradient of the objective function in regards to one of the variables with the assumption of keeping the other variables constant;
(2) Setting the obtained gradient equal to zero;
(3) Applying the Karush-Kuhn-Tucker conditions (for example, in our problem, we have ).
To ensure that the denominator of the fraction does not become zero, two common workarounds are employed in the problems based on matrix factorization:
First option: Utilizing the assumption of non-negativity for the input matrices of the problem. Notably, one of the most famous dimension reduction methods, Non-negative Matrix Factorization (NMF), assumes the non-negativity condition for the data matrix and the unknown matrices used in the optimization problem, such as the matrices , , and in our proposed framework. This guarantees that the updated matrices , , and remain non-negative, avoiding the zero denominator problem. Theoretical investigations into this issue can be found in some studies related to NMF, such as Lemma 1 on Page 3 of the following reference:
``A convergent algorithm for orthogonal nonnegative matrix factorization. Journal of Computational and Applied Mathematics, 260, 149-166."
Second option: This method, which is applicable even when assuming non-negativity, has a more practical aspect. To prevent the denominator from reaching zero and causing numerical calculation errors, a small non-negative number, such as , is used. Specifically, we set
Employing this strategy ensures that fraction calculations are performed without encountering zero-related problems.
In the framework designed for our problem, the first option (assuming non-negativity of inputs) is used. Additionally, when implementing the method, the second option (comparing the denominator with a very small non-negative number) is also considered.
It is worth mentioning that even if the assumption of non-negativity on the input variables is not considered, there is a crucial principle in matrix analysis stating that any matrix with real elements can be expressed as the difference of the two non-negative matrices. This is demonstrated as follows:
where , and . By utilizing this representation and the two methods mentioned above, the problem of encountering zero in dimension reduction problems based on matrix factorization is effectively prevented.
To respond to the reviewer's comment, we would like to provide the list of feature selection methods, along with their corresponding computational complexities. These methods are utilized as comparative benchmarks in our experiments. In the following list, represents the number of samples, is the number of features, signifies the number of iterations, denotes the number of classes, and indicates the batch size:
1- LS: Structure Preserving-based method, with ,
2- CNAFS: Matrix factorization and Structure Preserving-based method, with ,
3- OCLSP: Structure Preserving-based method, with ,
4- RMFFS: Matrix factorization-based method, with ,
5- RNE, Structure Preserving-based method, with ,
6- VCSDFS: Matrix factorization-based method, with ,
7- CAE: Encoder-decoder-based Method, with ,
8- CD-LSR: Least Square Regression-based method, ,
9- Our proposed method, GRSSLFS: Matrix factorization and Structure Preserving-based method, .
In order to address the reviewer's comment, it is noteworthy that in the literature on matrix factorization-based feature selection methods, it is evident that the computational complexity of such approaches is a function of , , , or . This pattern is observed in various matrix factorization-based methods, including CNAFS, RMFFS, VCSDFS, and our proposed method, GRSSLFS. Notably, our proposed method, GRSSLFS, introduces an additional step involving three unknown variables, in contrast to many unsupervised feature selection methods like CNAFS, RMFFS, and VCSDFS, which typically use two unknown variables. Due to this and the inclusion of an extra step considering the basis matrix, the computational complexity of our proposed method is quadratic in both the number of samples and the number of features. However, in case the number of features far exceeds the number of samples, as in our study, the complexity can be approximated as .
The reviewer's great attention to this issue is mostly appreciated. In order to address the reviewer's question, we would like to highlight that the primary objective of this paper is to further investigate the impact of a basis for the feature space on tasks related to feature learning. As discussed in the paper, we first defined the self-representation problem based on the basis matrix as follows:
where is a basis matrix. Moreover, in order to learn a low-dimensional representation of high-dimensional data, we established the subspace learning problem based on the basis matrix as:
where , and , such that is the dimension of the selected feature space. And, we finally defined our proposed feature selection method based on the above-mentioned problems designed via the basis for the feature space. In summary, our proposed feature selection method employed the following structure:
$
\small Dataset,,\mathbf{X}\rightarrowConstruct the Basis,,\mathbf{B},,for the Feature Space\rightarrowDefine the Feature Selection Problem Using,,\mathbf{B}.
$
To provide further clarification, in contrast to classical feature selection methods that select a subset of features by searching the original data feature set, our proposed feature selection method conducts the selection process by searching through a basis of features rather than all features. As discussed in Subsection 3.3 of the manuscript, the selected feature matrix can be written as
On the other hand, considering that the matrix serves as a basis representation for the data matrix , it follows from a fundamental theorem in linear algebra (namely, each vector in a vector space can be expressed as a linear combination of basis members) that can be represented in its basis form, denoted as . Consequently, it can be inferred that .
In order to address the reviewer's question, it is worthwhile to note that
1- The representation for the selected feature matrix is based on the aforementioned theorem in linear algebra, stating that each vector in a vector space can be written as a linear combination of basis members (for instance, in our case, ).
2- The selected feature matrix and the matrix are explicitly connected through the relationship . In essence, our proposed feature selection method conducts the selection process by exploring a basis of features rather than considering all features.
3- The selected feature matrix and the matrix can be implicitly linked through the relationship .
In both scenarios, the matrix functions as the feature weight matrix, and the feature selection process is carried out based on .
Our responses are as follows:
1- What is the interested region?
For patients diagnosed with pneumonia, the chest X-rays reveal various diagnostic features, particularly in the cardiac section. For more information, please refer to ``Chest Radiology: Patterns and Differential Diagnoses, Elsevier Health Sciences, 2017." Notably among these characteristics are an augmented cardiac shadow, alterations in cardiac positioning, the presence of effusion, pericardial effusion, and instances of cardiomegaly. In this study, the application of our proposed feature selection method to the PneumoniaMNIST dataset is assessed with the goal of precisely identifying and analyzing the cardiac silhouette in chest X-ray images, which is located in the central part of the chest.
2- How many selected features are in the interested region?
For our proposed method when the number of selected features is 50 and 100 (as depicted in Figure 1 on Page 9), all the selected features are consolidated within our area of interest.
3- The validation is not quantified. How many radiologists are involved in the evaluation? What is the performance measured?
The validation process incorporated the expertise of two board-certified radiologists. Both professionals thoroughly reviewed the results of chest X-ray images, offering valuable insights to ensure the accuracy and reliability of our proposed method. Their assessments played a pivotal role in validating the model's performance within a clinical context. Additionally, given the study's exclusive focus on detecting the cardiac silhouette without making any diagnoses, both radiologists concurred that assessing specific performance criteria is unnecessary. For this reason, our analysis is focused solely on identifying and examining the cardiac silhouette in chest X-ray images.
4- How do other compared methods perform?
In agreement with the reviewer, we have included results related to the selection of 100 features using comparative methods in Section J on Page 23 (refer Figure 8 in the Appendix). Examining this figure, it is evident that some methods, including our proposed method, RMFFS, and VCSDFS, exhibit a tendency to identify the central area of chest X-ray images. Conversely, some other methods, while successfully identifying portions of the central areas, may overlook additional regions that could provide valuable information. These methods fail to capture such information in radiological analysis.
Finally, in response to the reviewer's questions, we have incorporated additional explanations regarding ``the region of interest" and clarified our purpose for conducting the test on the PneumoniaMNIST dataset in the revised manuscript. Furthermore, the results concerning the selection of 100 features by other comparative methods are presented in Figure 8 of the revised manuscript.
Dear Reviewer, We kindly request your prompt review of our responses as the deadline is in a few hours.
Thank you for your timely assistance
Considering there exists a major gap in the mathematical principles that underlie the self-representation based unsupervised feature selection approaches and their capacity to represent the feature space, this paper proposes Graph Regularized Self-Representation and Sparse Subspace Learning (GRSSLFS), for the unsupervised feature selection, which expresses the self-representation problem based on the concept of “a basis of feature space” to represent the original feature space as a low-dimensional space made of linearly independent features. Experiments on widely-used datasets are conducted to validate the efficacy of the proposed method.
优点
- The computational complexity of the proposed GRSSLFS method is low, which is efficient for large-scale and high-dimensional data;
- The results of the proposed method seem better than other ones.
缺点
- Most of the compared methods are out-of-date, only one method used for comparison was publised in 2023, other methods are before 2020;
- The motivation of the proposed method is not clear. In Eq.(8), the first three terms have been well explained, but the final regularization term has not been explained.
问题
See weakness.
We thank the reviewer for raising this point. We agree that this was not noticed previously, and we have considered three state-of-the-art feature selection methods in the revised version of the manuscript in Table 1. Our proposed approach can outperform these recent SOTA methods as well. The considered SOTA methods are as follows:
1- CNAFS, Convex Non-Negative Matrix Factorization With Adaptive Graph for Unsupervised Feature Selection, IEEE Transactions on Cybernetics, vol. 52, Pages 5522-5534, 2022.
2- OCLSP, Unsupervised Feature Selection via Orthogonal Basis Clustering and Local Structure Preserving, IEEE Transactions on Neural Networks and Learning Systems, vol. 33, Pages 6881-6892, 2022.
3- VCSDFS, Unsupervised Feature Selection Based on Variance-Covariance Subspace Distance, Neural Networks, vol. 166, Pages 188-203, 2023.
The primary rationale for using these methods is their foundation in matrix factorization or subspace learning. These methods have been recently introduced and showed promising performance in the field of feature selection.
The results for the three previously employed methods namely MFFS, MPMR and SCFS which could be considered old methods, are moved to the Appendix (Section I.5). Instead, the results of these recent methods, CNAFS, OCLSP, and VCSDFS, are added to Table 1 on Page 8 of the revised manuscript. Along with these numerical results, all other tables and plots, have been updated to reflect these changes.
Table 1: Comparing the best values of the clustering ACC of nine algorithms on all datasets. Here, Baseline refers to the case where all features are considered. The highest value in each row is in bold.
| Datasets | Baseline | LS | CNAFS | OCLSP | RNE | RMFFS | VCSDFS | CAE | CD-LSR | GRSSLFS (Ours) |
|---|---|---|---|---|---|---|---|---|---|---|
| CNS | 53.33 | 58.66 | 63.84 | 65.31 | 58.33 | 63.33 | 66.49 | 66.61 | 61.67 | 73.33 |
| GLIOMA | 44.05 | 44.00 | 50.11 | 50.77 | 47.11 | 49.80 | 51.23 | 54.52 | 48.70 | 54.10 |
| TOX-171 | 41.25 | 40.35 | 48.06 | 48.67 | 51.69 | 49.64 | 52.69 | 53.36 | 48.77 | 57.39 |
| SRBCT | 25.66 | 46.02 | 44.17 | 44.27 | 42.12 | 48.97 | 44.39 | 46.08 | 52.63 | 53.49 |
| SMK | 51.34 | 55.11 | 60.12 | 62.31 | 61.51 | 52.94 | 62.91 | 61.80 | 64.71 | 65.72 |
| ATT | 63.84 | 46.75 | 58.68 | 59.71 | 59.08 | 46.25 | 61.97 | 56.86 | 58.03 | 64.11 |
| ORL | 50.25 | 38.51 | 49.17 | 50.45 | 51.31 | 47.71 | 47.51 | 50.09 | 54.05 | 53.45 |
| warpAR10P | 26.88 | 21.50 | 38.83 | 41.68 | 34.62 | 41.55 | 36.85 | 39.26 | 31.94 | 46.66 |
Table 1: Comparing the best values of the clustering NMI of nine algorithms on all datasets. Here, Baseline refers to the case where all features are considered. The highest value in each row is in bold.
| Datasets | Baseline | LS | CNAFS | OCLSP | RNE | RMFFS | VCSDFS | CAE | CD-LSR | GRSSLFS (Ours) |
|---|---|---|---|---|---|---|---|---|---|---|
| CNS | 1.17 | 2.44 | 8.15 | 10.76 | 2.29 | 9.35 | 10.35 | 11.05 | 13.76 | 18.56 |
| GLIOMA | 17.94 | 18.67 | 27.21 | 28.93 | 25.84 | 29.81 | 28.92 | 32.83 | 25.35 | 32.09 |
| TOX-171 | 13.54 | 11.68 | 23.47 | 25.27 | 26.78 | 24.60 | 30.98 | 31.12 | 31.28 | 36.96 |
| SRBCT | 11.55 | 56.23 | 41.21 | 42.44 | 29.48 | 47.82 | 48.00 | 45.28 | 49.53 | 59.67 |
| SMK | 0.01 | 2.17 | 3.75 | 4.72 | 3.77 | 3.21 | 5.63 | 3.51 | 7.12 | 8.20 |
| ATT | 81.19 | 68.46 | 77.05 | 78.61 | 77.17 | 65.27 | 78.62 | 75.27 | 77.02 | 81.71 |
| ORL | 72.68 | 63.85 | 71.44 | 72.19 | 72.59 | 69.72 | 73.02 | 73.64 | 72.57 | 74.56 |
| warpAR10P | 28.48 | 19.39 | 43.00 | 45.46 | 35.85 | 46.60 | 41.62 | 42.35 | 34.22 | 48.63 |
We appreciate the reviewer's thorough attention to these issues. To respond to the reviewer's comment regarding the motivation, we would like to mention that the main goal of the feature selection process is to opt for a subset of features with the purpose of diminishing noise and improving the efficiency and effectiveness of a learning task, such as classification and clustering. In essence, this process of selection aims to identify features that are important for the learning task while ensuring that the chosen set of features is non-redundant, thereby strengthening robustness. Within this framework, a crucial factor in feature selection is redundancy, which involves examining the similarity among features and assessing how effective it is to introduce a new feature into a designated feature set for data analysis.
In the field of linear algebra, a direct connection exists between redundancy and the concept of linear independence. Specifically, a set of vectors in a vector space is deemed linearly dependent if non-zero scalars can be identified in a way that their linear combination equals zero. This suggests that it is possible to express one of these vectors as a combination of the others. In contrast, a collection of vectors is deemed linearly independent if there is no linear dependence among them. Consequently, we can deduce from this principle that a set of vectors exhibiting linear dependence includes redundancy, enabling the removal of one vector from the initial set without altering the span of the remaining vectors. A significant benefit of eliminating linearly dependent features from the initial feature set is that it aids in fulfilling a crucial aim in feature selection, which is the reduction of redundancy. In the pursuit of this goal, the concept of a basis, a fundamental element in the theory of linear vector spaces, becomes crucial.
Each basis for the feature space exhibits two essential traits. Firstly, it covers the entire feature space, and secondly, all features within a basis are linearly independent. These attributes positively and constructively influence the discriminative feature selection process. The initial property guarantees that a basis can represent the complete feature space through its elements, implying that the fundamental characteristics of the original features are present in the basis with a significantly reduced number of elements compared to the entire feature space. The second characteristic of a basis, which involves its elements being linearly independent, leads to a significant reduction in data redundancy. Exploiting the advantages of a basis in a vector space, the primary objective of this paper is to further investigate the impact of a basis for the feature space on tasks related to feature learning.
In order to reply and address the other comment of the reviewer, we follow the matrix structure of and , as stated in the manuscript. According to the subspace learning problem mentioned in Problem (7) of the manuscript:
we concluded . This means that and implies that each feature of the basis are represented by , for . To be more precise, the features are expressed as:
Hence, the sparser the columns of are, the fewer members of are used in the representation of of the basis matrix . As a result of this fact, a reduction in redundancy is more likely to occur. In order to reflect the importance of the representation matrix in our proposed feature selection process, the idea of inner product regularization is used. Let us consider the representation matrix . The Gram matrix of is written as
where is the -th column of , for . Here onwards, the expression readily corresponds to the summation of the off-diagonal elements within the Gram matrix , that is to say . Therefore, by the subsequent problem formulated as follows:
and under the assumption that , it can be deduced that the optimal solution to the above problem is attained when the inner product terms tend toward zero for all pairs of and except when . On the other hand, since and considering the non-negativity assumption for and , the value of tends towards zero when either or become zero or extremely sparse vectors. Consequently, the optimization problem defined based on the inner product regularization can result in a significant degree of sparsity within the coefficient matrix .
In order to address the reviewer's comment, we have included new explanations following the above-mentioned details. For more information, we kindly ask the reviewer to refer to Subsection 3.3 on Page 5 and Section A on Page 13 of the revised manuscript.
Dear Reviewer, We kindly request your prompt review of our responses as the deadline is in a few hours.
Thank you for your timely assistance
Dear Reviewers and Area Chairs,
We would like to express our gratitude to all reviewers for their valuable comments and feedback.
We would be grateful if reviewers could check our responses and revisions in order to have possible interactions before 22 November. This way, we will have enough time to respond to any further comments.
We have conducted additional experiments, provided detailed explanation to every weakness and questions, and updated the paper accordingly based on reviewers feedbacks.
The paper presents an unsupervised feature selection method called Graph Regularized Self-Representation and Sparse Subspace Learning (GRSSLFS), and experimentally evaluate their method in the context of clustering. The reviewers raise various concerns, and the majority of the reviewers suggest rejection after the author’s rebuttal. I have also read the paper. I find the writing is confusing and the technical novelty is quite limited. In addition, the deductions from Eq(1) to Eq(4) seem incorrect. Overall, I think the current paper is not ready for publication and would recommend rejection.
为何不给更高分
The reviewers raise various concerns, and the majority of the reviewers suggest rejection after the author’s rebuttal. I have also read the paper. I find the writing is confusing and the technical novelty is quite limited. In addition, the deductions from Eq(1) to Eq(4) seem incorrect.
为何不给更低分
N/A
Reject