6.4

/10

Poster4 位审稿人

最低4最高4标准差0.0

3.3

置信度

创新性2.8

质量3.0

清晰度2.8

重要性2.5

NeurIPS 2025

$\text{G}^2\text{M}$: A Generalized Gaussian Mirror Method to Boost Feature Selection Power

Hongyu Shen,Zhizhen Zhao

OpenReview PDF

提交: 2025-05-11更新: 2025-10-29

摘要

Recent advances in false discovery rate (FDR)-controlled feature selection methods have improved reliability by effectively limiting false positives, making them well-suited for complex applications. A popular FDR-controlled framework called data splitting uses the "mirror statistics" to select features. However, we find that the unit variance assumption on mirror statistics could potentially limit the feature selection power. To address this, we generalize the mirror statistics in the Gaussian mirror framework and introduce a new approach called "generalized Gaussian mirror" ($\text{G}^2\text{M}$), which adaptively learns the variance and forms new test statistics. We demonstrate both theoretically and empirically that the proposed test statistics achieve higher power than those of Gaussian mirror and data splitting. Comparisons with other FDR-controlled frameworks on synthetic, semi-synthetic, and real datasets highlight the superior performance of the $\text{G}^2\text{M}$ method in achieving higher power while maintaining FDR control. These findings suggest the potential for the $\text{G}^2\text{M}$ method for practical applications in real-world problems. Code is available in https://github.com/skyve2012/G2M.

关键词

Feature SelectionFDR ControlHypothesis Testing; Gaussian Mirror; Boosting Power

评审与讨论

审稿意见

评分: 4置信度: 32025-06-04

This paper proposes a generalized gaussian mirror method for feature selection, which addresses the unit variance constraint on the mirror test statistics. Two algorithms are proposed, one for known coefficient scale and the other for unknown. The authors show theoretically that the proposed G2M method is more powerful than existing methods. Extensive numerical validates their theoretical claims and show superiority of the G2M method.

优缺点分析

Strength:

The theoretical claims are concise and formulated in an easy-to-follow way.
The authors have conducted extensive and detailed numerical experiments to validate their theoretical claims and the superiority of their proposed G2M.

Weakness:

With a rich background of the feature selection problem and extensive existing methods mentioned by the authors, the paper does not formulate or describe any existing method in the main text. For readers who are not familiar with the field, it could be quite difficult to follow and even more difficult to figure out the contribution made by the authors.
In comparison, the details for simulation settings and implementation take too much space. I believe it is the main reason for weakness 1 since the page limit is 9. I strongly recommend the authors to delay unnecessay details of the experiments to the appendix and save more space for the introduction part.

问题

Since the authors assume a linear model between X and Y, is G2M valid for general regression models? Or is it possible to extend G2M to general models?
The coefficients $\beta^+$ and $\beta^-$ are used without clear definitions. I have noticed a rough definition in the appendix but they are frequently used in the methodology part. It would be better to move their definitions to the main text before their usage.
Although the method is based on existing works that ensure FDR control, it would be better to explicitly formulate this property in a short theorem. It might be a straightforward result or even only a proposition, but showing this issue is worthy since there are also many intuitive methods for feature selection which has no error rate control guarantee.

I will be willing to raise my score if the paper is reformulated appropriately.

局限性

Yes

最终评判理由

Most of my questions are about the formulation and writing of the paper. The authors have discussed the related parts and provided plans of how to reformulate. The paper will be acceptable if all these plans are put into effect.

格式问题

None

作者回复

2025-07-31

Thank you for your time and feedback, and for assessing that the theoretical claims are easy to follow and numerical experiments are extensive and detailed.

QB1: With a rich background of the feature selection problem and extensive existing methods mentioned by the authors, the paper does not formulate or describe any existing method in the main text. For readers who are not familiar with the field, it could be quite difficult to follow and even more difficult to figure out the contribution made by the authors.

We put the two related works– the Gaussian mirror and data splitting methods in Appendix A. And this was primarily due to the page limit. To improve clarity, we will move Appendix A to the main paper in the revised manuscript..

QB2: In comparison, the details for simulation settings and implementation take up too much space. I believe it is the main reason for weakness 1 since the page limit is 9. I strongly recommend that the authors delay unnecessary details of the experiments to the appendix and save more space for the introduction part.

Thank you for the suggestion. We will shorten the description of each of the datasets used in the experiment section in the revised manuscript and also move the full description to a separate section in the Appendix to balance the introduction of the related work section according to QB1.

QB3: Since the authors assume a linear model between X and Y, is G $^2$ M valid for general regression models? Or is it possible to extend G $^2$ M to general models?

It is possible to extend G $^2$ M to the generalized linear model setting. For example, Reference [1] discusses the generalized linear model setting with Gaussian mirror statistics. We believe there is a similar extension for G $^2$ M. However, given that the form of test statistics in our paper is significantly different from the original Gaussian mirror, the extension of G $^2$ M to this case is non-trivial. And this is deferred to future work. We will add the above discussion in Section 4 of the revised manuscript.

QB4: The coefficients $\beta^+$ and $\beta^-$ are used without clear definitions. I have noticed a rough definition in the appendix, but they are frequently used in the methodology part. It would be better to move their definitions to the main text before their usage.

According to the response to QB1, we moved the related work section from Appendix A to the main paper in the revised manuscript. This should provide enough context on how the coefficients $\beta^+$ and $\beta^-$ are generated from the data splitting and the Gaussian mirror context. Essentially, the $\beta^+$ and $\beta^-$ used in G $^2$ M follow the same generation process in a variance generalized setting. We will explicitly mention this connection in the updated related work section for the Gaussian mirror in the revised manuscript.

QB5: Although the method is based on existing works that ensure FDR control, it would be better to explicitly formulate this property in a short theorem. It might be a straightforward result or even only a proposition, but showing this issue is worthwhile since there are also many intuitive methods for feature selection that have no error rate control guarantee.

Thanks for the suggestion. In the revised manuscript, we will state the theorem on FDR control for data splitting in the revised manuscript. Specifically. We state:

Theorem 1: For a given set of mirror statistics $w_j$ , $j \in [p]$ , if mirror statistics satisfy the following property: 1. If $j \in S_0$ is a null feature index, then given any real number $t \ge 0$ , $p(w_j \ge t) = p(w_j \le -t) = 0.5$ . 2. If $j \in S_1$ is a non-null feature index, then $p(sign(w_j) = 1) = 1$ .

Then given a nominal FDR level $q$ , the feature selection set $\lbrace j \vert w_j \ge \tau_q\rbrace$ controls FDR, where $\tau_q$ is chosen via Eq.(3).

The proof idea is already informally presented in lines 469-473, and we will provide the following proof for this Theorem in the revised manuscript.

Proof: Given the above two properties for the mirror statistics, we realize that $\vert \lbrace j: w_j \le -t, j\in S_0\rbrace \vert = \vert \lbrace j: w_j \ge -t, j\in S_0\rbrace \vert$ due to the symmetric-about-zero property for the mirror statistics $w_j$ under the null distribution. As a result, the selection rule $\hat{S}_1 = \\lbrace w_j \ge \tau_q\rbrace$ (e.g., the denominator of Eq. (3)) selects features based on the level- $q$ quantile $\tau_q$ on the right side of the null distribution, ensuring the FDR level to be at most $q$ .

References

[1] Dai, C., Lin, B., Xing, X., & Liu, J. S. (2023). A scale-free approach for false discovery rate control in generalized linear models. Journal of the American Statistical Association, 118(543), 1551-1565.

2025-08-01

I appreciate the authors' efforts in providing the response and modifying the paper. My questions have been satisfactorily addressed, and I have updated the score accordingly.

审稿意见

评分: 4置信度: 42025-07-02

This paper tackles the problem of variable selection under false discovery rate (FDR) control for linear regression models. The authors propose a generalized Gaussian mirror ( $\text{G}^2$ M) test statistics to study the feature selection problem. They established rigorous theoretical guarantees for the $\text{G}^2$ M statistics, including the entry-wise UMP properties and the advantage over existing statistics. Algorithms are proposed to handle both the case where one has access to true nonnull $\beta_j$ 's and the estimation in practice. Substantial empirical simulations are then conducted based on the estimation algorithm, which shows the advantages of the $\text{G}^2$ M statistics in terms of testing power and FDR control.

优缺点分析

Strengths:

The paper provided a good motivation for why one might care to consider using the $\text{G}^2$ M statistics for the variable selection problem under false discovery rate (FDR) control.
The theoretical foundation is strong, with clear lemmas and theorems providing a closed-form $\text{G}^2$ M statistics that is UMP and is more powerful over existing statistics including the Gaussian mirror and the data-splitting test statistics.
The methodology provides greater empirical power than multiple benchmarking methods in both the synthetic and the semi-synthetic simulation settings.

Weaknesses:

The paper does not discuss whether the proposed $\text{G}^2$ M test statistics have an intuitive interpretation, which may be of interest in the case of real-world data.
There is still some ambiguity concerning the difference between $\text{G}^2$ M and $\text{G}^2\text{M}\dagger$ . Given that the highest statistical power is predominantly achieved using $\text{G}^2\text{M}\dagger$ instead of $\text{G}^2$ M, further justification is required to substantiate the selection of the $\tau_q$ in Eq.3 (assuming it is the equation between line 215 and 216) as the optimal threshold.
It would be desirable to have at least one more real case study in section 3.3 as the current example does not sufficiently illustrate the comparative advantages of the proposed method over existing approaches.

问题

In section 3.1, the selection of $\rho=\{0.6,0.7\}$ needs some further justification.
The simulation using the Copula model on page 6 needs additional clarification, given that the reader may not possess prior knowledge of this methodology.

Typos:

Page 7, line 214: Eq. (3) cannot be found in the main file (assuming it is the equation between line 215 and 216).
Page 9, line 323: 'An possible' should be 'A possible'

局限性

Yes

最终评判理由

I would like to thank the review for the carefully-written rebuttal, and also more numerical results. I am satisfied with the response, and I will keep my rating.

格式问题

作者回复

2025-07-31

Thank you for your time and feedback, and for assessing that the paper has a strong theoretical foundation and the methodology provides greater empirical power compared to benchmarks.

QA1: The paper does not discuss whether the proposed test statistics have an intuitive interpretation, which may be of interest in the case of real-world data.

Essentially, the proposed test statistics is mainly the result of a variance extension of the mirror statistics. We have included in Appendix B why it is necessary to consider a variance-generalized setting. Primarily, this is because we observed that even with the simplest Gaussian simulation setting (see Appendix B), the variance of the mirror statistics is not 1. When data is more complex (e.g., the real-world data), it is expected that the unit-variance setting will be violated. In this case, the original Gaussian Mirror/ Data splitting setting that depends on the unit-variance assumption is likely to fail. In QA3 below, we have also included another real-world dataset and demonstrate the gains of the proposed statistics from the feature selection perspective.

QA2: There is still some ambiguity concerning the difference between G $^2$ M and G $^2$ M $^\dagger$ . Given that the highest statistical power is predominantly achieved using G $^2$ M $^\dagger$ instead of G $^2$ M, further justification is required to substantiate the selection of the $\tau_q$ in Eq.3 (assuming it is the equation between lines 215 and 216) as the optimal threshold.

Essentially, the $\dagger$ version and the regular version differ only by the selection rule of $\tau_q$ . The regular G $^2$ M utilizes Eq. (3), and the G $^2$ M $^\dagger$ considers the equation between lines 215 and 216. And according to [1], the two equations differ only when $\min_{t > 0} \lbrace t : \frac{1 + \vert \lbrace j : w_j \le -t \rbrace \vert}{\max(1, \vert \lbrace j : w_j \ge t \rbrace \vert)} \le q \rbrace > \min_{t > 0} \lbrace t : \sum_{j \in [p]} \mathbb{1} \lbrace j : w_j \ge t \rbrace < \frac{1}{q} \rbrace$ , where $\mathbb{1}$ denotes the indicator function.

This means when $q$ is small or the fraction of non-nulls is small, then the first term usually finds a larger $\tau_q$ , in favor of the FDR control. In this case, the power is pretty low, as when $\tau_q$ is large, there are only limited non-nulls selected. Instead of choosing $\tau_q$ with the first term (e.g., Eq. 3), the second term chooses a smaller $\tau_q$ , which results in higher power. We will elaborate on the above in the revised manuscript near lines 215 and 216, where the new selection rule is introduced. In the revised manuscript, we will add an equation number to $\tau_q$ between lines 215 and 216.

QA3: It would be desirable to have at least one more real case study in section 3.3, as the current example does not sufficiently illustrate the comparative advantages of the proposed method over existing approaches.

We have prepared another real-world case study that focuses on a breast cancer dataset [2] that has 569 samples. It contains the binary label (i.e., cancer v.s. normal) as the dependent variable and 30 features as the independent variable. We have detailed the description of the data in the revised manuscript in Section 3.3.

Following a similar setting for the real case study, we investigated the literature and identified the features that have been proven to have high relevance to the cancer tissue. Using this as the ground truth, we report the “referenced/identified” features in the following table. In total, there are 22 relevant features.

Model	G²M (ours)†	CRT	Distilled-CRT	Gaussian Mirror†	Data Splitting†	HRT	Powerful Knockoff
Referenced / Identified	18/21	5/6	12/15	18/26	11/15	2/4	21/26

Model	DeepDRK	Deep Knockoff	sRMMD	KnockoffGAN	DDLK
Referenced / Identified	15/20	11/15	13/18	17/22	8/13

It is clear that G $^2$ M maintains the lowest false discovery rate, as there are only 3 missing. Meanwhile, it obtains 18 correctly identified features that are referenced, suggesting its good performance in real-world data.

The experimental setup is identical to the existing real case study; the only change is the dataset. And such information, including the results, is reflected in the revised manuscript. The selected features for each of the models are also explicitly displayed in the Appendix in the revised manuscript. We will add the results in the revised manuscript.

QA4: In section 3.1, the selection of $\rho$ needs some further justification.

We follow [3] to consider the $\rho=0.6$ setting. However, we also want to provide additional context on how the change of $\rho$ affects the feature selection performance. Hence, the $\rho=0.7$ setting is considered. We will add the above explanation in line 198 about this choice.

QA5: The simulation using the Copula model on page 6 needs additional clarification, given that the reader may not possess prior knowledge of this methodology.

We will include a new section in the Appendix in the revised manuscript to include the description of the Copulas. Essentially, Copulas are statistical tools designed to model and simulate complex dependencies among random variables, independently from the shapes of their marginal distributions. Unlike traditional multivariate models that assume linear or Gaussian relationships, copulas allow us to construct datasets where variables exhibit non-linear or asymmetric dependencies, better reflecting patterns seen in real-world data. In our study, we use two widely-studied copula families: the Clayton copula, which models strong lower-tail dependence (i.e., variables tend to move together when their values are low), and the Joe copula, which captures strong upper-tail dependence (i.e., variables tend to move together when their values are high). By specifying a copula parameter of 2, we control the strength of these dependencies in a consistent way across scenarios. For each simulated dataset, the individual variable distributions (marginals) are chosen to be either uniform (via identity transformation) or exponential (rate=1), allowing us to assess the robustness of our methods under different data distributions. This setup, implemented using the PyCop library, enables a comprehensive evaluation of how our proposed methods perform under diverse and realistic correlation structures.

QA6: Typos

Thanks for pointing out the typos. We will correct them in the revised manuscript.

References

[1] Ren, Z., & Barber, R. F. (2024). Derandomised knockoffs: leveraging e-values for false discovery rate control. Journal of the Royal Statistical Society Series B: Statistical Methodology, 86(1), 122-154.

[2] https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic

[3] Sudarshan, M., Tansey, W., & Ranganath, R. (2020). Deep direct likelihood knockoffs. Advances in neural information processing systems, 33, 5036-5046.

评论- Response to the authors

2025-08-09

Thanks for your clarification. Your response to QA1-QA3 and QA5 are satisfactory.

For QA4: Is computing for different rho difficult? Otherwise, you may want to let me see results for different rho (even the case of rho = .7 is not presented here. Though I trust that you will include it in the paper. I would like to see the results on different rho).

Thanks.

2025-08-09

Thanks for the feedback that the responses to QA1-QA3 and QA5 are satisfactory. For QA4, it’s not difficult to run the experiments for different $\rho$ . In the paper, we already included results for $\rho = 0.7$ (see Table 1). The parameter $\rho$ controls the covariance matrix for the Gaussian case (please refer to lines 194-198 for details).

We provide additional results for $\rho =0.5$ .

Method	OLS FDR	OLS Power	Ridge FDR	Ridge Power	LASSO FDR	LASSO Power
Distilled-CRT	$\textcolor{red}{0.54}$	$\textcolor{red}{0.99}$	$\textcolor{red}{0.34}$	$\textcolor{red}{0.99}$	$\textcolor{red}{0.15}$	$\textcolor{red}{0.98}$
Data Splitting	0.00	0.00	$\textcolor{red}{0.13}$	$\textcolor{red}{0.82}$	0.08	0.70
Data Splitting†	0.06	0.75	$\textcolor{red}{0.13}$	$\textcolor{red}{0.82}$	$\textcolor{red}{0.13}$	$\textcolor{red}{0.76}$
Gaussian Mirror	0.10	0.79	0.05	0.77	0.07	0.84
Gaussian Mirror†	$\textcolor{blue}{0.10}$	$\textcolor{blue}{0.82}$	$\textcolor{blue}{0.05}$	$\textcolor{blue}{0.80}$	$\textcolor{blue}{0.07}$	$\textcolor{blue}{0.87}$
HRT	0.00	0.34	0.00	0.36	0.01	0.36
Powerful Knockoff	0.00	0.00	0.07	0.49	0.05	0.77
Powerful Knockoff†	0.06	0.80	0.09	0.60	0.05	0.80
G^2M (ours)	$0.10$	$0.88$	$0.07$	$0.90$	$0.07$	$0.94$
G^2M† (ours)	$0.10$	$0.88$	$0.07$	$0.90$	$0.07$	$0.94$

Note that it takes a relatively long time to get results from the baseline CRT, therefore, we skip it here. We will add the new results (including the baseline CRT) in the final paper.

审稿意见

评分: 4置信度: 32025-07-04

This paper addresses the task of generating realistic and diverse human motions from natural language descriptions, with a focus on improving generalization across different datasets and motion styles. The authors propose a novel method called MotionAuto, a diffusion-based model that separates the motion generation process into two stages:

Text-to-Motion Representation (T2MRep) Generation: Given a text input, the model generates a latent motion representation using a transformer-based architecture with diffusion sampling. This stage learns a distribution over possible motion intents, grounded in the textual description.
Motion Decoding: The generated T2MRep is then passed to a motion decoder that reconstructs a full motion sequence using a generalizable motion prior.

A key contribution of the paper is the Generalizable Motion Prior (GMP)—a modular and transferable motion representation trained to model the manifold of realistic human motions independent of specific text annotations or datasets. By disentangling motion prior learning from language grounding, GMP enables zero-shot and few-shot transfer to new datasets and tasks.

The model is evaluated across multiple datasets (HumanML3D, KIT-ML, and BABEL) and shows superior performance in both standard and cross-dataset generalization settings, with improved diversity, realism, and text-motion alignment compared to strong baselines. Qualitative examples and ablations further support the effectiveness of the modular design and latent representation.

优缺点分析

-Strengths

--Quality

Sound technical approach: The proposed two-stage architecture, combining text-guided latent representation generation and motion decoding via a generalizable prior, is methodologically well-motivated and executed.

Strong empirical performance: The model shows consistent improvements across standard and cross-dataset evaluations (HumanML3D, KIT-ML, BABEL). Metrics indicate better realism, diversity, and text-motion alignment.

Comprehensive evaluation: The authors provide detailed ablation studies and visualizations to justify the design choices. Both quantitative and qualitative results support the claims.

--Clarity

Clear presentation: The paper is well-written and logically structured. Figures and diagrams (especially the model overview on page 3) effectively illustrate the pipeline and components.

Defined terminology: Concepts such as T2MRep and GMP are clearly introduced and consistently used.

--Significance

Addresses a key limitation: The work tackles generalization in text-to-motion models—a major hurdle in practical deployment and cross-domain usage.

Practical utility: By decoupling motion learning from text supervision, the proposed framework lowers the annotation burden and increases reusability of pretrained components.

--Originality

Novel modular design: The introduction of GMP as a reusable, disentangled motion prior is a creative and impactful contribution.

Diffusion in latent space: Applying diffusion to a motion-specific latent space (rather than directly to poses or joint angles) adds a fresh perspective on modeling uncertainty in motion generation.

-Weaknesses

--Quality

Limited analysis of failure cases: While success is well-demonstrated, the paper would benefit from a deeper exploration of when or why the model fails (e.g., unusual text prompts or out-of-distribution motion types).

Few details on computational cost: The runtime or training/inference efficiency is not thoroughly reported or compared to baselines, which may matter for real-world adoption.

--Clarity

Latent space interpretability: The nature and structure of the T2MRep and GMP representations could be more explicitly explained or visualized. What semantics, if any, do these latent variables capture?

--Significance

Limited modality diversity: All experiments are in the domain of skeleton-based 3D human motion. It remains unclear how well the approach generalizes to other data modalities (e.g., 2D video motion, gestures, or robotic trajectories).

--Originality

Building on recent trends: While the combination is novel, many components—diffusion models, latent space modeling, and modular priors—build upon recent work in motion synthesis and text-to-motion generation. The originality lies in the effective integration and the use of GMP, rather than in new fundamental techniques.

问题

What is the structure and semantics of the T2MRep latent space?

Question: Can the authors clarify what kind of information the Text-to-Motion Representation (T2MRep) encodes? Visualizations (e.g., t-SNE or PCA plots) or clustering analyses could help reveal whether semantically similar motions occupy nearby regions in this space. A better understanding of T2MRep's structure would support its interpretability and clarify why diffusion in this space works well.

How transferable is the Generalizable Motion Prior (GMP) across domains or tasks?

Question: Beyond cross-dataset generalization, have the authors tested GMP in new tasks (e.g., action classification or robotic control) or motion styles (e.g., stylized, multi-agent)? Even preliminary results or discussion about potential use cases would strengthen the claim of “generalizability.” Demonstrating use of GMP outside the training regime would increase confidence in its utility and modularity.

What are the failure modes of the proposed system?

Question: In what scenarios does MotionAuto fail to generate meaningful or aligned motions (e.g., under ambiguous, long, or compound text prompts)? Including a qualitative analysis or examples of such failures can help guide future improvements and clarify the method’s limits. If failure cases are rare or easily explained, this would reinforce the system's robustness.

How does the method perform under low-resource settings?

Question: Have the authors measured how MotionAuto performs when fine-tuned or trained on smaller datasets, especially when GMP is reused? A few-shot or low-data experiment would be helpful to demonstrate whether GMP meaningfully reduces the data burden. Positive results here would further validate one of the paper's key motivations: generalization with limited supervision.

What is the computational efficiency compared to baselines?

Question: What are the training and inference time comparisons between MotionAuto and other leading methods like MDM or MotionDiffuse? Including runtime or memory usage benchmarks (even briefly in the appendix) would give a more complete picture of the model’s practicality. If the proposed method is competitive in terms of resources, it would increase its appeal for broader deployment.

局限性

While the technical contributions are well-developed, the paper does not sufficiently address potential limitations or negative societal impacts of text-to-motion generation. Here are some suggestions for improvement:

Ethical Concerns: Consider discussing risks such as misuse in surveillance, deepfake-style animations, or biased motion generation (e.g., due to dataset imbalances).
Representation Bias: Acknowledge if datasets like HumanML3D or KIT-ML might overrepresent specific motion types, body types, or cultures, which could affect generalizability and fairness.
Physical Safety: For applications in robotics or avatar control, incorrect motion generation could cause physical harm or user confusion. This should be mentioned as a risk.
Failure Mode Transparency: Limitations in generating rare, multi-agent, or highly dynamic motion types are not explicitly addressed and would be important for understanding the boundaries of the method’s reliability.

Including even a brief discussion in the main paper or appendix would enhance transparency and responsibility.

格式问题

No major formatting issues were found.

审稿意见

评分: 4置信度: 32025-07-31

The paper identifies the limitation of existing mirror statistics and presents a Generalized Gaussian Mirror method to enhance feature selection. The method is compared with multiple FDR related methods across different datasets (synthetic and real world), which shows efficacy in controlling FDR while achieving high power.

优缺点分析

Strengths:

The related work covered sufficient works with proper discussion linking all related works, and motivating the proposed method.
Method is evaluated on multiple datasets, both synthetic and real world data, and compared with state-of-the-art methods.
Problem is well motivated with clear background and rationale.
The paper includes detailed proofs for each part of the method.

Weaknesses:

The paper address the limitation of existing mirror statistics, novelty is not that significant.
The experiment part can be enhanced with some significance test on the result comparing with other methods.
It's good that the paper points out the limitation of assuming normal distribution on the fitting coefficients, however, the limitation is still there. It would be good to discuss more on how this normal distribution can be lifted/resolved.

问题

1.It would be good to add rationale on why k-means is selected in estimation algorithm.

2.The paper references paper 'Shen et al [32]' quite a lot, regarding problem set up, method, experiment setting, and result discussion. It would be good to add more discussion and clarifications on what's the novelty/originality of this paper compared with existing 'Shen et al [32]'

3.some discussion on the method complexity would be good and on how it compares with other methods in terms of computational complexity.

局限性

Yes

最终评判理由

I appreciate the authors' response on the clarifications on the questions, and provided additional comments justifying paper novelty. Based on the rebuttal comments, i have updated my score in clarity and originality section. thanks.

格式问题

N/A

评论- Rebuttal by Authors (Cont.)

2025-08-02

QC5: The paper references paper 'Shen et al [32]' quite a lot, regarding problem set up, method, experiment setting, and result discussion. It would be good to add more discussion and clarifications on what's the novelty/originality of this paper compared with existing 'Shen et al [32]'

The FDR controlled feature selection problem is a widely studied problem in statistics and machine learning and its mathematical formulation is stated in Shen et al. and earlier works [1-8]. To address the problem, Shen et al. proposed a deep model-X knockoff approach. In this paper, we didn't build upon the method proposed in Shen et al. Instead, we extend the Gaussian mirror [8] and data splitting [2] methods, which do not need the generation of knockoff copies. The experiment setting in Shen et al. [32] gives a comprehensive coverage of existing experiment settings in FDR controlled feature selections [1-8]. Therefore, we follow the same experiment setting for a fair comparison with existing state-of-the-arts. We will clarify the difference and the connection to 'Shen et al [32]’ and other model-X-knockoff-based methods in the revised manuscript.

QC6: some discussion on the method complexity would be good and on how it compares with other methods in terms of computational complexity.

The complexity of the Gaussian Mirror method is $\mathcal{O}(np^3+p^4)$ (for $n>p$ ). Essentially it runs $p$ ordinary least square fit (OLS), each of which has $\mathcal{O}(np^2+p^3)$ complexity ( $n>p$ ). The computational complexity of Algorithm 2 is $\mathcal{O}(np^3+p^4 + pki)$ (for $n>p$ ), where the $\mathcal{O}(pki)$ part is introduced by the k-means (following Lloyd's algorithm). $k$ is the number of clusters and $i$ stands for the number of iterations till convergence. In comparison, the data splitting runs two OLS, resulting in $\mathcal{O}(np^2+p^3)$ complexity. The knockoff framework, according to [10], requires at least $\mathcal{O}(p^{3.5})$ to solve for the knockoff variable in a semi-definite programming setting. Later on, it requires one OLS for $2p$ dimension, resulting in $\mathcal{O}(p^{3.5} + 4np^2+8p^3)$ complexity. Other deep-learning-based knockoff variables require additional deep learning model fitting to obtain the knockoff statistics, making the complexity analysis hard given the choice of optimizer and the model architecture, hence ignored. According to [11], the computational complexity for CRT, assuming $p=n$ , is $\mathcal{O}(p^3 \log^2 p)$ . We did not find any rigorous complexity analysis about HRT and dCRT, however, based on the results reported in Table 2 of the paper [12], we believe the complexity is between model-X knockoff and CRT. We will include this part in a separate section in the Appendix.

评论- updated score given rebuttal comments.

2025-08-04

评论- Rebuttal by Authors

2025-08-02

Thank you for the valuable feedback. Thank you for recognizing the strengths of our work, including a well-motivated problem with a clear rationale, a comprehensive discussion of related work, detailed theoretical proofs to support our method, and extensive experiments on both synthetic and real-world datasets against state-of-the-art approaches.

QC1: Novelty

We want to point out that we not only identified the limitation of the existing assumption for both Gaussian mirror and data splitting, but also derived the new test statistics given this relaxation. The benchmarking results demonstrated significant improvements, suggesting this relaxation on unit variance assumption is a strong cause of underperformance. Additionally, we also identified in the proof of Theorem 2.6 (i.e., section C.5 in the Appendix) that in order to boost power, one does not need to focus on jointly maximizing value of the non-null statistics.Instead, it is sufficient to consider treating each j-th test statistics separately. This observation is never explicitly pointed out by other papers, and is able to provide insights for future work that focuses on improving the power of feature selection algorithms with mirror test statistics. Overall, we believe all the above are valuable contributions of the paper. We will clarify and emphasize our contribution in the revised manuscript.

QC2: The experiment part can be enhanced with some significance test on the result compared with other methods.

Thank you for the suggestion. We want to point out that the false discovery rate (FDR) and the power are the gold standard for evaluating the performance of a feature selection algorithm, as indicated in [1-8]. Here, the power is defined as:

\text{Power} = \mathbb{E} \left [\frac{\vert \lbrace j : j \in S_1, j \in \hat{S}_1 \rbrace \vert}{\vert \lbrace j: j \in S_1 \rbrace\vert \vee 1 } \right];

FDR is defined as:

\text{FDR} = \mathbb{E}\left[ \frac{\left| \lbrace j : j \in S_0,\ j \in \hat{S}_1 \rbrace \right|}{\left| \lbrace j: j \in \hat{S}_1 \rbrace \right| \vee 1}\right].

And $\hat{S}_1 = \lbrace j: w_j \ge \tau_q\rbrace$ .

QC3: It would be good to discuss more on how this normal distribution can be lifted/resolved.

Because we assume $\beta$ ’s are the coefficients for the linear model, so the fitting coefficients follow normal distributions [9]. Relaxing the normality assumption on the fitting coefficient essentially requires the relaxation on the linear model assumption. Instead of considering the relaxation on the normality assumption as the coefficients $\beta$ ’s (in the linear regression model) are simply parameters to quantify the relation between the input $X$ and the output $y$ .

Our setting can be extended to a generalized linear model setting following the insights from [1]. However, the extension is non-trivial given that the change of the underlying model requires redefining the test statistics for G $^2$ M. We will address this limitation and point out explicitly this direction as future work in section 4 of the revised manuscript.

QC4 : It would be good to add rationale on why k-means is selected in the estimation algorithm.

In Algorithm 2, the quantity of $\delta_j$ , which is assumed to be given in Algorithm 1, needs to be estimated. In this practical setting where $\delta_j$ is not given, we treat the estimation of $\delta_j$ as an unsupervised learning problem given that we do not know the true values of $\delta_j$ . And we choose the k-means given it has been widely considered as an efficient unsupervised algorithm. Specifically, we find the $k$ cluster centers from $\lbrace v_j\rbrace_{j=1}^{p}$ . And the $k$ is chosen via the silhouette score (mentioned in line 164 in the manuscript). Then, we can give an estimate of $\delta_j$ by computing the deviation between $v_j$ and its cluster center as stated between line 10-12 in Algorithm 2. We will add the rationale in line 160 to link the estimation problem with k-means clustering.

评论- rebuttal comments read with updated score

2025-08-06

The rebuttal comments provided by the authors have been digested. As I commented below, per the clarifications on the questions, I have updated my score in clarity and originality section. thanks.

最终决定Accept (poster)

2025-09-17

The paper proposes a generalized Gaussian mirror test statistics for feature selection under false discovery rate (FDR) control for linear regression models. The experimental results of the proposed method demonstrate significant improvements in power while still maintaining the nominal FDR level compared to the baselines. Through discussions with the reviewers, the authors clarified their novelty, provided the results on more datasets, and promised to improve the clarity by moving some of the appendices to the main body. Overall, the reviewers agreed that the paper made a solid contribution for feature selection under FDR control.