Robust group and simultaneous inferences for high-dimensional single index model
This paper introduces high-dimensional robust inference procedures by recasting the single index model into a pseudo-linear model with transformed responses.
摘要
评审与讨论
This paper studies the high-dimensional single index model (SIM), which takes the form with and being orthogonal. Although this model has flexibility and interpretability, its efficiency is adversely affected by outlying observations and heavy-tailed distributions. The paper improves this in the following 3 aspects:
(1) they extend the rank-LASSO procedure to include both convex and non-convex penalties and establish error bound of any local optimum of the empirical objective;
(2) they provide asymptotically honest group inference procedures based on the idea of orthogonalization for testing the joint effect of many predictors;
(3) they develop a multiple testing procedure for determining if the individual coefficients are relevant simultaneously, and show that it is able to control the FDR asymptotically.
优点
Provides new ideas that deal with the inefficiency of single index model due to outlying observations and heavy-tailed distributions.
缺点
I think the paper is quite well-written. The only issue may be that the authors can provide more about the background of the problem,
问题
N/A
局限性
N/A
Thanks very much for your valuable comments.
Weaknesses
The group inference is helpful to decide whether a group of predictors are important or not for the response. If we find a group of predictors are important, we would like to know which specific predictors in the group are significant. For this aim, our developed multiple testing procedure is useful. For instance, researchers may aim to test whether a gene pathway, consisting of high-dimensional genes for the same biological function, is important for a certain clinical outcome, given the other high-dimensional genes. When determining that a certain gene pathway is important, researchers need to further identify specific genes within the pathway which are important for a certain clinical outcome.
According to your helpful suggestion, we will add a real data analysis in the appendix of the revision. For your convenience, we also copy the real data analysis in the following:
We apply our methods on a dataset about riboflavin (vitamin B2) production rate with Bacillus Subtilis. This dataset is made publicly by Buhlmann et al. (2014) and has been analyzed by many authors, for instance Meinshausen et al. (2009), Van de Geer et al. (2014), Javanmard and Montanar (2014), and Fei et al. (2019). The dataset riboflavin can be obtained from the R package . It consists of observations of strains of Bacillus Subtilis and covariates, measuring the log-expression levels of 4088 genes. The response variable is the logarithm of the riboflavin production rate.
Our goal is to detect which genes are associated with riboflavin production rate. Like most existing studies, we first reduce ultrahigh-dimension to a moderate high-dimension. Here we pick out first 300 genes by distance correlation based screening Li et al. (2012). We first conduct global testing on these 300 genes. The -value of our group inference procedure is 1.29e-04, indicating that the null hypothesis is rejected and the selected 300 genes are influential for riboflavin production rate. Next, we further use FDR control procedure to select the important genes in these 300 genes.
By implementing our proposed FDR control procedure with the FDR level of 0.1, we identify 10 genes that are significantly associated with the response. That is
= {YTGB_at, YCKE_at, YXLE_at, YXLD_at, YJCJ_at, XHLA_at, xepA_at, YCGO_at, RPLP_at, XKDS_at}.
If the FDR level is set as 0.2, 5 more genes will be selected. That is
= {SPOIISA_at, YHCB_at, XKDI_at, YJCF_at, XHLB_at}.
We further conduct group inference on the selected subsets and their complement sets . As expected, our group inference procedure finds again that are significant while are not. The corresponding -values of , , and are 6.75e-06, 7.26e-01, 9.33e-06 and 9.37e-01, respectively. These results suggest that the genes selected by the FDR control procedure are really influential.
We compare these selected genes with other methods. For example, the multi-sample-splitting method proposed in Meinshausen et al. (2009) identified YXLD_at; Van de Geer et al. (2014) did not select any gene using the de-sparsified Lasso; Javanmard and Montanar (2014) only selected two genes: YXLD_at and YXLE_at and Fei et al. (2019) claimed YCKE_at, XHLA_at, YXLD_at, YDAR_at and YCGN_at as significant. From and , we can see clearly that the gene YXLD_at is detected by not only Meinshausen et al. (2009), Javanmard and Montanar (2014), Fei et al. (2019), but also by our procedure. Besides, the genes YCKE_at, YXLE_at and XHLA_at which are detected by Javanmard and Montanar (2014) and Fei et al. (2019), are also found by our method. Further, our procedure detects some additional important genes.
Thanks for the rebuttal. I tend to keep the score.
This paper proposes a robust group hypothesis testing procedure for a high-dimensional single index model based on a data-driven transformation of the response variable. The key observation is that under the the linearity condition (LC) of the predictors, a wide class of single index models can be equivalently (for testing purposes) converted into a linear model with a transformed response variable. The main contribution of this paper is the introduction of the distribution transformation (a data-driven approach) of the response variable, and derive the asymptotic distribution of the resulting test statistic. Numerical performance of the proposed method is evaluated through simulation studies.
优点
The proposed method is interesting and potentially useful. The paper is well-written and the presentation is clear.
缺点
The simulation study can be improved.
问题
-
The introduction of the distribution function in (2.7) is interesting but also a little arbitrary. I suspect that any other functions that can be written as linear combinations of functions of 's can be used in (2.7), for example, the kernel density estimator, the kernel smoother of 's. The resulting test statistic should still be asymptotically equivalent to a U-statistic. Could the authors comment on this?
-
Whenever the term "robustness" is involved, it implies loss of efficiency in some cases. In my opinion, it is important to make a transparent evaluation of the limitations of the proposed procedure. In the simulation study, one can add a comparison to the method with as described in section 3.3. It will be informative if one can showcase in which cases which procedure is more powerful, and thus understand which method to use in practice.
-
In the simulation study, I would also suggest a graphical comparison between different methods by gradually increasing from to, say, for a more complete picture.
局限性
The simulation study can be improved.
Thanks very much for your valuable comments.
Weaknesses
We have added more simulations to illustrate our procedures and compare them with other methods. The simulation settings are summarized in global response, and the simulated results are displayed in the attached pdf file.
Question 1
As noted by you, there are many possible choices for the transformation function . There are several reasons for us to choose the distribution function as the transformation function:
- Firstly, the response-distribution transformation function is bounded. Actually with the equation (2.3), given the widely imposed sub-gaussian assumption on the predictors, any bounded transformation function would lead the transformed error term being sub-gaussian, even if the original error term in the single index model comes from Cauchy distribution.
- As noted by Rejchel and Bogdan (2020), in the empirical distribution function, the term is the rank of . Since statistics with ranks such as Wilcoxon test and the Kruskall-Wallis ANOVA test, are well-known to be robust, this then intuitively explains why our procedures with distribution function are robust with respect to outliers in response.
- The distribution function is very easy to estimate and thus our approach is straightforward to implement and understand. While the kernel density estimator, the kernel smoother of ’s require additionally tuning bandwidth.
Question 2
According to your valuable suggestion, we have conducted detailed comparisons with the method based on as described in section 3.3. The simulation settings are summarized in global response, and the corresponding numerical results are displayed in Figure R.1 in the attached pdf file.
- In Figure R.1, we compare our method with the procedure based on when the error term follows the standard normal distribution. In the standard case (normal distribution error, linear model and no response polluted), the performance of test statistic with is slightly better than our method with . However, in other settings, our method with performs much better than method with .
Although performs well in the standard case, one cannot know the distribution of the error term or the formula of the model in practice. Thus compared to , we recommend using our procedure in practice.
Question 3
We agree that a graphical comparison between different methods by gradually increasing from 0 to 0.5 would provide a more complete picture. According to your valuable suggestion, we have conducted detailed simulations for comparisons. The simulation settings are summarized in global response, and the results are displayed in Figure R.1 of the attached pdf file.
We consider three transformation procedures: (1) (Our method); (2) ; (3) . We vary from and the simulation results are shown in Figure R.1.
- Firstly, under the null hypothesis (), cannot control the type I error for Model 2, while other procedures perform well.
- Secondly, under the alternative hypothesis (), the empirical powers of other procedures decrease rapidly with the increase of , while the powers of our procedure remain stable, which is particularly noticeable when . This finding indicates that our method has strong robustness when the responses are polluted.
- Thirdly, our method performs well for both Model 1 and Model 2, indicating that our method is robust across different single-index models.
Limitations
Following your suggestions, we have added more simulations to illustrate our procedures and compare them with other methods. The simulation settings are summarized in global response, and the simulated results are displayed in Figure R.1 of the attached pdf file.
I want to thank you for addressing my concerns. They are very helpful. Congratulations for a solid work!
The paper proposes an algorithm for group inference in single-index models with an unknown link function, that is robust to heavy-tailed noise in the responses. The central idea of the approach is based on the property of elliptical distributions that the linear input-output correlations remain along a fixed direction for any transformation of the labels. The algorithm introduces a linear estimation objective for transformed labels resulting in an estimator converging to the true parameters under sparsity assumptions. Crucially, the obtained estimator yields a robust test for group inference satisfying the orthogonality property, which relaxes the requirements of separation between zero and non-zero coefficients. The obtained test further satisfies honesty, namely the uniform convergence of Type 1 error. The work concludes with numerical tests of the proposed approach.
优点
The paper is generally well written and provides an adequate motivation for the problem of group inference under heavy-tailed response, and the idea of orthogonalization. The proposed approach is straightforward to implement and understand. The paper provides an extensive discussion of some important properties of the test such as honesty and power. The estimation error bounds are non-asymptotic and hold for general scalings of p,n. The numerical experiments support the efficiency and robustness of the approach.
缺点
A few limitations of the setup are not explicitly discussed:
- The robustness is restricted to variability in the noise of the response function, not w.r.t general perturbations to the distribution. The paper also doesn’t provide quantitative bounds on robustness.
- The condition kappa_h \neq 0 excludes even link functions and in particular the problem of sparse phase retrieval.
- Theorem 3.1 imposes limitations on the sparsity level that should be explicit. The results in the paper seems to be restricted to constant sparsity levels/
The choice of h=F for robustness appears to be due to F(Y) being uniformly distributed in [0,1] independent of the distribution of Y. This should be more discussed explicitly.
问题
Questions:
Why are λX, λY not directly described in Theorem 3.1?
Why is the transformation H=F particularly suitable for robustness? Is it because of the distribution of F(Y) being agnostic to the distribution of Y?
For sub-exponential variables, Lemma A.11 appears to be weaker than Bernstein’s inequality.
Could you clarify the difference w.r.t Bernstein’s inequality and related inequalities for random variables with bounded Orlicz norms?
局限性
Some missing discussion on limitations is described above under “weaknesses”. The work is primarily of a theoretical nature and has no potential negative societal impact.
Thanks very much for your valuable comments.
Weakness 1
As noticed by you, in our paper, we say our procedure is robust since our methods do not need any moment condition for the error term in the single-index model. For your concern, we further discuss the robustness of our procedure based on the tool of efficient influence function. Recall that our test procedure is inspired by the quantity where is distribution of . Next we derive the efficient influence function (EIF) of . Consider where , and is a point mass at a single observation . By some calculation, the EIF for at observation is Since and are bounded, then given , is bounded for any . In terms of the EIF, our test statistics are robust with respect to the perturbations in the responses.
Weakness 2
We agree with you. The condition does exclude even link functions and in particular the problem of sparse phase retrieval. Neykov et al. (2020) introduced a novel procedure to deal with the problem of sparse phase retrieval when . We will incorporate their insights in the reivision.
Weakness 3
For consistency in -loss, we require the sparsity level satisfying . For -loss, the restriction becomes . The sparsity level is allowed to be diverging with and .
Weakness 4
With the equation (2.3), given the widely imposed sub-gaussian assumption on the predictors, any bounded transformation function would lead the transformed error term being sub-gaussian, even if the original error term comes from Cauchy distribution. However, the response-distribution transformation is preferred due to the following additional reasons.
- As noted by Rejchel and Bogdan (2020), in the empirical distribution function, the term is the rank of . Since statistics with ranks such as Wilcoxon test and the Kruskall-Wallis ANOVA test, are well-known to be robust, this then intuitively explains why our procedures with distribution function are robust with respect to outliers in response.
- The distribution function is very easy to estimate, and thus, our approach is straightforward to implement and understand.
- Lastly, besides being robust, the choice of would also have relatively high efficiency. Specifically, we conduct detailed simulations to illustrate this point. The simulation settings are summarized in global response, and the simulated results are displayed in the attached pdf file. Besides distribution function, we also consider and . Compared to other functions, our procedure has high power performance under nearly all the settings. The better performance compared to demonstrates that the superiority of our procedure is not merely due to the boundedness of the transformation function.
Question 1
To simplify the presentation, all main assumptions are given in the A.5 of the Appendix. In Assumption A.1 (ii), we assume that for some constants .
Question 2
Please refer to response to weakness 4 for details.
Question 3
After some careful calculation, we agree with you that for sub-Exponential variables, Lemma A.11 is weaker than Bernstein’s inequality. Let be independent, mean zero, sub-Exponential random variables. Then for , Bernstein inequality implies thatwhere is a constant and . And Lemma A.11 implies thatfor , where and are constants. For ease of comparsion, suppose that for . Note that when is sufficiently large. That is, the bound of Bernstein's inequality is sharper than the bound of Lemma A.11.
Question 4
Let be a non-decreasing convex function with . The ``-Orlicz'' norm of a random variable is .
Specifically, let . The sub-Weibull norm of a r.v. for any is defined as . The r.v. follows sub-Weibull distribution for is equivalent to is bounded. When , is sub-Exponential. In this case, Bernstein inequality is applicable when . While Lemma A.11 is applicable for . Therefore, Lemma A.11 is a generalization of Bernstein's inequality for general sub-Weibull distributions.
In this article, Lemma A.11 is adopted in the proof of Lemma A.14. In this case, we need to derive the bound of . For , can be represented as , where are zero mean, i.i.d. variables with bounded sub-Weibull norm for .
Dear Program Chairs, Senior Area Chairs, Area Chairs and Reviewers,
Thank all of you for your insightful comments and valuable suggestions, which have significantly enhanced the quality of this work. In this response, all the comments have been carefully addressed and accommodated. A summary of response is as follows.
-
Discussions of robustness with respect to perturbations in the responses. In the revision, we carefully discuss the robustness of our inference procedure based on the tool of efficient influence function. Please see the response to weakness 1 of Reviewer wCvo for details.
-
Discussions of the response-distribution transformation function. In the revision, we discuss the reasons why the response-distribution transformation is preferred. Please see responses to weakness 4 of Reviewer wCvo and question 1 of Reviewer BqNs for details.
-
Relationships between Lemma A.11 and Bernstein' inequality. Please see responses to questions 3 and 4 of Reviewer wCvo for details.
-
Additional simulation studies. In the revision, we have conducted additional simulation studies to illustrate the robustness of our procedure. Please refer to the attached pdf file for details.
-
The simulation settings are summarized as follows. We consider the following two models:
-
Model 1: Linear model: .
-
Model 2: Non-linear model: .
The regression coefficients are generated as for and otherwise, where can be regarded as a signal strength parameter. We generate the error term from the standard normal distribution. We add outliers to pollute the observations: of the responses are picked at random and increased by -times maximum of original responses. Specifically, the detailed settings for the above parameters are as follows. Firstly, we consider the sample size and the dimension . Secondly, we set the signal strength parameter to vary from . Thirdly, we fix to 10. Lastly, we vary from to in increments of . For more complete comparisons, we consider three transformation procedures: (1) (Our method); (2) ; (3) . Other simulation settings are the same as described in section 4 of the main text. The simulation results are summarized in Figure R.1 of the attached pdf file.
-
-
Figure R.1 summarizes the results of empirical type I error and empirical power with the significant level of for different methods when the error term follows the standard normal distribution. Firstly, under the null hypothesis (), cannot control the type I error for Model 2, while other procedures perform well. Secondly, under the alternative hypothesis (), the empirical powers of other procedures decrease rapidly with the increase of , while the powers of our procedure remain stable, which is particularly noticeable when . This finding indicates that our method has strong robustness when the responses are polluted. Thirdly, our method performs well for both Model 1 and Model 2, indicating that our method is robust across different single-index models.
-
-
Discussion about the background of the problem. In the response, we discuss the practical background of the problem and demonstrate the practical value by a real data analysis. Please see the response to weaknesses of Reviewer Tg8m for details.
-
All the other comments from the Program Chairs, Senior Area Chairs, Area Chairs and Reviewers are also carefully addressed.
Thank you very much for giving us an opportunity to revise the paper.
Dear authors, dear reviewers,
the discussion period has begun as the authors have provided their rebuttals. I encourage the reviewers to read all the reviews and the corresponding rebuttals: the current period might be an opportunity for further clarification on the paper results and in general to engage in an open and constructive exchange.
Many thanks for your work. The AC
The focus of the paper is an algorithm for group inference in single-index models which is robust under heavy-tailed noise in high dimension: under some sparsity hypothesis, an equivalent linear estimation objective allows to obtain a robust estimator converging to the true parameters.
The paper received positive feedback from all reviewers because of its clarity and the proposed ideas to deal with the problem of group inference under heavy tails in the response in single-index models. The results are further supported by numerical evidence. During the rebuttal discussion, limitations in the analyzed set-up were addressed and clarified in the resubmitted version, and further numerical evidence was added to the manuscript. I share positive evaluations of the Reviewers and recommend therefore acceptance.