Permutation-based Rank Test in the Presence of Discretization and Application in Causal Discovery with Mixed Data
摘要
评审与讨论
This paper presents the Mixed data Permutation-based Rank Test (MPRT) to address the challenge of testing the rank of cross-covariance matrices in the presence of discretized variables. MPRT controls statistical errors even when some or all variables are discretized by establishing exchangeability and estimating the asymptotic null distribution through permutations, thereby effectively controlling Type I error. The method is validated through extensive experiments on both synthetic and real-world data, proving its effectiveness and applicability in causal discovery.
优点
This paper effectively handles the challenge of discretized variables in causal discovery, a significant advancement over traditional methods.
缺点
The main limitation of this paper is its reliance on the multivariate Gaussian distribution assumption. This restriction significantly limits the method’s applicability, and it remains unclear how the proposed method would perform with non-normal variables.
问题
-
The authors consider discretizing continuous observations into only three categories. What happens when the number of categories increases? Additionally, what is the computational cost associated with the proposed method?
-
The authors examine three scenarios for correlation estimation: between two continuous variables, between a continuous and a discrete variable, and between two discrete variables. How can one distinguish whether the data is discrete or continuous in practice, particularly when the number of categories is large?
-
Does the estimation of correlations in different scenarios impact the asymptotic distribution of the test statistic? Is there any theoretical justification for this?
Thank you for the time dedicated to reviewing our paper, the insightful comments, and valuable feedback. Please see our point-by-point responses below.
Q1: It remains unclear how the proposed method would perform with non-normal variables.
A1: Thank you for your insightful question. The proposed method can still work with non-normal variables, as long as the parametric form is given. To be specific, if variables follow a linear non-gaussian causal model and our observation is discretized, we only need to modify the likelihood function according to the corresponding parametric form for correlation estimation and the proposed method can still work. As a comparison, the traditional CCA rank test must assume normality to infer the chi-square null distribution. On the other hand, if the parametric form is not given, which means we do not have any information about the shape of the distribution, it may not be possible to consistently recover the underlying correlation (due to insufficient information), and thus the problem cannot be solved. Thank you again for the comment and we have added a related discussion to our revision to avoid any further confusion.
Q2: What happens when the number of categories increases? Additionally, what is the computational cost associated with the proposed method?
A2: The asymptotic validity of the method is not influenced by the number of level of discretization, but more levels are beneficial. Specifically, the proposed method can handle any level of discretization as long as it is greater than 1, with Type-I errors properly controlled. Further, more levels of discretization is beneficial because it leads to less information loss during the discretization process, and thus the correlation matrix can be more efficiently estimated for building the test.
Our method is also quite computationally efficient: as we employ pseudo-likelihoods, the estimation of each entry of the correlation matrix can be parallelized. In our implementation, a test with 100 random permutations can be done within 2 seconds. Another advantage of our method is that the computational cost increases only marginally as the sample size grows. This is because we only need to compute the probability mass of each category once (which depends on the sample size) and save it for future use; once this step is done, the remaining computations do not depend on the sample size.
Q3: How can one distinguish whether a variable is discrete or continuous in practice, particularly when the number of categories is large?
A3: In real-world causal discovery applications, it is often the case that we know the name and meaning of each variable, as well as how they are measured. This allows us to leverage domain knowledge to determine whether a variable should be treated as discrete or continuous. For example, in the finance field, discrete survey measures of investor sentiment are often used to provide insights into investor behavior, while variables such as stock prices and annual revenue are inherently continuous.
Q4: Does the estimation of correlations in different scenarios impact the asymptotic distribution of the test statistic? Any theoretical justification?
A4: Yes, different scenarios of correlation estimation will lead to different asymptotic null distributions (as different extent of information loss was introduced during discretization). However, we note that this does not influence asymptotic correctness the proposed permutation test. The reason lies in that given our established exchangeability, the asymptotic distribution of the test statistic can always be estimated by making use of data permutations (as in Thms 4 and 5, and thus properly control Type-I errors), no matter which scenarios of correlation estimation are considered.
We genuinely appreciate the reviewer's effort and hope that your concerns/questions are addressed.
This paper tackles the challenge of conducting rank tests when continuous variables are discretized, a common scenario in practice that often leads to information loss and failure of existing rank tests. To address this, the authors propose the Mixed-data Permutation-based Rank Test (MPRT), the first approach of its kind in the literature. They provide an asymptotic analysis of MPRT, demonstrating effective control of Type I error. The test accommodates scenarios with fully continuous, partially discretized, or fully discretized variables. Numerical experiments further confirm MPRT’s control of Type I error and its effective performance on Type II error.
优点
To the best of my knowledge, this paper introduces the first statistical rank test designed to address the issue of discretized variables. Great presentation with clear motivation, notation and theorem statements.
缺点
More theory on the asymptotic power and Type I error would be interesting to think about. Please refer to Questions section for more details.
问题
- Line 208: should be "...sample covariance " not on the numerator.
- The authors have empirically verified the Type I and Type II errors. I’m curious if the asymptotic behavior of the Type I error and power of the test could also be theoretically verified. This can better help understand how the power can be further improved. Some convergence rate results of the power or Type I error can also help. Would like to hear more thoughts from the authors on this direction.
Thank you for the time dedicated to reviewing our paper, the insightful comments, and valuable feedback. Please see our point-by-point responses below.
Q1: Theoretical analysis of the asymptotic behavior of the Type I error and power of the test. Some convergence rate results of the power or Type I error can also help.
A1: Thank you for the valuable feedback. Regarding Type-I errors, as we establish the exchangeability even in the discretized scenario, the asymptotic null distribution can be estimated by random permutations (as in Theorems 4 and 5). Consequently, Type-I errors can be properly controlled at any significance level. On the other hand, we totally agree that theoretical analysis of power and the convergence rate would also be interesting. At the same time, even without considering discretization, providing such a theoretical analysis seems to involve tools from advanced random matrix theories and is highly nontrivial. Furthermore, in our setting with discretized variables, the involved maximum likelihood step makes such an analysis even more challenging. To our best knowledge, there is not any existing result available for the analytic form of the power in our setting and we plan to leave it for future exploration. Thank you again for the insightful question and we have added this discussion to our revision.
Regarding the typo on line 208, we thank the reviewer and have revised accordingly. We genuinely appreciate the reviewer's effort and hope that your concerns/questions are addressed.
This paper presents a novel statistical method that addresses limitations in rank tests for cross-covariance matrices with discrete data. The traditional rank tests assume continuous data, leading to challenges in controlling Type I errors in the presence of discretized variables. The proposed method allows effective control of Type I errors with the existence of discretized variables. The method is empirically validated through experiments to demonstrate it is a general rank test method when the data are all continuous and also able to handle the discretization of the data.
优点
- The problem is relevant and well-motivated.
- This paper introduces a novel approach for the permutation-based rank test, which extends the use of t-separation to the discrete data.
- The paper provides solid theoretical contributions, establishing the exchangeability of canonical variables in permutation tests with discretized data.
- The proposed method is demonstrated with both synthetic and real-world datasets, showing it's broad applicability.
缺点
-
The results could benefit from deeper intuition and context to make the theoretical contributions more accessible to readers who are unfamiliar with the context. For example, the use of exchangeability for the permutation test, and the definition for the t-separation.
-
Although the MPRT is compared with CCA rank tests, a comparison with some CI tests may provide a more comprehensive view of MPRT's positioning.
问题
Typo:
l241. result of -> rest of
-
How would the proposed method affected by the level of discretization? A discussion of whether different discretization levels affect the method’s robustness would be beneficial.
-
Does the result of Theorem 5 affected if the variables are not unit variance?
-
How does the proposed method work for the general discrete variables instead of discretized joint Gaussian variables?
Thank you for the time dedicated to reviewing our paper, the insightful comments, and valuable feedback. Please see our point-by-point responses below.
Q1: More context about the use of exchangeability for permutation tests and t-separations.
A1: Thank you for the valuable comment to improve the clarity of our paper. In light of your suggestion, we have revised to add a section with an illustrative figure to explain the exchangeability of permutation tests as well as a section with examples to introduce the definition of t-separations (in appendix). Regarding the typo on line 241, we have revised it accordingly.
Q2: Any comparison of MPRT with CI tests may provide a more comprehensive view.
A2: As rank tests take partial correlation / linear CI tests as special cases, the results in Figure 2 and Figure 3 have already taken the comparison between MPRT and CI tests into considerations. Further, Table 1 also provides a direct comparison between MPRT and the Fisher-Z CI test, where using MPRT consistently leads to the best causal discovery results compared to baselines that includes CI test. We thank the reviewer for the valuable feedback and will make it clearer in our revision to avoid any further confusion.
Q3: How would the proposed method affected by the level of discretization?
A3: The short answer is, the correctness of the method is not influenced by the number of level of discretization, but more levels are beneficial. Specifically, the proposed method can handle any level of discretization as long as it is greater than 1, with Type-I errors properly controlled. Furthermore, more levels of discretization is beneficial because it leads to less information loss during the discretization process, and thus the correlation matrix can be more efficiently estimated for building the test.
Q4: Does the result of Theorem 5 affected if the variables are not unit variance?
A4: Thank you for the insightful question. Theorem 5 does require that variables have unit variances, and yet we note that the unit variance of variables can be assumed without loss of generality. The reason lies in that rescaling of either some or all variables does not affect the rank of a cross-covariance matrix (also mentioned on line 309-311).
Q5: How does the proposed method work for the general discrete variables instead of discretized joint Gaussian variables?
A5: Thank you for the insightful question. If variables are discretized linear non-Gaussian variables and the parametric form is known, we can modify the likelihood in Section 3.3 according to the parametric form and the proposed method still works. As a comparison, the traditional CCA rank test must assume normality to infer the chi-square null distribution. On the other hand, if discrete variables are directly generated according to probability mass table, the proposed method does not work; in this case, even with infinite number of samples, the relation between rank and t-separation does not hold and remains an open problem in causal discovery, not to mention building a test for error control in the finite sample scenario.
We genuinely appreciate the reviewer's effort and hope that your concerns/questions are addressed.
This paper proposes the rank test framework applicable for mixed data. The proposed framework is a novel statistical test for determining the rank of cross-covariance matrices in the presence of discretized data. It is shown and demonstrated in causal discovery across datasets with mixed continuous and discrete variables, where traditional rank tests fail due to discretization challenges.
优点
-
The paper addresses a significant limitation in traditional rank tests, which struggle with discretized data. The proposed method effectively extends rank testing to datasets with mixed continuous and discrete variables, a common scenario in practices.
-
Extensive empirical results on synthetic and real-world datasets demonstrate the ability of the proposed framework to maintain accurate Type I and Type II error rates, outperforming traditional methods. Additionally, the application to causal discovery with real-world personality data illustrates its practical value in inferring causal structures in realistic, mixed-type data scenarios.
-
By using a permutation-based approach, MPRT avoids complex adjustments for non-continuous data distributions and offers a computationally feasible way to approximate the null distribution, making it suitable for large datasets with mixed data types.
缺点
- Many real-world relationships are inherently non-linear. However, the proposed method is restricted to linear relationships. This limits its potential applicability.
问题
Thank you for the time dedicated to reviewing our paper, the insightful comments, and valuable feedback. Please see our point-by-point responses below.
Q1: Whether the proposed method can handle nonlinear scenarios?
A1: Yes, the proposed method can be extended to handle certain forms of nonlinear relations among the continuous variables. For example, suppose the continuous variables follow nonparanormal distribution, i.e., jointly Gaussian distributed after some monotone mappings. In that case, we can apply the proposed method to the transformed data by the empirical distribution functions of those variables. As such, a valid test can still be built.
On the other hand, if the underlying model with latent variables is allowed to be fully non-parametric without any additional assumption, the problem is highly challenging; in this case, even without discretization and with infinite number of samples, the relation between rank and t-separation does not hold, not to mention building a test for error control in the finite sample scenario. We agree with the reviewer that such a fully non-parametric setting could be beneficial to the field, but this remains an important open problem in causal discovery and we plan to leave it for future exploration. If you have any thoughts or suggestions on this matter, we would greatly appreciate your input.
We genuinely appreciate the reviewer's effort and hope that your concerns/questions are addressed.
This paper uses a mixed data permutation based rank test for causal discovery on mixed data. Doing so requires showing that exchangeability holds so that the permutation can approximate a valid null. Reviewers are all in agreement, with borderline positive scores, but the reviews are quite brief. My own view is that the paper contains some good ideas, but extending to the discretization requires little innovation and the main results are straightforward to establish.
审稿人讨论附加意见
Reviews were low quality and very brief. They are non committal and no reviewers engaged in discussion
Reject