PaperHub
7.0
/10
Poster3 位审稿人
最低3最高4标准差0.5
4
3
4
ICML 2025

Testing Conditional Mean Independence Using Generative Neural Networks

OpenReviewPDF
提交: 2025-01-19更新: 2025-07-24
TL;DR

We introduce a novel method for conditional mean independence testing using deep generative neural networks with strong performance in high-dimensional setting.

摘要

Conditional mean independence (CMI) testing is crucial for statistical tasks including model determination and variable importance evaluation. In this work, we introduce a novel population CMI measure and a bootstrap-based testing procedure that utilizes deep generative neural networks to estimate the conditional mean functions involved in the population measure. The test statistic is thoughtfully constructed to ensure that even slowly decaying nonparametric estimation errors do not affect the asymptotic accuracy of the test. Our approach demonstrates strong empirical performance in scenarios with high-dimensional covariates and response variable, can handle multivariate responses, and maintains nontrivial power against local alternatives outside an $n^{-1/2}$ neighborhood of the null hypothesis. We also use numerical simulations and real-world imaging data applications to highlight the efficacy and versatility of our testing procedure.
关键词
Conditional DistributionMaximum Mean DiscrepancyKernel MethodDouble Robustness

评审与讨论

审稿意见
4

The paper introduces a new nonparametric test for conditional mean independence (CMI) that leverages deep generative neural networks to estimate conditional mean embeddings. The proposed method uses a novel population measure based on RKHS embeddings and constructs a test statistic in a multiplicative form that is robust to the slower convergence rates of nonparametric nuisance parameter estimators. To mitigate estimation errors, the authors combine sample splitting and cross-fitting with a generative moment matching network (GMMN) for sampling from the conditional distribution of covariates. The paper provides comprehensive theoretical guarantees (including asymptotic size control and power properties under local alternatives) and supports its claims via extensive simulation studies and real-world imaging applications (facial expression recognition and age estimation).

给作者的问题

  1. How does computational complexity scale with the dimensionality of X, Y, Z, and sample size n compared to existing methods?
  2. Have you explored alternative conditional generative models beyond GMMNs (e.g., conditional GANs, score-based diffusion model, see Essential References Not Discussed)? How might they affect test performance?
  3. How robust is your method to model misspecification when estimating conditional mean functions in highly nonlinear relationships?

论据与证据

The central claims are well-supported by theoretical analysis and empirical results. In particular:

  • The theoretical claims about precise asymptotic size control are supported by rigorous proofs and verified in simulation studies showing empirical sizes close to nominal levels. The claim of detecting local alternatives is validated through theoretical analysis in Theorem 5 and empirical power evaluations in simulations. However, the theoretical results rely on certain technical assumptions (e.g., on the decay rates of estimation errors) might limit the generality of the results in practice.

  • The claim of strong empirical performance in high-dimensional settings is demonstrated through comprehensive simulations against multiple baseline methods and by experiment on real-world imaging applications.

方法与评估标准

This work proposes to use generative models to approximate conditional distributions for RKHS-based testing, which is novel to overcome challenges in high-dimensional nonparametric estimation. The evaluations on both synthetic experiments (with clear sparse and dense alternatives) and applications on real imaging datasets are appropriate and provide convincing evidence of the method’s effectiveness.

理论论述

The paper contains several theoretical contributions, with proofs detailed in the supplementary material. I reviewed the main steps in the proofs of Theorems 4, 5, and 6. Under the given assumptions (e.g., Assumptions 7 and 9), the arguments appear sound. However, some of those technical assumptions seem to be quite strong. Clarification on the practical implications of these assumptions would be beneficial.

实验设计与分析

The simulation studies (Examples A1 and A2) are well-designed and cover a range of scenarios (both linear and nonlinear models, and sparse versus dense alternatives). The experimental analyses also include a comparison with multiple state-of-the-art methods.

However, while the results demonstrate clear benefits of the proposed test in terms of both size control and power, I believe a more detailed ablation study—particularly regarding the sensitivity to hyperparameter choices and kernel bandwidth selection—could strengthen the empirical section further.

补充材料

Yes, mostly on the proofs of theoretical results.

与现有文献的关系

I believe the authors have done good work on literature review within the CMI testing literature, clearly identifying limitations of existing methods.

遗漏的重要参考文献

While the paper cites a wide range of relevant literature, it would benefit from discussing recent advances in generative modeling (GANs and diffusion models) for conditional distribution estimation (i.e. trying different design of the generative network G^\hat G), for example:

  1. Athey, S., Imbens, G. W., Metzger, J., and Munro, E. Using Wasserstein generative adversarial networks for the design of Monte Carlo simulations. Journal of Econometrics, 2021.

  2. Baptista, R., Hosseini, B., Kovachki, N. B., and Marzouk, Y. Conditional sampling with monotone GANs: From generative models to likelihood-free inference. arXiv preprint arXiv:2006.06755, 2020.

  3. Shi, Y., De Bortoli, V., Deligiannidis, G., and Doucet, A. Conditional simulation using diffusion Schrödinger bridges. In Uncertainty in Artificial Intelligence, pp. 1792–1802. PMLR, 2022.

  4. Nguyen, B., Nguyen, B., Nguyen, H. T., & Nguyen, V. A. Generative conditional distributions by neural (entropic) optimal transport. In Proceedings of the 41st International Conference on Machine Learning (pp. 37761-37775), 2024.

其他优缺点

Good:

  • The paper is in general clear to follow, though can be dense with notation in the first few pages.

Need to address:

  • There is limited discussion of hyperparameter sensitivity (kernel bandwidths, network architectures)

其他意见或建议

n/a

作者回复

We greatly appreciate your valuable comments, which have helped lead to a much-improved manuscript. In the following, we present our point-by-point responses to your questions and will take into account all your suggestions in a revised version of our manuscript.

Generality and practical implications of the assumptions. For a detailed discussion on Assumptions 7 and 9, please refer to our reply to Reviewer tmTe. In summary, our proposed CMI test is fully nonparametric, and Assumptions 7 and 9 do not impose explicit restrictions on the data distribution (e.g., boundedness, continuity, or sub-Gaussianity), enhancing its practical applicability. Furthermore, the double robustness property of the test statistic allows for mild assumptions on the error decay rates of nuisance parameters. For example, if we assume that (Y,Z)(Y, Z) follows a linear regression model, then gYg_Y can be estimated at the n1/2n^{-1/2} rate, which implies that the conditional distribution PXZP_{X|Z} only needs to be consistently estimated without strict rate requirements.

Following your comments, we will include a brief discussion in the revised version of our paper, highlighting the generality and practical implications of these assumptions, particularly the more technical ones.

Hyperparameter sensitivity. For the sensitivity to network hyperparameters, please refer to our reply to Reviewer tmTe. Regarding the choice of kernel bandwidths, we followed the median heuristic in our manuscript, as it is widely used in kernel-based tests. To further evaluate the sensitivity to bandwidth selection, we conducted additional simulations for Example A1 with a fixed sample size n=400n=400 using bandwidths determined by either the mean pairwise distance or the “γ\gammath quantile heuristic” for γ\gamma \in {25%, 75%}. Specifically, the bandwidth for KX\mathcal{K}_X was set as the mean or γ\gammath quantile of {XjXk1:j,k[n]|X_j - X_k|_1 : j,k \in [n]}, with similar choices for KZ\mathcal{K}_Z. The empirical sizes for the test using the mean, 25% quantile, and 75% quantile bandwidths are 7%, 6.8%, and 7%, respectively. The size-adjusted power under the sparse alternative are 98.6%, 98.4%, and 98.6%, while under the dense alternative, it remains 100% in all cases. These results indicate that the test’s empirical performance remains stable across different bandwidth selection methods.

Computational complexity. Given the trained neural networks and assuming Gaussian or Laplacian kernel functions, our statistic resembles the average of two U-statistics of degree two, with its value depending on pairwise distances between samples. As a result, the computational complexity scales linearly with the dimensions (dX+dY+dZ)(d_X + d_Y + d_Z) and quadratically with the sample size nn. For comparison, the computational complexity of pMITM_M (Cai et al., 2024) and DSPM_M (Dai et al., 2022), both DNN-based CMI tests focused on univariate YY, scales linearly with nn. The network training complexity is O(EnP)O(E \cdot n \cdot P), where EE is the number of epochs and PP is the total number of trainable parameters.

In light of this comment, we will include a discussion on computational complexity in the revised version.

Alternative GNN structure and discussion on recent advances in generative modeling. At the early stages of this paper, we experimented with conditional GANs similar to those in Shi et al. (2021) to approximate PXZP_{X|Z}. The empirical performance of the test was comparable to the current approach using GMMN. However, a key drawback of GANs is their longer training time, as both the generator and discriminator must be trained simultaneously. In contrast, GMMN has a more efficient training process and yields a test statistic whose empirical performance closely matches the oracle statistic; see Table 2 in Section 3. We will incorporate a more detailed discussion of recent advancements in generative modeling, particularly those highlighted in your review, in the revised version of our paper.

Robustness to model misspecification. A key strength of our proposed test is its fully nonparametric nature, as it does not assume a specific parametric form for the mean functions. Thanks to the universal approximation properties of neural networks, model misspecification is not a concern asymptotically if the network width increases with the sample size. However, in practice, network structures are fixed, which may introduce approximation errors.

Fortunately, the double robustness property of our statistic mitigates sensitivity to these approximation errors, making our method more reliable than approaches that lack this property. As demonstrated by the simulation results in Section 3 for both linear and nonlinear models, our test's performance is comparable to the oracle test.

审稿意见
3

This paper proposes a novel statistical method to conditional mean independence (CMI) testing. First, the authors introduce a new population-level CMI measure and develop a bootstrap-based hypothesis testing framework that employs generative neural networks to approximate conditional mean functions. Its test statistic is constructed to reduce the influence of nonparametric estimation errors, ensuring asymptotic precision. The proposed method performs well in high-dimensional settings, and supports multivariate responses. Finally, the experiments on simulated and real-world data demonstrate the effectiveness of the proposed method.

update after rebuttal

After reviewing the rebuttal addressed to me and those for other reviewers, I am willing to maintain my score.

给作者的问题

No.

论据与证据

Yes, the claims made in the submission appear to be supported by clear and convincing evidence.

方法与评估标准

Yes, the proposed hypothesis testing framework makes sense for the CMI testing task.

理论论述

Yes. I check some proofs, including Theorems 4,5,6.

实验设计与分析

Yes, the experimental designs and analyses are sound.

补充材料

No, I did not read the supplementary materials.

与现有文献的关系

This paper proposes a novel framework for CMI testing problem with strong theoretical guarantee.

遗漏的重要参考文献

No, the paper includes all essential and relevant references.

其他优缺点

Strengths:

  • The method is shown to control Type I error while maintaining nontrivial power.

  • The proofs are clear and sound.

  • Comparisons against existing CMI tests highlight superior empirical performance of the proposed method.

Weaknesses:

  • The method requires to train multiple deep neural networks, which increases computational cost.

  • The theoretical results depends on some strong assumptions, such as assumption 7. The rationality of the assumptions in this paper should be discussed.

其他意见或建议

See Weaknesses.

作者回复

We greatly appreciate your valuable comments, which have helped lead to a much-improved manuscript. In the following, we present our point-by-point responses to your questions and will take into account all your suggestions in a revised version of our manuscript.

Computational cost. Due to the sample splitting and cross-fitting framework, we need to train four neural networks (two GNNs and two DNNs) to construct our statistic, which is analogous to the two-split pMITM_M test proposed in Cai et al. (2024). Regarding computation time, it takes 41.0 seconds for our method to complete one Monte Carlo simulation for Example A1 with a sample size of n=400n=400. This is longer than competing methods but of the same order as pMITM_M; please refer to our reply to Reviewer KpXH for more details on computation time.
The computation time for training the neural networks depends on the complexity of the network, which is primarily determined by the data structures. For our numerical results in Section 3, the network structures used are relatively simple (with only one hidden layer). These simple structures are easy to train and still yield satisfactory empirical performance. Importantly, our method does not rely on a specific machine learning algorithm or network structure. As long as the estimation error meets the requirements in Assumption 7, any new or different machine learning techniques and network architectures can be used to reduce the training cost.

Rationality of the assumptions. Part (a) of Assumption 7 ensures that the (conditional) mean embeddings into the RKHSs, as well as the operator Σ\Sigma, are well-defined. This assumption is commonly used in the literature of kernel-based conditional (mean) independence testing and holds for bounded kernels such as the Gaussian and Laplacian kernels.

Part (b) of Assumption 7 requires the estimation errors of the neural networks to decay to zero at rates nα1n^{-\alpha_1} and nα2n^{-\alpha_2} for α1,α2(0,)\alpha_1, \alpha_2 \in (0, \infty) such that α1+α2>1/2\alpha_1 + \alpha_2 > 1/2. Similar rate requirements appear in Cai et al. (2024) and Lundborg et al. (2024), where "black-box" estimators such as DNNs are used. As shown in Stone (1982), the minimax nonparametric regression decay rate for E[gY(Z)g^Y(Z)2]\mathbb{E} [ |g_Y(Z) - \hat g_Y(Z)|^2] is n2p/(2p+dZ)n^{2p/(2p+d_Z)}. When estimating gYg_Y with DNNs, Bauer and Kohler (2019) demonstrated that the decay rate can be n2p/(2p+d)n^{2p/(2p+d^\ast)}, where pp is the smoothness parameter and dd^\ast represents the intrinsic dimensionality. Regarding the estimation error of the conditional mean embedding of P(XZ)P(X|Z), our requirement is actually less restrictive than the assumptions on the total variation distance between P(XZ)P(X|Z) and its estimator (see Remark 8 in Appendix C), which were used in Shi et al. (2021). Shi et al. (2021) argue that their assumption holds in a wide range of settings, with examples provided in Berrett et al. (2019). Moreover, due to the double robustness property of our test statistic, we do not impose explicit constraints on the individual estimation errors of gYg_Y and the mean embedding of PXZP_{X|Z}. Instead, we only require their product to decay faster than n1/2n^{-1/2}. Notably, when gYg_Y is sufficiently smooth, α2\alpha_2 can approach 1/21/2, allowing α1\alpha_1 to remain very small to accommodate complex and high-dimensional distributions of PXZP_{X|Z} (e.g., when both XX and ZZ are images). For Assumption 9, we allow the residual vector YE[YZ]Y - \mathbb{E}[Y|Z] to vary under local alternatives. This is a more general setting than in nonparametric regression models, where the residual is assumed to remain the same as under the null hypothesis; see Remark 10 in Appendix C. As suggested, we will include a brief discussion of these assumptions, particularly the more technical ones, in the revised version of our paper.

Reference

Stone, C. J. (1982): "Optimal global rates of convergence for nonparametric regression." Ann. Statist.

Bauer and Kohler (2019): "On deep learning as a remedy for the curse of dimensionality in nonparametric regression." Ann. Statist.

Berrett et al. (2019): "The Conditional Permutation Test for Independence While Controlling for Confounders." Journal of the Royal Statistical Society Series B: Statistical Methodology

审稿意见
4

This develops a novel method to test for conditional mean independence that works well in high-dimensions, gives asymptotic size control, and has nontrivial power against local alternatives. This depends on using deep learning to learn g_y and g_x using a bootstrap sample. They then use the test statistic in the unnumbered equation before equation 2. They then generate data under a null distribution and use that to approximate the p-value. They show theoretically They then test the performance on synthetic data with regression examples along with real examples testing whether masking affects prediction accuracy.

给作者的问题

Unless I missed it why is the effect of choice of B not evaluated?

Are there direct comparisons on the image data?

论据与证据

The authors justify the claims that this method works well in testing CMI in high dimensions, both theoretically and empirically. In my opinion their evidence is sufficient but some of the comparisons are disappointing. The authors don't really explore the quality of neural network approximations, but given that proving theoretical guarantees even for more rudimentary tasks is difficult I don't fault them for that. But I would have preferred some material in the appendix seeing how some of the neural network parameters changing affects results. Also would be nice to have a sense of how long these methods take to run.

方法与评估标准

Hypothesis testing is straightforward for evaluation criteria. Given that this is designed for high dimensions, high dimensional regression and image questions make sense. One of the problems I do have with this paper is that none of the competitor methods are used on the image data from what I can tell. I would have assumed that it would be a straightforward improvement.

理论论述

I checked the theoretical claims to the best of my ability.

实验设计与分析

I checked the analyses. I suppose this is the most appropriate category to point out that from what I can tell, the paper doesn't analyze how the Monte Carlo method performs along with the number of bootstrap samples. This is important, because one possibility is that this is just a very high-powered poorly sized test but the errors in the approximations introduced address that.

补充材料

I made sure that there were files there.

与现有文献的关系

This paper improves on extensive literature on conditional mean independence. Many kernel methods struggle with high dimensional data. While most tests have size guarantees, at least of the most pertinent methods also struggle at maintaining power at a parametric rate, or at least being backed with theoretical guarantees.

遗漏的重要参考文献

The references cited seem quite extensive.

其他优缺点

None that are not previously described

其他意见或建议

Please don't put assumptions in the appendix. It's annoying to have to look there. Also it's weird to have assumption 7 and 9 but not 1-6 and 8.

作者回复

We greatly appreciate your valuable comments, which have significantly contributed to improving the quality of our manuscript. Below, we provide point-by-point responses to your questions and will incorporate all of your suggestions in the revised version of our manuscript.

Sensitivity to network parameters and computation time. To evaluate how changes in neural network parameters affect the performance of the proposed test, we repeated the simulation for Example A1 with a fixed sample size of n=400n = 400 but varied the network configurations. In the first case, we increased the number of hidden layers in both networks to two, keeping all other parameters unchanged. In the second case, we reduced the number of nodes in G^ \mathbb{\widehat G} and g^Y \widehat g_Y by half (to 56 and 128, respectively).

The results were consistent with those in Table 2 of the manuscript: empirical sizes: 5.8% and 6% (close to the original); size-adjusted power (sparse alternative): 98.6% and 99.8%; size-adjusted power (dense alternative): 100% in both cases. This suggests that the test’s performance is robust to moderate changes in key network parameters.

For other hyperparameters (e.g., batch size, learning rate), we used the Optuna automated search package to optimize them by minimizing the loss function defined in Equation (4) of Appendix A.

Regarding computation time, our method takes 41.0 seconds to complete one Monte Carlo simulation for Example A1 with sample size n=400n=400 using NVIDIA T4 GPU, which is of the same order as the DNN based test pMITM_M. Detailed computation times for our competitors under the same setting are listed below.

  • On Intel Xeon CPU. DSP: 3.21 seconds; DSPM_M: 14.9 seconds; DNN-pMIT: 5.13 seconds; DNN-pMITe_e: 5.13 seconds; DNN-pMITM_M: 25.0 seconds; DNN-pMIT_e$$_M: 25.0 seconds.
  • On 11th gen intel core i7-11800h CPU. XGB-pMIT: 0.167 seconds; XGB-pMITe_e: 0.167 seconds; XGB-pMITM_M: 1.56 seconds; XGB-pMIT_e$$_M: 1.56 seconds; PCM: 0.20 seconds; PCMM_M: 1.07 seconds; VIM: 6.667 seconds.

Competitor methods on image data application. We applied some competing methods to the image datasets in our initial submission, but due to space limitations, we have included some details in the appendix. For the image data application in Section 4.1, we compared our method with the DSPM_M approach developed by Dai et al. (2022), where a similar application was examined. For the application in Section 4.2, we compared with the pMITM_M method introduced by Cai et al. (2024). The p-values for these comparison methods are included in Figures 1 and 3, and a detailed comparison is provided in Appendix B.

Sensitivity to numbers of Monte Carlo data and bootstrap sample. We have conducted additional simulations to investigate how the number of Monte Carlo synthetic data (MM) and the bootstrap number (BB) influence the performance of our test. The results suggest that both the size and power of our testing procedure remain robust against these tuning parameters. Specifically, we repeated the simulation for Example A1 with n=400n=400, varying MM in {5,20,605,20,60} and BB in {200,500200,500}. The results are presented in the following table.

M=5 B=200M=5 B=500M=20 B=200M=20 B=500M=60 B=200M=60 B=500
Empirical Size6.26.65.86.26.86.4
Power under Sparse Alternative99.29999.699.698.699
Power under Dense Alternative100100100100100100

Numbering and location of assumptions. Originally, we chose to place the assumptions in the Appendix to stay within page limits and to keep the focus on explaining the core ideas of our proposed test without introducing excessive technical details. For the next version, we plan to maintain them in the Appendix due to space constraints, but will add a few sentences in the main text to briefly summarize these assumptions. Regarding assumption labeling, we initially used the default formatting from the ICML LaTeX template, but we would be happy to relabel them separately from theorems and remarks if preferred.

最终决定

This paper introduces a new nonparametric test for conditional mean independence (CMI), that works well in high-dimensional settings, by leveraging deep generative neural networks to estimate the conditional mean. A new population-level CMI measure is proposed and its sample version serves as the test statistic. To mitigate the estimation errors a combination of sample splitting and cross-fitting with a generative moment matching network (GMMN) for sampling from the conditional distribution of covariates. The theory provides theoretical guarantees for both the asymptotic size control and power under local alternatives. Numerical experiments support the theory.

One concern is the lack of comparisons on image data with competing methods. The authors addressed this in their response.

Overall, this work makes a significant contribution to the literature of testing CMI.