Liu Molei, Katsevich Eugene, Janson Lucas, Ramdas Aaditya
Department of Biostatistics, Harvard Chan School of Public Health, 677 Huntington Avenue, Boston, Massachusetts 02115, U.S.A.
Department of Statistics and Data Science, Wharton School of the University of Pennsylvania, 265 South 37th Street, Philadelphia, Pennsylvania 19104, U.S.A.
Biometrika. 2022 Jun;109(2):277-293. doi: 10.1093/biomet/asab039. Epub 2021 Jul 8.
We consider the problem of conditional independence testing: given a response and covariates , we test the null hypothesis that . The conditional randomization test was recently proposed as a way to use distributional information about to exactly and nonasymptotically control Type-I error using any test statistic in any dimensionality without assuming anything about . This flexibility, in principle, allows one to derive powerful test statistics from complex prediction algorithms while maintaining statistical validity. Yet the direct use of such advanced test statistics in the conditional randomization test is prohibitively computationally expensive, especially with multiple testing, due to the requirement to recompute the test statistic many times on resampled data. We propose the distilled conditional randomization test, a novel approach to using state-of-the-art machine learning algorithms in the conditional randomization test while drastically reducing the number of times those algorithms need to be run, thereby taking advantage of their power and the conditional randomization test's statistical guarantees without suffering the usual computational expense. In addition to distillation, we propose a number of other tricks, like screening and recycling computations, to further speed up the conditional randomization test without sacrificing its high power and exact validity. Indeed, we show in simulations that all our proposals combined lead to a test that has similar power to the most powerful existing conditional randomization test implementations, but requires orders of magnitude less computation, making it a practical tool even for large datasets. We demonstrate these benefits on a breast cancer dataset by identifying biomarkers related to cancer stage.
给定一个响应变量和协变量,我们检验原假设 。条件随机化检验最近被提出,作为一种利用关于 的分布信息,在不做任何关于 的假设的情况下,使用任意维度下的任何检验统计量来精确且非渐近地控制第一类错误的方法。原则上,这种灵活性允许人们从复杂的预测算法中推导出强大的检验统计量,同时保持统计有效性。然而,在条件随机化检验中直接使用这种先进的检验统计量在计算上成本过高,特别是在多重检验的情况下,因为需要在重采样数据上多次重新计算检验统计量。我们提出了蒸馏条件随机化检验,这是一种在条件随机化检验中使用先进机器学习算法的新方法,同时大幅减少这些算法需要运行的次数,从而在不承担通常计算成本的情况下利用其强大功能和条件随机化检验的统计保证。除了蒸馏,我们还提出了一些其他技巧,如筛选和循环计算,以进一步加快条件随机化检验的速度,同时不牺牲其高功效和精确有效性。事实上,我们在模拟中表明,我们所有的提议相结合会产生一种检验,其功效与现有的最强大的条件随机化检验实现类似,但所需计算量减少了几个数量级,使其即使对于大型数据集也是一个实用工具。我们通过识别与癌症分期相关的生物标志物,在一个乳腺癌数据集上展示了这些优势。