Huang Tzu-Jung, Luedtke Alex, McKeague Ian W
Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center.
Department of Statistics, University of Washington.
Ann Stat. 2023 Oct;51(5):1965-1988. doi: 10.1214/23-aos2313. Epub 2023 Dec 14.
This paper develops a new approach to post-selection inference for screening high-dimensional predictors of survival outcomes. Post-selection inference for right-censored outcome data has been investigated in the literature, but much remains to be done to make the methods both reliable and computationally-scalable in high-dimensions. Machine learning tools are commonly used to provide of survival outcomes, but the estimated effect of a selected predictor suffers from confirmation bias unless the selection is taken into account. The new approach involves the construction of semi-parametrically efficient estimators of the linear association between the predictors and the survival outcome, which are used to build a test statistic for detecting the presence of an association between any of the predictors and the outcome. Further, a stabilization technique reminiscent of bagging allows a normal calibration for the resulting test statistic, which enables the construction of confidence intervals for the maximal association between predictors and the outcome and also greatly reduces computational cost. Theoretical results show that this testing procedure is valid even when the number of predictors grows superpolynomially with sample size, and our simulations support this asymptotic guarantee at moderate sample sizes. The new approach is applied to the problem of identifying patterns in viral gene expression associated with the potency of an antiviral drug.
本文提出了一种新的方法,用于对生存结果的高维预测变量进行筛选后的推断。文献中已经研究了对右删失结果数据的筛选后推断,但要使这些方法在高维情况下既可靠又具有计算可扩展性,仍有许多工作要做。机器学习工具通常用于提供生存结果的预测,但除非考虑到选择过程,否则所选预测变量的估计效应会受到确认偏差的影响。新方法涉及构建预测变量与生存结果之间线性关联的半参数有效估计量,这些估计量用于构建一个检验统计量,以检测任何预测变量与结果之间是否存在关联。此外,一种类似于装袋法的稳定化技术可以对所得检验统计量进行正态校准,这使得能够构建预测变量与结果之间最大关联的置信区间,并且还大大降低了计算成本。理论结果表明,即使预测变量的数量随样本量呈超多项式增长,这种检验过程也是有效的,并且我们的模拟在中等样本量下支持了这种渐近保证。新方法被应用于识别与抗病毒药物效力相关的病毒基因表达模式的问题。