Hong Hyokyoung G, Zheng Qi, Li Yi
Department of Statistics and Probability, Michigan State University, 19 Red Cedar Road, East Lansing, MI 48823, USA.
Department of Bioinformatics and Biostatistics, University of Louisville, 485 East Gray Street, Louisville, KY 40202, USA.
J Multivar Anal. 2019 Sep;173:268-290. doi: 10.1016/j.jmva.2019.02.011. Epub 2019 Mar 5.
Forward regression, a classical variable screening method, has been widely used for model building when the number of covariates is relatively low. However, forward regression is seldom used in high-dimensional settings because of the cumbersome computation and unknown theoretical properties. Some recent works have shown that forward regression, coupled with an extended Bayesian information criterion (EBIC)-based stopping rule, can consistently identify all relevant predictors in high-dimensional linear regression settings. However, the results are based on the sum of residual squares from linear models and it is unclear whether forward regression can be applied to more general regression settings, such as Cox proportional hazards models. We introduce a forward variable selection procedure for Cox models. It selects important variables sequentially according to the increment of partial likelihood, with an EBIC stopping rule. To our knowledge, this is the first study that investigates the partial likelihood-based forward regression in high-dimensional survival settings and establishes selection consistency results. We show that, if the dimension of the true model is finite, forward regression can discover all relevant predictors within a finite number of steps and their order of entry is determined by the size of the increment in partial likelihood. As partial likelihood is not a regular density-based likelihood, we develop some new theoretical results on partial likelihood and use these results to establish the desired sure screening properties. The practical utility of the proposed method is examined via extensive simulations and analysis of a subset of the Boston Lung Cancer Survival Cohort study, a hospital-based study for identifying biomarkers related to lung cancer patients' survival.
向前回归是一种经典的变量筛选方法,当协变量数量相对较少时,它已被广泛用于模型构建。然而,由于计算繁琐且理论性质未知,向前回归在高维情形中很少使用。最近的一些研究表明,向前回归与基于扩展贝叶斯信息准则(EBIC)的停止规则相结合,能够在高维线性回归情形中一致地识别所有相关预测变量。然而,这些结果是基于线性模型的残差平方和,目前尚不清楚向前回归是否可以应用于更一般的回归情形,如Cox比例风险模型。我们为Cox模型引入了一种向前变量选择程序。它根据偏似然的增量依次选择重要变量,并采用EBIC停止规则。据我们所知,这是第一项在高维生存情形中研究基于偏似然的向前回归并建立选择一致性结果的研究。我们表明,如果真实模型的维度是有限的,向前回归可以在有限步骤内发现所有相关预测变量,并且它们的进入顺序由偏似然增量的大小决定。由于偏似然不是基于正则密度的似然,我们针对偏似然开发了一些新的理论结果,并利用这些结果建立了所需的确定筛选性质。通过广泛的模拟以及对波士顿肺癌生存队列研究(一项基于医院的旨在识别与肺癌患者生存相关生物标志物的研究)的一个子集进行分析,检验了所提出方法的实际效用。