Cai T, Huang J, Tian L
Department of Biostatistics, Harvard University, Boston, Massachusetts 02115, USA.
Biometrics. 2009 Jun;65(2):394-404. doi: 10.1111/j.1541-0420.2008.01074.x.
In the presence of high-dimensional predictors, it is challenging to develop reliable regression models that can be used to accurately predict future outcomes. Further complications arise when the outcome of interest is an event time, which is often not fully observed due to censoring. In this article, we develop robust prediction models for event time outcomes by regularizing the Gehan's estimator for the accelerated failure time (AFT) model (Tsiatis, 1996, Annals of Statistics 18, 305-328) with least absolute shrinkage and selection operator (LASSO) penalty. Unlike existing methods based on the inverse probability weighting and the Buckley and James estimator (Buckley and James, 1979, Biometrika 66, 429-436), the proposed approach does not require additional assumptions about the censoring and always yields a solution that is convergent. Furthermore, the proposed estimator leads to a stable regression model for prediction even if the AFT model fails to hold. To facilitate the adaptive selection of the tuning parameter, we detail an efficient numerical algorithm for obtaining the entire regularization path. The proposed procedures are applied to a breast cancer dataset to derive a reliable regression model for predicting patient survival based on a set of clinical prognostic factors and gene signatures. Finite sample performances of the procedures are evaluated through a simulation study.
在存在高维预测变量的情况下,开发能够准确预测未来结果的可靠回归模型具有挑战性。当感兴趣的结果是事件时间时,会出现进一步的复杂情况,由于删失,事件时间往往无法完全观测到。在本文中,我们通过使用最小绝对收缩和选择算子(LASSO)惩罚对加速失效时间(AFT)模型(Tsiatis,1996年,《统计学年鉴》18卷,305 - 328页)的Gehan估计器进行正则化,来开发针对事件时间结果的稳健预测模型。与基于逆概率加权以及Buckley和James估计器(Buckley和James,1979年,《生物统计学》66卷,429 - 436页)的现有方法不同,所提出的方法不需要关于删失的额外假设,并且总能得到一个收敛的解。此外,即使AFT模型不成立,所提出的估计器也能导致一个用于预测的稳定回归模型。为了便于自适应选择调谐参数,我们详细介绍了一种用于获得整个正则化路径的高效数值算法。所提出的方法应用于一个乳腺癌数据集,以基于一组临床预后因素和基因特征推导一个用于预测患者生存的可靠回归模型。通过模拟研究评估了这些方法的有限样本性能。