Li Hongzhe, Luan Yihui
Rowe Program in Human Genetics, University of California, Davis, CA 95616, USA.
Bioinformatics. 2005 May 15;21(10):2403-9. doi: 10.1093/bioinformatics/bti324. Epub 2005 Feb 15.
An important area of research in the postgenomics era is to relate high-dimensional genetic or genomic data to various clinical phenotypes of patients. Due to large variability in time to certain clinical events among patients, studying possibly censored survival phenotypes can be more informative than treating the phenotypes as categorical variables. Due to high dimensionality and censoring, building a predictive model for time to event is more difficult than the classification/linear regression problem. We propose to develop a boosting procedure using smoothing splines for estimating the general proportional hazards models. Such a procedure can potentially be used for identifying non-linear effects of genes on the risk of developing an event.
Our empirical simulation studies showed that the procedure can indeed recover the true functional forms of the covariates and can identify important variables that are related to the risk of an event. Results from predicting survival after chemotherapy for patients with diffuse large B-cell lymphoma demonstrate that the proposed method can be used for identifying important genes that are related to time to death due to cancer and for building a parsimonious model for predicting the survival of future patients. In addition, there is clear evidence of non-linear effects of some genes on survival time.
后基因组时代的一个重要研究领域是将高维遗传或基因组数据与患者的各种临床表型联系起来。由于患者发生某些临床事件的时间存在很大差异,研究可能被截尾的生存表型可能比将表型视为分类变量更具信息性。由于维度高和存在截尾情况,构建事件发生时间的预测模型比分类/线性回归问题更困难。我们建议开发一种使用平滑样条的提升程序来估计一般比例风险模型。这样的程序有可能用于识别基因对发生事件风险的非线性效应。
我们的实证模拟研究表明,该程序确实可以恢复协变量的真实函数形式,并可以识别与事件风险相关的重要变量。对弥漫性大B细胞淋巴瘤患者化疗后生存情况进行预测的结果表明,所提出的方法可用于识别与癌症死亡时间相关的重要基因,并用于构建一个简约模型来预测未来患者的生存情况。此外,有明确证据表明一些基因对生存时间存在非线性效应。