Sinnott Jennifer A, Cai Tianxi
Department of Biostatistics, Harvard School of Public Health, 655 Huntington Avenue, Boston, Massachusetts 02115, U.S.A.
Biometrics. 2013 Dec;69(4):861-73. doi: 10.1111/biom.12098. Epub 2013 Nov 6.
Integrating genomic information with traditional clinical risk factors to improve the prediction of disease outcomes could profoundly change the practice of medicine. However, the large number of potential markers and possible complexity of the relationship between markers and disease make it difficult to construct accurate risk prediction models. Standard approaches for identifying important markers often rely on marginal associations or linearity assumptions and may not capture non-linear or interactive effects. In recent years, much work has been done to group genes into pathways and networks. Integrating such biological knowledge into statistical learning could potentially improve model interpretability and reliability. One effective approach is to employ a kernel machine (KM) framework, which can capture nonlinear effects if nonlinear kernels are used (Scholkopf and Smola, 2002; Liu et al., 2007, 2008). For survival outcomes, KM regression modeling and testing procedures have been derived under a proportional hazards (PH) assumption (Li and Luan, 2003; Cai, Tonini, and Lin, 2011). In this article, we derive testing and prediction methods for KM regression under the accelerated failure time (AFT) model, a useful alternative to the PH model. We approximate the null distribution of our test statistic using resampling procedures. When multiple kernels are of potential interest, it may be unclear in advance which kernel to use for testing and estimation. We propose a robust Omnibus Test that combines information across kernels, and an approach for selecting the best kernel for estimation. The methods are illustrated with an application in breast cancer.
将基因组信息与传统临床风险因素相结合以改善疾病预后预测,可能会深刻改变医学实践。然而,大量潜在标记以及标记与疾病之间关系可能存在的复杂性,使得构建准确的风险预测模型变得困难。识别重要标记的标准方法通常依赖于边际关联或线性假设,可能无法捕捉非线性或交互作用。近年来,人们在将基因分组到通路和网络方面做了大量工作。将此类生物学知识整合到统计学习中可能会提高模型的可解释性和可靠性。一种有效方法是采用核机器(KM)框架,如果使用非线性核,该框架可以捕捉非线性效应(Scholkopf和Smola,2002;Liu等人,2007年、2008年)。对于生存结局,在比例风险(PH)假设下已经推导出了KM回归建模和检验程序(Li和Luan,2003;Cai、Tonini和Lin,2011)。在本文中,我们在加速失效时间(AFT)模型下推导出KM回归的检验和预测方法,AFT模型是PH模型的一个有用替代方案。我们使用重采样程序来近似检验统计量的零分布。当多个核可能令人感兴趣时,事先可能不清楚该使用哪个核进行检验和估计。我们提出一种稳健的综合检验,它结合了跨核的信息,以及一种选择最佳核进行估计的方法。本文通过乳腺癌的应用实例对这些方法进行了说明。