Liu Cong, Wang Xujun, Genchev Georgi Z, Lu Hui
Department of Bioengineering, University of Illinois at Chicago, Chicago, USA; Center for Biomedical Informatics, Shanghai Children's Hospital, Shanghai, China.
SJTU-Yale Joint Center for Biostatistics, Shanghai Jiaotong University, Shanghai, China; Department of Bioinformatics and Biostatistics, Shanghai Jiaotong University, Shanghai, China.
Methods. 2017 Jul 15;124:100-107. doi: 10.1016/j.ymeth.2017.06.010. Epub 2017 Jun 13.
New developments in high-throughput genomic technologies have enabled the measurement of diverse types of omics biomarkers in a cost-efficient and clinically-feasible manner. Developing computational methods and tools for analysis and translation of such genomic data into clinically-relevant information is an ongoing and active area of investigation. For example, several studies have utilized an unsupervised learning framework to cluster patients by integrating omics data. Despite such recent advances, predicting cancer prognosis using integrated omics biomarkers remains a challenge. There is also a shortage of computational tools for predicting cancer prognosis by using supervised learning methods. The current standard approach is to fit a Cox regression model by concatenating the different types of omics data in a linear manner, while penalty could be added for feature selection. A more powerful approach, however, would be to incorporate data by considering relationships among omics datatypes.
Here we developed two methods: a SKI-Cox method and a wLASSO-Cox method to incorporate the association among different types of omics data. Both methods fit the Cox proportional hazards model and predict a risk score based on mRNA expression profiles. SKI-Cox borrows the information generated by these additional types of omics data to guide variable selection, while wLASSO-Cox incorporates this information as a penalty factor during model fitting.
We show that SKI-Cox and wLASSO-Cox models select more true variables than a LASSO-Cox model in simulation studies. We assess the performance of SKI-Cox and wLASSO-Cox using TCGA glioblastoma multiforme and lung adenocarcinoma data. In each case, mRNA expression, methylation, and copy number variation data are integrated to predict the overall survival time of cancer patients. Our methods achieve better performance in predicting patients' survival in glioblastoma and lung adenocarcinoma.
高通量基因组技术的新发展使得能够以经济高效且临床可行的方式测量多种类型的组学生物标志物。开发用于分析此类基因组数据并将其转化为临床相关信息的计算方法和工具是一个持续且活跃的研究领域。例如,一些研究利用无监督学习框架通过整合组学数据对患者进行聚类。尽管有这些最新进展,但使用整合的组学生物标志物预测癌症预后仍然是一项挑战。此外,缺乏使用监督学习方法预测癌症预后的计算工具。当前的标准方法是通过以线性方式连接不同类型的组学数据来拟合Cox回归模型,同时可以添加惩罚项进行特征选择。然而,一种更强大的方法是通过考虑组学数据类型之间的关系来整合数据。
在此,我们开发了两种方法:SKI - Cox方法和wLASSO - Cox方法,以纳入不同类型组学数据之间的关联。这两种方法都拟合Cox比例风险模型,并根据mRNA表达谱预测风险评分。SKI - Cox借用这些额外类型的组学数据生成的信息来指导变量选择,而wLASSO - Cox在模型拟合期间将此信息作为惩罚因子纳入。
我们表明,在模拟研究中,SKI - Cox和wLASSO - Cox模型比LASSO - Cox模型选择了更多真实变量。我们使用TCGA多形性胶质母细胞瘤和肺腺癌数据评估了SKI - Cox和wLASSO - Cox的性能。在每种情况下,整合mRNA表达、甲基化和拷贝数变异数据以预测癌症患者的总生存时间。我们的方法在预测胶质母细胞瘤和肺腺癌患者的生存方面表现更好。