Ma Baoshan, Meng Fanyu, Yan Ge, Yan Haowen, Chai Bingjie, Song Fengju
College of Information Science and Technology, Dalian Maritime University, Dalian, 116026, China.
College of Information Science and Technology, Dalian Maritime University, Dalian, 116026, China.
Comput Biol Med. 2020 Jun;121:103761. doi: 10.1016/j.compbiomed.2020.103761. Epub 2020 Apr 16.
Accurate diagnostic classification of cancers can greatly help physicians to choose surveillance and treatment strategies for patients. Following the explosive growth of huge amounts of biological data, the shift from traditional biostatistical methods to computer-aided means has made machine-learning methods as an integral part of today's cancer prognosis prediction. In this work, we proposed a classification model by leveraging the power of extreme gradient boosting (XGBoost) and using increasingly complex multi-omics data with the aim to separate early stage and late stage cancers. We applied XGBoost model to four kinds of cancer data downloaded from TCGA and compared its performance with other popular machine-learning methods. The experimental results showed that our method obtained statistically significantly better or comparable predictive performance. The results of this study also revealed that DNA methylation outperforms other molecular data (mRNA expression and miRNA expression) in terms of accuracy and stability for discriminating between early stage and late stage groups. Furthermore, integration of multi-omics data by autoencoder can enhance the classification accuracy of cancer stage. Finally, we conducted bioinformatics analyses to assess the medical utility of the significant genes ranked by their importance using XGBoost algorithm. Extensively comparative experiments demonstrated that the XGBoost method has a remarkable performance in predicting the stage of cancer patients with multi-omics data. Moreover, identification of novel candidate genes associated with cancer stages would contribute to further elucidate disease pathogenesis and develop novel therapeutics.
癌症的准确诊断分类能够极大地帮助医生为患者选择监测和治疗策略。随着大量生物数据的爆炸式增长,从传统生物统计方法向计算机辅助手段的转变使得机器学习方法成为当今癌症预后预测的一个组成部分。在这项工作中,我们通过利用极端梯度提升(XGBoost)的强大功能并使用日益复杂的多组学数据,提出了一种分类模型,旨在区分早期和晚期癌症。我们将XGBoost模型应用于从TCGA下载的四种癌症数据,并将其性能与其他流行的机器学习方法进行比较。实验结果表明,我们的方法在统计上获得了显著更好或相当的预测性能。本研究结果还表明,在区分早期和晚期组方面,DNA甲基化在准确性和稳定性方面优于其他分子数据(mRNA表达和miRNA表达)。此外,通过自动编码器整合多组学数据可以提高癌症分期的分类准确性。最后,我们进行了生物信息学分析,以评估使用XGBoost算法按重要性排名的显著基因的医学效用。广泛的比较实验表明,XGBoost方法在利用多组学数据预测癌症患者分期方面具有显著性能。此外,鉴定与癌症分期相关的新型候选基因将有助于进一步阐明疾病发病机制并开发新型治疗方法。