Mohammed Akram, Biegert Greyson, Adamec Jiri, Helikar Tomáš
Department of Biochemistry, University of Nebraska-Lincoln, Lincoln, Nebraska, USA.
Oncotarget. 2017 Sep 21;8(49):85692-85715. doi: 10.18632/oncotarget.21127. eCollection 2017 Oct 17.
Machine learning techniques for cancer prediction and biomarker discovery can hasten cancer detection and significantly improve prognosis. Recent "OMICS" studies which include a variety of cancer and normal tissue samples along with machine learning approaches have the potential to further accelerate such discovery. To demonstrate this potential, 2,175 gene expression samples from nine tissue types were obtained to identify gene sets whose expression is characteristic of each cancer class. Using random forests classification and ten-fold cross-validation, we developed nine single-tissue classifiers, two multi-tissue cancer-versus-normal classifiers, and one multi-tissue normal classifier. Given a sample of a specified tissue type, the single-tissue models classified samples as cancer or normal with a testing accuracy between 85.29% and 100%. Given a sample of non-specific tissue type, the multi-tissue bi-class model classified the sample as cancer versus normal with a testing accuracy of 97.89%. Given a sample of non-specific tissue type, the multi-tissue multi-class model classified the sample as cancer versus normal and as a specific tissue type with a testing accuracy of 97.43%. Given a normal sample of any of the nine tissue types, the multi-tissue normal model classified the sample as a particular tissue type with a testing accuracy of 97.35%. The machine learning classifiers developed in this study identify potential cancer biomarkers with sensitivity and specificity that exceed those of existing biomarkers and pointed to pathways that are critical to tissue-specific tumor development. This study demonstrates the feasibility of predicting the tissue origin of carcinoma in the context of multiple cancer classes.
用于癌症预测和生物标志物发现的机器学习技术可以加快癌症检测并显著改善预后。最近的“组学”研究包括各种癌症和正常组织样本以及机器学习方法,有潜力进一步加速此类发现。为了证明这种潜力,我们获取了来自九种组织类型的2175个基因表达样本,以识别其表达是每种癌症类型特征的基因集。使用随机森林分类和十折交叉验证,我们开发了九个单组织分类器、两个多组织癌症与正常组织分类器以及一个多组织正常组织分类器。对于指定组织类型的样本,单组织模型将样本分类为癌症或正常组织,测试准确率在85.29%至100%之间。对于非特定组织类型的样本,多组织二分类模型将样本分类为癌症与正常组织,测试准确率为97.89%。对于非特定组织类型的样本,多组织多分类模型将样本分类为癌症与正常组织以及特定组织类型,测试准确率为97.43%。对于九种组织类型中任何一种的正常样本,多组织正常组织模型将样本分类为特定组织类型,测试准确率为97.35%。本研究中开发的机器学习分类器以超过现有生物标志物的敏感性和特异性识别潜在的癌症生物标志物,并指出了对组织特异性肿瘤发展至关重要的途径。这项研究证明了在多种癌症类型的背景下预测癌组织起源的可行性。