Department of Preventive Medicine, Northwestern University, Feinberg School of Medicine, Chicago, IL, USA.
Committee on Developmental Biology and Regenerative Medicine, The University of Chicago, Chicago, IL, USA.
J Biomed Inform. 2019 Aug;96:103247. doi: 10.1016/j.jbi.2019.103247. Epub 2019 Jul 2.
Extracting genetic information from a full range of sequencing data is important for understanding disease. We propose a novel method to effectively explore the landscape of genetic mutations and aggregate them to predict cancer type.
We applied non-smooth non-negative matrix factorization (nsNMF) and support vector machine (SVM) to utilize the full range of sequencing data, aiming to better aggregate genetic mutations and improve their power to predict disease type. More specifically, we introduce a novel classifier to distinguish cancer types using somatic mutations obtained from whole-exome sequencing data. Mutations were identified from multiple cancers and scored using SIFT, PP2, and CADD, and collapsed at the individual gene level. nsNMF was then applied to reduce dimensionality and obtain coefficient and basis matrices. A feature matrix was derived from the obtained matrices to train a classifier for cancer type classification with the SVM model.
We have demonstrated that the classifier was able to distinguish four cancer types with reasonable accuracy. In five-fold cross-validations using mutation counts as features, the average prediction accuracy was 80% (SEM = 0.1%), significantly outperforming baselines and outperforming models using mutation scores as features.
Using the factor matrices derived from the nsNMF, we identified multiple genes and pathways that are significantly associated with each cancer type. This study presents a generic and complete pipeline to study the associations between somatic mutations and cancers. The proposed method can be adapted to other studies for disease status classification and pathway discovery.
从全范围测序数据中提取遗传信息对于理解疾病非常重要。我们提出了一种新的方法,可以有效地探索基因突变的全貌,并将其聚合起来预测癌症类型。
我们应用非光滑非负矩阵分解(nsNMF)和支持向量机(SVM)来利用全范围测序数据,旨在更好地聚合基因突变并提高其预测疾病类型的能力。更具体地说,我们引入了一种新的分类器,使用从全外显子组测序数据中获得的体细胞突变来区分癌症类型。突变是从多种癌症中识别出来的,并使用 SIFT、PP2 和 CADD 进行评分,并在个体基因水平上进行合并。然后应用 nsNMF 来降低维度,并获得系数和基础矩阵。从获得的矩阵中得到特征矩阵,并用 SVM 模型训练用于癌症类型分类的分类器。
我们已经证明,该分类器能够以合理的准确度区分四种癌症类型。在使用突变计数作为特征的五重交叉验证中,平均预测准确率为 80%(SEM=0.1%),明显优于基线和使用突变评分作为特征的模型。
使用 nsNMF 导出的因子矩阵,我们确定了多个与每种癌症类型显著相关的基因和途径。这项研究提出了一种通用且完整的研究体细胞突变与癌症之间关联的管道。所提出的方法可以适应其他疾病状态分类和途径发现的研究。