Department of Bioengineering, Stanford University, Stanford, CA 94305, USA.
Genomics. 2011 Oct;98(4):310-7. doi: 10.1016/j.ygeno.2011.06.010. Epub 2011 Jul 7.
High-throughput genotyping and sequencing techniques are rapidly and inexpensively providing large amounts of human genetic variation data. Single Nucleotide Polymorphisms (SNPs) are an important source of human genome variability and have been implicated in several human diseases, including cancer. Amino acid mutations resulting from non-synonymous SNPs in coding regions may generate protein functional changes that affect cell proliferation. In this study, we developed a machine learning approach to predict cancer-causing missense variants. We present a Support Vector Machine (SVM) classifier trained on a set of 3163 cancer-causing variants and an equal number of neutral polymorphisms. The method achieve 93% overall accuracy, a correlation coefficient of 0.86, and area under ROC curve of 0.98. When compared with other previously developed algorithms such as SIFT and CHASM our method results in higher prediction accuracy and correlation coefficient in identifying cancer-causing variants.
高通量基因分型和测序技术正在快速、廉价地提供大量人类遗传变异数据。单核苷酸多态性(SNP)是人类基因组变异的重要来源,与多种人类疾病有关,包括癌症。编码区非同义 SNP 导致的氨基酸突变可能会产生影响细胞增殖的蛋白质功能变化。在这项研究中,我们开发了一种机器学习方法来预测致癌错义变异。我们提出了一种基于 3163 种致癌变异和等量中性多态性的支持向量机(SVM)分类器。该方法的整体准确率为 93%,相关系数为 0.86,ROC 曲线下面积为 0.98。与 SIFT 和 CHASM 等其他先前开发的算法相比,我们的方法在识别致癌变异方面具有更高的预测准确性和相关系数。