Biocomputing Group, *CIRI-Health Science and Technology/Department of Biology, via San Giacomo 9/2, Bologna, Italy.
BMC Genomics. 2012 Jun 18;13 Suppl 4(Suppl 4):S8. doi: 10.1186/1471-2164-13-S4-S8.
Various computational methods are presently available to classify whether a protein variation is disease-associated or not. However data derived from recent technological advancements make it feasible to extend the annotation of disease-associated variations in order to include specific phenotypes. Here we tackle the problem of distinguishing between genetic variations associated to cancer and variations associated to other genetic diseases.
We implement a new method based on Support Vector Machines that takes as input the protein variant and the protein function, as described by its associated Gene Ontology terms. Our approach succeeds in discriminating between germline variants that are likely to be cancer-associated from those that are related to other genetic disorders. The method performs with values of 90% accuracy and 0.61 Matthews correlation coefficient on a set comprising 6478 germline variations (16% are cancer-associated) in 592 proteins. The sensitivity and the specificity on the cancer class are 69% and 66%, respectively. Furthermore the method is capable of correctly excluding some 96% of 3392 somatic cancer-associated variations in 1983 proteins not included in the training/testing set.
Here we prove feasible that a large set of cancer associated germline protein variations can be successfully discriminated from those associated to other genetic disorders. This is a step further in the process of protein variant annotation. Scoring largely improves when protein function as encoded by Gene Ontology terms is considered, corroborating the role of protein function as a key feature for a correct annotation of its variations.
目前有多种计算方法可用于对蛋白质变异是否与疾病相关进行分类。然而,由于最近技术进步所产生的数据,使得对与疾病相关的变异进行注释并纳入特定表型成为可能。在此,我们解决了区分与癌症相关的遗传变异与与其他遗传疾病相关的变异的问题。
我们实施了一种新的基于支持向量机的方法,该方法将输入作为蛋白质变异和蛋白质功能,如相关基因本体术语所描述的。我们的方法成功地区分了可能与癌症相关的种系变异与与其他遗传疾病相关的变异。该方法在包含 592 种蛋白质中的 6478 种种系变异(16%与癌症相关)的一组数据上,准确率为 90%,马修斯相关系数为 0.61。在癌症类中,敏感性和特异性分别为 69%和 66%。此外,该方法能够正确排除 1983 种蛋白质中 3392 种未包含在训练/测试集中的体细胞癌症相关变异的约 96%。
在此,我们证明了从与其他遗传疾病相关的变异中成功区分大量与癌症相关的种系蛋白质变异是可行的。这是蛋白质变异注释过程中的一个重要进展。当考虑基因本体术语编码的蛋白质功能时,评分大大提高,这证实了蛋白质功能作为正确注释其变异的关键特征的作用。