Capriotti E, Calabrese R, Casadio R
Laboratory of Biocomputing, CIRB/Department of Biology, University of Bologna via Irnerio 42, 40126 Bologna, Italy.
Bioinformatics. 2006 Nov 15;22(22):2729-34. doi: 10.1093/bioinformatics/btl423. Epub 2006 Aug 7.
Human single nucleotide polymorphisms (SNPs) are the most frequent type of genetic variation in human population. One of the most important goals of SNP projects is to understand which human genotype variations are related to Mendelian and complex diseases. Great interest is focused on non-synonymous coding SNPs (nsSNPs) that are responsible of protein single point mutation. nsSNPs can be neutral or disease associated. It is known that the mutation of only one residue in a protein sequence can be related to a number of pathological conditions of dramatic social impact such as Alzheimer's, Parkinson's and Creutzfeldt-Jakob's diseases. The quality and completeness of presently available SNPs databases allows the application of machine learning techniques to predict the insurgence of human diseases due to single point protein mutation starting from the protein sequence.
In this paper, we develop a method based on support vector machines (SVMs) that starting from the protein sequence information can predict whether a new phenotype derived from a nsSNP can be related to a genetic disease in humans. Using a dataset of 21 185 single point mutations, 61% of which are disease-related, out of 3587 proteins, we show that our predictor can reach more than 74% accuracy in the specific task of predicting whether a single point mutation can be disease related or not. Our method, although based on less information, outperforms other web-available predictors implementing different approaches.
A beta version of the web tool is available at http://gpcr.biocomp.unibo.it/cgi/predictors/PhD-SNP/PhD-SNP.cgi
人类单核苷酸多态性(SNP)是人类群体中最常见的遗传变异类型。SNP项目的最重要目标之一是了解哪些人类基因型变异与孟德尔疾病和复杂疾病相关。人们对导致蛋白质单点突变的非同义编码SNP(nsSNP)极为关注。nsSNP可能是中性的,也可能与疾病相关。已知蛋白质序列中仅一个残基的突变就可能与许多具有重大社会影响的病理状况有关,如阿尔茨海默病、帕金森病和克雅氏病。当前可用的SNP数据库的质量和完整性使得能够应用机器学习技术,从蛋白质序列出发预测由于单点蛋白质突变而导致的人类疾病的发生。
在本文中,我们开发了一种基于支持向量机(SVM)的方法,该方法从蛋白质序列信息出发,可以预测源自nsSNP的新表型是否可能与人类遗传疾病相关。我们使用了一个包含21185个单点突变的数据集,这些突变来自3587种蛋白质,其中61%与疾病相关,结果表明我们的预测器在预测单点突变是否与疾病相关的特定任务中,准确率可以超过74%。我们的方法虽然基于较少的信息,但优于其他采用不同方法的在线可用预测器。
网络工具的测试版可在http://gpcr.biocomp.unibo.it/cgi/predictors/PhD-SNP/PhD-SNP.cgi获取。