Capriotti Emidio, Arbiza Leonardo, Casadio Rita, Dopazo Joaquín, Dopazo Hernán, Marti-Renom Marc A
Structural Genomics Unit, Centro de Investigación Príncipe Felipe (CIPF), Valencia, Spain.
Hum Mutat. 2008 Jan;29(1):198-204. doi: 10.1002/humu.20628.
Predicting the functional impact of protein variation is one of the most challenging problems in bioinformatics. A rapidly growing number of genome-scale studies provide large amounts of experimental data, allowing the application of rigorous statistical approaches for predicting whether a given single point mutation has an impact on human health. Up until now, existing methods have limited their source data to either protein or gene information. Novel in this work, we take advantage of both and focus on protein evolutionary information by using estimated selective pressures at the codon level. Here we introduce a new method (SeqProfCod) to predict the likelihood that a given protein variant is associated with human disease or not. Our method relies on a support vector machine (SVM) classifier trained using three sources of information: protein sequence, multiple protein sequence alignments, and the estimation of selective pressure at the codon level. SeqProfCod has been benchmarked with a large dataset of 8,987 single point mutations from 1,434 human proteins from SWISS-PROT. It achieves 82% overall accuracy and a correlation coefficient of 0.59, indicating that the estimation of the selective pressure helps in predicting the functional impact of single-point mutations. Moreover, this study demonstrates the synergic effect of combining two sources of information for predicting the functional effects of protein variants: protein sequence/profile-based information and the evolutionary estimation of the selective pressures at the codon level. The results of large-scale application of SeqProfCod over all annotated point mutations in SWISS-PROT (available for download at http://sgu.bioinfo.cipf.es/services/Omidios/; last accessed: 24 August 2007), could be used to support clinical studies.
预测蛋白质变异的功能影响是生物信息学中最具挑战性的问题之一。越来越多的基因组规模研究提供了大量实验数据,使得应用严格的统计方法来预测给定的单点突变是否对人类健康有影响成为可能。到目前为止,现有方法将其源数据限制在蛋白质或基因信息之一。本研究的新颖之处在于,我们同时利用了这两者,并通过使用密码子水平的估计选择压力来关注蛋白质进化信息。在这里,我们介绍一种新方法(SeqProfCod)来预测给定蛋白质变体与人类疾病相关的可能性。我们的方法依赖于一个支持向量机(SVM)分类器,该分类器使用三种信息源进行训练:蛋白质序列、多蛋白质序列比对以及密码子水平的选择压力估计。SeqProfCod已使用来自SWISS-PROT的1434个人类蛋白质的8987个单点突变的大型数据集进行了基准测试。它实现了82%的总体准确率和0.59的相关系数,表明选择压力的估计有助于预测单点突变的功能影响。此外,本研究证明了结合两种信息源来预测蛋白质变体功能影响的协同效应:基于蛋白质序列/图谱的信息以及密码子水平选择压力的进化估计。SeqProfCod在SWISS-PROT中所有注释的点突变上的大规模应用结果(可从http://sgu.bioinfo.cipf.es/services/Omidios/下载;最后访问时间:2007年8月24日)可用于支持临床研究。