Bendl Jaroslav, Stourac Jan, Salanda Ondrej, Pavelka Antonin, Wieben Eric D, Zendulka Jaroslav, Brezovsky Jan, Damborsky Jiri
Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment, Faculty of Science, Masaryk University, Brno, Czech Republic ; Department of Information Systems, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic ; Center of Biomolecular and Cellular Engineering, International Centre for Clinical Research, St. Anne's University Hospital Brno, Brno, Czech Republic.
Loschmidt Laboratories, Department of Experimental Biology and Research Centre for Toxic Compounds in the Environment, Faculty of Science, Masaryk University, Brno, Czech Republic ; Center of Biomolecular and Cellular Engineering, International Centre for Clinical Research, St. Anne's University Hospital Brno, Brno, Czech Republic.
PLoS Comput Biol. 2014 Jan;10(1):e1003440. doi: 10.1371/journal.pcbi.1003440. Epub 2014 Jan 16.
Single nucleotide variants represent a prevalent form of genetic variation. Mutations in the coding regions are frequently associated with the development of various genetic diseases. Computational tools for the prediction of the effects of mutations on protein function are very important for analysis of single nucleotide variants and their prioritization for experimental characterization. Many computational tools are already widely employed for this purpose. Unfortunately, their comparison and further improvement is hindered by large overlaps between the training datasets and benchmark datasets, which lead to biased and overly optimistic reported performances. In this study, we have constructed three independent datasets by removing all duplicities, inconsistencies and mutations previously used in the training of evaluated tools. The benchmark dataset containing over 43,000 mutations was employed for the unbiased evaluation of eight established prediction tools: MAPP, nsSNPAnalyzer, PANTHER, PhD-SNP, PolyPhen-1, PolyPhen-2, SIFT and SNAP. The six best performing tools were combined into a consensus classifier PredictSNP, resulting into significantly improved prediction performance, and at the same time returned results for all mutations, confirming that consensus prediction represents an accurate and robust alternative to the predictions delivered by individual tools. A user-friendly web interface enables easy access to all eight prediction tools, the consensus classifier PredictSNP and annotations from the Protein Mutant Database and the UniProt database. The web server and the datasets are freely available to the academic community at http://loschmidt.chemi.muni.cz/predictsnp.
单核苷酸变异是一种常见的遗传变异形式。编码区的突变常常与各种遗传疾病的发生相关。用于预测突变对蛋白质功能影响的计算工具对于单核苷酸变异分析及其实验表征的优先级确定非常重要。许多计算工具已广泛用于此目的。不幸的是,训练数据集和基准数据集之间的大量重叠阻碍了它们的比较和进一步改进,这导致报告的性能存在偏差且过于乐观。在本研究中,我们通过去除先前在评估工具训练中使用的所有重复、不一致和突变构建了三个独立的数据集。包含超过43,000个突变的基准数据集用于对八个既定预测工具进行无偏评估:MAPP、nsSNPAnalyzer、PANTHER、PhD-SNP、PolyPhen-1、PolyPhen-2、SIFT和SNAP。六个性能最佳的工具被组合成一个共识分类器PredictSNP,从而显著提高了预测性能,同时返回了所有突变的结果,证实了共识预测是单个工具预测的准确且稳健的替代方法。一个用户友好的网络界面使人们能够轻松访问所有八个预测工具、共识分类器PredictSNP以及来自蛋白质突变数据库和UniProt数据库的注释。网络服务器和数据集可在http://loschmidt.chemi.muni.cz/predictsnp免费提供给学术界。