Magnan Christophe N, Baldi Pierre
Department of Computer Science and Institute for Genomics and Bioinformatics, University of California, Irvine, CA 92697, USA Department of Computer Science and Institute for Genomics and Bioinformatics, University of California, Irvine, CA 92697, USA.
Bioinformatics. 2014 Sep 15;30(18):2592-7. doi: 10.1093/bioinformatics/btu352. Epub 2014 May 24.
Accurately predicting protein secondary structure and relative solvent accessibility is important for the study of protein evolution, structure and function and as a component of protein 3D structure prediction pipelines. Most predictors use a combination of machine learning and profiles, and thus must be retrained and assessed periodically as the number of available protein sequences and structures continues to grow.
We present newly trained modular versions of the SSpro and ACCpro predictors of secondary structure and relative solvent accessibility together with their multi-class variants SSpro8 and ACCpro20. We introduce a sharp distinction between the use of sequence similarity alone, typically in the form of sequence profiles at the input level, and the additional use of sequence-based structural similarity, which uses similarity to sequences in the Protein Data Bank to infer annotations at the output level, and study their relative contributions to modern predictors. Using sequence similarity alone, SSpro's accuracy is between 79 and 80% (79% for ACCpro) and no other predictor seems to exceed 82%. However, when sequence-based structural similarity is added, the accuracy of SSpro rises to 92.9% (90% for ACCpro). Thus, by combining both approaches, these problems appear now to be essentially solved, as an accuracy of 100% cannot be expected for several well-known reasons. These results point also to several open technical challenges, including (i) achieving on the order of ≥ 80% accuracy, without using any similarity with known proteins and (ii) achieving on the order of ≥ 85% accuracy, using sequence similarity alone.
SSpro, SSpro8, ACCpro and ACCpro20 programs, data and web servers are available through the SCRATCH suite of protein structure predictors at http://scratch.proteomics.ics.uci.edu.
准确预测蛋白质二级结构和相对溶剂可及性对于蛋白质进化、结构和功能的研究以及作为蛋白质三维结构预测流程的一个组成部分而言至关重要。大多数预测器使用机器学习和轮廓的组合,因此随着可用蛋白质序列和结构数量的持续增长,必须定期重新训练和评估。
我们展示了二级结构和相对溶剂可及性预测器SSpro和ACCpro的新训练模块化版本及其多类变体SSpro8和ACCpro20。我们明确区分了仅使用序列相似性(通常以输入级别的序列轮廓形式)和额外使用基于序列的结构相似性(利用与蛋白质数据库中序列的相似性在输出级别推断注释),并研究它们对现代预测器的相对贡献。仅使用序列相似性时,SSpro的准确率在79%至80%之间(ACCpro为79%),似乎没有其他预测器超过82%。然而,当添加基于序列的结构相似性时,SSpro的准确率提高到92.9%(ACCpro为90%)。因此,通过结合这两种方法,由于一些众所周知的原因无法期望达到100%的准确率,这些问题现在似乎已基本得到解决。这些结果还指出了几个开放的技术挑战,包括(i)在不使用与已知蛋白质的任何相似性的情况下达到≥80%的准确率水平,以及(ii)仅使用序列相似性达到≥85%的准确率水平。
SSpro、SSpro8、ACCpro和ACCpro20程序、数据和网络服务器可通过蛋白质结构预测器的SCRATCH套件在http://scratch.proteomics.ics.uci.edu获得。