Jones S, Stewart M, Michie A, Swindells M B, Orengo C, Thornton J M
Department of Biochemistry and Molecular Biology, University College, London, United Kingdom.
Protein Sci. 1998 Feb;7(2):233-42. doi: 10.1002/pro.5560070202.
A consensus approach for the assignment of structural domains in proteins is presented. The approach combines a number of previously published algorithms, and takes advantage of the elevated accuracy obtained when assignments from the individual algorithms are in agreement. The consensus approach is tested on a data set of 55 protein chains, for which domain assignments from four automated methods were known, and for which crystallographers assignments had been reported in the literature. Accuracy was found to increase in this test from 72% using individual algorithms to 100% when all four methods were in agreement. However a consensus prediction using all four methods was only possible for 52% of the dataset. The consensus approach [using three publicly available domain assignment algorithms (PUU, DETECTIVE, DOMAK)] was then used to make domain assignments for a data set of 787 protein chains from the Protein Data Bank. Analysis of the assignments showed 55.7% of assignments could be made automatically, and of these, 13.5% were multi-domain proteins. Of the remaining 44.3% that could not be assigned by the consensus procedure 90.4% had their domain boundaries assigned correctly by at least one of the algorithms. Once identified, these domains were analyzed for trends in their size and secondary structure class. In addition, the discontinuity of each domain along the protein chain was considered.
本文提出了一种用于蛋白质结构域分配的共识方法。该方法结合了许多先前发表的算法,并利用了当各个算法的分配结果一致时所获得的更高准确性。在一个包含55条蛋白质链的数据集上对该共识方法进行了测试,已知该数据集的四种自动化方法的结构域分配情况,并且文献中已报道了晶体学家的分配结果。在该测试中发现,使用单个算法时的准确率为72%,而当所有四种方法都一致时,准确率提高到了100%。然而,对于该数据集的52%,仅使用所有四种方法进行共识预测才是可能的。然后,使用共识方法[使用三种公开可用的结构域分配算法(PUU、DETECTIVE、DOMAK)]对来自蛋白质数据库的787条蛋白质链的数据集进行结构域分配。对这些分配结果的分析表明,55.7%的分配可以自动完成,其中13.5%是多结构域蛋白质。在其余44.3%无法通过共识程序进行分配的情况中,90.4%的结构域边界至少被一种算法正确分配。一旦确定,就对这些结构域的大小和二级结构类别趋势进行分析。此外,还考虑了每个结构域沿蛋白质链的不连续性。