在Phyre程序中使用集成折叠识别方法探索序列/结构空间的极限。

Exploring the extremes of sequence/structure space with ensemble fold recognition in the program Phyre.

作者信息

Bennett-Lovsey Riccardo M, Herbert Alex D, Sternberg Michael J E, Kelley Lawrence A

机构信息

Structural Bioinformatics Group, Division of Molecular Biosciences, Imperial College London, London SW7 2AY, United Kingdom.

出版信息

Proteins. 2008 Feb 15;70(3):611-25. doi: 10.1002/prot.21688.

DOI:10.1002/prot.21688

PMID:17876813

Abstract

Structural and functional annotation of the large and growing database of genomic sequences is a major problem in modern biology. Protein structure prediction by detecting remote homology to known structures is a well-established and successful annotation technique. However, the broad spectrum of evolutionary change that accompanies the divergence of close homologues to become remote homologues cannot easily be captured with a single algorithm. Recent advances to tackle this problem have involved the use of multiple predictive algorithms available on the Internet. Here we demonstrate how such ensembles of predictors can be designed in-house under controlled conditions and permit significant improvements in recognition by using a concept taken from protein loop energetics and applying it to the general problem of 3D clustering. We have developed a stringent test that simulates the situation where a protein sequence of interest is submitted to multiple different algorithms and not one of these algorithms can make a confident (95%) correct assignment. A method of meta-server prediction (Phyre) that exploits the benefits of a controlled environment for the component methods was implemented. At 95% precision or higher, Phyre identified 64.0% of all correct homologous query-template relationships, and 84.0% of the individual test query proteins could be accurately annotated. In comparison to the improvement that the single best fold recognition algorithm (according to training) has over PSI-Blast, this represents a 29.6% increase in the number of correct homologous query-template relationships, and a 46.2% increase in the number of accurately annotated queries. It has been well recognised in fold prediction, other bioinformatics applications, and in many other areas, that ensemble predictions generally are superior in accuracy to any of the component individual methods. However there is a paucity of information as to why the ensemble methods are superior and indeed this has never been systematically addressed in fold recognition. Here we show that the source of ensemble power stems from noise reduction in filtering out false positive matches. The results indicate greater coverage of sequence space and improved model quality, which can consequently lead to a reduction in the experimental workload of structural genomics initiatives.

摘要

对庞大且不断增长的基因组序列数据库进行结构和功能注释是现代生物学中的一个主要问题。通过检测与已知结构的远程同源性来进行蛋白质结构预测是一种成熟且成功的注释技术。然而，随着近缘同源物分化为远缘同源物所伴随的广泛进化变化，很难用单一算法来捕捉。解决这个问题的最新进展涉及使用互联网上可用的多种预测算法。在这里，我们展示了如何在可控条件下在内部设计这样的预测器集合，并通过采用蛋白质环能量学中的一个概念并将其应用于三维聚类的一般问题，从而显著提高识别能力。我们开发了一种严格的测试，模拟将感兴趣的蛋白质序列提交给多种不同算法且这些算法中没有一个能做出可靠（95%）正确分配的情况。实现了一种利用组件方法的可控环境优势的元服务器预测方法（Phyre）。在95%的精度或更高精度下，Phyre识别出所有正确同源查询 - 模板关系中的64.0%，并且84.0%的单个测试查询蛋白质能够被准确注释。与最佳的单一折叠识别算法（根据训练）相对于PSI - Blast的改进相比，这意味着正确同源查询 - 模板关系的数量增加了29.6%，准确注释查询的数量增加了46.2%。在折叠预测、其他生物信息学应用以及许多其他领域中，人们已经充分认识到，集成预测在准确性上通常优于任何单个组件方法。然而，关于集成方法为何更优的信息却很少，实际上在折叠识别中从未对此进行过系统探讨。在这里，我们表明集成能力的来源在于过滤掉假阳性匹配时的噪声降低。结果表明序列空间的覆盖范围更大且模型质量得到改善，这进而可以减少结构基因组学计划的实验工作量。