计算设计的蛋白质序列能否提高二级结构预测？

Can computationally designed protein sequences improve secondary structure prediction?

机构信息

Biotechnology HPC Software Applications Institute, Telemedicine and Advanced Technology Research Center, US Army Medical Research and Materiel Command, Fort Detrick, MD 21702, USA.

出版信息

Protein Eng Des Sel. 2011 May;24(5):455-61. doi: 10.1093/protein/gzr003. Epub 2011 Jan 31.

DOI:10.1093/protein/gzr003

PMID:21282334

Abstract

Computational sequence design methods are used to engineer proteins with desired properties such as increased thermal stability and novel function. In addition, these algorithms can be used to identify an envelope of sequences that may be compatible with a particular protein fold topology. In this regard, we hypothesized that sequence-property prediction, specifically secondary structure, could be significantly enhanced by using a large database of computationally designed sequences. We performed a large-scale test of this hypothesis with 6511 diverse protein domains and 50 designed sequences per domain. After analysis of the inherent accuracy of the designed sequences database, we realized that it was necessary to put constraints on what fraction of the native sequence should be allowed to change. With mutational constraints, accuracy was improved vs. no constraints, but the diversity of designed sequences, and hence effective size of the database, was moderately reduced. Overall, the best three-state prediction accuracy (Q(3)) that we achieved was nearly a percentage point improved over using a natural sequence database alone, well below the theoretical possibility for improvement of 8-10 percentage points. Furthermore, our nascent method was used to augment the state-of-the-art PSIPRED program by a percentage point.

摘要

计算序列设计方法用于设计具有所需性质的蛋白质，例如增加热稳定性和新功能。此外，这些算法可用于识别可能与特定蛋白质折叠拓扑结构兼容的序列范围。在这方面，我们假设通过使用大型计算设计序列数据库，序列-性质预测（特别是二级结构）可以得到显著增强。我们使用 6511 个不同的蛋白质结构域和每个结构域 50 个设计序列对该假设进行了大规模测试。在分析设计序列数据库的固有准确性后，我们意识到有必要限制允许改变的天然序列的比例。通过突变限制，与没有限制相比，准确性得到了提高，但设计序列的多样性，因此数据库的有效大小，适度降低。总体而言，我们实现的最佳三态预测准确性（Q（3））比单独使用天然序列数据库提高了近一个百分点，远低于理论上 8-10 个百分点的改进可能性。此外，我们的初始方法将 PSIPRED 程序的最新状态提高了一个百分点。