Department of Applied Mathematics, Genome Center, University of California, Davis, 95616, California.
Proteins. 2013 Sep;81(9):1556-70. doi: 10.1002/prot.24307. Epub 2013 Jun 20.
It is well known that protein fold recognition can be greatly improved if models for the underlying evolution history of the folds are taken into account. The improvement, however, exists only if such evolutionary information is available. To circumvent this limitation for protein families that only have a small number of representatives in current sequence databases, we follow an alternate approach in which the benefits of including evolutionary information can be recreated by using sequences generated by computational protein design algorithms. We explore this strategy on a large database of protein templates with 1747 members from different protein families. An automated method is used to design sequences for these templates. We use the backbones from the experimental structures as fixed templates, thread sequences on these backbones using a self-consistent mean field approach, and score the fitness of the corresponding models using a semi-empirical physical potential. Sequences designed for one template are translated into a hidden Markov model-based profile. We describe the implementation of this method, the optimization of its parameters, and its performance. When the native sequences of the protein templates were tested against the library of these profiles, the class, fold, and family memberships of a large majority (>90%) of these sequences were correctly recognized for an E-value threshold of 1. In contrast, when homologous sequences were tested against the same library, a much smaller fraction (35%) of sequences were recognized; The structural classification of protein families corresponding to these sequences, however, are correctly recognized (with an accuracy of >88%).
众所周知,如果考虑到折叠的潜在进化历史模型,蛋白质折叠识别可以得到极大的改善。然而,只有在有这种进化信息的情况下,这种改进才会存在。为了规避这个限制,对于那些在当前序列数据库中只有少数代表的蛋白质家族,我们采用了一种替代方法,通过使用计算蛋白质设计算法生成的序列来重新创造包含进化信息的好处。我们在一个由来自不同蛋白质家族的 1747 个成员组成的蛋白质模板大型数据库上探索了这种策略。我们使用一种自动化方法为这些模板设计序列。我们使用实验结构的骨架作为固定模板,使用自洽平均场方法在线性骨架上穿线,并使用半经验物理势对相应模型的适应性进行评分。为一个模板设计的序列被翻译成基于隐马尔可夫模型的轮廓。我们描述了这种方法的实现、其参数的优化及其性能。当蛋白质模板的天然序列与这些轮廓的文库进行测试时,对于 E 值阈值为 1 的情况下,这些序列中的绝大多数 (>90%)的类、折叠和家族成员都被正确识别。相比之下,当同源序列与相同的文库进行测试时,只有一小部分(35%)的序列被识别;然而,与这些序列对应的蛋白质家族的结构分类被正确识别(准确率>88%)。