School of Informatics, Indiana University Purdue University Indianapolis, and Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, 719 Indiana Avenue, Walker Plaza Building Suite 319, Indianapolis, IN 46202, USA.
J Mol Biol. 2011 May 6;408(3):585-95. doi: 10.1016/j.jmb.2011.02.056. Epub 2011 Mar 2.
Worldwide structural genomics projects are increasing structure coverage of sequence space but have not significantly expanded the protein structure space itself (i.e., number of unique structural folds) since 2007. Discovering new structural folds experimentally by directed evolution and random recombination of secondary-structure blocks is also proved rarely successful. Meanwhile, previous computational efforts for large-scale mapping of protein structure space are limited to simple model proteins and led to an inconclusive answer on the completeness of the existing observed protein structure space. Here, we build novel protein structures by extending naturally occurring circular (single-loop) permutation to multiple loop permutations (MLPs). These structures are clustered by structural similarity measure called TM-score. The computational technique allows us to produce different structural clusters on the same naturally occurring, packed, stable core but with alternatively connected secondary-structure segments. A large-scale MLP of 2936 domains from structural classification of protein domains reproduces those existing structural clusters (63%) mostly as hubs for many nonredundant sequences and illustrates newly discovered novel clusters as islands adopted by a few sequences only. Results further show that there exist a significant number of novel potentially stable clusters for medium-size or large-size single-domain proteins, in particular, >100 amino acid residues, that are either not yet adopted by nature or adopted only by a few sequences. This study suggests that MLP provides a simple yet highly effective tool for engineering and design of novel protein structures (including naturally knotted proteins). The implication of recovering new-fold targets from critical assessment of structure prediction techniques (CASP) by MLP on template-based structure prediction is also discussed. Our MLP structures are available for download at the publication page of the Web site http://sparks.informatics.iupui.edu.
全球结构基因组学项目正在增加序列空间的结构覆盖率,但自 2007 年以来,并没有显著扩大蛋白质结构空间本身(即独特结构折叠的数量)。通过定向进化和二级结构块的随机重组来实验性地发现新的结构折叠也很少成功。与此同时,以前用于大规模映射蛋白质结构空间的计算工作仅限于简单的模型蛋白质,并且对于现有观察到的蛋白质结构空间的完整性得出了不确定的答案。在这里,我们通过将自然发生的圆形(单环)排列扩展到多个环排列(MLP)来构建新的蛋白质结构。这些结构通过结构相似性度量(称为 TM 分数)进行聚类。该计算技术允许我们在相同的自然发生、包装、稳定的核心上生成不同的结构簇,但具有不同连接的二级结构段。来自蛋白质结构分类的 2936 个结构域的大规模 MLP 再现了那些现有的结构簇(63%),主要作为许多非冗余序列的中心,并且说明了仅被少数序列采用的新发现的新颖簇。结果还表明,对于中等大小或大尺寸的单域蛋白质,存在大量潜在的新型稳定簇,特别是>100 个氨基酸残基,这些簇尚未被自然界采用,或者仅被少数序列采用。这项研究表明,MLP 为新型蛋白质结构(包括天然纽结蛋白)的工程和设计提供了一种简单而高效的工具。还讨论了 MLP 从结构预测技术的关键评估(CASP)中回收新折叠目标对基于模板的结构预测的影响。我们的 MLP 结构可在网站 http://sparks.informatics.iupui.edu 的出版物页面上下载。