IISc Mathematics Initiative, Indian Institute of Science, Bangalore 560 012, India.
National Centre for Biological Sciences, University of Agricultural Sciences Gandhi Krishi Vignan Kendra Campus, Bangalore 560 065, India.
J Mol Biol. 2014 Feb 20;426(4):962-79. doi: 10.1016/j.jmb.2013.11.026. Epub 2013 Dec 4.
Protein functional annotation relies on the identification of accurate relationships, sequence divergence being a key factor. This is especially evident when distant protein relationships are demonstrated only with three-dimensional structures. To address this challenge, we describe a computational approach to purposefully bridge gaps between related protein families through directed design of protein-like "linker" sequences. For this, we represented SCOP domain families, integrated with sequence homologues, as multiple profiles and performed HMM-HMM alignments between related domain families. Where convincing alignments were achieved, we applied a roulette wheel-based method to design 3,611,010 protein-like sequences corresponding to 374 SCOP folds. To analyze their ability to link proteins in homology searches, we used 3024 queries to search two databases, one containing only natural sequences and another one additionally containing designed sequences. Our results showed that augmented database searches showed up to 30% improvement in fold coverage for over 74% of the folds, with 52 folds achieving all theoretically possible connections. Although sequences could not be designed between some families, the availability of designed sequences between other families within the fold established the sequence continuum to demonstrate 373 difficult relationships. Ultimately, as a practical and realistic extension, we demonstrate that such protein-like sequences can be "plugged-into" routine and generic sequence database searches to empower not only remote homology detection but also fold recognition. Our richly statistically supported findings show that complementary searches in both databases will increase the effectiveness of sequence-based searches in recognizing all homologues sharing a common fold.
蛋白质功能注释依赖于准确关系的识别,序列差异是一个关键因素。当只有三维结构才能证明遥远蛋白质之间的关系时,这一点尤为明显。为了应对这一挑战,我们描述了一种计算方法,通过有目的地设计类似于蛋白质的“连接”序列,在相关蛋白质家族之间架起桥梁。为此,我们将 SCOP 结构域家族与序列同源物一起表示为多个轮廓,并在相关结构域家族之间执行 HMM-HMM 比对。在实现令人信服的比对的情况下,我们应用基于轮盘赌的方法设计了 3611010 个类似于蛋白质的序列,对应于 374 个 SCOP 折叠。为了分析它们在同源搜索中连接蛋白质的能力,我们使用 3024 个查询搜索了两个数据库,一个仅包含天然序列,另一个额外包含设计序列。我们的结果表明,对于超过 74%的折叠,增强数据库搜索的折叠覆盖率提高了 30%,其中 52 个折叠实现了所有理论上可能的连接。虽然不能在某些家族之间设计序列,但在折叠内的其他家族之间设计序列的可用性确立了序列连续体,以展示 373 个困难的关系。最终,作为一个实际和现实的扩展,我们证明了这种类似于蛋白质的序列可以“插入”到常规和通用的序列数据库搜索中,不仅增强了远程同源检测,还增强了折叠识别。我们丰富的统计支持的发现表明,在两个数据库中进行互补搜索将提高基于序列的搜索识别共享共同折叠的所有同源物的有效性。