Department of Computing, Imperial College London, SW7 2AZ, UK.
Bioinformatics. 2014 May 1;30(9):1259-65. doi: 10.1093/bioinformatics/btu020. Epub 2014 Jan 17.
Protein structure alignment is key for transferring information from well-studied proteins to less studied ones. Structural alignment identifies the most precise mapping of equivalent residues, as structures are more conserved during evolution than sequences. Among the methods for aligning protein structures, maximum Contact Map Overlap (CMO) has received sustained attention during the past decade. Yet, known algorithms exhibit modest performance and are not applicable for large-scale comparison.
Graphlets are small induced subgraphs that are used to design sensitive topological similarity measures between nodes and networks. By generalizing graphlets to ordered graphs, we introduce GR-Align, a CMO heuristic that is suited for database searches. On the Proteus_300 set (44 850 protein domain pairs), GR-Align is several orders of magnitude faster than the state-of-the-art CMO solvers Apurva, MSVNS and AlEigen7, and its similarity score is in better agreement with the structural classification of proteins. On a large-scale experiment on the Gold-standard benchmark dataset (3 207 270 protein domain pairs), GR-Align is several orders of magnitude faster than the state-of-the-art protein structure comparison tools TM-Align, DaliLite, MATT and Yakusa, while achieving similar classification performances. Finally, we illustrate the difference between GR-Align's flexible alignments and the traditional ones by querying a flexible protein in the Astral-40 database (11 154 protein domains). In this experiment, GR-Align's top scoring alignments are not only in better agreement with structural classification of proteins, but also that they allow transferring more information across proteins.
蛋白质结构比对对于将信息从研究充分的蛋白质转移到研究较少的蛋白质至关重要。结构比对确定了等价残基的最精确映射,因为在进化过程中结构比序列更保守。在蛋白质结构对齐的方法中,最大接触图重叠(CMO)在过去十年中一直受到关注。然而,已知的算法表现出中等的性能,并且不适用于大规模比较。
图元是用于设计节点和网络之间敏感拓扑相似性度量的小诱导子图。通过将图元推广到有序图,我们引入了 GR-Align,这是一种适用于数据库搜索的 CMO 启发式算法。在 Proteus_300 数据集(44850 个蛋白质域对)上,GR-Align 的速度比最先进的 CMO 求解器 Apurva、MSVNS 和 AlEigen7 快几个数量级,其相似度得分与蛋白质的结构分类更一致。在 Gold-standard 基准数据集(3207270 个蛋白质域对)的大规模实验中,GR-Align 的速度比最先进的蛋白质结构比较工具 TM-Align、DaliLite、MATT 和 Yakusa 快几个数量级,同时实现了类似的分类性能。最后,我们通过在 Astral-40 数据库(11154 个蛋白质域)中查询一个灵活的蛋白质来展示 GR-Align 的灵活对齐与传统对齐之间的差异。在这个实验中,GR-Align 的最高得分对齐不仅与蛋白质的结构分类更一致,而且它们允许在蛋白质之间传递更多的信息。