Department of Biochemistry & Molecular Biology and The Institute for Genome Sciences, University of Maryland, School of Medicine, BioPark II, Baltimore, MD 21201, USA.
Bioinformatics. 2009 Aug 1;25(15):1869-75. doi: 10.1093/bioinformatics/btp342. Epub 2009 Jun 8.
The patterns of sequence similarity and divergence present within functionally diverse, evolutionarily related proteins contain implicit information about corresponding biochemical similarities and differences. A first step toward accessing such information is to statistically analyze these patterns, which, in turn, requires that one first identify and accurately align a very large set of protein sequences. Ideally, the set should include many distantly related, functionally divergent subgroups. Because it is extremely difficult, if not impossible for fully automated methods to align such sequences correctly, researchers often resort to manual curation based on detailed structural and biochemical information. However, multiply-aligning vast numbers of sequences in this way is clearly impractical.
This problem is addressed using Multiply-Aligned Profiles for Global Alignment of Protein Sequences (MAPGAPS). The MAPGAPS program uses a set of multiply-aligned profiles both as a query to detect and classify related sequences and as a template to multiply-align the sequences. It relies on Karlin-Altschul statistics for sensitivity and on PSI-BLAST (and other) heuristics for speed. Using as input a carefully curated multiple-profile alignment for P-loop GTPases, MAPGAPS correctly aligned weakly conserved sequence motifs within 33 distantly related GTPases of known structure. By comparison, the sequence- and structurally based alignment methods hmmalign and PROMALS3D misaligned at least 11 and 23 of these regions, respectively. When applied to a dataset of 65 million protein sequences, MAPGAPS identified, classified and aligned (with comparable accuracy) nearly half a million putative P-loop GTPase sequences.
A C++ implementation of MAPGAPS is available at http://mapgaps.igs.umaryland.edu.
Supplementary data are available at Bioinformatics online.
在功能多样、进化相关的蛋白质中,序列相似性和差异性的模式包含了关于相应生化相似性和差异性的隐含信息。访问这些信息的第一步是对这些模式进行统计分析,这反过来又要求首先识别并准确对齐非常大量的蛋白质序列。理想情况下,该集合应包括许多远缘、功能上不同的亚群。由于完全自动化的方法很难(如果不是不可能的话)正确对齐这些序列,因此研究人员通常依赖于基于详细结构和生化信息的手动策展。然而,以这种方式多次对齐大量序列显然是不切实际的。
这个问题通过使用多对齐蛋白质序列全局对齐的多对齐轮廓(MAPGAPS)来解决。MAPGAPS 程序使用一组多对齐轮廓作为查询来检测和分类相关序列,并作为模板对序列进行多对齐。它依赖于 Karlin-Altschul 统计来提高敏感性,并依赖于 PSI-BLAST(和其他)启发式方法来提高速度。使用精心策展的 P 环 GTP 酶多轮廓对齐作为输入,MAPGAPS 正确对齐了 33 个已知结构的远缘 GTP 酶中弱保守序列基序。相比之下,序列和结构基的对齐方法 hmmalign 和 PROMALS3D 分别至少错误地对齐了这些区域中的 11 个和 23 个。当应用于 6500 万个蛋白质序列的数据集时,MAPGAPS 识别、分类和对齐了(具有可比性的准确性)近 50 万个假定的 P 环 GTP 酶序列。
MAPGAPS 的 C++实现可在 http://mapgaps.igs.umaryland.edu 获得。
补充数据可在 Bioinformatics 在线获得。