Kann Maricel G, Sheetlin Sergey L, Park Yonil, Bryant Stephen H, Spouge John L
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD 20894, USA.
Nucleic Acids Res. 2007;35(14):4678-85. doi: 10.1093/nar/gkm414. Epub 2007 Jun 27.
The sequencing of complete genomes has created a pressing need for automated annotation of gene function. Because domains are the basic units of protein function and evolution, a gene can be annotated from a domain database by aligning domains to the corresponding protein sequence. Ideally, complete domains are aligned to protein subsequences, in a 'semi-global alignment'. Local alignment, which aligns pieces of domains to subsequences, is common in high-throughput annotation applications, however. It is a mature technique, with the heuristics and accurate E-values required for screening large databases and evaluating the screening results. Hidden Markov models (HMMs) provide an alternative theoretical framework for semi-global alignment, but their use is limited because they lack heuristic acceleration and accurate E-values. Our new tool, GLOBAL, overcomes some limitations of previous semi-global HMMs: it has accurate E-values and the possibility of the heuristic acceleration required for high-throughput applications. Moreover, according to a standard of truth based on protein structure, two semi-global HMM alignment tools (GLOBAL and HMMer) had comparable performance in identifying complete domains, but distinctly outperformed two tools based on local alignment. When searching for complete protein domains, therefore, GLOBAL avoids disadvantages commonly associated with HMMs, yet maintains their superior retrieval performance.
完整基因组测序催生了对基因功能自动注释的迫切需求。由于结构域是蛋白质功能和进化的基本单位,通过将结构域与相应蛋白质序列进行比对,可从结构域数据库中对基因进行注释。理想情况下,完整结构域通过“半全局比对”与蛋白质子序列进行比对。然而,将结构域片段与子序列进行比对的局部比对在高通量注释应用中很常见。它是一项成熟的技术,具备筛选大型数据库及评估筛选结果所需的启发式算法和准确的E值。隐马尔可夫模型(HMM)为半全局比对提供了另一种理论框架,但其应用受限,因为它们缺乏启发式加速和准确的E值。我们的新工具GLOBAL克服了先前半全局HMM的一些局限性:它具有准确的E值以及高通量应用所需的启发式加速可能性。此外,根据基于蛋白质结构的真值标准,两种半全局HMM比对工具(GLOBAL和HMMer)在识别完整结构域方面具有可比的性能,但明显优于两种基于局部比对的工具。因此,在搜索完整蛋白质结构域时,GLOBAL避免了通常与HMM相关的缺点,同时保持了其卓越的检索性能。