使用精确的E值进行半全局比对，以识别蛋白质序列中的完整结构域。

The identification of complete domains within protein sequences using accurate E-values for semi-global alignment.

作者信息

Kann Maricel G, Sheetlin Sergey L, Park Yonil, Bryant Stephen H, Spouge John L

机构信息

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD 20894, USA.

出版信息

Nucleic Acids Res. 2007;35(14):4678-85. doi: 10.1093/nar/gkm414. Epub 2007 Jun 27.

DOI:10.1093/nar/gkm414

PMID:17596268

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1950549/

Abstract

The sequencing of complete genomes has created a pressing need for automated annotation of gene function. Because domains are the basic units of protein function and evolution, a gene can be annotated from a domain database by aligning domains to the corresponding protein sequence. Ideally, complete domains are aligned to protein subsequences, in a 'semi-global alignment'. Local alignment, which aligns pieces of domains to subsequences, is common in high-throughput annotation applications, however. It is a mature technique, with the heuristics and accurate E-values required for screening large databases and evaluating the screening results. Hidden Markov models (HMMs) provide an alternative theoretical framework for semi-global alignment, but their use is limited because they lack heuristic acceleration and accurate E-values. Our new tool, GLOBAL, overcomes some limitations of previous semi-global HMMs: it has accurate E-values and the possibility of the heuristic acceleration required for high-throughput applications. Moreover, according to a standard of truth based on protein structure, two semi-global HMM alignment tools (GLOBAL and HMMer) had comparable performance in identifying complete domains, but distinctly outperformed two tools based on local alignment. When searching for complete protein domains, therefore, GLOBAL avoids disadvantages commonly associated with HMMs, yet maintains their superior retrieval performance.

摘要

完整基因组测序催生了对基因功能自动注释的迫切需求。由于结构域是蛋白质功能和进化的基本单位，通过将结构域与相应蛋白质序列进行比对，可从结构域数据库中对基因进行注释。理想情况下，完整结构域通过“半全局比对”与蛋白质子序列进行比对。然而，将结构域片段与子序列进行比对的局部比对在高通量注释应用中很常见。它是一项成熟的技术，具备筛选大型数据库及评估筛选结果所需的启发式算法和准确的E值。隐马尔可夫模型（HMM）为半全局比对提供了另一种理论框架，但其应用受限，因为它们缺乏启发式加速和准确的E值。我们的新工具GLOBAL克服了先前半全局HMM的一些局限性：它具有准确的E值以及高通量应用所需的启发式加速可能性。此外，根据基于蛋白质结构的真值标准，两种半全局HMM比对工具（GLOBAL和HMMer）在识别完整结构域方面具有可比的性能，但明显优于两种基于局部比对的工具。因此，在搜索完整蛋白质结构域时，GLOBAL避免了通常与HMM相关的缺点，同时保持了其卓越的检索性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ad0f/1950549/fe6258accb7f/gkm414f1.jpg

相似文献

The identification of complete domains within protein sequences using accurate E-values for semi-global alignment.使用精确的E值进行半全局比对，以识别蛋白质序列中的完整结构域。

Nucleic Acids Res. 2007;35(14):4678-85. doi: 10.1093/nar/gkm414. Epub 2007 Jun 27.

DIALIGN-T: an improved algorithm for segment-based multiple sequence alignment.DIALIGN-T：一种改进的基于片段的多序列比对算法。

BMC Bioinformatics. 2005 Mar 22;6:66. doi: 10.1186/1471-2105-6-66.

Fitting hidden Markov models of protein domains to a target species: application to Plasmodium falciparum.将蛋白质结构域的隐马尔可夫模型拟合到目标物种上：在疟原虫中的应用。

BMC Bioinformatics. 2012 May 1;13:67. doi: 10.1186/1471-2105-13-67.

Alignment of protein sequences by their profiles.通过蛋白质序列的图谱进行比对。

Protein Sci. 2004 Apr;13(4):1071-87. doi: 10.1110/ps.03379804.

High speed biological sequence analysis with hidden Markov models on reconfigurable platforms.在可重构平台上使用隐马尔可夫模型进行高速生物序列分析。

IEEE Trans Inf Technol Biomed. 2009 Sep;13(5):740-6. doi: 10.1109/TITB.2007.904632. Epub 2008 Jun 10.

HMMerThread: detecting remote, functional conserved domains in entire genomes by combining relaxed sequence-database searches with fold recognition.HMMerThread：通过将宽松的序列数据库搜索与折叠识别相结合，在整个基因组中检测远程、功能保守的结构域。

PLoS One. 2011 Mar 10;6(3):e17568. doi: 10.1371/journal.pone.0017568.

Hidden Markov Models for Protein Domain Homology Identification and Analysis.用于蛋白质结构域同源性鉴定与分析的隐马尔可夫模型

Methods Mol Biol. 2017;1555:47-58. doi: 10.1007/978-1-4939-6762-9_3.

MetaDomain: a profile HMM-based protein domain classification tool for short sequences.MetaDomain：一种基于隐马尔可夫模型轮廓的短序列蛋白质结构域分类工具。

Pac Symp Biocomput. 2012:271-82.

The HMMER Web Server for Protein Sequence Similarity Search.用于蛋白质序列相似性搜索的HMMER网络服务器。

Curr Protoc Bioinformatics. 2017 Dec 8;60:3.15.1-3.15.23. doi: 10.1002/cpbi.40.

Dynamics alignment: comparison of protein dynamics in the SCOP database.动力学比对：SCOP 数据库中蛋白质动力学的比较。

Proteins. 2012 Apr;80(4):1167-76. doi: 10.1002/prot.24017. Epub 2012 Feb 10.

引用本文的文献

Developing anti-microbial peptide database version 1 to provide comprehensive and exhaustive resource of manually curated AMPs.开发抗菌肽数据库版本 1，提供全面、详尽的人工 curated AMPs 资源。

Sci Rep. 2023 Oct 19;13(1):17843. doi: 10.1038/s41598-023-45016-3.

Genome-Wide Identification of the LexA-Mediated DNA Damage Response in Streptomyces venezuelae.在委内瑞拉链霉菌中 LexA 介导的 DNA 损伤反应的全基因组鉴定

J Bacteriol. 2022 Aug 16;204(8):e0010822. doi: 10.1128/jb.00108-22. Epub 2022 Jul 13.

Improving Retrieval Efficacy of Homology Searches Using the False Discovery Rate.使用错误发现率提高同源性搜索的检索效率

IEEE/ACM Trans Comput Biol Bioinform. 2015 May-Jun;12(3):531-7. doi: 10.1109/TCBB.2014.2366112.

Domain enhanced lookup time accelerated BLAST.基于域名的快速检索 BLAST。

Biol Direct. 2012 Apr 17;7:12. doi: 10.1186/1745-6150-7-12.

Prototypes of elementary functional loops unravel evolutionary connections between protein functions.基本功能环的原型揭示了蛋白质功能之间的进化联系。

Bioinformatics. 2010 Sep 15;26(18):i497-503. doi: 10.1093/bioinformatics/btq374.

RefProtDom: a protein database with improved domain boundaries and homology relationships.RefProtDom：一个具有改进的结构域边界和同源关系的蛋白质数据库。

Bioinformatics. 2010 Sep 15;26(18):2361-2. doi: 10.1093/bioinformatics/btq426. Epub 2010 Aug 6.

The construction and use of log-odds substitution scores for multiple sequence alignment.多序列比对中对对数几率替换评分的构建和使用。

PLoS Comput Biol. 2010 Jul 15;6(7):e1000852. doi: 10.1371/journal.pcbi.1000852.

Threshold Average Precision (TAP-k): a measure of retrieval designed for bioinformatics.阈平均值精度（TAP-k）：一种专为生物信息学设计的检索度量标准。

Bioinformatics. 2010 Jul 15;26(14):1708-13. doi: 10.1093/bioinformatics/btq270. Epub 2010 May 26.

CORAL: aligning conserved core regions across domain families.CORAL：对齐跨结构域家族的保守核心区域。

Bioinformatics. 2009 Aug 1;25(15):1862-8. doi: 10.1093/bioinformatics/btp334. Epub 2009 May 26.

Non-coding RNA annotation of the genome of Trichoplax adhaerens.黏菌盘基网柄菌基因组的非编码RNA注释

Nucleic Acids Res. 2009 Apr;37(5):1602-15. doi: 10.1093/nar/gkn1084. Epub 2009 Jan 16.

本文引用的文献

MEME: discovering and analyzing DNA and protein sequence motifs.MEME：发现和分析DNA与蛋白质序列基序

Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W369-73. doi: 10.1093/nar/gkl198.

Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching.使用受试者工作特征（ROC）分析来评估序列匹配。

Comput Chem. 1996 Mar;20(1):25-33. doi: 10.1016/s0097-8485(96)80004-0.

A structure-based method for protein sequence alignment.一种基于结构的蛋白质序列比对方法。

Bioinformatics. 2005 Apr 15;21(8):1451-6. doi: 10.1093/bioinformatics/bti233. Epub 2004 Dec 21.

Determination of local statistical significance of patterns in Markov sequences with application to promoter element identification.马尔可夫序列中模式的局部统计显著性测定及其在启动子元件识别中的应用

J Comput Biol. 2004;11(1):1-14. doi: 10.1089/106652704773416858.

Gaps in structurally similar proteins: towards improvement of multiple sequence alignment.结构相似蛋白质中的缺口：迈向多重序列比对的改进

Proteins. 2004 Jan 1;54(1):71-87. doi: 10.1002/prot.10508.

The COG database: an updated version includes eukaryotes.COG数据库：更新版本涵盖真核生物。

BMC Bioinformatics. 2003 Sep 11;4:41. doi: 10.1186/1471-2105-4-41.

CDD: a curated Entrez database of conserved domain alignments.CDD：一个经过整理的关于保守结构域比对的Entrez数据库。

Nucleic Acids Res. 2003 Jan 1;31(1):383-7. doi: 10.1093/nar/gkg087.

Maximum likelihood fitting of FROC curves under an initial-detection-and-candidate-analysis model.初始检测与候选分析模型下FROC曲线的最大似然拟合

Med Phys. 2002 Dec;29(12):2861-70. doi: 10.1118/1.1524631.

A comparison of profile hidden Markov model procedures for remote homology detection.用于远程同源性检测的轮廓隐马尔可夫模型程序比较。

Nucleic Acids Res. 2002 Oct 1;30(19):4321-8. doi: 10.1093/nar/gkf544.

Hybrid alignment: high-performance with universal statistics.混合比对：兼具高性能与通用统计特性。

Bioinformatics. 2002 Jun;18(6):864-72. doi: 10.1093/bioinformatics/18.6.864.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用精确的E值进行半全局比对，以识别蛋白质序列中的完整结构域。

The identification of complete domains within protein sequences using accurate E-values for semi-global alignment.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献