使用来自外部源的提示，通过广义隐马尔可夫模型对真核生物进行基因预测。

Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources.

作者信息

Stanke Mario, Schöffmann Oliver, Morgenstern Burkhard, Waack Stephan

机构信息

lnstitut für Mikrobiologie und Genetik, Universität Göttingen, Göttingen, Germany.

出版信息

BMC Bioinformatics. 2006 Feb 9;7:62. doi: 10.1186/1471-2105-7-62.

DOI:10.1186/1471-2105-7-62

PMID:16469098

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1409804/

Abstract

BACKGROUND

In order to improve gene prediction, extrinsic evidence on the gene structure can be collected from various sources of information such as genome-genome comparisons and EST and protein alignments. However, such evidence is often incomplete and usually uncertain. The extrinsic evidence is usually not sufficient to recover the complete gene structure of all genes completely and the available evidence is often unreliable. Therefore extrinsic evidence is most valuable when it is balanced with sequence-intrinsic evidence.

RESULTS

We present a fairly general method for integration of external information. Our method is based on the evaluation of hints to potentially protein-coding regions by means of a Generalized Hidden Markov Model (GHMM) that takes both intrinsic and extrinsic information into account. We used this method to extend the ab initio gene prediction program AUGUSTUS to a versatile tool that we call AUGUSTUS+. In this study, we focus on hints derived from matches to an EST or protein database, but our approach can be used to include arbitrary user-defined hints. Our method is only moderately effected by the length of a database match. Further, it exploits the information that can be derived from the absence of such matches. As a special case, AUGUSTUS+ can predict genes under user-defined constraints, e.g. if the positions of certain exons are known. With hints from EST and protein databases, our new approach was able to predict 89% of the exons in human chromosome 22 correctly.

CONCLUSION

Sensitive probabilistic modeling of extrinsic evidence such as sequence database matches can increase gene prediction accuracy. When a match of a sequence interval to an EST or protein sequence is used it should be treated as compound information rather than as information about individual positions.

摘要

背景

为了改进基因预测，可以从各种信息源收集有关基因结构的外在证据，如基因组-基因组比较以及EST和蛋白质比对。然而，此类证据往往不完整且通常具有不确定性。外在证据通常不足以完全恢复所有基因的完整基因结构，而且现有证据往往不可靠。因此，当外在证据与序列内在证据相平衡时，其价值最大。

结果

我们提出了一种相当通用的整合外部信息的方法。我们的方法基于通过广义隐马尔可夫模型（GHMM）对潜在蛋白质编码区域的提示进行评估，该模型同时考虑了内在和外在信息。我们使用此方法将从头基因预测程序AUGUSTUS扩展为一个多功能工具，我们称之为AUGUSTUS+。在本研究中，我们专注于源自与EST或蛋白质数据库匹配的提示，但我们的方法可用于纳入任意用户定义的提示。我们的方法仅受到数据库匹配长度的适度影响。此外，它利用了可从不存在此类匹配中得出的信息。作为一种特殊情况，AUGUSTUS+可以在用户定义的约束条件下预测基因，例如，如果某些外显子的位置已知。借助来自EST和蛋白质数据库的提示，我们的新方法能够正确预测人类22号染色体中89%的外显子。

结论

对诸如序列数据库匹配等外在证据进行灵敏的概率建模可以提高基因预测的准确性。当使用序列区间与EST或蛋白质序列的匹配时，应将其视为复合信息而非关于单个位置的信息。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e0a/1409804/dc6b551236b3/1471-2105-7-62-1.jpg

相似文献

Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources.使用来自外部源的提示，通过广义隐马尔可夫模型对真核生物进行基因预测。

BMC Bioinformatics. 2006 Feb 9;7:62. doi: 10.1186/1471-2105-7-62.

Generalized hierarchical markov models for the discovery of length-constrained sequence features from genome tiling arrays.用于从基因组平铺阵列中发现长度受限序列特征的广义分层马尔可夫模型。

Biometrics. 2007 Sep;63(3):797-805. doi: 10.1111/j.1541-0420.2007.00760.x.

AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome.EGASP中的AUGUSTUS：利用EST、蛋白质和基因组比对改进人类基因组中的基因预测

Genome Biol. 2006;7 Suppl 1(Suppl 1):S11.1-8. doi: 10.1186/gb-2006-7-s1-s11. Epub 2006 Aug 7.

Gene prediction with a hidden Markov model and a new intron submodel.基于隐马尔可夫模型和新型内含子子模型的基因预测

Bioinformatics. 2003 Oct;19 Suppl 2:ii215-25. doi: 10.1093/bioinformatics/btg1080.

Unsupervised segmentation of continuous genomic data.连续基因组数据的无监督分割

Bioinformatics. 2007 Jun 1;23(11):1424-6. doi: 10.1093/bioinformatics/btm096. Epub 2007 Mar 23.

Modeling sequencing errors by combining Hidden Markov models.通过结合隐马尔可夫模型对测序错误进行建模。

Bioinformatics. 2003 Oct;19 Suppl 2:ii103-12. doi: 10.1093/bioinformatics/btg1067.

Eukaryotic gene prediction using GeneMark.hmm.使用GeneMark.hmm进行真核基因预测。

Curr Protoc Bioinformatics. 2003 May;Chapter 4:Unit4.6. doi: 10.1002/0471250953.bi0406s01.

Finding cis-regulatory modules in Drosophila using phylogenetic hidden Markov models.使用系统发育隐马尔可夫模型在果蝇中寻找顺式调控模块。

Bioinformatics. 2007 Aug 15;23(16):2031-7. doi: 10.1093/bioinformatics/btm299. Epub 2007 Jun 5.

Transcription binding site prediction using Markov models.使用马尔可夫模型进行转录结合位点预测。

J Bioinform Comput Biol. 2006 Apr;4(2):425-41. doi: 10.1142/s0219720006001813.

MORPH: probabilistic alignment combined with hidden Markov models of cis-regulatory modules.MORPH：概率比对与顺式调控模块的隐马尔可夫模型相结合。

PLoS Comput Biol. 2007 Nov;3(11):e216. doi: 10.1371/journal.pcbi.0030216. Epub 2007 Sep 24.

引用本文的文献

Chromosome-level genome assembly of Hippophae salicifolia.沙棘的染色体水平基因组组装

Sci Data. 2025 Aug 28;12(1):1503. doi: 10.1038/s41597-025-05844-6.

sp. nov., isolated from Niya Fish Salt Lake Sediment in Xinjiang, Northwest China.新种，从中国西北部新疆的尼雅鱼盐湖沉积物中分离得到。

Int J Syst Evol Microbiol. 2025 Aug;75(8). doi: 10.1099/ijsem.0.006894.

Better together: Subgenomes for allotetraploid potato wild relative Solanum acaule Bitt. reveal origins in Petota Clade 3 and 4.携手共进：异源四倍体马铃薯野生近缘种智利茄的亚基因组揭示其起源于马铃薯进化分支3和4。

Plant Genome. 2025 Sep;18(3):e70095. doi: 10.1002/tpg2.70095.

Genomes of nitrogen-fixing eukaryotes reveal an alternate path for organellogenesis.固氮真核生物的基因组揭示了一条细胞器发生的替代途径。

Proc Natl Acad Sci U S A. 2025 Aug 19;122(33):e2507237122. doi: 10.1073/pnas.2507237122. Epub 2025 Aug 12.

Decoding pilocarpine biosynthesis and its roles in Pilocarpus microphyllus through a comparative transcriptomics approach.通过比较转录组学方法解析毛果芸香碱生物合成及其在小叶毛果芸香中的作用。

BMC Plant Biol. 2025 Aug 4;25(1):1024. doi: 10.1186/s12870-025-07087-4.

Integrating genome assembly, structural variation map construction and GWAS reveal the impact of SVs on agronomic traits of Brassica napus.整合基因组组装、结构变异图谱构建和全基因组关联研究揭示了结构变异对甘蓝型油菜农艺性状的影响。

Theor Appl Genet. 2025 Jul 26;138(8):191. doi: 10.1007/s00122-025-04977-x.

Chromosome-level genome assembly of Sinocyclocheilus jii based on PacBio HiFi and Hi-C sequencing.基于PacBio HiFi和Hi-C测序的吉氏金线鲃染色体水平基因组组装

Sci Data. 2025 Jul 26;12(1):1303. doi: 10.1038/s41597-025-05663-9.

Chromosome-level genome assembly of the Vermilion Snapper (Rhomboplites aurorubens).红鲷（Rhomboplites aurorubens）的染色体水平基因组组装

Sci Data. 2025 Jul 23;12(1):1281. doi: 10.1038/s41597-025-05573-w.

Multiple Autopolyploid Arabidopsis lyrata Populations Stabilized by Long-Range Adaptive Introgression Across Eurasia.多个通过欧亚大陆远距离适应性渐渗而稳定的多倍体琴叶拟南芥种群

Mol Biol Evol. 2025 Jul 30;42(8). doi: 10.1093/molbev/msaf153.

A functional interleukin-4 homolog is encoded in the genome of infectious laryngotracheitis virus: Unveiling a novel virulence factor.传染性喉气管炎病毒基因组中编码一种功能性白细胞介素-4同源物：揭示一种新型毒力因子。

PLoS Pathog. 2025 Jul 23;21(7):e1013219. doi: 10.1371/journal.ppat.1013219. eCollection 2025 Jul.

本文引用的文献

Using multiple alignments to improve gene prediction.使用多重比对来改进基因预测。

J Comput Biol. 2006 Mar;13(2):379-93. doi: 10.1089/cmb.2006.13.379.

ExonHunter: a comprehensive approach to gene finding.外显子猎手：一种全面的基因发现方法。

Bioinformatics. 2005 Jun;21 Suppl 1:i57-65. doi: 10.1093/bioinformatics/bti1040.

AUGUSTUS: a web server for gene finding in eukaryotes.奥古斯塔斯：用于真核生物基因发现的网络服务器。

Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W309-12. doi: 10.1093/nar/gkh379.

AGenDA: gene prediction by cross-species sequence comparison.AGenDA：通过跨物种序列比较进行基因预测。

Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W305-8. doi: 10.1093/nar/gkh386.

Recent advances in gene structure prediction.基因结构预测的最新进展。

Curr Opin Struct Biol. 2004 Jun;14(3):264-72. doi: 10.1016/j.sbi.2004.05.007.

GeneWise and Genomewise.基因比对软件GeneWise和基因组比对软件Genomewise

Genome Res. 2004 May;14(5):988-95. doi: 10.1101/gr.1865504.

Computational gene prediction using multiple sources of evidence.利用多种证据来源进行计算基因预测。

Genome Res. 2004 Jan;14(1):142-8. doi: 10.1101/gr.1562804.

Fast and sensitive multiple alignment of large genomic sequences.大型基因组序列的快速灵敏多重比对。

BMC Bioinformatics. 2003 Dec 23;4:66. doi: 10.1186/1471-2105-4-66.

Gene prediction with a hidden Markov model and a new intron submodel.基于隐马尔可夫模型和新型内含子子模型的基因预测

Bioinformatics. 2003 Oct;19 Suppl 2:ii215-25. doi: 10.1093/bioinformatics/btg1080.

SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model.SLAM：跨物种基因发现及使用广义配对隐马尔可夫模型进行比对

Genome Res. 2003 Mar;13(3):496-502. doi: 10.1101/gr.424203.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用来自外部源的提示，通过广义隐马尔可夫模型对真核生物进行基因预测。

Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献