一种用于在大肠杆菌DNA中寻找基因的隐马尔可夫模型。

A hidden Markov model that finds genes in E. coli DNA.

作者信息

Krogh A, Mian I S, Haussler D

机构信息

Nordita, Copenhagen, Denmark.

出版信息

Nucleic Acids Res. 1994 Nov 11;22(22):4768-78. doi: 10.1093/nar/22.22.4768.

DOI:10.1093/nar/22.22.4768

PMID:7984429

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC308529/

Abstract

A hidden Markov model (HMM) has been developed to find protein coding genes in E. coli DNA using E. coli genome DNA sequence from the EcoSeq6 database maintained by Kenn Rudd. This HMM includes states that model the codons and their frequencies in E. coli genes, as well as the patterns found in the intergenic region, including repetitive extragenic palindromic sequences and the Shine-Delgarno motif. To account for potential sequencing errors and or frameshifts in raw genomic DNA sequence, it allows for the (very unlikely) possibility of insertions and deletions of individual nucleotides within a codon. The parameters of the HMM are estimated using approximately one million nucleotides of annotated DNA in EcoSeq6 and the model tested on a disjoint set of contigs containing about 325,000 nucleotides. The HMM finds the exact locations of about 80% of the known E. coli genes, and approximate locations for about 10%. It also finds several potentially new genes, and locates several places were insertion or deletion errors/and or frameshifts may be present in the contigs.

摘要

一种隐马尔可夫模型（HMM）已被开发出来，用于利用肯·拉德维护的EcoSeq6数据库中的大肠杆菌基因组DNA序列，在大肠杆菌DNA中寻找蛋白质编码基因。该HMM包含一些状态，这些状态对大肠杆菌基因中的密码子及其频率进行建模，以及基因间区域中发现的模式，包括重复的基因外回文序列和Shine-Delgarno基序。为了考虑原始基因组DNA序列中潜在的测序错误和/或移码，它允许密码子内单个核苷酸插入和缺失的（非常不可能的）可能性。HMM的参数使用EcoSeq6中约100万个带注释的DNA核苷酸进行估计，并在一组包含约325,000个核苷酸的不相交重叠群上对模型进行测试。该HMM找到了约80%已知大肠杆菌基因的精确位置，以及约10%的近似位置。它还发现了几个潜在的新基因，并确定了重叠群中可能存在插入或缺失错误和/或移码的几个位置。

相似文献

A hidden Markov model that finds genes in E. coli DNA.一种用于在大肠杆菌DNA中寻找基因的隐马尔可夫模型。

Nucleic Acids Res. 1994 Nov 11;22(22):4768-78. doi: 10.1093/nar/22.22.4768.

GeneMark.hmm: new solutions for gene finding.基因标记隐马尔可夫模型：基因发现的新解决方案。

Nucleic Acids Res. 1998 Feb 15;26(4):1107-15. doi: 10.1093/nar/26.4.1107.

[Statistical characteristics in primary structures of functional regions of Escherichia coli genome. II. Non-stationary Markov chains].[大肠杆菌基因组功能区一级结构的统计特征。II. 非平稳马尔可夫链]

Mol Biol (Mosk). 1986 Jul-Aug;20(4):1024-33.

Intrinsic and extrinsic approaches for detecting genes in a bacterial genome.检测细菌基因组中基因的内在和外在方法。

Nucleic Acids Res. 1994 Nov 11;22(22):4756-67. doi: 10.1093/nar/22.22.4756.

[Statistical characteristics of primary structures of the functional regions of the Escherichia coli genome. III. Computer recognition of coding regions].[大肠杆菌基因组功能区一级结构的统计特征。III. 编码区的计算机识别]

Mol Biol (Mosk). 1986 Sep-Oct;20(5):1390-8.

Gene recognition in cyanobacterium genomic sequence data using the hidden Markov model.利用隐马尔可夫模型在蓝藻基因组序列数据中进行基因识别。

Proc Int Conf Intell Syst Mol Biol. 1996;4:252-60.

[Statistical characteristics in primary structures of functional regions of Escherichia coli genome. I. Frequency characteristics].[大肠杆菌基因组功能区一级结构的统计特征。I. 频率特征]

Mol Biol (Mosk). 1986 Jul-Aug;20(4):1014-23.

Finding genes in DNA with a Hidden Markov Model.使用隐马尔可夫模型在DNA中寻找基因。

J Comput Biol. 1997 Summer;4(2):127-41. doi: 10.1089/cmb.1997.4.127.

Detection of genes in Escherichia coli sequences determined by genome projects and prediction of protein production levels, based on multivariate diversity in codon usage.通过基因组计划确定的大肠杆菌序列中的基因检测，以及基于密码子使用的多变量多样性对蛋白质产生水平的预测。

Comput Appl Biosci. 1996 Jun;12(3):213-25. doi: 10.1093/bioinformatics/12.3.213.

Recognition of genes in DNA sequence with ambiguities.识别具有模糊性的DNA序列中的基因。

Biosystems. 1993;30(1-3):161-71. doi: 10.1016/0303-2647(93)90068-n.

引用本文的文献

tRNAscan-SE 2.0: improved detection and functional classification of transfer RNA genes.tRNAscan-SE 2.0：改进的 tRNA 基因检测和功能分类。

Nucleic Acids Res. 2021 Sep 20;49(16):9077-9096. doi: 10.1093/nar/gkab688.

Classification of seed members of five riboswitch families as short sequences based on the features extracted by Block Location-Based Feature Extraction (BLBFE) method.基于基于块位置的特征提取（BLBFE）方法提取的特征，将五个核糖开关家族的种子成员分类为短序列。

Bioimpacts. 2021;11(2):101-109. doi: 10.34172/bi.2021.17. Epub 2020 Apr 17.

Alignment using genetic programming with causal trees for identification of protein functions.使用带有因果树的遗传编程进行比对以识别蛋白质功能。

Nonlinear Anal Theory Methods Appl. 2006 Sep 1;65(5):1070-1093. doi: 10.1016/j.na.2005.09.048. Epub 2005 Nov 28.

Classification of Riboswitch Families Using Block Location-Based Feature Extraction (BLBFE) Method.基于块位置的特征提取（BLBFE）方法对核糖开关家族的分类

Adv Pharm Bull. 2020 Jan;10(1):97-105. doi: 10.15171/apb.2020.012. Epub 2019 Dec 11.

Development of a new oligonucleotide block location-based feature extraction (BLBFE) method for the classification of riboswitches.开发一种基于寡核苷酸模块位置的新特征提取（BLBFE）方法，用于核糖开关的分类。

Mol Genet Genomics. 2020 Mar;295(2):525-534. doi: 10.1007/s00438-019-01642-z. Epub 2020 Jan 4.

Prediction of Sphingosine protein-coding regions with a self adaptive spectral rotation method.利用自适光谱旋转方法预测鞘氨醇蛋白编码区。

PLoS One. 2019 Apr 3;14(4):e0214442. doi: 10.1371/journal.pone.0214442. eCollection 2019.

Identification and characterization of Coronaviridae genomes from Vietnamese bats and rats based on conserved protein domains.基于保守蛋白结构域对越南蝙蝠和大鼠中冠状病毒科基因组的鉴定与特征分析。

Virus Evol. 2018 Dec 15;4(2):vey035. doi: 10.1093/ve/vey035. eCollection 2018 Jul.

Binding site discovery from nucleic acid sequences by discriminative learning of hidden Markov models.通过隐马尔可夫模型的判别式学习从核酸序列中发现结合位点。

Nucleic Acids Res. 2014 Dec 1;42(21):12995-3011. doi: 10.1093/nar/gku1083. Epub 2014 Nov 11.

Regional effects on chimera formation in 454 pyrosequenced amplicons from a mock community.对来自模拟群落的454焦磷酸测序扩增子中嵌合体形成的区域影响。

J Microbiol. 2014 Jul;52(7):566-73. doi: 10.1007/s12275-014-3485-6. Epub 2014 May 30.

Algorithms for hidden markov models restricted to occurrences of regular expressions.正则表达式约束的隐马尔可夫模型算法。

Biology (Basel). 2013 Nov 8;2(4):1282-95. doi: 10.3390/biology2041282.

本文引用的文献

Sequence length and error analysis of Sequenase and automated Taq cycle sequencing methods.Sequenase和自动Taq循环测序方法的序列长度及误差分析

Biotechniques. 1993 Mar;14(3):442-7.

Structural analysis based on state-space modeling.基于状态空间建模的结构分析。

Protein Sci. 1993 Mar;2(3):305-14. doi: 10.1002/pro.5560020302.

Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks.基因组DNA序列中编码区域的识别：动态规划和神经网络的应用

Nucleic Acids Res. 1993 Feb 11;21(3):607-13. doi: 10.1093/nar/21.3.607.

Alternative readings of the genetic code.遗传密码的其他解读方式。

Cell. 1993 Aug 27;74(4):591-6. doi: 10.1016/0092-8674(93)90507-m.

Compilation of DNA sequences of Escherichia coli (update 1993).大肠杆菌DNA序列汇编（1993年更新）

Nucleic Acids Res. 1993 Jul 1;21(13):2973-3000. doi: 10.1093/nar/21.13.2973.

Hidden Markov models of biological primary sequence information.生物一级序列信息的隐马尔可夫模型

Proc Natl Acad Sci U S A. 1994 Feb 1;91(3):1059-63. doi: 10.1073/pnas.91.3.1059.

Protein classification by stochastic modeling and optimal filtering of amino-acid sequences.通过氨基酸序列的随机建模和最优滤波进行蛋白质分类。

Math Biosci. 1994 Jan;119(1):35-75. doi: 10.1016/0025-5564(94)90004-3.

Hidden Markov models in computational biology. Applications to protein modeling.计算生物学中的隐马尔可夫模型。在蛋白质建模中的应用。

J Mol Biol. 1994 Feb 4;235(5):1501-31. doi: 10.1006/jmbi.1994.1104.

Using Dirichlet mixture priors to derive hidden Markov models for protein families.使用狄利克雷混合先验来推导蛋白质家族的隐马尔可夫模型。

Proc Int Conf Intell Syst Mol Biol. 1993;1:47-55.

Recognition of protein coding regions in DNA sequences.DNA序列中蛋白质编码区域的识别。

Nucleic Acids Res. 1982 Sep 11;10(17):5303-18. doi: 10.1093/nar/10.17.5303.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验