Krogh A, Mian I S, Haussler D
Nordita, Copenhagen, Denmark.
Nucleic Acids Res. 1994 Nov 11;22(22):4768-78. doi: 10.1093/nar/22.22.4768.
A hidden Markov model (HMM) has been developed to find protein coding genes in E. coli DNA using E. coli genome DNA sequence from the EcoSeq6 database maintained by Kenn Rudd. This HMM includes states that model the codons and their frequencies in E. coli genes, as well as the patterns found in the intergenic region, including repetitive extragenic palindromic sequences and the Shine-Delgarno motif. To account for potential sequencing errors and or frameshifts in raw genomic DNA sequence, it allows for the (very unlikely) possibility of insertions and deletions of individual nucleotides within a codon. The parameters of the HMM are estimated using approximately one million nucleotides of annotated DNA in EcoSeq6 and the model tested on a disjoint set of contigs containing about 325,000 nucleotides. The HMM finds the exact locations of about 80% of the known E. coli genes, and approximate locations for about 10%. It also finds several potentially new genes, and locates several places were insertion or deletion errors/and or frameshifts may be present in the contigs.
一种隐马尔可夫模型(HMM)已被开发出来,用于利用肯·拉德维护的EcoSeq6数据库中的大肠杆菌基因组DNA序列,在大肠杆菌DNA中寻找蛋白质编码基因。该HMM包含一些状态,这些状态对大肠杆菌基因中的密码子及其频率进行建模,以及基因间区域中发现的模式,包括重复的基因外回文序列和Shine-Delgarno基序。为了考虑原始基因组DNA序列中潜在的测序错误和/或移码,它允许密码子内单个核苷酸插入和缺失的(非常不可能的)可能性。HMM的参数使用EcoSeq6中约100万个带注释的DNA核苷酸进行估计,并在一组包含约325,000个核苷酸的不相交重叠群上对模型进行测试。该HMM找到了约80%已知大肠杆菌基因的精确位置,以及约10%的近似位置。它还发现了几个潜在的新基因,并确定了重叠群中可能存在插入或缺失错误和/或移码的几个位置。