Parra G, Blanco E, Guigó R
Grup de Recerca en Informàtica Mèdica, Institut Municipal d'Investigació Mèdica (IMIM), Universitat Pompeu Fabra, E-08003 Barcelona, Spain.
Genome Res. 2000 Apr;10(4):511-5. doi: 10.1101/gr.10.4.511.
GeneID is a program to predict genes in anonymous genomic sequences designed with a hierarchical structure. In the first step, splice sites, and start and stop codons are predicted and scored along the sequence using position weight matrices (PWMs). In the second step, exons are built from the sites. Exons are scored as the sum of the scores of the defining sites, plus the log-likelihood ratio of a Markov model for coding DNA. In the last step, from the set of predicted exons, the gene structure is assembled, maximizing the sum of the scores of the assembled exons. In this paper we describe the obtention of PWMs for sites, and the Markov model of coding DNA in Drosophila melanogaster. We also compare other models of coding DNA with the Markov model. Finally, we present and discuss the results obtained when GeneID is used to predict genes in the Adh region. These results show that the accuracy of GeneID predictions compares currently with that of other existing tools but that GeneID is likely to be more efficient in terms of speed and memory usage.
基因识别(GeneID)是一个用于在具有层次结构的匿名基因组序列中预测基因的程序。第一步,使用位置权重矩阵(PWM)沿着序列预测剪接位点、起始密码子和终止密码子,并对其进行评分。第二步,根据这些位点构建外显子。外显子的得分是定义位点得分的总和,再加上编码DNA的马尔可夫模型的对数似然比。在最后一步,从预测的外显子集合中组装基因结构,使组装后的外显子得分总和最大化。在本文中,我们描述了果蝇中位点的位置权重矩阵的获取以及编码DNA的马尔可夫模型。我们还将编码DNA的其他模型与马尔可夫模型进行了比较。最后,我们展示并讨论了使用基因识别(GeneID)预测乙醇脱氢酶(Adh)区域基因时获得的结果。这些结果表明,基因识别(GeneID)预测的准确性目前与其他现有工具相当,但在速度和内存使用方面,基因识别(GeneID)可能更高效。