Stanke Mario, Waack Stephan
Institut für Mikrobiologie und Genetik, Abteilung Bioinformatik, Universität Göttingen, Göttingen, Germany.
Bioinformatics. 2003 Oct;19 Suppl 2:ii215-25. doi: 10.1093/bioinformatics/btg1080.
The problem of finding the genes in eukaryotic DNA sequences by computational methods is still not satisfactorily solved. Gene finding programs have achieved relatively high accuracy on short genomic sequences but do not perform well on longer sequences with an unknown number of genes in them. Here existing programs tend to predict many false exons.
We have developed a new program, AUGUSTUS, for the ab initio prediction of protein coding genes in eukaryotic genomes. The program is based on a Hidden Markov Model and integrates a number of known methods and submodels. It employs a new way of modeling intron lengths. We use a new donor splice site model, a new model for a short region directly upstream of the donor splice site model that takes the reading frame into account and apply a method that allows better GC-content dependent parameter estimation. AUGUSTUS predicts on longer sequences far more human and drosophila genes accurately than the ab initio gene prediction programs we compared it with, while at the same time being more specific.
A web interface for AUGUSTUS and the executable program are located at http://augustus.gobics.de.
通过计算方法在真核生物DNA序列中寻找基因的问题仍未得到令人满意的解决。基因寻找程序在短基因组序列上已取得了相对较高的准确率,但在含有未知数量基因的较长序列上表现不佳。在此类序列中,现有程序往往会预测出许多错误的外显子。
我们开发了一个新程序AUGUSTUS,用于从真核生物基因组中从头预测蛋白质编码基因。该程序基于隐马尔可夫模型,并整合了许多已知方法和子模型。它采用了一种新的内含子长度建模方法。我们使用了一种新的供体剪接位点模型、一种考虑阅读框的供体剪接位点模型直接上游短区域的新模型,并应用了一种能实现更好的GC含量依赖性参数估计的方法。与我们将AUGUSTUS与之比较的从头基因预测程序相比,AUGUSTUS在较长序列上能更准确地预测更多人类和果蝇基因,同时具有更高的特异性。
AUGUSTUS的网络界面和可执行程序位于http://augustus.gobics.de 。