Solovyev Victor, Kosarev Peter, Seledsov Igor, Vorobyev Denis
Department of Computer Science, Royal Holloway, University of London, Egham, Surrey TW20 0EX, UK.
Genome Biol. 2006;7 Suppl 1(Suppl 1):S10.1-12. doi: 10.1186/gb-2006-7-s1-s10. Epub 2006 Aug 7.
The ENCODE gene prediction workshop (EGASP) has been organized to evaluate how well state-of-the-art automatic gene finding methods are able to reproduce the manual and experimental gene annotation of the human genome. We have used Softberry gene finding software to predict genes, pseudogenes and promoters in 44 selected ENCODE sequences representing approximately 1% (30 Mb) of the human genome. Predictions of gene finding programs were evaluated in terms of their ability to reproduce the ENCODE-HAVANA annotation.
The Fgenesh++ gene prediction pipeline can identify 91% of coding nucleotides with a specificity of 90%. Our automatic pseudogene finder (PSF program) found 90% of the manually annotated pseudogenes and some new ones. The Fprom promoter prediction program identifies 80% of TATA promoters sequences with one false positive prediction per 2,000 base-pairs (bp) and 50% of TATA-less promoters with one false positive prediction per 650 bp. It can be used to identify transcription start sites upstream of annotated coding parts of genes found by gene prediction software.
We review our software and underlying methods for identifying these three important structural and functional genome components and discuss the accuracy of predictions, recent advances and open problems in annotating genomic sequences. We have demonstrated that our methods can be effectively used for initial automatic annotation of the eukaryotic genome.
已组织开展ENCODE基因预测研讨会(EGASP),以评估最先进的自动基因发现方法在多大程度上能够重现人类基因组的人工和实验基因注释。我们使用Softberry基因发现软件来预测44条选定的ENCODE序列中的基因、假基因和启动子,这些序列约占人类基因组的1%(30兆碱基)。根据基因发现程序重现ENCODE - HAVANA注释的能力对其预测结果进行评估。
Fgenesh++基因预测流程能够识别91%的编码核苷酸,特异性为90%。我们的自动假基因发现工具(PSF程序)找到了90%的人工注释假基因以及一些新的假基因。Fprom启动子预测程序能够识别80%的TATA启动子序列,每2000个碱基对(bp)有一个假阳性预测,对于无TATA框启动子,识别率为50%,每650 bp有一个假阳性预测。它可用于识别基因预测软件所发现基因的注释编码部分上游的转录起始位点。
我们回顾了用于识别这三种重要的基因组结构和功能元件的软件及基础方法,并讨论了预测的准确性、注释基因组序列方面的最新进展和未解决的问题。我们已经证明,我们的方法可有效地用于真核基因组的初始自动注释。