Reese M G, Hartzell G, Harris N L, Ohler U, Abril J F, Lewis S E
Berkeley Drosophila Genome Project, Department of Molecular and Cell Biology, University of California, Berkeley 94720-3200, USA.
Genome Res. 2000 Apr;10(4):483-501. doi: 10.1101/gr.10.4.483.
Computational methods for automated genome annotation are critical to our community's ability to make full use of the large volume of genomic sequence being generated and released. To explore the accuracy of these automated feature prediction tools in the genomes of higher organisms, we evaluated their performance on a large, well-characterized sequence contig from the Adh region of Drosophila melanogaster. This experiment, known as the Genome Annotation Assessment Project (GASP), was launched in May 1999. Twelve groups, applying state-of-the-art tools, contributed predictions for features including gene structure, protein homologies, promoter sites, and repeat elements. We evaluated these predictions using two standards, one based on previously unreleased high-quality full-length cDNA sequences and a second based on the set of annotations generated as part of an in-depth study of the region by a group of Drosophila experts. Although these standard sets only approximate the unknown distribution of features in this region, we believe that when taken in context the results of an evaluation based on them are meaningful. The results were presented as a tutorial at the conference on Intelligent Systems in Molecular Biology (ISMB-99) in August 1999. Over 95% of the coding nucleotides in the region were correctly identified by the majority of the gene finders, and the correct intron/exon structures were predicted for >40% of the genes. Homology-based annotation techniques recognized and associated functions with almost half of the genes in the region; the remainder were only identified by the ab initio techniques. This experiment also presents the first assessment of promoter prediction techniques for a significant number of genes in a large contiguous region. We discovered that the promoter predictors' high false-positive rates make their predictions difficult to use. Integrating gene finding and cDNA/EST alignments with promoter predictions decreases the number of false-positive classifications but discovers less than one-third of the promoters in the region. We believe that by establishing standards for evaluating genomic annotations and by assessing the performance of existing automated genome annotation tools, this experiment establishes a baseline that contributes to the value of ongoing large-scale annotation projects and should guide further research in genome informatics.
自动化基因组注释的计算方法对于我们的科研群体充分利用大量已生成和发布的基因组序列的能力至关重要。为了探究这些自动化特征预测工具在高等生物基因组中的准确性,我们在来自黑腹果蝇Adh区域的一个大型、特征明确的序列重叠群上评估了它们的性能。这个实验,即基因组注释评估项目(GASP),于1999年5月启动。十二个团队应用了最先进的工具,对包括基因结构、蛋白质同源性、启动子位点和重复元件等特征进行了预测。我们使用两种标准评估这些预测,一种基于先前未发布的高质量全长cDNA序列,另一种基于一组果蝇专家对该区域进行深入研究时生成的注释集。尽管这些标准集仅近似该区域特征的未知分布,但我们认为结合上下文来看,基于它们的评估结果是有意义的。这些结果在1999年8月的分子生物学智能系统会议(ISMB - 99)上作为教程展示。该区域超过95%的编码核苷酸被大多数基因识别工具正确识别,并且超过40%的基因的正确内含子/外显子结构被预测出来。基于同源性的注释技术识别出该区域近一半的基因并赋予其功能;其余的仅通过从头预测技术识别。这个实验还首次对一个大的连续区域中大量基因的启动子预测技术进行了评估。我们发现启动子预测工具的高假阳性率使得它们的预测难以使用。将基因识别和cDNA/EST比对与启动子预测相结合可以减少假阳性分类的数量,但只能发现该区域不到三分之一的启动子。我们相信,通过建立评估基因组注释的标准并评估现有自动化基因组注释工具的性能,这个实验建立了一个基线,有助于正在进行的大规模注释项目的价值,并应指导基因组信息学方面的进一步研究。