Nielsen Pernille, Krogh Anders
Bioinformatics Centre, Institute of Molecular Biology and Physiology, University of Copenhagen Universitetsparken 15, 2100 Copenhagen, Denmark.
Bioinformatics. 2005 Dec 15;21(24):4322-9. doi: 10.1093/bioinformatics/bti701. Epub 2005 Oct 25.
Prokaryotic genomes are sequenced and annotated at an increasing rate. The methods of annotation vary between sequencing groups. It makes genome comparison difficult and may lead to propagation of errors when questionable assignments are adapted from one genome to another. Genome comparison either on a large or small scale would be facilitated by using a single standard for annotation, which incorporates a transparency of why an open reading frame (ORF) is considered to be a gene.
A total of 143 prokaryotic genomes were scored with an updated version of the prokaryotic genefinder EasyGene. Comparison of the GenBank and RefSeq annotations with the EasyGene predictions reveals that in some genomes up to approximately 60% of the genes may have been annotated with a wrong start codon, especially in the GC-rich genomes. The fractional difference between annotated and predicted confirms that too many short genes are annotated in numerous organisms. Furthermore, genes might be missing in the annotation of some of the genomes. We predict 41 of 143 genomes to be over-annotated by >5%, meaning that too many ORFs are annotated as genes. We also predict that 12 of 143 genomes are under-annotated. These results are based on the difference between the number of annotated genes not found by EasyGene and the number of predicted genes that are not annotated in GenBank. We argue that the average performance of our standardized and fully automated method is slightly better than the annotation.
原核生物基因组测序和注释的速度越来越快。不同测序团队的注释方法各不相同。这使得基因组比较变得困难,并且当可疑的注释从一个基因组应用到另一个基因组时,可能会导致错误的传播。使用单一的注释标准将有助于大规模或小规模的基因组比较,该标准应包含一个开放阅读框(ORF)被视为基因的原因的透明度。
使用原核基因查找器EasyGene的更新版本对总共143个原核生物基因组进行了评分。将GenBank和RefSeq注释与EasyGene预测进行比较发现,在某些基因组中,高达约60%的基因可能被错误地注释了起始密码子,特别是在富含GC的基因组中。注释和预测之间的分数差异证实,许多生物体中注释了太多短基因。此外,一些基因组的注释中可能缺少基因。我们预测143个基因组中有41个被过度注释超过5%,这意味着太多的ORF被注释为基因。我们还预测143个基因组中有12个被注释不足。这些结果是基于EasyGene未找到的注释基因数量与GenBank中未注释的预测基因数量之间的差异。我们认为我们的标准化和全自动方法的平均性能略优于注释。