Berrios Louis, Ely Bert
Department of Biological Sciences, University of South Carolina, Columbia, SC, 29208, USA.
Curr Microbiol. 2018 Dec;75(12):1642-1648. doi: 10.1007/s00284-018-1572-3. Epub 2018 Sep 26.
Annotated sequence data are instrumental in nearly all realms of biology. However, the advent of next-generation sequencing has rapidly facilitated an imbalance between accurate sequence data and accurate annotation data. To increase the annotation accuracy of the Caulobacter vibrioides CB13b1a (CB13) genome, we compared the PGAP and RAST annotations of the CB13 genome. A total of 64 unique genes were identified in the PGAP annotation that were either completely or partially absent in the RAST annotation, and a total of 16 genes were identified in the RAST annotation that were not included in the PGAP annotation. Moreover, PGAP identified 73 frameshifted genes and 22 genes with an internal stop. In contrast, RAST annotated the larger segment of these frameshifted genes without indicating a change in reading frame may have occurred. The RAST annotation did not include any genes with internal stop codons, since it chose start codons that were after the internal stop. To confirm the discrepancies between the two annotations and verify the accuracy of the CB13 genome sequence data, we re-sequenced and re-annotated the entire genome and obtained an identical sequence, except in a small number of homopolymer regions. A genome sequence comparison between the two versions allowed us to determine the correct number of bases in each homopolymer region, which eliminated frameshifts for 31 genes annotated as frameshifted genes and removed 24 pseudogenes from the PGAP annotation. Both annotation systems correctly identified genes that were missed by the other system. In addition, PGAP identified conserved gene fragments that represented the beginning of genes, but it employed no corrective method to adjust the reading frame of frameshifted genes or the start sites of genes harboring an internal stop codon. In doing so, the PGAP annotation identified a large number of pseudogenes, which may reflect evolutionary history but likely do not produce gene products. These results demonstrate that re-sequencing and annotation comparisons can be used to increase the accuracy of genomic data and the corresponding gene annotation.
带注释的序列数据在几乎所有生物学领域都发挥着重要作用。然而,新一代测序技术的出现迅速加剧了准确序列数据与准确注释数据之间的不平衡。为了提高新月柄杆菌CB13b1a(CB13)基因组的注释准确性,我们比较了CB13基因组的PGAP注释和RAST注释。在PGAP注释中总共鉴定出64个独特基因,这些基因在RAST注释中完全或部分缺失,而在RAST注释中总共鉴定出16个基因未包含在PGAP注释中。此外,PGAP鉴定出73个移码基因和22个带有内部终止密码子的基因。相比之下,RAST对这些移码基因的较大片段进行了注释,但未表明可能发生了读框变化。RAST注释不包括任何带有内部终止密码子的基因,因为它选择的起始密码子在内部终止密码子之后。为了确认两种注释之间的差异并验证CB13基因组序列数据的准确性,我们对整个基因组进行了重新测序和重新注释,除了少数同聚物区域外,获得了相同的序列。两个版本之间的基因组序列比较使我们能够确定每个同聚物区域中的正确碱基数,这消除了31个被注释为移码基因的基因的移码,并从PGAP注释中删除了24个假基因。两种注释系统都正确鉴定出了另一个系统遗漏的基因。此外,PGAP鉴定出了代表基因起始的保守基因片段,但它没有采用任何校正方法来调整移码基因的读框或带有内部终止密码子的基因的起始位点。这样一来,PGAP注释鉴定出了大量假基因,这些假基因可能反映了进化历史,但可能不产生基因产物。这些结果表明,重新测序和注释比较可用于提高基因组数据及相应基因注释的准确性。