Institute of Biological, Environmental and Rural Sciences, Aberystwyth University, Aberystwyth SY23 3PD, Wales, UK.
Department of Computer Science, Aberystwyth University, Aberystwyth SY23 3DB, Wales, UK.
Nucleic Acids Res. 2023 Nov 27;51(21):11504-11517. doi: 10.1093/nar/gkad814.
Large regions of prokaryotic genomes are currently without any annotation, in part due to well-established limitations of annotation tools. For example, it is routine for genes using alternative start codons to be misreported or completely omitted. Therefore, we present StORF-Reporter, a tool that takes an annotated genome and returns regions that may contain missing CDS genes from unannotated regions. StORF-Reporter consists of two parts. The first begins with the extraction of unannotated regions from an annotated genome. Next, Stop-ORFs (StORFs) are identified in these unannotated regions. StORFs are open reading frames that are delimited by stop codons and thus can capture those genes most often missing in genome annotations. We show this methodology recovers genes missing from canonical genome annotations. We inspect the results of the genomes of model organisms, the pangenome of Escherichia coli, and a set of 5109 prokaryotic genomes of 247 genera from the Ensembl Bacteria database. StORF-Reporter extended the core, soft-core and accessory gene collections, identified novel gene families and extended families into additional genera. The high levels of sequence conservation observed between genera suggest that many of these StORFs are likely to be functional genes that should now be considered for inclusion in canonical annotations.
目前,大部分原核生物基因组区域都没有任何注释,这在一定程度上是由于注释工具存在一些既定的局限性。例如,使用不同起始密码子的基因经常会被错误报告或完全遗漏。因此,我们提出了 StORF-Reporter,这是一种工具,它可以接受已注释的基因组,并返回可能包含来自未注释区域的缺失 CDS 基因的区域。StORF-Reporter 由两部分组成。第一部分从已注释的基因组中提取未注释的区域。接下来,在这些未注释的区域中识别终止 ORF(StORF)。StORF 是由终止密码子分隔的开放阅读框,因此可以捕获那些在基因组注释中经常缺失的基因。我们证明了这种方法可以恢复从规范基因组注释中缺失的基因。我们检查了模式生物基因组、大肠杆菌泛基因组以及来自 Ensembl Bacteria 数据库的 247 个属的 5109 个原核生物基因组的结果。StORF-Reporter 扩展了核心、软核心和辅助基因集,鉴定了新的基因家族和扩展家族到其他属中。在属之间观察到的高序列保守性表明,这些 StORFs 中的许多可能是功能基因,现在应该考虑将它们纳入规范注释中。