Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America.
Center for Computational Biology, Johns Hopkins University, Baltimore, Maryland, United States of America.
PLoS Comput Biol. 2021 Feb 26;17(2):e1008727. doi: 10.1371/journal.pcbi.1008727. eCollection 2021 Feb.
Low-cost, high-throughput sequencing has led to an enormous increase in the number of sequenced microbial genomes, with well over 100,000 genomes in public archives today. Automatic genome annotation tools are integral to understanding these organisms, yet older gene finding methods must be retrained on each new genome. We have developed a universal model of prokaryotic genes by fitting a temporal convolutional network to amino-acid sequences from a large, diverse set of microbial genomes. We incorporated the new model into a gene finding system, Balrog (Bacterial Annotation by Learned Representation Of Genes), which does not require genome-specific training and which matches or outperforms other state-of-the-art gene finding tools. Balrog is freely available under the MIT license at https://github.com/salzberg-lab/Balrog.
低成本、高通量测序导致已测序微生物基因组数量的大幅增加,目前公共档案中已有超过 10 万个基因组。自动基因组注释工具对于理解这些生物体至关重要,但必须针对每个新基因组重新训练旧的基因发现方法。我们通过将时间卷积网络拟合到来自大量不同微生物基因组的氨基酸序列,开发了一个普遍的原核基因模型。我们将新模型纳入基因发现系统 Balrog(通过基因的学习表示进行细菌注释)中,该系统不需要针对特定基因组的训练,并且匹配或优于其他最先进的基因发现工具。Balrog 可在 MIT 许可证下免费获得,网址为 https://github.com/salzberg-lab/Balrog。