Kananen Kathryn, Veseli Iva, Quiles Pérez Christian J, Miller Samuel E, Eren A Murat, Bradley Patrick H
Department of Microbiology, The Ohio State University, Columbus, OH 43210, United States.
Helmholtz Institute for Functional Marine Biodiversity, 26129 Oldenburg, Germany.
Bioinform Adv. 2025 Mar 21;5(1):vbaf039. doi: 10.1093/bioadv/vbaf039. eCollection 2025.
Gene function annotation in microbial genomes and metagenomes is a fundamental first step toward understanding metabolic potential and determinants of fitness. The Kyoto Encyclopedia of Genes and Genomes publishes a curated list of profile hidden Markov models to identify orthologous gene families (KOfams) with roles in metabolism. However, the computational tools that rely upon KOfams yield different annotations for the same set of genomes, leading to different downstream biological inferences.
Here, we apply three open-source software tools that can annotate KOfams to genomes of phylogenetically diverse bacterial families from host-associated and free-living biomes. We use multiple computational approaches to benchmark these methods and investigate individual case studies where they differ. Our results show that despite their fundamental similarities, these methods have different annotation rates and quality. In particular, a method that adaptively tunes sequence similarity thresholds substantially improves sensitivity while maintaining high accuracy. We observe particularly large improvements for protein families with few reference sequences, or when annotating genomes from nonmodel organisms (such as gut-dwelling ). Our findings show that small improvements in annotation workflows can maximize the utility of existing databases and meaningfully improve characterizations of microbial metabolism.
Anvi'o is available at https://anvio.org under the GNU GPL license. Scripts and workflow are available at https://github.com/pbradleylab/2023-anvio-comparison under the MIT license.
微生物基因组和宏基因组中的基因功能注释是理解代谢潜力和适应性决定因素的基本第一步。《京都基因与基因组百科全书》发布了一份经过整理的轮廓隐藏马尔可夫模型列表,以识别在代谢中起作用的直系同源基因家族(KOfams)。然而,依赖KOfams的计算工具对同一组基因组产生不同的注释,导致不同的下游生物学推断。
在这里,我们应用三种可以将KOfams注释到来自宿主相关和自由生活生物群落的系统发育多样细菌家族基因组的开源软件工具。我们使用多种计算方法对这些方法进行基准测试,并研究它们不同的个别案例。我们的结果表明,尽管这些方法有基本的相似性,但它们的注释率和质量不同。特别是,一种自适应调整序列相似性阈值的方法在保持高精度的同时,显著提高了灵敏度。对于参考序列较少的蛋白质家族,或者在注释非模式生物(如肠道微生物)的基因组时,我们观察到了特别大的改进。我们的研究结果表明,注释工作流程的微小改进可以最大限度地利用现有数据库,并切实改善对微生物代谢的表征。
Anvi'o可在https://anvio.org上根据GNU GPL许可获得。脚本和工作流程可在https://github.com/pbradleylab/2023-anvio-comparison上根据MIT许可获得。