Louca Stilianos
Department of Biology, University of Oregon, Eugene, OR 97403, United States.
Institute of Ecology and Evolution, University of Oregon, Eugene, OR 97403, United States.
NAR Genom Bioinform. 2025 Jun 19;7(2):lqaf090. doi: 10.1093/nargab/lqaf090. eCollection 2025 Jun.
The relationship between gene content differences and microbial taxonomic divergence remains poorly understood, and algorithms for delineating novel microbial taxa above genus level based on multiple genome similarity metrics are lacking. Addressing these gaps is important for macroevolutionary theory, biodiversity assessments, and discovery of novel taxa in metagenomes. Here, I develop machine learning classifier models, based on multiple genome similarity metrics, to determine whether any two marine bacterial and archaeal (prokaryotic) metagenome-assembled genomes (MAGs) belong to the same taxon, from the genus up to the phylum levels. Metrics include average amino acid and nucleotide identities, and fractions of shared genes within various categories, applied to 14 390 previously published non-redundant MAGs. At all taxonomic levels, the balanced accuracy (average of the true-positive and true-negative rate) of classifiers exceeded 92%, suggesting that simple genome similarity metrics serve as good taxon differentiators. Predictor selection and sensitivity analyses revealed gene categories, e.g. those involved in metabolism of cofactors and vitamins, particularly correlated to taxon divergence. Predicted taxon delineations were further used to enumerate marine prokaryotic taxa. Statistical analyses of those enumerations suggest that over half of extant marine prokaryotic phyla, classes, and orders have already been recovered by genome-resolved metagenomic surveys.
基因含量差异与微生物分类学差异之间的关系仍未得到充分理解,并且缺乏基于多种基因组相似性指标来划分属以上新微生物类群的算法。填补这些空白对于宏观进化理论、生物多样性评估以及宏基因组中新类群的发现至关重要。在此,我基于多种基因组相似性指标开发了机器学习分类器模型,以确定任意两个海洋细菌和古菌(原核生物)的宏基因组组装基因组(MAG)是否属于同一分类单元,分类单元范围从属到门。这些指标包括平均氨基酸和核苷酸同一性,以及各类别中共享基因的比例,应用于14390个先前发表的非冗余MAG。在所有分类水平上,分类器的平衡准确率(真阳性率和真阴性率的平均值)超过92%,这表明简单的基因组相似性指标可作为良好的分类单元区分器。预测器选择和敏感性分析揭示了与分类单元差异特别相关的基因类别,例如那些参与辅因子和维生素代谢的基因类别。预测的分类单元划分进一步用于枚举海洋原核生物分类单元。对这些枚举的统计分析表明,现存海洋原核生物门、纲和目的一半以上已通过基因组解析宏基因组调查被发现。