Buck Moritz, Mehrshad Maliheh, Bertilsson Stefan
Department of Aquatic Sciences and Assessment, Swedish University of Agricultural Sciences, Lennart Hjelms väg 9, 75651 Uppsala, Sweden.
NAR Genom Bioinform. 2022 Aug 15;4(3):lqac060. doi: 10.1093/nargab/lqac060. eCollection 2022 Sep.
Recent advances in sequencing and bioinformatics have expanded the tree of life by providing genomes for uncultured environmentally relevant clades, either through metagenome-assembled genomes or through single-cell genomes. While this expanded diversity can provide novel insights into microbial population structure, most tools available for core-genome estimation are sensitive to genome completeness. Consequently, a major portion of the huge phylogenetic diversity uncovered by environmental genomic approaches remains excluded from such analyses. We present mOTUpan, a novel iterative Bayesian method for computing the core genome for sets of genomes of highly diverse completeness range. The likelihood for each gene cluster to belong to core or accessory genome is estimated by computing the probability of its presence/absence pattern in the target genome set. The core-genome prediction is computationally efficient and can be scaled up to thousands of genomes. It has shown comparable estimates to state-of-the-art tools Roary and PPanGGOLiN for high-quality genomes and is capable of using genomes at lower completeness thresholds. mOTUpan wraps a bootstrapping procedure to estimate the quality of a specific core-genome prediction, as the accuracy of each run will depend on the specific completeness distribution and the number of genomes in the dataset under scrutiny. mOTUpan is implemented in the mOTUlizer software package, and available at github.com/moritzbuck/mOTUlizer, under GPL 3.0 license.
测序技术和生物信息学的最新进展通过宏基因组组装基因组或单细胞基因组为未培养的环境相关进化枝提供基因组,从而扩展了生命之树。虽然这种扩展的多样性能够为微生物种群结构提供新的见解,但大多数可用于核心基因组估计的工具对基因组完整性很敏感。因此,环境基因组方法揭示的巨大系统发育多样性的很大一部分仍被排除在这类分析之外。我们提出了mOTUpan,这是一种新颖的迭代贝叶斯方法,用于计算高度不同完整性范围的基因组集的核心基因组。通过计算每个基因簇在目标基因组集中存在/缺失模式的概率,估计其属于核心基因组或辅助基因组的可能性。核心基因组预测在计算上效率很高,并且可以扩展到数千个基因组。对于高质量基因组,它已显示出与最先进的工具Roary和PPanGGOLiN相当的估计结果,并且能够使用完整性阈值较低的基因组。mOTUpan包含一个自展程序来估计特定核心基因组预测的质量,因为每次运行的准确性将取决于特定的完整性分布和所审查数据集中的基因组数量。mOTUpan在mOTUlizer软件包中实现,可在github.com/moritzbuck/mOTUlizer上获取,遵循GPL 3.0许可协议。