Department of Medical Microbiology, Amsterdam UMC, University of Amsterdam, Amsterdam, The Netherlands.
Department of Global Health, Amsterdam Institute for Global Health and Development, Amsterdam UMC, University of Amsterdam, Amsterdam, The Netherlands.
Microb Genom. 2022 Mar;8(3). doi: 10.1099/mgen.0.000799.
Phylogenetic analyses are widely used in microbiological research, for example to trace the progression of bacterial outbreaks based on whole-genome sequencing data. In practice, multiple analysis steps such as assembly, alignment and phylogenetic inference are combined to form phylogenetic workflows. Comprehensive benchmarking of the accuracy of complete phylogenetic workflows is lacking. To benchmark different phylogenetic workflows, we simulated bacterial evolution under a wide range of evolutionary models, varying the relative rates of substitution, insertion, deletion, gene duplication, gene loss and lateral gene transfer events. The generated datasets corresponded to a genetic diversity usually observed within bacterial species (≥95 % average nucleotide identity). We replicated each simulation three times to assess replicability. In total, we benchmarked 19 distinct phylogenetic workflows using 8 different simulated datasets. We found that recently developed -mer alignment methods such as kSNP and ska achieve similar accuracy as reference mapping. The high accuracy of -mer alignment methods can be explained by the large fractions of genomes these methods can align, relative to other approaches. We also found that the choice of assembly algorithm influences the accuracy of phylogenetic reconstruction, with workflows employing SPAdes or skesa outperforming those employing Velvet. Finally, we found that the results of phylogenetic benchmarking are highly variable between replicates. We conclude that for phylogenomic reconstruction, -mer alignment methods are relevant alternatives to reference mapping at the species level, especially in the absence of suitable reference genomes. We show genome assembly accuracy to be an underappreciated parameter required for accurate phylogenomic reconstruction.
系统发育分析被广泛应用于微生物学研究,例如,基于全基因组测序数据追踪细菌爆发的进展。在实践中,将多个分析步骤(如组装、比对和系统发育推断)组合起来形成系统发育工作流程。完整的系统发育工作流程的准确性综合基准测试是缺乏的。为了对不同的系统发育工作流程进行基准测试,我们模拟了在广泛的进化模型下细菌的进化,改变了替代、插入、缺失、基因复制、基因丢失和水平基因转移事件的相对速率。生成的数据集对应于细菌物种内通常观察到的遗传多样性(≥95%平均核苷酸同一性)。我们对每个模拟重复了三次,以评估可重复性。总共使用 8 个不同的模拟数据集对 19 个不同的系统发育工作流程进行了基准测试。我们发现,最近开发的 kSNP 和 ska 等 -mer 比对方法与参考映射具有相似的准确性。-mer 比对方法的高精度可以用这些方法可以比对的基因组的大分数来解释,相对于其他方法。我们还发现,组装算法的选择会影响系统发育重建的准确性,使用 SPAdes 或 skesa 的工作流程优于使用 Velvet 的工作流程。最后,我们发现,系统发育基准测试的结果在重复之间高度可变。我们得出的结论是,对于基因组系统发育重建,-mer 比对方法是物种水平参考映射的一个相关替代方法,尤其是在没有合适的参考基因组的情况下。我们表明,基因组组装的准确性是准确的基因组系统发育重建所需的一个被低估的参数。