Division of Infectious Diseases and Geographic Medicine, Stanford University School of Medicine, Stanford, CA, USA.
Department of Mathematics, Simon Fraser University, Burnaby, BC, Canada.
Microb Genom. 2020 Aug;6(8). doi: 10.1099/mgen.0.000418. Epub 2020 Jul 31.
Pathogen genomic data are increasingly used to characterize global and local transmission patterns of important human pathogens and to inform public health interventions. Yet, there is no current consensus on how to measure genomic variation. To test the effect of the variant-identification approach on transmission inferences for we conducted an experiment in which five genomic epidemiology groups applied variant-identification pipelines to the same outbreak sequence data. We compared the variants identified by each group in addition to transmission and phylogenetic inferences made with each variant set. To measure the performance of commonly used variant-identification tools, we simulated an outbreak. We compared the performance of three mapping algorithms, five variant callers and two variant filters in recovering true outbreak variants. Finally, we investigated the effect of applying increasingly stringent filters on transmission inferences and phylogenies. We found that variant-calling approaches used by different groups do not recover consistent sets of variants, which can lead to conflicting transmission inferences. Further, performance in recovering true variation varied widely across approaches. While no single variant-identification approach outperforms others in both recovering true genome-wide and outbreak-level variation, variant-identification algorithms calibrated upon real sequence data or that incorporate local reassembly outperform others in recovering true pairwise differences between isolates. The choice of variant filters contributed to extensive differences across pipelines, and applying increasingly stringent filters rapidly eroded the accuracy of transmission inferences and quality of phylogenies reconstructed from outbreak variation. Commonly used approaches to identify genomic variation have variable performance, particularly when predicting potential transmission links from pairwise genetic distances. Phylogenetic reconstruction may be improved by less stringent variant filtering. Approaches that improve variant identification in repetitive, hypervariable regions, such as long-read assemblies, may improve transmission inference.
病原体基因组数据越来越多地用于描述重要人类病原体的全球和本地传播模式,并为公共卫生干预措施提供信息。然而,目前还没有关于如何衡量基因组变异的共识。为了测试变异识别方法对我们进行的传播推断的影响,我们进行了一项实验,其中五个基因组流行病学小组将变异识别管道应用于相同的爆发序列数据。我们比较了每个小组识别的变体,以及使用每个变体集进行的传播和系统发育推断。为了衡量常用变异识别工具的性能,我们模拟了一次爆发。我们比较了三种映射算法、五种变异调用器和两种变异筛选器在恢复真实爆发变体方面的性能。最后,我们研究了应用越来越严格的筛选器对传播推断和系统发育的影响。我们发现,不同小组使用的变异调用方法不能恢复一致的变体集,这可能导致相互冲突的传播推断。此外,在恢复真实变异方面,各种方法的性能差异很大。虽然没有一种单一的变异识别方法在恢复真实的全基因组和爆发级别的变异方面都表现出色,但在恢复真实的分离株之间的成对差异方面,经过真实序列数据校准或包含局部重新组装的变异识别算法表现优于其他方法。变异筛选器的选择导致了不同管道之间的广泛差异,并且应用越来越严格的筛选器会迅速削弱传播推断的准确性和从爆发变异中重建的系统发育的质量。识别 基因组变异的常用方法具有不同的性能,特别是在预测来自成对遗传距离的潜在传播联系时。通过不那么严格的变异筛选,可以改善系统发育重建。在重复、高变异性区域(如长读长组装)中改进变异识别的方法可能会改善传播推断。