Department of Microbiology, New York University School of Medicine, New York, New York 10016, USA.
Parasites and Microbes, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1SA, United Kingdom.
Genome Res. 2019 Feb;29(2):304-316. doi: 10.1101/gr.241455.118. Epub 2019 Jan 24.
The routine use of genomics for disease surveillance provides the opportunity for high-resolution bacterial epidemiology. Current whole-genome clustering and multilocus typing approaches do not fully exploit core and accessory genomic variation, and they cannot both automatically identify, and subsequently expand, clusters of significantly similar isolates in large data sets spanning entire species. Here, we describe PopPUNK (ulation artitioning sing ucleotide -mers), a software implementing scalable and expandable annotation- and alignment-free methods for population analysis and clustering. Variable-length -mer comparisons are used to distinguish isolates' divergence in shared sequence and gene content, which we demonstrate to be accurate over multiple orders of magnitude using data from both simulations and genomic collections representing 10 taxonomically widespread species. Connections between closely related isolates of the same strain are robustly identified, despite interspecies variation in the pairwise distance distributions that reflects species' diverse evolutionary patterns. PopPUNK can process 10-10 genomes in a single batch, with minimal memory use and runtimes up to 200-fold faster than existing model-based methods. Clusters of strains remain consistent as new batches of genomes are added, which is achieved without needing to reanalyze all genomes de novo. This facilitates real-time surveillance with consistent cluster naming between studies and allows for outbreak detection using hundreds of genomes in minutes. Interactive visualization and online publication is streamlined through the automatic output of results to multiple platforms. PopPUNK has been designed as a flexible platform that addresses important issues with currently used whole-genome clustering and typing methods, and has potential uses across bacterial genetics and public health research.
常规使用基因组学进行疾病监测为高分辨率细菌流行病学提供了机会。当前的全基因组聚类和多位点分型方法不能充分利用核心和辅助基因组变异,并且它们不能自动识别,也不能随后在跨越整个物种的大型数据集扩展具有显著相似分离株的聚类。在这里,我们描述了 PopPUNK(population artitioning sing ucleotide -mers),这是一种软件,实现了可扩展和可扩展的无注释和无对齐方法,用于种群分析和聚类。使用可变长度 -mer 比较来区分分离株在共享序列和基因内容中的差异,我们使用来自模拟和代表 10 个分类广泛的物种的基因组集合的数据证明了这一点,这些数据跨越了多个数量级。尽管种间差异反映了物种的不同进化模式,但仍然可以可靠地识别相同菌株的密切相关分离株之间的联系。PopPUNK 可以在单个批次中处理 10-10 个基因组,内存使用最少,运行时间比现有的基于模型的方法快 200 倍以上。随着新批次的基因组添加,菌株的聚类保持一致,而无需重新分析所有基因组。这实现了实时监测,在研究之间具有一致的聚类命名,并允许在几分钟内使用数百个基因组检测爆发。通过自动输出结果到多个平台,简化了交互式可视化和在线发布。PopPUNK 被设计为一个灵活的平台,解决了当前使用的全基因组聚类和分型方法的重要问题,并且在细菌遗传学和公共卫生研究中具有潜在的用途。