NIHR Health Protection Research Unit in Respiratory Infections, National Heart and Lung Institute, Imperial College London, London W21PG, United Kingdom.
European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom.
Genome Res. 2024 Oct 29;34(10):1661-1673. doi: 10.1101/gr.279449.124.
Sequence variation observed in populations of pathogens can be used for important public health and evolutionary genomic analyses, especially outbreak analysis and transmission reconstruction. Identifying this variation is typically achieved by aligning sequence reads to a reference genome, but this approach is susceptible to reference biases and requires careful filtering of called genotypes. There is a need for tools that can process this growing volume of bacterial genome data, providing rapid results, but that remain simple so they can be used without highly trained bioinformaticians, expensive data analysis, and long-term storage and processing of large files. Here we describe split -mer analysis (SKA2), a method that supports both reference-free and reference-based mapping to quickly and accurately genotype populations of bacteria using sequencing reads or genome assemblies. SKA2 is highly accurate for closely related samples, and in outbreak simulations, we show superior variant recall compared with reference-based methods, with no false positives. SKA2 can also accurately map variants to a reference and be used with recombination detection methods to rapidly reconstruct vertical evolutionary history. SKA2 is many times faster than comparable methods and can be used to add new genomes to an existing call set, allowing sequential use without the need to reanalyze entire collections. With an inherent absence of reference bias, high accuracy, and a robust implementation, SKA2 has the potential to become the tool of choice for genotyping bacteria. SKA2 is implemented in Rust and is freely available as open-source software.
在病原体种群中观察到的序列变异可用于重要的公共卫生和进化基因组分析,特别是暴发分析和传播重建。鉴定这种变异通常是通过将序列读取与参考基因组对齐来实现的,但这种方法容易受到参考偏差的影响,并且需要仔细筛选所调用的基因型。需要有工具可以处理不断增长的细菌基因组数据量,提供快速的结果,但又要保持简单,以便无需经过高度训练的生物信息学家、昂贵的数据分析以及长期存储和处理大型文件,就可以使用。在这里,我们描述了分割 - 合并分析(SKA2),这是一种既支持无参考又支持基于参考的映射的方法,可使用测序读取或基因组组装快速准确地对细菌种群进行基因分型。SKA2 对密切相关的样本具有高度准确性,在暴发模拟中,与基于参考的方法相比,我们显示出优越的变异召回率,而没有假阳性。SKA2 还可以准确地将变体映射到参考基因组,并与重组检测方法结合使用,以快速重建垂直进化史。SKA2 比可比方法快许多倍,可以用于将新基因组添加到现有调用集中,允许连续使用,而无需重新分析整个数据集。由于不存在参考偏差、准确性高和稳健的实现,SKA2 有可能成为细菌基因分型的首选工具。SKA2 是用 Rust 编写的,并且作为开源软件免费提供。