Saltykova Assia, Wuyts Véronique, Mattheus Wesley, Bertrand Sophie, Roosens Nancy H C, Marchal Kathleen, De Keersmaecker Sigrid C J
Platform Biotechnology and Molecular Biology, Scientific Institute of Public Health, Brussels, Belgium.
Department of Information Technology, IDLab, Ghent University, IMEC, Ghent, Belgium.
PLoS One. 2018 Feb 6;13(2):e0192504. doi: 10.1371/journal.pone.0192504. eCollection 2018.
Whole genome sequencing represents a promising new technology for subtyping of bacterial pathogens. Besides the technological advances which have pushed the approach forward, the last years have been marked by considerable evolution of the whole genome sequencing data analysis methods. Prior to application of the technology as a routine epidemiological typing tool, however, reliable and efficient data analysis strategies need to be identified among the wide variety of the emerged methodologies. In this work, we have compared three existing SNP-based subtyping workflows using a benchmark dataset of 32 Salmonella enterica subsp. enterica serovar Typhimurium and serovar 1,4,[5],12:i:- isolates including five isolates from a confirmed outbreak and three isolates obtained from the same patient at different time points. The analysis was carried out using the original (high-coverage) and a down-sampled (low-coverage) datasets and two different reference genomes. All three tested workflows, namely CSI Phylogeny-based workflow, CFSAN-based workflow and PHEnix-based workflow, were able to correctly group the confirmed outbreak isolates and isolates from the same patient with all combinations of reference genomes and datasets. However, the workflows differed strongly with respect to the SNP distances between isolates and sensitivity towards sequencing coverage, which could be linked to the specific data analysis strategies used therein. To demonstrate the effect of particular data analysis steps, several modifications of the existing workflows were also tested. This allowed us to propose data analysis schemes most suitable for routine SNP-based subtyping applied to S. Typhimurium and S. 1,4,[5],12:i:-. Results presented in this study illustrate the importance of using correct data analysis strategies and to define benchmark and fine-tune parameters applied within routine data analysis pipelines to obtain optimal results.
全基因组测序是一种用于细菌病原体分型的有前景的新技术。除了推动该方法发展的技术进步外,过去几年全基因组测序数据分析方法也有了显著发展。然而,在将该技术作为常规流行病学分型工具应用之前,需要在众多已出现的方法中确定可靠且高效的数据分析策略。在这项工作中,我们使用32株肠炎沙门氏菌亚种肠炎血清型鼠伤寒沙门氏菌和血清型1,4,[5],12:i:-菌株的基准数据集,比较了三种现有的基于单核苷酸多态性(SNP)的分型工作流程,其中包括来自一次确诊疫情的5株菌株以及在不同时间点从同一患者获得的3株菌株。分析使用原始(高覆盖度)和下采样(低覆盖度)数据集以及两个不同的参考基因组进行。所有三种测试的工作流程,即基于CSI系统发育的工作流程、基于CFSAN的工作流程和基于PHEnix的工作流程,在所有参考基因组和数据集组合下,都能够正确地将确诊疫情菌株和来自同一患者的菌株分组。然而,这些工作流程在菌株间的SNP距离以及对测序覆盖度的敏感性方面差异很大,这可能与其中使用的特定数据分析策略有关。为了证明特定数据分析步骤的效果,还测试了对现有工作流程的几种修改。这使我们能够提出最适合应用于鼠伤寒沙门氏菌和1,4,[5],12:i:-血清型的基于SNP的常规分型的数据分析方案。本研究给出的结果说明了使用正确数据分析策略以及定义常规数据分析流程中应用的基准和微调参数以获得最佳结果的重要性。