基于机器学习的李斯特菌核心基因组多位点序列分型打字方案优化，具有高分辨力的常见源爆发追踪能力。

A machine learning-based typing scheme refinement for Listeria monocytogenes core genome multilocus sequence typing with high discriminatory power for common source outbreak tracking.

机构信息

Department of Public Health, China Medical University, Taichung, Taiwan.

Institute of Medical Science and Technology, National Sun Yat-sen University, Kaohsiung, Taiwan.

出版信息

PLoS One. 2021 Nov 19;16(11):e0260293. doi: 10.1371/journal.pone.0260293. eCollection 2021.

DOI:10.1371/journal.pone.0260293

PMID:34797875

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8604304/

Abstract

BACKGROUND

As whole-genome sequencing for pathogen genomes becomes increasingly popular, the typing methods of gene-by-gene comparison, such as core genome multilocus sequence typing (cgMLST) and whole-genome multilocus sequence typing (wgMLST), are being routinely implemented in molecular epidemiology. However, some intrinsic problems remain. For example, genomic sequences with varying read depths, read lengths, and assemblers influence the genome assemblies, introducing error or missing alleles into the generated allelic profiles. These errors and missing alleles might create "specious discrepancy" among closely related isolates, thus making accurate epidemiological interpretation challenging. In addition, the rapid growth of the cgMLST allelic profile database can cause problems related to storage and maintenance as well as long query search times.

METHODS

We attempted to resolve these issues by decreasing the scheme size to reduce the occurrence of error and missing alleles, alleviate the storage burden, and improve the query search time. The challenge in this approach is maintaining the typing resolution when using fewer loci. We achieved this by using a popular artificial intelligence technique, XGBoost, coupled with Shapley additive explanations for feature selection. Finally, 370 loci from the original 1701 cgMLST loci of Listeria monocytogenes were selected.

RESULTS

Although the size of the final scheme (LmScheme_370) was approximately 80% lower than that of the original cgMLST scheme, its discriminatory power, tested for 35 outbreaks, was concordant with that of the original cgMLST scheme. Although we used L. monocytogenes as a demonstration in this study, the approach can be applied to other schemes and pathogens. Our findings might help elucidate gene-by-gene-based epidemiology.

摘要

背景

随着全基因组测序在病原体基因组中的应用越来越广泛，基于基因比较的分型方法，如核心基因组多位点序列分型（cgMLST）和全基因组多位点序列分型（wgMLST），已在分子流行病学中常规实施。然而，一些内在问题仍然存在。例如，具有不同读取深度、读取长度和组装器的基因组序列会影响基因组组装，从而在生成的等位基因谱中引入错误或缺失的等位基因。这些错误和缺失的等位基因可能会在密切相关的分离株之间产生“虚假差异”，从而使得准确的流行病学解释变得具有挑战性。此外，cgMLST 等位基因谱数据库的快速增长可能会导致存储和维护以及长查询搜索时间相关的问题。

方法

我们试图通过减小方案大小来解决这些问题，以减少错误和缺失等位基因的发生，减轻存储负担，并提高查询搜索时间。这种方法的挑战是在使用较少的基因座时保持分型分辨率。我们通过使用流行的人工智能技术 XGBoost 并结合 Shapley 加法解释进行特征选择来实现这一点。最后，从李斯特菌的原始 1701 个 cgMLST 基因座中选择了 370 个基因座。

结果

尽管最终方案（LmScheme_370）的大小约为原始 cgMLST 方案的 80%，但其区分力，在 35 次暴发中进行了测试，与原始 cgMLST 方案一致。虽然我们在这项研究中使用了李斯特菌作为演示，但该方法可应用于其他方案和病原体。我们的发现可能有助于阐明基于基因的流行病学。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/21c2/8604304/a4690af9c0df/pone.0260293.g001.jpg

相似文献

A machine learning-based typing scheme refinement for Listeria monocytogenes core genome multilocus sequence typing with high discriminatory power for common source outbreak tracking.

PLoS One. 2021 Nov 19;16(11):e0260293. doi: 10.1371/journal.pone.0260293. eCollection 2021.

Defining and Evaluating a Core Genome Multilocus Sequence Typing Scheme for Whole-Genome Sequence-Based Typing of Listeria monocytogenes.

J Clin Microbiol. 2015 Sep;53(9):2869-76. doi: 10.1128/JCM.01193-15. Epub 2015 Jul 1.

Retrospective investigation of listeriosis outbreaks in small ruminants using different analytical approaches for whole genome sequencing-based typing of Listeria monocytogenes.

Infect Genet Evol. 2020 Jan;77:104047. doi: 10.1016/j.meegid.2019.104047. Epub 2019 Oct 17.

Core Genome Multilocus Sequence Typing for Identification of Globally Distributed Clonal Groups and Differentiation of Outbreak Strains of Listeria monocytogenes.

Appl Environ Microbiol. 2016 Sep 30;82(20):6258-6272. doi: 10.1128/AEM.01532-16. Print 2016 Oct 15.

Whole genome sequencing analyses of Listeria monocytogenes that persisted in a milkshake machine for a year and caused illnesses in Washington State.

BMC Microbiol. 2017 Jun 15;17(1):134. doi: 10.1186/s12866-017-1043-1.

Real-Time Whole-Genome Sequencing for Surveillance of Listeria monocytogenes, France.

Emerg Infect Dis. 2017 Sep;23(9):1462-1470. doi: 10.3201/eid2309.170336. Epub 2017 Sep 17.

Translatability of WGS typing results can simplify data exchange for surveillance and control of .

Microb Genom. 2021 Jan;7(1). doi: 10.1099/mgen.0.000491. Epub 2020 Dec 4.

Retrospective validation of whole genome sequencing-enhanced surveillance of listeriosis in Europe, 2010 to 2015.

Euro Surveill. 2018 Aug;23(33). doi: 10.2807/1560-7917.ES.2018.23.33.1700798.

Development of Mycoplasma synoviae (MS) core genome multilocus sequence typing (cgMLST) scheme.

Vet Microbiol. 2018 May;218:84-89. doi: 10.1016/j.vetmic.2018.03.021. Epub 2018 Mar 21.

LmTraceMap: A Listeria monocytogenes fast-tracing platform for global surveillance.

PLoS One. 2022 May 9;17(5):e0267972. doi: 10.1371/journal.pone.0267972. eCollection 2022.

本文引用的文献

minMLST: machine learning for optimization of bacterial strain typing.

Bioinformatics. 2021 Apr 20;37(3):303-311. doi: 10.1093/bioinformatics/btaa724.

Whole Genome Sequencing Based Surveillance of for Early Detection and Investigations of Listeriosis Outbreaks.

Front Public Health. 2019 Jun 4;7:139. doi: 10.3389/fpubh.2019.00139. eCollection 2019.

The Cano-eMLST Program: An Approach for the Calculation of Canonical Extended Multi-Locus Sequence Typing, Making Comparison of Genetic Differences Among Bunches of Bacterial Strains.

Microorganisms. 2019 Apr 3;7(4):98. doi: 10.3390/microorganisms7040098.

High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries.

Nat Commun. 2018 Nov 30;9(1):5114. doi: 10.1038/s41467-018-07641-9.

SKESA: strategic k-mer extension for scrupulous assemblies.

Genome Biol. 2018 Oct 4;19(1):153. doi: 10.1186/s13059-018-1540-z.

Real-Time Whole-Genome Sequencing for Surveillance of Listeria monocytogenes, France.

Emerg Infect Dis. 2017 Sep;23(9):1462-1470. doi: 10.3201/eid2309.170336. Epub 2017 Sep 17.

Fast Genome-Wide Functional Annotation through Orthology Assignment by eggNOG-Mapper.

Mol Biol Evol. 2017 Aug 1;34(8):2115-2122. doi: 10.1093/molbev/msx148.

PGAdb-builder: A web service tool for creating pan-genome allele database for molecular fine typing.

Sci Rep. 2016 Nov 8;6:36213. doi: 10.1038/srep36213.

Whole genome-based population biology and epidemiological surveillance of Listeria monocytogenes.

Nat Microbiol. 2016 Oct 10;2:16185. doi: 10.1038/nmicrobiol.2016.185.

Next-Generation Epidemiology: Using Real-Time Core Genome Multilocus Sequence Typing To Support Infection Control Policy.

J Clin Microbiol. 2016 Dec;54(12):2850-2853. doi: 10.1128/JCM.01714-16. Epub 2016 Sep 14.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于机器学习的李斯特菌核心基因组多位点序列分型打字方案优化，具有高分辨力的常见源爆发追踪能力。

A machine learning-based typing scheme refinement for Listeria monocytogenes core genome multilocus sequence typing with high discriminatory power for common source outbreak tracking.

机构信息

Department of Public Health, China Medical University, Taichung, Taiwan.

Institute of Medical Science and Technology, National Sun Yat-sen University, Kaohsiung, Taiwan.