Department of Public Health, China Medical University, Taichung, Taiwan.
Institute of Medical Science and Technology, National Sun Yat-sen University, Kaohsiung, Taiwan.
PLoS One. 2021 Nov 19;16(11):e0260293. doi: 10.1371/journal.pone.0260293. eCollection 2021.
As whole-genome sequencing for pathogen genomes becomes increasingly popular, the typing methods of gene-by-gene comparison, such as core genome multilocus sequence typing (cgMLST) and whole-genome multilocus sequence typing (wgMLST), are being routinely implemented in molecular epidemiology. However, some intrinsic problems remain. For example, genomic sequences with varying read depths, read lengths, and assemblers influence the genome assemblies, introducing error or missing alleles into the generated allelic profiles. These errors and missing alleles might create "specious discrepancy" among closely related isolates, thus making accurate epidemiological interpretation challenging. In addition, the rapid growth of the cgMLST allelic profile database can cause problems related to storage and maintenance as well as long query search times.
We attempted to resolve these issues by decreasing the scheme size to reduce the occurrence of error and missing alleles, alleviate the storage burden, and improve the query search time. The challenge in this approach is maintaining the typing resolution when using fewer loci. We achieved this by using a popular artificial intelligence technique, XGBoost, coupled with Shapley additive explanations for feature selection. Finally, 370 loci from the original 1701 cgMLST loci of Listeria monocytogenes were selected.
Although the size of the final scheme (LmScheme_370) was approximately 80% lower than that of the original cgMLST scheme, its discriminatory power, tested for 35 outbreaks, was concordant with that of the original cgMLST scheme. Although we used L. monocytogenes as a demonstration in this study, the approach can be applied to other schemes and pathogens. Our findings might help elucidate gene-by-gene-based epidemiology.
随着全基因组测序在病原体基因组中的应用越来越广泛,基于基因比较的分型方法,如核心基因组多位点序列分型(cgMLST)和全基因组多位点序列分型(wgMLST),已在分子流行病学中常规实施。然而,一些内在问题仍然存在。例如,具有不同读取深度、读取长度和组装器的基因组序列会影响基因组组装,从而在生成的等位基因谱中引入错误或缺失的等位基因。这些错误和缺失的等位基因可能会在密切相关的分离株之间产生“虚假差异”,从而使得准确的流行病学解释变得具有挑战性。此外,cgMLST 等位基因谱数据库的快速增长可能会导致存储和维护以及长查询搜索时间相关的问题。
我们试图通过减小方案大小来解决这些问题,以减少错误和缺失等位基因的发生,减轻存储负担,并提高查询搜索时间。这种方法的挑战是在使用较少的基因座时保持分型分辨率。我们通过使用流行的人工智能技术 XGBoost 并结合 Shapley 加法解释进行特征选择来实现这一点。最后,从李斯特菌的原始 1701 个 cgMLST 基因座中选择了 370 个基因座。
尽管最终方案(LmScheme_370)的大小约为原始 cgMLST 方案的 80%,但其区分力,在 35 次暴发中进行了测试,与原始 cgMLST 方案一致。虽然我们在这项研究中使用了李斯特菌作为演示,但该方法可应用于其他方案和病原体。我们的发现可能有助于阐明基于基因的流行病学。