Division for Epidemiology and Microbial Genomics, National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark.
Université PARIS-EST, Agence Nationale de Sécurité Sanitaire de L'Alimentation, de L'Environnement et du Travail (ANSES), Laboratory for Food Safety, Maisons-Alfort, France.
Risk Anal. 2019 Jun;39(6):1397-1413. doi: 10.1111/risa.13239. Epub 2018 Nov 21.
Next-generation sequencing (NGS) data present an untapped potential to improve microbial risk assessment (MRA) through increased specificity and redefinition of the hazard. Most of the MRA models do not account for differences in survivability and virulence among strains. The potential of machine learning algorithms for predicting the risk/health burden at the population level while inputting large and complex NGS data was explored with Listeria monocytogenes as a case study. Listeria data consisted of a percentage similarity matrix from genome assemblies of 38 and 207 strains of clinical and food origin, respectively. Basic Local Alignment (BLAST) was used to align the assemblies against a database of 136 virulence and stress resistance genes. The outcome variable was frequency of illness, which is the percentage of reported cases associated with each strain. These frequency data were discretized into seven ordinal outcome categories and used for supervised machine learning and model selection from five ensemble algorithms. There was no significant difference in accuracy between the models, and support vector machine with linear kernel was chosen for further inference (accuracy of 89% [95% CI: 68%, 97%]). The virulence genes FAM002725, FAM002728, FAM002729, InlF, InlJ, Inlk, IisY, IisD, IisX, IisH, IisB, lmo2026, and FAM003296 were important predictors of higher frequency of illness. InlF was uniquely truncated in the sequence type 121 strains. Most important risk predictor genes occurred at highest prevalence among strains from ready-to-eat, dairy, and composite foods. We foresee that the findings and approaches described offer the potential for rethinking the current approaches in MRA.
下一代测序 (NGS) 数据具有提高微生物风险评估 (MRA) 的特异性和重新定义危害的潜力。大多数 MRA 模型都没有考虑到菌株之间生存能力和毒力的差异。本研究以李斯特菌为例,探讨了机器学习算法在输入大量复杂 NGS 数据的情况下,预测人群风险/健康负担的潜力。李斯特菌数据包括来自临床和食品来源的 38 株和 207 株菌株基因组组装的百分比相似性矩阵。使用基本局部比对 (BLAST) 将组装与数据库中的 136 个毒力和应激抗性基因进行比对。结果变量是发病率,即与每个菌株相关的报告病例百分比。这些频率数据被离散化为七个有序的结果类别,并用于五种集成算法的监督机器学习和模型选择。模型的准确性没有显著差异,选择带有线性核的支持向量机进行进一步推断(准确性为 89%[95%CI: 68%, 97%])。FAM002725、FAM002728、FAM002729、InlF、InlJ、Inlk、IisY、IisD、IisX、IisH、IisB、lmo2026 和 FAM003296 等毒力基因是发病率较高的重要预测因子。序列类型 121 株中 InlF 独特地缺失。最重要的风险预测因子基因在即食、乳制品和复合食品来源的菌株中最为普遍。我们预计,所描述的研究结果和方法提供了重新思考当前 MRA 方法的潜力。