Research Group for Genomic Epidemiology, National Food Institute, Technical University of Denmark, Kemitorvet, Building 204, 2800 Kgs. Lyngby, Denmark.
Research Group for Genomic Epidemiology, National Food Institute, Technical University of Denmark, Kemitorvet, Building 204, 2800 Kgs. Lyngby, Denmark.
Int J Food Microbiol. 2019 Mar 2;292:72-82. doi: 10.1016/j.ijfoodmicro.2018.11.016. Epub 2018 Dec 4.
The ever decreasing cost and increase in throughput of next generation sequencing (NGS) techniques have resulted in a rapid increase in availability of NGS data. Such data have the potential for rapid, reproducible and highly discriminative characterization of pathogens. This provides an opportunity in microbial risk assessment to account for variations in survivability and virulence among strains. A major challenge towards such attempts remains the highly dimensional nature of genomic data versus the number of isolates. Machine learning-based (ML) predictive risk modelling provides a solution to this "curse of dimensionality" while accounting for individual effects that are dependent on interactions with other genetic and environmental factors. This pilot study explores the potential of ML in the prediction of health endpoints resulting from shigatoxigenic E. coli (STEC) infection. Accessory genes in amino acid sequences were used as model input to predict and differentiate health outcomes in STEC infections including diarrhea, bloody diarrhea, hemolytic uremic syndrome and their combinations. Outcomes severity was also distinguished by hospitalization. A matrix of percent similarity between accessory genes and the E. coli genomes was generated and subsequently used as input for ML. The performances of ML algorithms random forest, support vector machine (radial and linear kernel), gradient boosting, and logit boost were compared. Logit boost was the best model showing an outcome prediction accuracy of 0.75 (95% CI: 0.60, 0.86), an excellent or substantial performance (Kappa = 0.72). Important genetic predictors of riskier STEC clinical outcomes included proteins involved in initial attachment to the host cell, persistence of plasmids or genomic islands, conjugative plasmid transfer and formation of sex pili, regulation of locus of enterocyte effacement expression, post-translational acetylation of proteins, facilitation of the rearrangement or deletion of sections within the pathogenic islands and transport macromolecules across the cell envelope. We propose further studies are proposed on the proteins with undefined or unclear functionality. One protein family in particular predicted HUS outcome. Toxin-antitoxin systems are potential stress adaptation markers which may mediate environmental persistence of strains in diverse sources. We foresee the application of ML approach to the set-up of real-time online analysis of whole genome sequence data to estimate the human health risk at the population or strain level. The ML approach is envisaged to support the prediction of more specific STEC clinical endpoints type by inputting isolate sequence data.
下一代测序 (NGS) 技术的成本不断降低和通量不断增加,导致 NGS 数据的可用性迅速增加。此类数据有可能快速、可重复且高度区分病原体。这为微生物风险评估提供了一个机会,可以考虑菌株之间生存能力和毒力的差异。一个主要挑战仍然是基因组数据的高维性质与分离株数量之间的关系。基于机器学习 (ML) 的预测风险建模为解决这一“维度诅咒”提供了一种解决方案,同时考虑了依赖于与其他遗传和环境因素相互作用的个体效应。这项初步研究探讨了 ML 在预测产志贺毒素大肠杆菌 (STEC) 感染所致健康终点方面的潜力。氨基酸序列中的辅助基因被用作模型输入,以预测和区分 STEC 感染中的健康结果,包括腹泻、血性腹泻、溶血尿毒综合征及其组合。严重程度也通过住院来区分。生成了辅助基因与大肠杆菌基因组之间的相似度矩阵,随后将其用作 ML 的输入。比较了 ML 算法随机森林、支持向量机(径向和线性核)、梯度提升和对数提升的性能。对数提升是最好的模型,其结果预测准确率为 0.75(95%CI:0.60,0.86),表现出色或非常好(Kappa=0.72)。风险较高的 STEC 临床结果的重要遗传预测因子包括与宿主细胞初始附着、质粒或基因组岛的持久性、可共轭质粒转移和性菌毛形成、肠上皮细胞 effacement 表达调控、蛋白质的翻译后乙酰化、促进致病性岛内部分的重排或缺失以及跨细胞膜转运大分子有关的蛋白质。我们建议对功能尚不清楚或不清楚的蛋白质进行进一步研究。预测 HUS 结果的一个蛋白质家族尤其受到关注。毒素-抗毒素系统是潜在的应激适应标记物,可能介导菌株在不同来源中的环境持久性。我们预计 ML 方法将应用于建立全基因组序列数据的实时在线分析,以估计人群或菌株水平的人类健康风险。预计 ML 方法将通过输入分离株序列数据来支持对更具体的 STEC 临床终点类型的预测。