Department of Nutrition and Food Science, University of Maryland, College Park, MD 20742, USA.
Department of Nutrition and Food Science, University of Maryland, College Park, MD 20742, USA; Center for Food Safety and Security Systems, University of Maryland, College Park, MD 20742, USA.
Food Res Int. 2022 Jan;151:110817. doi: 10.1016/j.foodres.2021.110817. Epub 2021 Nov 22.
The past few years have seen a significant increase in availability of whole genome sequencing information, allowing for its incorporation in predictive modeling for foodborne pathogens to account for inter- and intra-species differences in their virulence. However, this is hindered by the inability of traditional statistical methods to analyze such large amounts of data compared to the number of observations/isolates. In this study, we have explored the applicability of machine learning (ML) models to predict the disease outcome, while identifying features that exert a significant effect on the prediction. This study was conducted on Salmonella enterica, a major foodborne pathogen with considerable inter- and intra-serovar variation. WGS of isolates obtained from various sources (i.e., human, chicken, and swine) were used as input in four machine learning models (logistic regression with ridge, random forest, support vector machine, and AdaBoost) to classify isolates based on disease severity (extraintestinal vs. gastrointestinal) in the host. The predictive performances of all models were tested with and without Elastic Net regularization to combat dimensionality issues. Elastic Net-regularized logistic regression model showed the best area under the receiver operating characteristic curve (AUC-ROC; 0.86) and outcome prediction accuracy (0.76). Additionally, genes coding for transcriptional regulation, acidic, oxidative, and anaerobic stress response, and antibiotic resistance were found to be significant predictors of disease severity. These genes, which were significantly associated with each outcome, could possibly be input in amended, gene-expression-specific predictive models to estimate virulence pattern-specific effect of Salmonella and other foodborne pathogens on human health.
过去几年中,全基因组测序信息的可用性显著增加,这使得可以将其纳入食源性致病菌的预测模型中,以解释其在毒力方面的种间和种内差异。然而,与观察/分离物的数量相比,传统的统计方法无法分析如此大量的数据,这限制了其应用。在这项研究中,我们探索了机器学习 (ML) 模型在预测疾病结果方面的适用性,同时确定了对预测有显著影响的特征。本研究以沙门氏菌属(Salmonella enterica)为对象,沙门氏菌属是一种主要的食源性致病菌,具有相当大的种间和种内变异。从各种来源(即人类、鸡和猪)获得的分离物的 WGS 被用作四个机器学习模型(带有岭回归的逻辑回归、随机森林、支持向量机和 AdaBoost)的输入,以根据宿主中的疾病严重程度(肠外与胃肠道)对分离物进行分类。所有模型的预测性能均在有无弹性网络正则化的情况下进行了测试,以解决维度问题。弹性网络正则化逻辑回归模型显示出最佳的接收者操作特征曲线下面积(AUC-ROC;0.86)和结果预测准确性(0.76)。此外,编码转录调节、酸性、氧化和厌氧应激反应以及抗生素耐药性的基因被发现是疾病严重程度的重要预测因子。这些与每种结果都显著相关的基因,可能会被输入到经过修正的、基于基因表达的特定预测模型中,以估计沙门氏菌和其他食源性致病菌对人类健康的毒力模式特异性影响。