Tao Dandan, Zhang Dongyu, Hu Ruofan, Rundensteiner Elke, Feng Hao
Vanke School of Public Health, Tsinghua University, Beijing 100084, China.
Data Science Program, Worcester Polytechnic Institute, Worcester, MA 01609, USA.
Foods. 2023 Oct 19;12(20):3825. doi: 10.3390/foods12203825.
Diseases caused by the consumption of food are a significant but avoidable public health issue, and identifying the source of contamination is a key step in an outbreak investigation to prevent foodborne illnesses. Historical foodborne outbreaks provide rich data on critical attributes such as outbreak factors, food vehicles, and etiologies, and an improved understanding of the relationships between these attributes could provide insights for developing effective food safety interventions. The purpose of this study was to identify hidden patterns underlying the relations between the critical attributes involved in historical foodborne outbreaks through data mining approaches. A statistical analysis was used to identify the associations between outbreak factors and food sources, and the factors that were strongly significant were selected as predictive factors for food vehicles. A multinomial prediction model was built based on factors selected for predicting "simple" foods (beef, dairy, and vegetables) as sources of outbreaks. In addition, the relations between the food vehicles and common etiologies were investigated through text mining approaches (support vector machines, logistic regression, random forest, and naïve Bayes). A support vector machine model was identified as the optimal model to predict etiologies from the occurrence of food vehicles. Association rules also indicated the specific food vehicles that have strong relations to the etiologies. Meanwhile, a food ingredient network describing the relationships between foods and ingredients was constructed and used with Monte Carlo simulation to predict possible ingredients from foods that cause an outbreak. The simulated results were confirmed with foods and ingredients that are already known to cause historical foodborne outbreaks. The method could provide insights into the prediction of the possible ingredient sources of contamination when given the name of a food. The results could provide insights into the early identification of food sources of contamination and assist in future outbreak investigations. The data-driven approach will provide a new perspective and strategies for discovering hidden knowledge from massive data.
食源性疾病是一个重大但可避免的公共卫生问题,确定污染源是暴发调查中预防食源性疾病的关键步骤。历史上的食源性疾病暴发提供了关于暴发因素、食物载体和病因等关键属性的丰富数据,更好地理解这些属性之间的关系可为制定有效的食品安全干预措施提供见解。本研究的目的是通过数据挖掘方法识别历史食源性疾病暴发所涉及的关键属性之间潜在的隐藏模式。采用统计分析来确定暴发因素与食物来源之间的关联,并将具有高度显著性的因素选为食物载体的预测因素。基于所选因素构建了一个多项预测模型,用于预测作为暴发源头的“简单”食物(牛肉、奶制品和蔬菜)。此外,通过文本挖掘方法(支持向量机、逻辑回归、随机森林和朴素贝叶斯)研究了食物载体与常见病因之间的关系。确定了一个支持向量机模型作为从食物载体出现情况预测病因的最优模型。关联规则还表明了与病因有密切关系的特定食物载体。同时,构建了一个描述食物与成分之间关系的食物成分网络,并与蒙特卡罗模拟一起使用,以从导致暴发的食物中预测可能的成分。模拟结果通过已知会导致历史食源性疾病暴发的食物和成分得到了证实。该方法可以在给定食物名称时,为预测可能的污染成分来源提供见解。研究结果可为早期识别污染源提供见解,并有助于未来的暴发调查。这种数据驱动的方法将为从海量数据中发现隐藏知识提供新的视角和策略。