Computer Network Information Center, Chinese Academy of Sciences, Beijing, China.
School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China.
Foodborne Pathog Dis. 2021 Aug;18(8):590-598. doi: 10.1089/fpd.2020.2913. Epub 2021 Apr 26.
The China National Center for Food Safety Risk Assessment (CFSA) uses the Foodborne Disease Monitoring and Reporting System (FDMRS) to monitor outbreaks of foodborne diseases across the country. However, there are problems of underreporting or erroneous reporting in FDMRS, which significantly increase the cost of related epidemic investigations. To solve this problem, we designed a model to identify suspected outbreaks from the data generated by the FDMRS of CFSA. In this study, machine learning models were used to fit the data. The recall rate and F1-score were used as evaluation metrics to compare the classification performance of each model. Feature importance and pathogenic factors were identified and analyzed using tree-based and gradient boosting models. Three real foodborne disease outbreaks were then used to evaluate the best performing model. Furthermore, the SHapley Additive exPlanation value was used to identify the effect of features. Among all machine learning classification models, the eXtreme Gradient Boosting (XGBoost) model achieved the best performance, with the highest recall rate and F1-score of 0.9699 and 0.9582, respectively. In terms of model validation, the model provides a correct judgment of real outbreaks. In the feature importance analysis with the XGBoost model, the health status of the other people with the same exposure has the highest weight, reaching 0.65. The machine learning model built in this study exhibits high accuracy in recognizing foodborne disease outbreaks, thus reducing the manual burden for medical staff. The model helped us identify the confounding factors of foodborne disease outbreaks. Attention should be paid not only to the health status of those with the same exposure but also to the similarity of the cases in time and space.
中国国家食品安全风险评估中心(CFSA)利用食源性疾病监测和报告系统(FDMRS)监测全国范围内的食源性疾病暴发情况。然而,FDMRS 存在漏报或错报问题,这大大增加了相关疫情调查的成本。为了解决这个问题,我们设计了一个模型,从 CFSA 的 FDMRS 生成的数据中识别疑似暴发。在这项研究中,使用机器学习模型来拟合数据。召回率和 F1 分数被用作评估指标,以比较每个模型的分类性能。使用基于树的和梯度提升模型来识别和分析特征重要性和病原体因素。然后使用三个真实的食源性疾病暴发来评估表现最好的模型。此外,还使用 SHapley Additive exPlanation 值来识别特征的影响。在所有机器学习分类模型中,极端梯度提升(XGBoost)模型的表现最好,召回率和 F1 分数最高,分别为 0.9699 和 0.9582。在模型验证方面,该模型对真实暴发提供了正确的判断。在使用 XGBoost 模型进行的特征重要性分析中,同一暴露人群中其他人的健康状况权重最高,达到 0.65。本研究中构建的机器学习模型在识别食源性疾病暴发方面具有很高的准确性,从而减轻了医务人员的手动负担。该模型帮助我们识别食源性疾病暴发的混杂因素。不仅要关注同一暴露人群的健康状况,还要关注时间和空间上病例的相似性。