Department of Medical Statistics, School of Public Health & Center for Health Information Research & Sun Yat-sen Global Health Institute, Sun Yat-sen University, Guangzhou, China.
Department of Infectious Disease Control and Prevention, Guangzhou Center for Disease Control and Prevention, Guangzhou, China.
BMC Infect Dis. 2024 Feb 26;24(1):265. doi: 10.1186/s12879-024-09138-x.
Infectious diarrhea remains a major public health problem worldwide. This study used stacking ensemble to developed a predictive model for the incidence of infectious diarrhea, aiming to achieve better prediction performance.
Based on the surveillance data of infectious diarrhea cases, relevant symptoms and meteorological factors of Guangzhou from 2016 to 2021, we developed four base prediction models using artificial neural networks (ANN), Long Short-Term Memory networks (LSTM), support vector regression (SVR) and extreme gradient boosting regression trees (XGBoost), which were then ensembled using stacking to obtain the final prediction model. All the models were evaluated with three metrics: mean absolute percentage error (MAPE), root mean square error (RMSE), and mean absolute error (MAE).
Base models that incorporated symptom surveillance data and weekly number of infectious diarrhea cases were able to achieve lower RMSEs, MAEs, and MAPEs than models that added meteorological data and weekly number of infectious diarrhea cases. The LSTM had the best prediction performance among the four base models, and its RMSE, MAE, and MAPE were: 84.85, 57.50 and 15.92%, respectively. The stacking ensembled model outperformed the four base models, whose RMSE, MAE, and MAPE were 75.82, 55.93, and 15.70%, respectively.
The incorporation of symptom surveillance data could improve the predictive accuracy of infectious diarrhea prediction models, and symptom surveillance data was more effective than meteorological data in enhancing model performance. Using stacking to combine multiple prediction models were able to alleviate the difficulty in selecting the optimal model, and could obtain a model with better performance than base models.
传染性腹泻仍是全球主要的公共卫生问题。本研究采用堆叠集成方法建立了一个用于预测传染性腹泻发病率的预测模型,旨在实现更好的预测性能。
基于 2016 年至 2021 年广州地区传染病腹泻病例、相关症状和气象因素的监测数据,我们使用人工神经网络(ANN)、长短期记忆网络(LSTM)、支持向量回归(SVR)和极端梯度提升回归树(XGBoost)四种基础预测模型,然后使用堆叠集成获得最终的预测模型。使用平均绝对百分比误差(MAPE)、均方根误差(RMSE)和平均绝对误差(MAE)三种指标评估所有模型。
纳入症状监测数据和每周传染病腹泻病例数的基础模型比添加气象数据和每周传染病腹泻病例数的模型具有更低的 RMSE、MAE 和 MAPE。四种基础模型中,LSTM 的预测性能最佳,其 RMSE、MAE 和 MAPE 分别为 84.85、57.50 和 15.92%。堆叠集成模型优于四种基础模型,其 RMSE、MAE 和 MAPE 分别为 75.82、55.93 和 15.70%。
纳入症状监测数据可以提高传染病腹泻预测模型的预测准确性,症状监测数据在提高模型性能方面比气象数据更有效。使用堆叠集成方法结合多个预测模型可以缓解选择最优模型的困难,并且可以获得比基础模型性能更好的模型。