Center for Marine Environmental Studies (CMES), Ehime University, Matsuyama, Japan.
Graduate School of Science and Engineering, Ehime University, Matsuyama, Ehime, Japan.
PLoS Negl Trop Dis. 2024 Oct 21;18(10):e0012599. doi: 10.1371/journal.pntd.0012599. eCollection 2024 Oct.
Spatiotemporal dengue forecasting using machine learning (ML) can contribute to the development of prevention and control strategies for impending dengue outbreaks. However, training data for dengue incidence may be inflated with frequent zero values because of the rarity of cases, which lowers the prediction accuracy. This study aimed to understand the influence of spatiotemporal resolutions of data on the accuracy of dengue incidence prediction using ML models, to understand how the influence of spatiotemporal resolution differs between quantitative and qualitative predictions of dengue incidence, and to improve the accuracy of dengue incidence prediction with zero-inflated data.
We predicted dengue incidence at six spatiotemporal resolutions and compared their prediction accuracy. Six ML algorithms were compared: generalized additive models, random forests, conditional inference forest, artificial neural networks, support vector machines and regression, and extreme gradient boosting. Data from 2009 to 2012 were used for training, and data from 2013 were used for model validation with quantitative and qualitative dengue variables. To address the inaccuracy in the quantitative prediction of dengue incidence due to zero-inflated data at fine spatiotemporal scales, we developed a hybrid approach in which the second-stage quantitative prediction is performed only when/where the first-stage qualitative model predicts the occurrence of dengue cases.
At higher resolutions, the dengue incidence data were zero-inflated, which was insufficient for quantitative pattern extraction of relationships between dengue incidence and environmental variables by ML. Qualitative models, used as binary variables, eased the effect of data distribution. Our novel hybrid approach of combining qualitative and quantitative predictions demonstrated high potential for predicting zero-inflated or rare phenomena, such as dengue.
Our research contributes valuable insights to the field of spatiotemporal dengue prediction and provides an alternative solution to enhance prediction accuracy in zero-inflated data where hurdle or zero-inflated models cannot be applied.
使用机器学习(ML)进行时空登革热预测有助于制定预防和控制即将发生的登革热爆发的策略。然而,由于病例罕见,登革热发病率的训练数据可能会因频繁出现零值而膨胀,从而降低预测准确性。本研究旨在了解数据的时空分辨率对使用 ML 模型预测登革热发病率的准确性的影响,了解时空分辨率对登革热发病率的定量和定性预测的影响有何不同,以及如何利用零膨胀数据提高登革热发病率预测的准确性。
我们预测了六个时空分辨率的登革热发病率,并比较了它们的预测准确性。比较了六种 ML 算法:广义加性模型、随机森林、条件推断森林、人工神经网络、支持向量机和回归以及极端梯度提升。使用 2009 年至 2012 年的数据进行训练,并使用 2013 年的数据对定量和定性登革热变量进行模型验证。为了解决由于精细时空尺度上零膨胀数据导致的登革热发病率定量预测不准确的问题,我们开发了一种混合方法,其中仅在第一阶段定性模型预测登革热病例发生时/在该位置进行第二阶段定量预测。
在较高的分辨率下,登革热发病率数据是零膨胀的,这不足以通过 ML 提取登革热发病率与环境变量之间的定量关系模式。作为二进制变量使用的定性模型缓解了数据分布的影响。我们结合定性和定量预测的新颖混合方法显示出了预测零膨胀或罕见现象(如登革热)的巨大潜力。
我们的研究为时空登革热预测领域提供了有价值的见解,并为在无法应用障碍或零膨胀模型的零膨胀数据中提高预测准确性提供了替代解决方案。