Jiang Dong, Hao Mengmeng, Ding Fangyu, Fu Jingying, Li Meng
State Key Laboratory of Resources and Environmental Information System, Institute of Geographical Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing, 100101, China; College of Resources and Environment, University of Chinese Academy of Sciences, Beijing, 100049, China.
Acta Trop. 2018 Sep;185:391-399. doi: 10.1016/j.actatropica.2018.06.021. Epub 2018 Jun 19.
Zika virus, which has been linked to severe congenital abnormalities, is exacerbating global public health problems with its rapid transnational expansion fueled by increased global travel and trade. Suitability mapping of the transmission risk of Zika virus is essential for drafting public health plans and disease control strategies, which are especially important in areas where medical resources are relatively scarce. Predicting the risk of Zika virus outbreak has been studied in recent years, but the published literature rarely includes multiple model comparisons or predictive uncertainty analysis. Here, three relatively popular machine learning models including backward propagation neural network (BPNN), gradient boosting machine (GBM) and random forest (RF) were adopted to map the probability of Zika epidemic outbreak at the global level, pairing high-dimensional multidisciplinary covariate layers with comprehensive location data on recorded Zika virus infection in humans. The results show that the predicted high-risk areas for Zika transmission are concentrated in four regions: Southeastern North America, Eastern South America, Central Africa and Eastern Asia. To evaluate the performance of machine learning models, the 50 modeling processes were conducted based on a training dataset. The BPNN model obtained the highest predictive accuracy with a 10-fold cross-validation area under the curve (AUC) of 0.966 [95% confidence interval (CI) 0.965-0.967], followed by the GBM model (10-fold cross-validation AUC = 0.964[0.963-0.965]) and the RF model (10-fold cross-validation AUC = 0.963[0.962-0.964]). Based on training samples, compared with the BPNN-based model, we find that significant differences (p = 0.0258* and p = 0.0001***, respectively) are observed for prediction accuracies achieved by the GBM and RF models. Importantly, the prediction uncertainty introduced by the selection of absence data was quantified and could provide more accurate fundamental and scientific information for further study on disease transmission prediction and risk assessment.
寨卡病毒与严重的先天性异常有关,随着全球旅行和贸易增加推动其迅速跨国传播,正在加剧全球公共卫生问题。寨卡病毒传播风险的适宜性地图绘制对于制定公共卫生计划和疾病控制策略至关重要,这在医疗资源相对稀缺的地区尤为重要。近年来已对寨卡病毒爆发风险预测进行了研究,但已发表的文献很少包括多种模型比较或预测不确定性分析。在此,采用了三种相对流行的机器学习模型,包括反向传播神经网络(BPNN)、梯度提升机(GBM)和随机森林(RF),以绘制全球层面寨卡疫情爆发的概率,将高维多学科协变量层与人类记录的寨卡病毒感染综合位置数据配对。结果表明,预测的寨卡病毒传播高风险地区集中在四个区域:北美东南部、南美东部、中非和东亚。为评估机器学习模型的性能,基于训练数据集进行了50次建模过程。BPNN模型获得了最高的预测准确率,10倍交叉验证曲线下面积(AUC)为0.966[95%置信区间(CI)0.965 - 0.967],其次是GBM模型(10倍交叉验证AUC = 0.964[0.963 - 0.965])和RF模型(10倍交叉验证AUC = 0.963[0.962 - 0.964])。基于训练样本,与基于BPNN的模型相比,我们发现GBM和RF模型实现的预测准确率存在显著差异(分别为p = 0.0258和p = 0.0001**)。重要的是,对缺失数据选择所引入的预测不确定性进行了量化,可为疾病传播预测和风险评估的进一步研究提供更准确的基础和科学信息。