文献检索，用中文搜 PubMed

BACKGROUND

Accurate and reliable predictions of infectious disease can be valuable to public health organizations that plan interventions to decrease or prevent disease transmission. A great variety of models have been developed for this task. However, for different data series, the performance of these models varies. Hepatitis E, as an acute liver disease, has been a major public health problem. Which model is more appropriate for predicting the incidence of hepatitis E? In this paper, three different methods are used and the performance of the three methods is compared.

METHODS

Autoregressive integrated moving average(ARIMA), support vector machine(SVM) and long short-term memory(LSTM) recurrent neural network were adopted and compared. ARIMA was implemented by python with the help of statsmodels. SVM was accomplished by matlab with libSVM library. LSTM was designed by ourselves with Keras, a deep learning library. To tackle the problem of overfitting caused by limited training samples, we adopted dropout and regularization strategies in our LSTM model. Experimental data were obtained from the monthly incidence and cases number of hepatitis E from January 2005 to December 2017 in Shandong province, China. We selected data from July 2015 to December 2017 to validate the models, and the rest was taken as training set. Three metrics were applied to compare the performance of models, including root mean square error(RMSE), mean absolute percentage error(MAPE) and mean absolute error(MAE).

RESULTS

By analyzing data, we took ARIMA(1, 1, 1), ARIMA(3, 1, 2) as monthly incidence prediction model and cases number prediction model, respectively. Cross-validation and grid search were used to optimize parameters of SVM. Penalty coefficient C and kernel function parameter g were set 8, 0.125 for incidence prediction, and 22, 0.01 for cases number prediction. LSTM has 4 nodes. Dropout and L2 regularization parameters were set 0.15, 0.001, respectively. By the metrics of RMSE, we obtained 0.022, 0.0204, 0.01 for incidence prediction, using ARIMA, SVM and LSTM. And we obtained 22.25, 20.0368, 11.75 for cases number prediction, using three models. For MAPE metrics, the results were 23.5%, 21.7%, 15.08%, and 23.6%, 21.44%, 13.6%, for incidence prediction and cases number prediction, respectively. For MAE metrics, the results were 0.018, 0.0167, 0.011 and 18.003, 16.5815, 9.984, for incidence prediction and cases number prediction, respectively.

CONCLUSIONS

Comparing ARIMA, SVM and LSTM, we found that nonlinear models(SVM, LSTM) outperform linear models(ARIMA). LSTM obtained the best performance in all three metrics of RSME, MAPE, MAE. Hence, LSTM is the most suitable for predicting hepatitis E monthly incidence and cases number.

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

准确可靠的传染病预测对于计划干预措施以减少或预防疾病传播的公共卫生组织具有重要价值。已经开发了许多模型来实现这一目标。然而，对于不同的数据序列，这些模型的性能有所不同。戊型肝炎是一种急性肝病，一直是一个主要的公共卫生问题。哪种模型更适合预测戊型肝炎的发病率？在本文中，使用了三种不同的方法，并比较了这三种方法的性能。

方法

采用自回归综合移动平均（ARIMA）、支持向量机（SVM）和长短期记忆（LSTM）递归神经网络，并进行了比较。ARIMA 通过 python 借助于 statsmodels 来实现。SVM 通过 matlab 借助于 libSVM 库来完成。LSTM 通过 Keras（一种深度学习库）由我们自己设计。为了解决由于训练样本有限而导致的过拟合问题，我们在 LSTM 模型中采用了辍学和正则化策略。实验数据来自 2005 年 1 月至 2017 年 12 月山东省戊型肝炎的月发病率和病例数。我们选择 2015 年 7 月至 2017 年 12 月的数据进行模型验证，其余数据作为训练集。采用均方根误差（RMSE）、平均绝对百分比误差（MAPE）和平均绝对误差（MAE）三种指标来比较模型的性能。

结果

通过数据分析，我们选择了 ARIMA（1,1,1）和 ARIMA（3,1,2）作为月发病率预测模型和病例数预测模型。使用交叉验证和网格搜索来优化 SVM 的参数。对于发病率预测，惩罚系数 C 和核函数参数 g 分别设置为 8 和 0.125；对于病例数预测，惩罚系数 C 和核函数参数 g 分别设置为 22 和 0.01。LSTM 有 4 个节点。辍学和 L2 正则化参数分别设置为 0.15 和 0.001。根据 RMSE 指标，我们得到了 0.022、0.0204 和 0.01 用于发病率预测，使用 ARIMA、SVM 和 LSTM。对于病例数预测，我们得到了 22.25、20.0368 和 11.75，使用三种模型。对于 MAPE 指标，结果分别为 23.5%、21.7%、15.08%和 23.6%、21.44%、13.6%，用于发病率预测和病例数预测。对于 MAE 指标，结果分别为 0.018、0.0167、0.011 和 18.003、16.5815、9.984，用于发病率预测和病例数预测。