School of Public Health, Guangxi Medical University, Nanning, China.
Guangxi Key Laboratory of AIDS Prevention and Treatment, Guangxi Medical University, Nanning, China.
J Med Internet Res. 2023 Oct 30;25:e49400. doi: 10.2196/49400.
Internet-derived data and the autoregressive integrated moving average (ARIMA) and ARIMA with explanatory variable (ARIMAX) models are extensively used for infectious disease surveillance. However, the effectiveness of the Baidu search index (BSI) in predicting the incidence of scarlet fever remains uncertain.
Our objective was to investigate whether a low-cost BSI monitoring system could potentially function as a valuable complement to traditional scarlet fever surveillance in China.
ARIMA and ARIMAX models were developed to predict the incidence of scarlet fever in China using data from the National Health Commission of the People's Republic of China between January 2011 and August 2022. The procedures included establishing a keyword database, keyword selection and filtering through Spearman rank correlation and cross-correlation analyses, construction of the scarlet fever comprehensive search index (CSI), modeling with the training sets, predicting with the testing sets, and comparing the prediction performances.
The average monthly incidence of scarlet fever was 4462.17 (SD 3011.75) cases, and annual incidence exhibited an upward trend until 2019. The keyword database contained 52 keywords, but only 6 highly relevant ones were selected for modeling. A high Spearman rank correlation was observed between the scarlet fever reported cases and the scarlet fever CSI (r=0.881). We developed the ARIMA(4,0,0)(0,1,2) model, and the ARIMA(4,0,0)(0,1,2) + CSI (Lag=0) and ARIMAX(1,0,2)(2,0,0) models were combined with the BSI. The 3 models had a good fit and passed the residuals Ljung-Box test. The ARIMA(4,0,0)(0,1,2), ARIMA(4,0,0)(0,1,2) + CSI (Lag=0), and ARIMAX(1,0,2)(2,0,0) models demonstrated favorable predictive capabilities, with mean absolute errors of 1692.16 (95% CI 584.88-2799.44), 1067.89 (95% CI 402.02-1733.76), and 639.75 (95% CI 188.12-1091.38), respectively; root mean squared errors of 2036.92 (95% CI 929.64-3144.20), 1224.92 (95% CI 559.04-1890.79), and 830.80 (95% CI 379.17-1282.43), respectively; and mean absolute percentage errors of 4.33% (95% CI 0.54%-8.13%), 3.36% (95% CI -0.24% to 6.96%), and 2.16% (95% CI -0.69% to 5.00%), respectively. The ARIMAX models outperformed the ARIMA models and had better prediction performances with smaller values.
This study demonstrated that the BSI can be used for the early warning and prediction of scarlet fever, serving as a valuable supplement to traditional surveillance systems.
互联网数据和自回归积分移动平均(ARIMA)和带解释变量的 ARIMA(ARIMAX)模型广泛用于传染病监测。然而,百度搜索指数(BSI)在预测猩红热发病率方面的有效性仍不确定。
本研究旨在探讨低成本的 BSI 监测系统是否有可能成为中国传统猩红热监测的有益补充。
使用中国国家卫生健康委员会 2011 年 1 月至 2022 年 8 月的数据,采用 ARIMA 和 ARIMAX 模型预测中国猩红热的发病率。该过程包括建立关键词数据库、通过 Spearman 秩相关和互相关分析进行关键词选择和筛选、构建猩红热综合搜索指数(CSI)、使用训练集进行建模、使用测试集进行预测,并比较预测性能。
猩红热的月平均发病率为 4462.17(SD 3011.75)例,发病率呈上升趋势,直至 2019 年。关键词数据库包含 52 个关键词,但仅选择了 6 个高度相关的关键词进行建模。猩红热报告病例与猩红热 CSI 之间存在高度的 Spearman 秩相关(r=0.881)。我们开发了 ARIMA(4,0,0)(0,1,2)模型,并且将 ARIMA(4,0,0)(0,1,2) + CSI(滞后=0)和 ARIMAX(1,0,2)(2,0,0)模型与 BSI 相结合。这 3 个模型拟合良好且通过了残差 Ljung-Box 检验。ARIMA(4,0,0)(0,1,2)、ARIMA(4,0,0)(0,1,2) + CSI(滞后=0)和 ARIMAX(1,0,2)(2,0,0)模型具有良好的预测能力,平均绝对误差分别为 1692.16(95%CI 584.88-2799.44)、1067.89(95%CI 402.02-1733.76)和 639.75(95%CI 188.12-1091.38);均方根误差分别为 2036.92(95%CI 929.64-3144.20)、1224.92(95%CI 559.04-1890.79)和 830.80(95%CI 379.17-1282.43);平均绝对百分比误差分别为 4.33%(95%CI 0.54%-8.13%)、3.36%(95%CI -0.24%至 6.96%)和 2.16%(95%CI -0.69%至 5.00%)。ARIMAX 模型优于 ARIMA 模型,且预测性能更好,误差值更小。
本研究表明,BSI 可用于猩红热的早期预警和预测,是传统监测系统的有益补充。