School of Public Health, Sun Yat-Sen University, Guangzhou, China.
School of Public Health (Shenzhen), Sun Yat-sen University, Shenzhen, China.
BMJ Open. 2020 Mar 24;10(3):e036098. doi: 10.1136/bmjopen-2019-036098.
Internet search engine data have been widely used to monitor and predict infectious diseases. Existing studies have found correlations between search data and HIV/AIDS epidemics. We aimed to extend the literature through exploring the feasibility of using search data to monitor and predict the number of newly diagnosed cases of HIV/AIDS, syphilis and gonorrhoea in China.
This paper used vector autoregressive model to combine the number of newly diagnosed cases with Baidu search index to predict monthly newly diagnosed cases of HIV/AIDS, syphilis and gonorrhoea in China. The procedures included: (1) keywords selection and filtering; (2) construction of composite search index; (3) modelling with training data from January 2011 to October 2016 and calculating the prediction performance with validation data from November 2016 to October 2017.
The analysis showed that there was a close correlation between the monthly number of newly diagnosed cases and the composite search index (the Spearman's rank correlation coefficients were 0.777 for HIV/AIDS, 0.590 for syphilis and 0.633 for gonorrhoea, p<0.05 for all). The R were all more than 85% and the mean absolute percentage errors were less than 11%, showing the good fitting effect and prediction performance of vector autoregressive model in this field.
Our study indicated the potential feasibility of using Baidu search data to monitor and predict the number of newly diagnosed cases of HIV/AIDS, syphilis and gonorrhoea in China.
互联网搜索引擎数据已被广泛用于监测和预测传染病。现有研究发现,搜索数据与 HIV/AIDS 流行之间存在相关性。我们旨在通过探索使用搜索数据监测和预测中国新诊断的 HIV/AIDS、梅毒和淋病病例数的可行性,扩展相关文献。
本文使用向量自回归模型将新诊断病例数与百度搜索指数相结合,预测中国新诊断的 HIV/AIDS、梅毒和淋病的每月病例数。该程序包括:(1)关键词选择和筛选;(2)构建综合搜索指数;(3)使用 2011 年 1 月至 2016 年 10 月的训练数据进行建模,并使用 2016 年 11 月至 2017 年 10 月的验证数据计算预测性能。
分析表明,每月新诊断病例数与综合搜索指数密切相关(HIV/AIDS 的 Spearman 秩相关系数为 0.777,梅毒为 0.590,淋病为 0.633,均为 p<0.05)。R 均大于 85%,平均绝对百分比误差小于 11%,表明向量自回归模型在该领域具有良好的拟合效果和预测性能。
本研究表明,使用百度搜索数据监测和预测中国新诊断的 HIV/AIDS、梅毒和淋病病例数具有潜在的可行性。