University of California Institute for Prediction Technology, Department of Family Medicine, University of California Los Angeles, Los Angeles, California, United States of America.
Department of Systems Engineering and Engineering Management, City University of Hong Kong, Kowloon, Hong Kong.
PLoS One. 2018 Jul 12;13(7):e0199527. doi: 10.1371/journal.pone.0199527. eCollection 2018.
A large and growing body of "big data" is generated by internet search engines, such as Google. Because people often search for information about public health and medical issues, researchers may be able to use search engine data to monitor and predict public health problems, such as HIV. We sought to assess the feasibility of using Google search data to analyze and predict new HIV diagnoses cases in the United States.
From 2007 to 2014, we collected search volume data on HIV-related Google search keywords across the United States. State-level new HIV diagnoses data were collected from the Centers for Disease Control and Prevention (CDC) and AIDSVu.org. We developed a negative binomial model to predict HIV cases using a subset of significant predictor keywords identified by LASSO. The Google search data were combined with state-level HIV case reports provided by the CDC. We use historical data to train the model and predict new HIV diagnoses from 2011 to 2014, with an average R2 value of 0.99 between predicted versus actual cases, and average root-mean-square error (RMSE) of 108.75.
Results indicate that Google Trends is a feasible tool to predict new cases of HIV at the state level. We discuss the implications of integrating visualization maps and tools based on these models into public health and HIV monitoring and surveillance.
互联网搜索引擎(如谷歌)生成了大量且不断增长的“大数据”。由于人们经常搜索有关公共卫生和医疗问题的信息,因此研究人员或许能够使用搜索引擎数据来监测和预测公共卫生问题,例如 HIV。我们旨在评估使用谷歌搜索数据来分析和预测美国新的 HIV 诊断病例的可行性。
从 2007 年到 2014 年,我们在美国范围内收集了与 HIV 相关的谷歌搜索关键字的搜索量数据。从疾病预防控制中心(CDC)和 AIDSVu.org 收集了各州新的 HIV 诊断数据。我们开发了一个负二项式模型,使用通过 LASSO 确定的一组显著预测关键字来预测 HIV 病例。将谷歌搜索数据与由 CDC 提供的各州 HIV 病例报告相结合。我们使用历史数据来训练模型,并预测 2011 年至 2014 年的新 HIV 诊断,预测与实际病例之间的平均 R2 值为 0.99,平均均方根误差(RMSE)为 108.75。
结果表明,谷歌趋势是预测州级新 HIV 病例的可行工具。我们讨论了将这些模型的可视化地图和工具集成到公共卫生和 HIV 监测和监测中的意义。