Department of Statistics and Data Science, The University of Texas at Austin, Austin, Texas, United States of America.
athenaResearch, Watertown, Massachusetts, United States of America.
PLoS Comput Biol. 2018 Sep 4;14(9):e1006236. doi: 10.1371/journal.pcbi.1006236. eCollection 2018 Sep.
Forecasting the emergence and spread of influenza viruses is an important public health challenge. Timely and accurate estimates of influenza prevalence, particularly of severe cases requiring hospitalization, can improve control measures to reduce transmission and mortality. Here, we extend a previously published machine learning method for influenza forecasting to integrate multiple diverse data sources, including traditional surveillance data, electronic health records, internet search traffic, and social media activity. Our hierarchical framework uses multi-linear regression to combine forecasts from multiple data sources and greedy optimization with forward selection to sequentially choose the most predictive combinations of data sources. We show that the systematic integration of complementary data sources can substantially improve forecast accuracy over single data sources. When forecasting the Center for Disease Control and Prevention (CDC) influenza-like-illness reports (ILINet) from week 48 through week 20, the optimal combination of predictors includes public health surveillance data and commercially available electronic medical records, but neither search engine nor social media data.
预测流感病毒的出现和传播是一项重要的公共卫生挑战。及时、准确地估计流感的流行程度,特别是需要住院治疗的严重病例,可以改善控制措施,减少传播和死亡率。在这里,我们扩展了之前发表的一种用于流感预测的机器学习方法,以整合多个不同的数据源,包括传统的监测数据、电子健康记录、互联网搜索流量和社交媒体活动。我们的分层框架使用多元线性回归来结合来自多个数据源的预测,并使用带有向前选择的贪婪优化来顺序选择最具预测性的数据源组合。我们表明,系统地整合补充数据源可以大大提高单数据源预测的准确性。在从第 48 周通过第 20 周预测疾病控制与预防中心(CDC)的流感样疾病报告(ILINet)时,最佳预测因子组合包括公共卫生监测数据和商业电子病历,但不包括搜索引擎或社交媒体数据。