Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America.
University of Colorado Boulder, Boulder, Colorado, United States of America.
PLoS Comput Biol. 2019 Oct 1;15(10):e1007165. doi: 10.1371/journal.pcbi.1007165. eCollection 2019 Oct.
Seasonal influenza is a sometimes surprisingly impactful disease, causing thousands of deaths per year along with much additional morbidity. Timely knowledge of the outbreak state is valuable for managing an effective response. The current state of the art is to gather this knowledge using in-person patient contact. While accurate, this is time-consuming and expensive. This has motivated inquiry into new approaches using internet activity traces, based on the theory that lay observations of health status lead to informative features in internet data. These approaches risk being deceived by activity traces having a coincidental, rather than informative, relationship to disease incidence; to our knowledge, this risk has not yet been quantitatively explored. We evaluated both simulated and real activity traces of varying deceptiveness for influenza incidence estimation using linear regression. We found that deceptiveness knowledge does reduce error in such estimates, that it may help automatically-selected features perform as well or better than features that require human curation, and that a semantic distance measure derived from the Wikipedia article category tree serves as a useful proxy for deceptiveness. This suggests that disease incidence estimation models should incorporate not only data about how internet features map to incidence but also additional data to estimate feature deceptiveness. By doing so, we may gain one more step along the path to accurate, reliable disease incidence estimation using internet data. This capability would improve public health by decreasing the cost and increasing the timeliness of such estimates.
季节性流感是一种有时影响巨大的疾病,每年导致数千人死亡,并导致更多的发病率。及时了解疫情状况对于有效应对非常有价值。目前的方法是通过与患者进行面对面接触来获取这些知识。虽然这种方法准确,但耗时且昂贵。这促使人们研究使用互联网活动痕迹的新方法,其理论依据是,对健康状况的非专业观察会导致互联网数据中出现有意义的特征。这些方法存在被与疾病发病率巧合相关而非有意义相关的活动痕迹所欺骗的风险;据我们所知,这种风险尚未得到定量探讨。我们使用线性回归评估了具有不同欺骗性的模拟和真实活动痕迹,以用于流感发病率估计。我们发现,欺骗性知识确实可以降低此类估计的误差,它可以帮助自动选择的特征与需要人工策展的特征一样或更好地发挥作用,并且从维基百科文章类别树派生的语义距离度量可以作为欺骗性的有用代理。这表明,疾病发病率估计模型不仅应该包含有关互联网特征与发病率之间映射关系的数据,还应该包含其他数据来估计特征的欺骗性。通过这样做,我们可以在使用互联网数据进行准确、可靠的疾病发病率估计方面更进一步。这种能力将通过降低成本和提高此类估计的及时性来改善公共卫生。