Priedhorsky Reid, Osthus Dave, Daughton Ashlynn R, Moran Kelly R, Generous Nicholas, Fairchild Geoffrey, Deshpande Alina, Del Valle Sara Y
High Performance Computing (HPC) Division.
Computer, Computational, and Statistical Sciences (CCS) Division.
CSCW Conf Comput Support Coop Work. 2017 Feb-Mar;2017:1812-1834. doi: 10.1145/2998181.2998183.
Effective disease monitoring provides a foundation for effective public health systems. This has historically been accomplished with patient contact and bureaucratic aggregation, which tends to be slow and expensive. Recent internet-based approaches promise to be real-time and cheap, with few parameters. However, the question of these approaches work remains open. We addressed this question using Wikipedia access logs and category links. Our experiments, replicable and extensible using our open source code and data, test the effect of semantic article filtering, amount of training data, forecast horizon, and model staleness by comparing across 6 diseases and 4 countries using thousands of individual models. We found that our minimal-configuration, language-agnostic article selection process based on semantic relatedness is effective for improving predictions, and that our approach is relatively insensitive to the amount and age of training data. We also found, in contrast to prior work, very little forecasting value, and we argue that this is consistent with theoretical considerations about the nature of forecasting. These mixed results lead us to propose that the currently observational field of internet-based disease surveillance must pivot to include theoretical models of information flow as well as controlled experiments based on simulations of disease.
有效的疾病监测为有效的公共卫生系统奠定了基础。在历史上,这是通过患者接触和官僚汇总来实现的,而这往往既缓慢又昂贵。最近基于互联网的方法有望实现实时且低成本,所需参数较少。然而,这些方法是否有效仍然是个未知数。我们利用维基百科访问日志和类别链接解决了这个问题。我们的实验可以使用我们的开源代码和数据进行复制和扩展,通过使用数千个单独的模型,对6种疾病和4个国家进行比较,测试语义文章过滤、训练数据量、预测范围和模型陈旧程度的影响。我们发现,基于语义相关性的最小配置、与语言无关的文章选择过程对于改进预测是有效的,并且我们的方法对训练数据的数量和时效性相对不敏感。与先前的研究结果相反,我们还发现预测价值很小,我们认为这与关于预测本质的理论考量是一致的。这些复杂的结果促使我们提出,当前基于互联网的疾病监测的观察性领域必须转向纳入信息流的理论模型以及基于疾病模拟的对照实验。