Harvard University, School of Engineering and Applied Sciences, Cambridge, Massachusetts, USA.
Li Ka Shing Knowledge Institute, St. Michaels Hospital, Toronto, Ontario, Canada.
J Am Med Inform Assoc. 2019 Nov 1;26(11):1355-1359. doi: 10.1093/jamia/ocz112.
We assessed whether machine learning can be utilized to allow efficient extraction of infectious disease activity information from online media reports.
We curated a data set of labeled media reports (n = 8322) indicating which articles contain updates about disease activity. We trained a classifier on this data set. To validate our system, we used a held out test set and compared our articles to the World Health Organization Disease Outbreak News reports.
Our classifier achieved a recall and precision of 88.8% and 86.1%, respectively. The overall surveillance system detected 94% of the outbreaks identified by the WHO covered by online media (89%) and did so 43.4 (IQR: 9.5-61) days earlier on average.
We constructed a global real-time disease activity database surveilling 114 illnesses and syndromes. We must further assess our system for bias, representativeness, granularity, and accuracy.
Machine learning, natural language processing, and human expertise can be used to efficiently identify disease activity from digital media reports.
评估机器学习是否可用于从在线媒体报道中高效提取传染病活动信息。
我们整理了一个标记媒体报道数据集(n=8322),指示哪些文章包含有关疾病活动的更新。我们在该数据集上训练了一个分类器。为了验证我们的系统,我们使用了一个保留的测试集,并将我们的文章与世界卫生组织疾病暴发新闻报道进行了比较。
我们的分类器的召回率和准确率分别为 88.8%和 86.1%。总体监测系统检测到了在线媒体报道的 94%的世界卫生组织所涵盖的暴发(89%),平均提前了 43.4(IQR:9.5-61)天。
我们构建了一个全球性的实时疾病活动数据库,监测 114 种疾病和综合征。我们必须进一步评估我们的系统的偏差、代表性、粒度和准确性。
机器学习、自然语言处理和人类专业知识可用于从数字媒体报道中高效识别疾病活动。