Robert Koch Institute (RKI), Berlin, Germany.
Osnabrück University, Osnabrück, Lower Saxony, Germany.
PLoS Comput Biol. 2020 Nov 20;16(11):e1008277. doi: 10.1371/journal.pcbi.1008277. eCollection 2020 Nov.
According to the World Health Organization (WHO), around 60% of all outbreaks are detected using informal sources. In many public health institutes, including the WHO and the Robert Koch Institute (RKI), dedicated groups of public health agents sift through numerous articles and newsletters to detect relevant events. This media screening is one important part of event-based surveillance (EBS). Reading the articles, discussing their relevance, and putting key information into a database is a time-consuming process. To support EBS, but also to gain insights into what makes an article and the event it describes relevant, we developed a natural language processing framework for automated information extraction and relevance scoring. First, we scraped relevant sources for EBS as done at the RKI (WHO Disease Outbreak News and ProMED) and automatically extracted the articles' key data: disease, country, date, and confirmed-case count. For this, we performed named entity recognition in two steps: EpiTator, an open-source epidemiological annotation tool, suggested many different possibilities for each. We extracted the key country and disease using a heuristic with good results. We trained a naive Bayes classifier to find the key date and confirmed-case count, using the RKI's EBS database as labels which performed modestly. Then, for relevance scoring, we defined two classes to which any article might belong: The article is relevant if it is in the EBS database and irrelevant otherwise. We compared the performance of different classifiers, using bag-of-words, document and word embeddings. The best classifier, a logistic regression, achieved a sensitivity of 0.82 and an index balanced accuracy of 0.61. Finally, we integrated these functionalities into a web application called EventEpi where relevant sources are automatically analyzed and put into a database. The user can also provide any URL or text, that will be analyzed in the same way and added to the database. Each of these steps could be improved, in particular with larger labeled datasets and fine-tuning of the learning algorithms. The overall framework, however, works already well and can be used in production, promising improvements in EBS. The source code and data are publicly available under open licenses.
根据世界卫生组织(WHO)的数据,约有 60%的疫情暴发是通过非正规渠道发现的。在许多公共卫生机构,包括世界卫生组织和罗伯特·科赫研究所(RKI),都有专门的公共卫生人员小组筛选大量文章和通讯,以发现相关事件。这种媒体筛选是基于事件的监测(EBS)的重要组成部分。阅读文章、讨论其相关性,并将关键信息输入数据库是一个耗时的过程。为了支持 EBS,也为了深入了解是什么使一篇文章及其描述的事件具有相关性,我们开发了一个用于自动信息提取和相关性评分的自然语言处理框架。首先,我们从 RKI(WHO 疾病暴发新闻和 ProMED)等 EBS 相关来源中抓取相关文章,并自动提取文章的关键数据:疾病、国家、日期和确诊病例数。为此,我们分两步进行命名实体识别:EpiTator,一个开源的流行病学注释工具,为每个实体都提供了许多不同的可能性。我们使用一个启发式方法提取关键国家和疾病,效果很好。我们使用 RKI 的 EBS 数据库作为标签来训练朴素贝叶斯分类器,以找到关键日期和确诊病例数,结果尚可。然后,对于相关性评分,我们定义了两个类别,任何文章都可能属于这两个类别:如果文章在 EBS 数据库中,则相关,否则不相关。我们使用词袋、文档和词向量比较了不同分类器的性能。表现最好的分类器是逻辑回归,其灵敏度为 0.82,平衡准确率为 0.61。最后,我们将这些功能集成到一个名为 EventEpi 的网络应用程序中,该应用程序可以自动分析相关来源并将其放入数据库中。用户也可以提供任何 URL 或文本,这些内容将以相同的方式进行分析并添加到数据库中。这些步骤中的每一步都可以改进,特别是使用更大的标记数据集和对学习算法进行微调。然而,整个框架已经运行良好,可以在生产中使用,有望改进 EBS。源代码和数据在开放许可证下公开。