Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, United States of America.
Department of Pathobiological Sciences, School of Veterinary Medicine, University of Wisconsin-Madison, Madison, WI, United States of America.
PLoS One. 2020 Feb 5;15(2):e0228105. doi: 10.1371/journal.pone.0228105. eCollection 2020.
The use of natural language data for animal population surveillance represents a valuable opportunity to gather information about potential disease outbreaks, emerging zoonotic diseases, or bioterrorism threats. In this study, we evaluate machine learning methods for conducting syndromic surveillance using free-text veterinary necropsy reports. We train a system to detect if a necropsy report from the Wisconsin Veterinary Diagnostic Laboratory contains evidence of gastrointestinal, respiratory, or urinary pathology. We evaluate the performance of several machine learning algorithms including deep learning with a long short-term memory network. Although no single algorithm was superior, random forest using feature vectors of TF-IDF statistics ranked among the top-performing models with F1 scores of 0.923 (gastrointestinal), 0.960 (respiratory), and 0.888 (urinary). This model was applied to over 33,000 necropsy reports and was used to describe temporal and spatial features of diseases within a 14-year period, exposing epidemiological trends and detecting a potential focus of gastrointestinal disease from a single submitting producer in the fall of 2016.
利用自然语言数据进行动物种群监测代表了一个有价值的机会,可以收集有关潜在疾病爆发、新出现的人畜共患病或生物恐怖主义威胁的信息。在这项研究中,我们评估了使用兽医剖检报告中的自由文本进行综合征监测的机器学习方法。我们训练了一个系统来检测威斯康星州兽医诊断实验室的剖检报告是否包含胃肠道、呼吸道或泌尿系统病理的证据。我们评估了几种机器学习算法的性能,包括具有长短期记忆网络的深度学习。虽然没有一种算法是优越的,但使用 TF-IDF 统计特征向量的随机森林算法在 F1 分数方面排名靠前,其分数分别为 0.923(胃肠道)、0.960(呼吸道)和 0.888(泌尿系统)。该模型应用于超过 33000 份剖检报告,并用于描述 14 年内疾病的时间和空间特征,揭示了流行病学趋势,并在 2016 年秋季从单个提交生产商中检测到了胃肠道疾病的潜在焦点。