Rabatel Julien, Arsevska Elena, Roche Mathieu
Cirad, Montpellier, France.
ASTRE, Cirad, INRA, Montpellier, France.
Data Brief. 2018 Dec 23;22:643-646. doi: 10.1016/j.dib.2018.12.063. eCollection 2019 Feb.
Monitoring animal health worldwide, especially the early detection of outbreaks of emerging pathogens, is one of the means of preventing the introduction of infectious diseases in countries (Collier et al., 2008) [3]. In this context, we developed PADI-web, a Platform for Automated extraction of animal Disease Information from the Web (Arsevska et al., 2016, 2018). PADI-web is a text-mining tool that automatically detects, categorizes and extracts disease outbreak information from Web news articles. PADI-web currently monitors the Web for five emerging animal infectious diseases, i.e., African swine fever, avian influenza including highly pathogenic and low pathogenic avian influenza, foot-and-mouth disease, bluetongue, and Schmallenberg virus infection. PADI-web collects Web news articles in near-real time through RSS feeds. Currently, PADI-web collects disease information from Google News because of its international and multiple language coverage. We implemented machine learning techniques to identify the relevant disease information in texts (i.e., location and date of an outbreak, affected hosts, their numbers and clinical signs). In order to train the model for Information Extraction (IE) from news articles, a corpus in English has been manually labeled by domain experts. This labeled corpus (Rabatel et al., 2017) is presented in this data paper.
监测全球动物健康,尤其是早期发现新出现病原体的疫情,是防止传染病传入各国的手段之一(Collier等人,2008年)[3]。在此背景下,我们开发了PADI-web,即一个从网络自动提取动物疾病信息的平台(Arsevska等人,2016年、2018年)。PADI-web是一种文本挖掘工具,可自动从网络新闻文章中检测、分类并提取疾病爆发信息。PADI-web目前在网络上监测五种新出现的动物传染病,即非洲猪瘟、禽流感(包括高致病性和低致病性禽流感)、口蹄疫、蓝舌病和施马伦贝格病毒感染。PADI-web通过RSS订阅源近乎实时地收集网络新闻文章。目前,由于谷歌新闻具有国际和多语言覆盖范围,PADI-web从谷歌新闻收集疾病信息。我们实施了机器学习技术来识别文本中的相关疾病信息(即疫情的地点和日期、受影响的宿主、其数量和临床症状)。为了训练从新闻文章中提取信息(IE)的模型,领域专家已手动标注了一个英文语料库。本数据论文展示了这个标注语料库(Rabatel等人,2017年)。