Pafilis Evangelos, Frankild Sune P, Fanini Lucia, Faulwetter Sarah, Pavloudi Christina, Vasileiadou Aikaterini, Arvanitidis Christos, Jensen Lars Juhl
Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion, Greece.
PLoS One. 2013 Jun 18;8(6):e65390. doi: 10.1371/journal.pone.0065390. Print 2013.
The exponential growth of the biomedical literature is making the need for efficient, accurate text-mining tools increasingly clear. The identification of named biological entities in text is a central and difficult task. We have developed an efficient algorithm and implementation of a dictionary-based approach to named entity recognition, which we here use to identify names of species and other taxa in text. The tool, SPECIES, is more than an order of magnitude faster and as accurate as existing tools. The precision and recall was assessed both on an existing gold-standard corpus and on a new corpus of 800 abstracts, which were manually annotated after the development of the tool. The corpus comprises abstracts from journals selected to represent many taxonomic groups, which gives insights into which types of organism names are hard to detect and which are easy. Finally, we have tagged organism names in the entire Medline database and developed a web resource, ORGANISMS, that makes the results accessible to the broad community of biologists. The SPECIES software is open source and can be downloaded from http://species.jensenlab.org along with dictionary files and the manually annotated gold-standard corpus. The ORGANISMS web resource can be found at http://organisms.jensenlab.org.
生物医学文献的指数级增长使得对高效、准确的文本挖掘工具的需求日益明显。在文本中识别命名的生物实体是一项核心且困难的任务。我们开发了一种高效算法,并实现了一种基于字典的命名实体识别方法,在此我们用它来识别文本中的物种和其他分类单元的名称。工具SPECIES比现有工具快一个数量级以上,且准确性相当。我们在一个现有的金标准语料库和一个由800篇摘要组成的新语料库上评估了精确率和召回率,新语料库是在工具开发后进行人工标注的。该语料库包含从代表多个分类群的期刊中选取的摘要,这有助于了解哪些类型的生物体名称难以检测,哪些容易检测。最后,我们在整个Medline数据库中标记了生物体名称,并开发了一个网络资源ORGANISMS,使广大生物学家群体能够访问这些结果。SPECIES软件是开源的,可以从http://species.jensenlab.org下载,同时还可下载字典文件和人工标注的金标准语料库。ORGANISMS网络资源可在http://organisms.jensenlab.org找到。