Faculty of Life Sciences, University of Manchester, Manchester, M13 9PT, UK.
BMC Bioinformatics. 2010 Feb 11;11:85. doi: 10.1186/1471-2105-11-85.
The task of recognizing and identifying species names in biomedical literature has recently been regarded as critical for a number of applications in text and data mining, including gene name recognition, species-specific document retrieval, and semantic enrichment of biomedical articles.
In this paper we describe an open-source species name recognition and normalization software system, LINNAEUS, and evaluate its performance relative to several automatically generated biomedical corpora, as well as a novel corpus of full-text documents manually annotated for species mentions. LINNAEUS uses a dictionary-based approach (implemented as an efficient deterministic finite-state automaton) to identify species names and a set of heuristics to resolve ambiguous mentions. When compared against our manually annotated corpus, LINNAEUS performs with 94% recall and 97% precision at the mention level, and 98% recall and 90% precision at the document level. Our system successfully solves the problem of disambiguating uncertain species mentions, with 97% of all mentions in PubMed Central full-text documents resolved to unambiguous NCBI taxonomy identifiers.
LINNAEUS is an open source, stand-alone software system capable of recognizing and normalizing species name mentions with speed and accuracy, and can therefore be integrated into a range of bioinformatics and text-mining applications. The software and manually annotated corpus can be downloaded freely at http://linnaeus.sourceforge.net/.
在文本和数据挖掘的许多应用中,识别和鉴定生物医学文献中的物种名称已被视为一项关键任务,包括基因名称识别、特定物种的文档检索以及生物医学文章的语义丰富。
在本文中,我们描述了一个开源的物种名称识别和标准化软件系统 LINNAEUS,并评估了其相对于几个自动生成的生物医学语料库以及一个手动注释了物种提及的全文文档新型语料库的性能。LINNAEUS 使用基于字典的方法(实现为高效确定性有限状态自动机)来识别物种名称,并使用一组启发式方法来解决模糊提及的问题。与我们手动注释的语料库相比,LINNAEUS 在提及级别上的召回率为 94%,准确率为 97%,在文档级别上的召回率为 98%,准确率为 90%。我们的系统成功地解决了不确定物种提及的歧义问题,PubMed Central 全文文档中的 97%的提及都解析为明确的 NCBI 分类标识符。
LINNAEUS 是一个开源的独立软件系统,能够快速准确地识别和规范化物种名称提及,因此可以集成到一系列生物信息学和文本挖掘应用中。软件和手动注释的语料库可在 http://linnaeus.sourceforge.net/ 免费下载。