生物标记器：在生物医学文献中检测、规范和定位生物实体。

OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents.

机构信息

Department of Computer Science and Software Engineering, Concordia University, Montréal, Québec, Canada.

出版信息

Bioinformatics. 2011 Oct 1;27(19):2721-9. doi: 10.1093/bioinformatics/btr452. Epub 2011 Aug 9.

DOI:10.1093/bioinformatics/btr452

PMID:21828087

Abstract

MOTIVATION

Semantic tagging of organism mentions in full-text articles is an important part of literature mining and semantic enrichment solutions. Tagged organism mentions also play a pivotal role in disambiguating other entities in a text, such as proteins. A high-precision organism tagging system must be able to detect the numerous forms of organism mentions, including common names as well as the traditional taxonomic groups: genus, species and strains. In addition, such a system must resolve abbreviations and acronyms, assign the scientific name and if possible link the detected mention to the NCBI Taxonomy database for further semantic queries and literature navigation.

RESULTS

We present the OrganismTagger, a hybrid rule-based/machine learning system to extract organism mentions from the literature. It includes tools for automatically generating lexical and ontological resources from a copy of the NCBI Taxonomy database, thereby facilitating system updates by end users. Its novel ontology-based resources can also be reused in other semantic mining and linked data tasks. Each detected organism mention is normalized to a canonical name through the resolution of acronyms and abbreviations and subsequently grounded with an NCBI Taxonomy database ID. In particular, our system combines a novel machine-learning approach with rule-based and lexical methods for detecting strain mentions in documents. On our manually annotated OT corpus, the OrganismTagger achieves a precision of 95%, a recall of 94% and a grounding accuracy of 97.5%. On the manually annotated corpus of Linnaeus-100, the results show a precision of 99%, recall of 97% and grounding accuracy of 97.4%.

AVAILABILITY

The OrganismTagger, including supporting tools, resources, training data and manual annotations, as well as end user and developer documentation, is freely available under an open-source license at http://www.semanticsoftware.info/organism-tagger.

CONTACT

witte@semanticsoftware.info.

摘要

动机

在全文文章中对生物体提及进行语义标记是文献挖掘和语义丰富解决方案的重要组成部分。标记的生物体提及在消除文本中其他实体的歧义方面也起着关键作用，例如蛋白质。高精度的生物体标记系统必须能够检测到生物体提及的许多形式，包括常见名称以及传统的分类群：属、种和菌株。此外，这样的系统必须解决缩写和首字母缩写词，分配学名，并在可能的情况下将检测到的提及链接到 NCBI 分类数据库，以进行进一步的语义查询和文献导航。

结果

我们提出了 OrganismTagger，这是一种基于规则/机器学习的混合系统，用于从文献中提取生物体提及。它包括从 NCBI 分类数据库的副本自动生成词汇和本体资源的工具，从而使用户能够轻松更新系统。其新颖的基于本体的资源也可在其他语义挖掘和链接数据任务中重复使用。每个检测到的生物体提及都通过解决缩写词和缩略语将其归一化为规范名称，随后与 NCBI 分类数据库 ID 关联。特别是，我们的系统结合了一种新颖的机器学习方法与基于规则和词汇的方法，用于在文档中检测菌株提及。在我们手动注释的 OT 语料库上，OrganismTagger 的精度为 95%，召回率为 94%，接地精度为 97.5%。在手动注释的 Linnaeus-100 语料库上，结果显示精度为 99%，召回率为 97%，接地精度为 97.4%。

可用性

OrganismTagger 包括支持工具、资源、培训数据和手动注释以及最终用户和开发人员文档，根据开源许可证可在 http://www.semanticsoftware.info/organism-tagger 上免费获得。

联系信息

witte@semanticsoftware.info。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

生物标记器：在生物医学文献中检测、规范和定位生物实体。

OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

CONTACT

动机

结果

可用性

联系信息

相似文献

引用本文的文献

生物标记器：在生物医学文献中检测、规范和定位生物实体。

OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY

CONTACT

动机

结果

可用性

联系信息

相似文献

引用本文的文献