Suppr超能文献

生物标记器:在生物医学文献中检测、规范和定位生物实体。

OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents.

机构信息

Department of Computer Science and Software Engineering, Concordia University, Montréal, Québec, Canada.

出版信息

Bioinformatics. 2011 Oct 1;27(19):2721-9. doi: 10.1093/bioinformatics/btr452. Epub 2011 Aug 9.

Abstract

MOTIVATION

Semantic tagging of organism mentions in full-text articles is an important part of literature mining and semantic enrichment solutions. Tagged organism mentions also play a pivotal role in disambiguating other entities in a text, such as proteins. A high-precision organism tagging system must be able to detect the numerous forms of organism mentions, including common names as well as the traditional taxonomic groups: genus, species and strains. In addition, such a system must resolve abbreviations and acronyms, assign the scientific name and if possible link the detected mention to the NCBI Taxonomy database for further semantic queries and literature navigation.

RESULTS

We present the OrganismTagger, a hybrid rule-based/machine learning system to extract organism mentions from the literature. It includes tools for automatically generating lexical and ontological resources from a copy of the NCBI Taxonomy database, thereby facilitating system updates by end users. Its novel ontology-based resources can also be reused in other semantic mining and linked data tasks. Each detected organism mention is normalized to a canonical name through the resolution of acronyms and abbreviations and subsequently grounded with an NCBI Taxonomy database ID. In particular, our system combines a novel machine-learning approach with rule-based and lexical methods for detecting strain mentions in documents. On our manually annotated OT corpus, the OrganismTagger achieves a precision of 95%, a recall of 94% and a grounding accuracy of 97.5%. On the manually annotated corpus of Linnaeus-100, the results show a precision of 99%, recall of 97% and grounding accuracy of 97.4%.

AVAILABILITY

The OrganismTagger, including supporting tools, resources, training data and manual annotations, as well as end user and developer documentation, is freely available under an open-source license at http://www.semanticsoftware.info/organism-tagger.

CONTACT

witte@semanticsoftware.info.

摘要

动机

在全文文章中对生物体提及进行语义标记是文献挖掘和语义丰富解决方案的重要组成部分。标记的生物体提及在消除文本中其他实体的歧义方面也起着关键作用,例如蛋白质。高精度的生物体标记系统必须能够检测到生物体提及的许多形式,包括常见名称以及传统的分类群:属、种和菌株。此外,这样的系统必须解决缩写和首字母缩写词,分配学名,并在可能的情况下将检测到的提及链接到 NCBI 分类数据库,以进行进一步的语义查询和文献导航。

结果

我们提出了 OrganismTagger,这是一种基于规则/机器学习的混合系统,用于从文献中提取生物体提及。它包括从 NCBI 分类数据库的副本自动生成词汇和本体资源的工具,从而使用户能够轻松更新系统。其新颖的基于本体的资源也可在其他语义挖掘和链接数据任务中重复使用。每个检测到的生物体提及都通过解决缩写词和缩略语将其归一化为规范名称,随后与 NCBI 分类数据库 ID 关联。特别是,我们的系统结合了一种新颖的机器学习方法与基于规则和词汇的方法,用于在文档中检测菌株提及。在我们手动注释的 OT 语料库上,OrganismTagger 的精度为 95%,召回率为 94%,接地精度为 97.5%。在手动注释的 Linnaeus-100 语料库上,结果显示精度为 99%,召回率为 97%,接地精度为 97.4%。

可用性

OrganismTagger 包括支持工具、资源、培训数据和手动注释以及最终用户和开发人员文档,根据开源许可证可在 http://www.semanticsoftware.info/organism-tagger 上免费获得。

联系信息

witte@semanticsoftware.info

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验