• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

生物标记器:在生物医学文献中检测、规范和定位生物实体。

OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents.

机构信息

Department of Computer Science and Software Engineering, Concordia University, Montréal, Québec, Canada.

出版信息

Bioinformatics. 2011 Oct 1;27(19):2721-9. doi: 10.1093/bioinformatics/btr452. Epub 2011 Aug 9.

DOI:10.1093/bioinformatics/btr452
PMID:21828087
Abstract

MOTIVATION

Semantic tagging of organism mentions in full-text articles is an important part of literature mining and semantic enrichment solutions. Tagged organism mentions also play a pivotal role in disambiguating other entities in a text, such as proteins. A high-precision organism tagging system must be able to detect the numerous forms of organism mentions, including common names as well as the traditional taxonomic groups: genus, species and strains. In addition, such a system must resolve abbreviations and acronyms, assign the scientific name and if possible link the detected mention to the NCBI Taxonomy database for further semantic queries and literature navigation.

RESULTS

We present the OrganismTagger, a hybrid rule-based/machine learning system to extract organism mentions from the literature. It includes tools for automatically generating lexical and ontological resources from a copy of the NCBI Taxonomy database, thereby facilitating system updates by end users. Its novel ontology-based resources can also be reused in other semantic mining and linked data tasks. Each detected organism mention is normalized to a canonical name through the resolution of acronyms and abbreviations and subsequently grounded with an NCBI Taxonomy database ID. In particular, our system combines a novel machine-learning approach with rule-based and lexical methods for detecting strain mentions in documents. On our manually annotated OT corpus, the OrganismTagger achieves a precision of 95%, a recall of 94% and a grounding accuracy of 97.5%. On the manually annotated corpus of Linnaeus-100, the results show a precision of 99%, recall of 97% and grounding accuracy of 97.4%.

AVAILABILITY

The OrganismTagger, including supporting tools, resources, training data and manual annotations, as well as end user and developer documentation, is freely available under an open-source license at http://www.semanticsoftware.info/organism-tagger.

CONTACT

witte@semanticsoftware.info.

摘要

动机

在全文文章中对生物体提及进行语义标记是文献挖掘和语义丰富解决方案的重要组成部分。标记的生物体提及在消除文本中其他实体的歧义方面也起着关键作用,例如蛋白质。高精度的生物体标记系统必须能够检测到生物体提及的许多形式,包括常见名称以及传统的分类群:属、种和菌株。此外,这样的系统必须解决缩写和首字母缩写词,分配学名,并在可能的情况下将检测到的提及链接到 NCBI 分类数据库,以进行进一步的语义查询和文献导航。

结果

我们提出了 OrganismTagger,这是一种基于规则/机器学习的混合系统,用于从文献中提取生物体提及。它包括从 NCBI 分类数据库的副本自动生成词汇和本体资源的工具,从而使用户能够轻松更新系统。其新颖的基于本体的资源也可在其他语义挖掘和链接数据任务中重复使用。每个检测到的生物体提及都通过解决缩写词和缩略语将其归一化为规范名称,随后与 NCBI 分类数据库 ID 关联。特别是,我们的系统结合了一种新颖的机器学习方法与基于规则和词汇的方法,用于在文档中检测菌株提及。在我们手动注释的 OT 语料库上,OrganismTagger 的精度为 95%,召回率为 94%,接地精度为 97.5%。在手动注释的 Linnaeus-100 语料库上,结果显示精度为 99%,召回率为 97%,接地精度为 97.4%。

可用性

OrganismTagger 包括支持工具、资源、培训数据和手动注释以及最终用户和开发人员文档,根据开源许可证可在 http://www.semanticsoftware.info/organism-tagger 上免费获得。

联系信息

witte@semanticsoftware.info。

相似文献

1
OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents.生物标记器:在生物医学文献中检测、规范和定位生物实体。
Bioinformatics. 2011 Oct 1;27(19):2721-9. doi: 10.1093/bioinformatics/btr452. Epub 2011 Aug 9.
2
NCBI disease corpus: a resource for disease name recognition and concept normalization.NCBI疾病语料库:一种用于疾病名称识别和概念规范化的资源。
J Biomed Inform. 2014 Feb;47:1-10. doi: 10.1016/j.jbi.2013.12.006. Epub 2014 Jan 3.
3
LINNAEUS: a species name identification system for biomedical literature.林奈氏:生物医学文献的物种名称识别系统。
BMC Bioinformatics. 2010 Feb 11;11:85. doi: 10.1186/1471-2105-11-85.
4
Challenges in clinical natural language processing for automated disorder normalization.临床自然语言处理中自动疾病标准化的挑战。
J Biomed Inform. 2015 Oct;57:28-37. doi: 10.1016/j.jbi.2015.07.010. Epub 2015 Jul 14.
5
Assessment of disease named entity recognition on a corpus of annotated sentences.基于带注释句子语料库的疾病命名实体识别评估。
BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S3. doi: 10.1186/1471-2105-9-S3-S3.
6
Automated extraction and semantic analysis of mutation impacts from the biomedical literature.从生物医学文献中自动提取和语义分析突变影响。
BMC Genomics. 2012 Jun 18;13 Suppl 4(Suppl 4):S10. doi: 10.1186/1471-2164-13-S4-S10.
7
Cell line name recognition in support of the identification of synthetic lethality in cancer from text.支持从文本中识别癌症合成致死性的细胞系名称识别
Bioinformatics. 2016 Jan 15;32(2):276-82. doi: 10.1093/bioinformatics/btv570. Epub 2015 Oct 1.
8
Hybrid Semantic Analysis for Mapping Adverse Drug Reaction Mentions in Tweets to Medical Terminology.用于将推文中的药物不良反应提及映射到医学术语的混合语义分析
AMIA Annu Symp Proc. 2018 Apr 16;2017:679-688. eCollection 2017.
9
Disambiguating the species of biomedical named entities using natural language parsers.利用自然语言解析器对生物医学命名实体进行消歧。
Bioinformatics. 2010 Mar 1;26(5):661-7. doi: 10.1093/bioinformatics/btq002. Epub 2010 Jan 6.
10
Gene name identification and normalization using a model organism database.使用模式生物数据库进行基因名称识别与标准化
J Biomed Inform. 2004 Dec;37(6):396-410. doi: 10.1016/j.jbi.2004.08.010.

引用本文的文献

1
Assigning species information to corresponding genes by a sequence labeling framework.通过序列标注框架为相应的基因分配物种信息。
Database (Oxford). 2022 Oct 13;2022. doi: 10.1093/database/baac090.
2
Taxallnomy: an extension of NCBI Taxonomy that produces a hierarchically complete taxonomic tree.分类学:对 NCBI 分类学的扩展,生成一个层次完整的分类树。
BMC Bioinformatics. 2021 Jul 29;22(1):388. doi: 10.1186/s12859-021-04304-3.
3
Dataset search in biodiversity research: Do metadata in data repositories reflect scholarly information needs?
生物多样性研究中的数据集搜索:数据存储库中的元数据是否反映了学术信息需求?
PLoS One. 2021 Mar 24;16(3):e0246099. doi: 10.1371/journal.pone.0246099. eCollection 2021.
4
Named Entity Recognition and Relation Detection for Biomedical Information Extraction.用于生物医学信息提取的命名实体识别与关系检测
Front Cell Dev Biol. 2020 Aug 28;8:673. doi: 10.3389/fcell.2020.00673. eCollection 2020.
5
Recognition of Latin scientific names using artificial neural networks.使用人工神经网络识别拉丁学名。
Appl Plant Sci. 2020 Jul 31;8(7):e11378. doi: 10.1002/aps3.11378. eCollection 2020 Jul.
6
COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature.COPIOUS:一个用于从生物多样性文献中提取物种出现信息的命名实体黄金标准语料库。
Biodivers Data J. 2019 Jan 22(7):e29626. doi: 10.3897/BDJ.7.e29626. eCollection 2019.
7
SPRENO: a BioC module for identifying organism terms in figure captions.SPRENO:一个用于在图注中识别生物学术语的 BioC 模块。
Database (Oxford). 2018 Jan 1;2018. doi: 10.1093/database/bay048.
8
Semantic annotation in biomedicine: the current landscape.生物医学中的语义标注:现状
J Biomed Semantics. 2017 Sep 22;8(1):44. doi: 10.1186/s13326-017-0153-x.
9
Constructing a biodiversity terminological inventory.构建生物多样性术语库
PLoS One. 2017 Apr 17;12(4):e0175277. doi: 10.1371/journal.pone.0175277. eCollection 2017.
10
AuDis: an automatic CRF-enhanced disease normalization in biomedical text.AuDis:生物医学文本中一种自动的基于条件随机场增强的疾病规范化方法
Database (Oxford). 2016 Jun 7;2016. doi: 10.1093/database/baw091. Print 2016.