用于识别和规范化物种以及基因/蛋白质提及的新南威尔士大学（UNSW）生物信息学模块。

NTTMUNSW BioC modules for recognizing and normalizing species and gene/protein mentions.

作者信息

Dai Hong-Jie, Singh Onkar, Jonnagaddala Jitendra, Su Emily Chia-Yu

机构信息

Department of Computer Science and Information Engineering, National Taitung University, Taitung, Taiwan Interdisciplinary Program of Green and Information Technology, National Taitung University, Taitung, Taiwan

Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan.

出版信息

Database (Oxford). 2016 Jul 27;2016. doi: 10.1093/database/baw111. Print 2016.

DOI:10.1093/database/baw111

PMID:27465130

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4962763/

Abstract

In recent years, the number of published biomedical articles has increased as researchers have focused on biological domains to investigate the functions of biological objects, such as genes and proteins. However, the ambiguous nature of genes and their products have rendered the literature more complex for readers and curators of molecular interaction databases. To address this challenge, a normalization technique that can link variants of biological objects to a single, standardized form was applied. In this work, we developed a species normalization module, which recognizes species names and normalizes them to NCBI Taxonomy IDs. Unlike most previous work, which ignored the prefix of a gene name that represents an abbreviation of the species name to which the gene belongs, the recognition results of our module include the prefixed species. The developed species normalization module achieved an overall F-score of 0.954 on an instance-level species normalization corpus. For gene normalization, two separate modules were respectively employed to recognize gene mentions and normalize those mentions to their Entrez Gene IDs by utilizing a multistage normalization algorithm developed for processing full-text articles. All of the developed modules are BioC-compatible .NET framework libraries and are publicly available from the NuGet gallery.Database URL: https://sites.google.com/site/hjdairesearch/Projects/isn-corpus.

摘要

近年来，随着研究人员专注于生物领域以研究生物对象（如基因和蛋白质）的功能，已发表的生物医学文章数量有所增加。然而，基因及其产物的模糊性质使得文献对于分子相互作用数据库的读者和管理者来说更加复杂。为应对这一挑战，应用了一种能将生物对象的变体链接到单一标准化形式的归一化技术。在这项工作中，我们开发了一个物种归一化模块，该模块识别物种名称并将其归一化为NCBI分类学ID。与大多数之前忽略代表基因所属物种名称缩写的基因名称前缀的工作不同，我们模块的识别结果包括带前缀的物种。所开发的物种归一化模块在实例级物种归一化语料库上的总体F分数达到了0.954。对于基因归一化，分别采用了两个独立的模块来识别基因提及，并通过利用为处理全文文章而开发的多阶段归一化算法将这些提及归一化为它们的Entrez基因ID。所有开发的模块都是与BioC兼容的.NET框架库，可从NuGet库中公开获取。数据库网址：https://sites.google.com/site/hjdairesearch/Projects/isn-corpus 。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/513a/4962763/f4846247a2a9/baw111f1p.jpg

相似文献

NTTMUNSW BioC modules for recognizing and normalizing species and gene/protein mentions.用于识别和规范化物种以及基因/蛋白质提及的新南威尔士大学（UNSW）生物信息学模块。

Database (Oxford). 2016 Jul 27;2016. doi: 10.1093/database/baw111. Print 2016.

The BioC-BioGRID corpus: full text articles annotated for curation of protein-protein and genetic interactions.BioC-BioGRID语料库：为蛋白质-蛋白质和基因相互作用的编目而注释的全文文章。

Database (Oxford). 2017 Jan 10;2017. doi: 10.1093/database/baw147. Print 2017.

Collective instance-level gene normalization on the IGN corpus.对 IGN 语料库进行集体实例级基因标准化。

PLoS One. 2013 Nov 25;8(11):e79517. doi: 10.1371/journal.pone.0079517. eCollection 2013.

SPRENO: a BioC module for identifying organism terms in figure captions.SPRENO：一个用于在图注中识别生物学术语的 BioC 模块。

Database (Oxford). 2018 Jan 1;2018. doi: 10.1093/database/bay048.

Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora.在生物医学文献中查找缩写：三个生物医学信息交换格式（BioC）兼容模块和四个BioC格式语料库。

Database (Oxford). 2014 Jun 9;2014. doi: 10.1093/database/bau044. Print 2014.

OrganismTagger: detection, normalization and grounding of organism entities in biomedical documents.生物标记器：在生物医学文献中检测、规范和定位生物实体。

Bioinformatics. 2011 Oct 1;27(19):2721-9. doi: 10.1093/bioinformatics/btr452. Epub 2011 Aug 9.

GNorm2: an improved gene name recognition and normalization system.GNorm2：一种改进的基因名称识别和标准化系统。

Bioinformatics. 2023 Oct 3;39(10). doi: 10.1093/bioinformatics/btad599.

Automated curation of gene name normalization results using the Konstanz information miner.使用康斯坦茨信息挖掘器对基因名称标准化结果进行自动管理。

J Biomed Inform. 2015 Feb;53:58-64. doi: 10.1016/j.jbi.2014.08.016. Epub 2014 Sep 10.

ProNormz--an integrated approach for human proteins and protein kinases normalization.ProNormz——一种用于人类蛋白质和蛋白激酶标准化的综合方法。

J Biomed Inform. 2014 Feb;47:131-8. doi: 10.1016/j.jbi.2013.10.003. Epub 2013 Oct 19.

Multistage gene normalization and SVM-based ranking for protein interactor extraction in full-text articles.多阶段基因标准化和基于 SVM 的排序在全文文章中提取蛋白质互作。

IEEE/ACM Trans Comput Biol Bioinform. 2010 Jul-Sep;7(3):412-20. doi: 10.1109/TCBB.2010.45.

引用本文的文献

Biomarker identification of hepatocellular carcinoma using a methodical literature mining strategy.使用系统文献挖掘策略对肝细胞癌进行生物标志物鉴定。

Database (Oxford). 2017 Jan 1;2017. doi: 10.1093/database/bax082.

SPRENO: a BioC module for identifying organism terms in figure captions.SPRENO：一个用于在图注中识别生物学术语的 BioC 模块。

Database (Oxford). 2018 Jan 1;2018. doi: 10.1093/database/bay048.

本文引用的文献

pGenN, a gene normalization tool for plant genes and proteins in scientific literature.pGenN，一种用于科学文献中植物基因和蛋白质的基因标准化工具。

PLoS One. 2015 Aug 10;10(8):e0135305. doi: 10.1371/journal.pone.0135305. eCollection 2015.

BioC interoperability track overview.生物信息学互操作性赛道概述。

Database (Oxford). 2014 Jun 30;2014. doi: 10.1093/database/bau053. Print 2014.

Collective instance-level gene normalization on the IGN corpus.对 IGN 语料库进行集体实例级基因标准化。

PLoS One. 2013 Nov 25;8(11):e79517. doi: 10.1371/journal.pone.0079517. eCollection 2013.

The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text.用于快速准确识别文本中分类名称的物种和生物体资源。

PLoS One. 2013 Jun 18;8(6):e65390. doi: 10.1371/journal.pone.0065390. Print 2013.

SR4GN: a species recognition software tool for gene normalization.SR4GN：一种用于基因标准化的物种识别软件工具。

PLoS One. 2012;7(6):e38460. doi: 10.1371/journal.pone.0038460. Epub 2012 Jun 5.

Integration of gene normalization stages and co-reference resolution using a Markov logic network.使用马尔可夫逻辑网络进行基因归一化阶段和共指解析的集成。

Bioinformatics. 2011 Sep 15;27(18):2586-94. doi: 10.1093/bioinformatics/btr358. Epub 2011 Jun 17.

Multistage gene normalization and SVM-based ranking for protein interactor extraction in full-text articles.多阶段基因标准化和基于 SVM 的排序在全文文章中提取蛋白质互作。

IEEE/ACM Trans Comput Biol Bioinform. 2010 Jul-Sep;7(3):412-20. doi: 10.1109/TCBB.2010.45.

LINNAEUS: a species name identification system for biomedical literature.林奈氏：生物医学文献的物种名称识别系统。

BMC Bioinformatics. 2010 Feb 11;11:85. doi: 10.1186/1471-2105-11-85.

Disambiguating the species of biomedical named entities using natural language parsers.利用自然语言解析器对生物医学命名实体进行消歧。

Bioinformatics. 2010 Mar 1;26(5):661-7. doi: 10.1093/bioinformatics/btq002. Epub 2010 Jan 6.

Overview of BioCreative II gene mention recognition.生物创意II基因提及识别概述。

Genome Biol. 2008;9 Suppl 2(Suppl 2):S2. doi: 10.1186/gb-2008-9-s2-s2. Epub 2008 Sep 1.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

用于识别和规范化物种以及基因/蛋白质提及的新南威尔士大学（UNSW）生物信息学模块。

NTTMUNSW BioC modules for recognizing and normalizing species and gene/protein mentions.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献