• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

整合各种资源进行基因名称标准化。

Integrating various resources for gene name normalization.

机构信息

School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning, China.

出版信息

PLoS One. 2012;7(9):e43558. doi: 10.1371/journal.pone.0043558. Epub 2012 Sep 12.

DOI:10.1371/journal.pone.0043558
PMID:22984434
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3440407/
Abstract

The recognition and normalization of gene mentions in biomedical literature are crucial steps in biomedical text mining. We present a system for extracting gene names from biomedical literature and normalizing them to gene identifiers in databases. The system consists of four major components: gene name recognition, entity mapping, disambiguation and filtering. The first component is a gene name recognizer based on dictionary matching and semi-supervised learning, which utilizes the co-occurrence information of a large amount of unlabeled MEDLINE abstracts to enhance feature representation of gene named entities. In the stage of entity mapping, we combine the strategies of exact match and approximate match to establish linkage between gene names in the context and the EntrezGene database. For the gene names that map to more than one database identifiers, we develop a disambiguation method based on semantic similarity derived from the Gene Ontology and MEDLINE abstracts. To remove the noise produced in the previous steps, we design a filtering method based on the confidence scores in the dictionary used for NER. The system is able to adjust the trade-off between precision and recall based on the result of filtering. It achieves an F-measure of 83% (precision: 82.5% recall: 83.5%) on BioCreative II Gene Normalization (GN) dataset, which is comparable to the current state-of-the-art.

摘要

生物医学文献中基因提及的识别和规范化是生物医学文本挖掘的关键步骤。我们提出了一种从生物医学文献中提取基因名称并将其规范化到数据库中基因标识符的系统。该系统由四个主要组件组成:基因名称识别、实体映射、消歧和过滤。第一个组件是基于字典匹配和半监督学习的基因名称识别器,它利用大量未标记的 MEDLINE 摘要的共现信息来增强基因命名实体的特征表示。在实体映射阶段,我们结合精确匹配和近似匹配的策略,在上下文中建立基因名称与 EntrezGene 数据库之间的联系。对于映射到多个数据库标识符的基因名称,我们开发了一种基于从基因本体论和 MEDLINE 摘要中得出的语义相似性的消歧方法。为了去除前几个步骤产生的噪声,我们设计了一种基于字典中用于 NER 的置信度得分的过滤方法。该系统能够根据过滤结果调整精度和召回率之间的权衡。在 BioCreative II 基因规范化 (GN) 数据集上,它的 F 度量达到了 83%(精度:82.5%,召回率:83.5%),与当前的最先进水平相当。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdfd/3440407/f10d1081f49a/pone.0043558.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdfd/3440407/314c9614bce3/pone.0043558.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdfd/3440407/e3af7c9d38cc/pone.0043558.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdfd/3440407/f10d1081f49a/pone.0043558.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdfd/3440407/314c9614bce3/pone.0043558.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdfd/3440407/e3af7c9d38cc/pone.0043558.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bdfd/3440407/f10d1081f49a/pone.0043558.g003.jpg

相似文献

1
Integrating various resources for gene name normalization.整合各种资源进行基因名称标准化。
PLoS One. 2012;7(9):e43558. doi: 10.1371/journal.pone.0043558. Epub 2012 Sep 12.
2
A method for named entity normalization in biomedical articles: application to diseases and plants.一种生物医学文章中命名实体规范化的方法:应用于疾病和植物
BMC Bioinformatics. 2017 Oct 13;18(1):451. doi: 10.1186/s12859-017-1857-8.
3
A multistage gene normalization system integrating multiple effective methods.一种整合多种有效方法的多阶段基因归一化系统。
PLoS One. 2013 Dec 12;8(12):e81956. doi: 10.1371/journal.pone.0081956. eCollection 2013.
4
Document-level attention-based BiLSTM-CRF incorporating disease dictionary for disease named entity recognition.基于文档级注意力的 BiLSTM-CRF 结合疾病词典的疾病命名实体识别。
Comput Biol Med. 2019 May;108:122-132. doi: 10.1016/j.compbiomed.2019.04.002. Epub 2019 Apr 7.
5
Recognizing names in biomedical texts: a machine learning approach.识别生物医学文本中的名称:一种机器学习方法。
Bioinformatics. 2004 May 1;20(7):1178-90. doi: 10.1093/bioinformatics/bth060. Epub 2004 Feb 10.
6
pGenN, a gene normalization tool for plant genes and proteins in scientific literature.pGenN,一种用于科学文献中植物基因和蛋白质的基因标准化工具。
PLoS One. 2015 Aug 10;10(8):e0135305. doi: 10.1371/journal.pone.0135305. eCollection 2015.
7
Inter-species normalization of gene mentions with GNAT.使用GNAT对基因提及进行种间标准化。
Bioinformatics. 2008 Aug 15;24(16):i126-132. doi: 10.1093/bioinformatics/btn299.
8
Building a protein name dictionary from full text: a machine learning term extraction approach.从全文构建蛋白质名称词典:一种机器学习术语提取方法。
BMC Bioinformatics. 2005 Apr 7;6:88. doi: 10.1186/1471-2105-6-88.
9
Gene name identification and normalization using a model organism database.使用模式生物数据库进行基因名称识别与标准化
J Biomed Inform. 2004 Dec;37(6):396-410. doi: 10.1016/j.jbi.2004.08.010.
10
Assessment of disease named entity recognition on a corpus of annotated sentences.基于带注释句子语料库的疾病命名实体识别评估。
BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S3. doi: 10.1186/1471-2105-9-S3-S3.

引用本文的文献

1
Frontiers of ferroptosis research: An analysis from the top 100 most influential articles in the field.铁死亡研究前沿:基于该领域100篇最具影响力文章的分析
Front Oncol. 2022 Aug 11;12:948389. doi: 10.3389/fonc.2022.948389. eCollection 2022.
2
GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains.GNormPlus:一种用于标记基因、基因家族和蛋白质结构域的综合方法。
Biomed Res Int. 2015;2015:918710. doi: 10.1155/2015/918710. Epub 2015 Aug 25.
3
BioCreative-IV virtual issue.生物创意四期虚拟特刊。

本文引用的文献

1
The GNAT library for local and remote gene mention normalization.GNAT 库,用于本地和远程基因提及标准化。
Bioinformatics. 2011 Oct 1;27(19):2769-71. doi: 10.1093/bioinformatics/btr455. Epub 2011 Aug 3.
2
GeneTUKit: a software for document-level gene normalization.Genetukit:一种用于文档级基因标准化的软件。
Bioinformatics. 2011 Apr 1;27(7):1032-3. doi: 10.1093/bioinformatics/btr042. Epub 2011 Feb 8.
3
Moara: a Java library for extracting and normalizing gene and protein mentions.Moara:一个用于提取和规范化基因和蛋白质提及的 Java 库。
Database (Oxford). 2014 May 22;2014. doi: 10.1093/database/bau039. Print 2014.
4
A multistage gene normalization system integrating multiple effective methods.一种整合多种有效方法的多阶段基因归一化系统。
PLoS One. 2013 Dec 12;8(12):e81956. doi: 10.1371/journal.pone.0081956. eCollection 2013.
BMC Bioinformatics. 2010 Mar 26;11:157. doi: 10.1186/1471-2105-11-157.
4
Incorporating rich background knowledge for gene named entity classification and recognition.整合丰富的背景知识用于基因命名实体分类与识别。
BMC Bioinformatics. 2009 Jul 17;10:223. doi: 10.1186/1471-2105-10-223.
5
High-performance gene name normalization with GeNo.使用GeNo进行高性能基因名称标准化
Bioinformatics. 2009 Mar 15;25(6):815-21. doi: 10.1093/bioinformatics/btp071. Epub 2009 Feb 2.
6
Overview of BioCreative II gene normalization.生物创意II基因标准化概述。
Genome Biol. 2008;9 Suppl 2(Suppl 2):S3. doi: 10.1186/gb-2008-9-s2-s3. Epub 2008 Sep 1.
7
Overview of BioCreative II gene mention recognition.生物创意II基因提及识别概述。
Genome Biol. 2008;9 Suppl 2(Suppl 2):S2. doi: 10.1186/gb-2008-9-s2-s2. Epub 2008 Sep 1.
8
Inter-species normalization of gene mentions with GNAT.使用GNAT对基因提及进行种间标准化。
Bioinformatics. 2008 Aug 15;24(16):i126-132. doi: 10.1093/bioinformatics/btn299.
9
Gene symbol disambiguation using knowledge-based profiles.使用基于知识的概况进行基因符号消歧。
Bioinformatics. 2007 Apr 15;23(8):1015-22. doi: 10.1093/bioinformatics/btm056. Epub 2007 Feb 21.
10
Overview of BioCreAtIvE task 1B: normalized gene lists.生物创意任务1B概述:标准化基因列表。
BMC Bioinformatics. 2005;6 Suppl 1(Suppl 1):S11. doi: 10.1186/1471-2105-6-S1-S11. Epub 2005 May 24.